nips nips2009 nips2009-207 knowledge-graph by maker-knowledge-mining

207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output


Source: pdf

Author: Matthias Hein

Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. [sent-3, score-0.651]

2 Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. [sent-4, score-1.163]

3 In this paper we focus on the case of regression with manifold-valued input and output. [sent-5, score-0.131]

4 We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. [sent-6, score-0.706]

5 1 Introduction In recent years there has been an increasing interest in learning with output which differs from the case of standard classification and regression. [sent-7, score-0.157]

6 In structured output learning, see [1, 2, 3] and references therein, one generalizes multiclass classification to more general discrete output spaces, in particular incooperating structure of the joint input and output space. [sent-9, score-0.683]

7 On the other hand there has been a recent series of work which generalizes regression with multivariate output to the case where the output space is a Riemannian manifold, see [4, 5, 6, 7], with applications in signal processing, computer vision, computer graphics and robotics. [sent-11, score-0.52]

8 One can also see this branch as structured output learning if one thinks of a Riemannian manifold as isometrically embedded in a Euclidean space. [sent-12, score-0.316]

9 Then the restriction that the output has to lie on the manifold can be interpreted as constrained regression in Euclidean space, where the constraints couple several output features together. [sent-13, score-0.481]

10 In this paper we propose a family of kernel estimators for regression with metric-space valued input and output motivated by estimators proposed in [6, 8] for manifold-valued regression. [sent-14, score-0.833]

11 We discuss loss functions and the corresponding Bayesian decision theory for this general regression problem. [sent-15, score-0.235]

12 Moreover, we show that this family of estimators has several well known estimators as special cases for certain choices of the output space and its metric. [sent-16, score-0.532]

13 However, our main emphasis lies on the problem of regression with manifold-valued input and output which includes the multivariate Euclidean case. [sent-17, score-0.362]

14 In particular, we show for all our proposed estimators their pointwise and Bayes consistency, that is in the limit as the sample size goes to infinity the estimated mapping converges to the Bayes optimal mapping. [sent-18, score-0.308]

15 This includes estimators implementing several robust loss functions like the L1 -loss, Huber loss or the ε-insensitive loss. [sent-19, score-0.457]

16 This generality is possible since our proof considers directly the functional which is minimized instead of its minimizer as it is usually done in consistency proofs of the Nadaraya-Watson estimator. [sent-20, score-0.183]

17 1 2 Bayesian decision theory and loss functions for metric-space valued output We consider the structured output learning problem where the task is to learn a mapping φ : M → N between two metric spaces M and N , where dM denotes the metric of M and dN the metric of N . [sent-22, score-1.153]

18 We assume that both metric spaces M and N are separable1 . [sent-23, score-0.188]

19 In order to prove later on consistency of our metric-space valued estimator we first have to define the Bayes optimal mapping φ∗ : M → N in the case where M and N are general metric spaces which depends on the employed loss function. [sent-28, score-0.7]

20 In multivariate regression the most common loss function 2 is, L(y, f (x)) = y − f (x) 2 . [sent-29, score-0.283]

21 However, it is well known that this loss is sensitive to outliers. [sent-30, score-0.127]

22 In univariate regression one therefore uses the L1 -loss or other robust loss functions like the Huber or εinsensitive loss. [sent-31, score-0.278]

23 For the L1 -loss the Bayes optimal function f ∗ is given as f ∗ (x) = Med[Y |X = x], where Med denotes the median of P(Y |X = x) which is a robust location measure. [sent-32, score-0.136]

24 Several generalizations of the median for multivariate output have been proposed, see e. [sent-33, score-0.298]

25 In this paper we refer to the minimizer of the loss function L(y, f (x)) = y − f (x) Rn resp. [sent-36, score-0.173]

26 L(y, f (x)) = dN (y, f (x)) as the (generalized) median, since this seems to be the only generalization of the univariate median which has a straightforward extension to metric spaces. [sent-37, score-0.235]

27 In analogy to Euclidean case, we will therefore use loss functions penalizing the distance between predicted output and desired output: L(y, φ(x)) = Γ dN (y, φ(x)) , y ∈ N, x ∈ M, where Γ : R+ → R+ . [sent-38, score-0.308]

28 The associated risk (or expected loss) is: RΓ (φ) = E[L(Y, φ(X))] and its Bayes optimal mapping φ∗ : M → N Γ can then be determined by φ∗ := Γ = arg min RΓ (φ) = φ:M →N, φ measurable arg min φ:M →N, φ measurable arg min φ:M →N, φ measurable E[Γ dN (Y, φ(X)) ] EX [EY |X [Γ dN (Y, φ(X)) | X]. [sent-40, score-0.663]

29 (1) In the second step we used a result of [10] which states that a joint probability measure on the product of two separable metric spaces can always be factorized into a conditional probability measure and the marginal. [sent-41, score-0.241]

30 In order that the risk is well-defined, we assume that there exists a measurable mapping φ : M → N so that E[Γ dN (Y, φ(X)) ] < ∞. [sent-42, score-0.161]

31 Apart from the global risk RΓ (φ) we analyze for each x ∈ M the pointwise risk RΓ (x, φ(x)), RΓ (x, φ(x)) = EY |X [Γ dN (Y, φ(X)) | X = x], which measures the loss suffered by predicting φ(x) for the input x ∈ M . [sent-44, score-0.349]

32 The total loss RΓ (φ) of the mapping φ is then RΓ (φ) = E[RΓ (X, φ(X))]. [sent-45, score-0.16]

33 As in standard regression the factorization allows to find the Bayes optimal mapping φ∗ pointwise, φ∗ (x) = arg min RΓ (x, p) = arg min E[Γ dN (Y, p) | X = x] = arg min Γ p∈N p∈N Γ dN (y, p) dµx (y), p∈N N where dµx is the conditional probability of Y conditioned on X = x. [sent-46, score-0.555]

34 Later on we prove consistency for a set of kernel estimators each using a different loss function Γ from the following class of functions. [sent-47, score-0.455]

35 Several functions Γ corresponding to standard loss functions in regression are (α, s)-bounded: • Lp -type loss: Γ(x) = xγ for γ ≥ 1 is (2γ , 1)-bounded, • Huber-loss: Γ(x) = 1 2x2 ε for x ≤ ε 2 and Γ(x) = 2x − ε 2 A metric space is separable if it contains a countable dense subset. [sent-49, score-0.431]

36 While uniqueness of the minimizer of the pointwise loss functional RΓ (x, ·) cannot be guaranteed anymore in the case of metric space valued output, the following lemma shows that RΓ (x, ·) has reasonable properties (all longer proofs can be found in Section 7 or in the supplementary material). [sent-52, score-0.676]

37 Lemma 1 Let N be a complete and separable metric space such that d(x, y) < ∞ for all x, y ∈ N and every closed and bounded set is compact. [sent-54, score-0.219]

38 If Γ is (α, s)-bounded and RΓ (x, q) < ∞ for some q ∈ N , then • RΓ (x, p) < ∞ for all p ∈ N , • RΓ (x, ·) is continuous on N , • The set of minimizers Q∗ = arg min RΓ (x, q) exists and is compact. [sent-55, score-0.232]

39 The minimizer of the pointwise risk, d2 (y, p) dµx (y), N F (p) = arg min p∈N N is called the Frech´ t mean2 or Karcher mean in the case where N is a manifold. [sent-57, score-0.299]

40 It is the generalizae tion of a mean in Euclidean space to a general metric space. [sent-58, score-0.165]

41 A simple example is the sphere as the output space together with a uniform probability measure on it. [sent-60, score-0.241]

42 For a discussion of the computation of the median in general metric spaces see [14]. [sent-64, score-0.281]

43 3 A family of kernel estimators with metric-space valued input and output In the following we provide the definition of the kernel estimator with metric-space valued output motivated by the two estimators proposed in [6, 8] for manifold-valued output. [sent-65, score-1.238]

44 We use in the following the notation kh (x) = h1 k(x/h). [sent-66, score-0.471]

45 The metric-space-valued i=1 kernel estimator φl : M → N from metric space M to metric space N is defined for all x ∈ M as φl (x) = arg min q∈N 1 l l Γ dN (q, Yi ) kh dM (x, Xi ) , (2) i=1 where Γ : R+ → R+ is (α, s)-bounded and k : R+ → R+ . [sent-68, score-1.188]

46 If the data contains a large fraction of outliers one should use a robust loss function Γ, see Section 6. [sent-69, score-0.277]

47 Usually the kernel function should be monotonically decreasing since the interpretation of kh dM (x, Xi ) is to measure the similarity between x and Xi in M which should decrease as the distance increases. [sent-70, score-0.611]

48 The computational complexity to determine φl (x) is quite high as for each test point one has to solve an optimization problem but comparable to structured output learning (see discussion below) where one maximizes for each test point the score function over the output space. [sent-71, score-0.414]

49 For manifold-valued output we will describe in the next section a simple gradient-descent type optimization scheme in order to determine φl (x). [sent-72, score-0.157]

50 It is interesting to see that several well-known nonparametric estimators for classification and regression can be seen as special cases of this estimator (or a slightly more general form) for different choices of the output space, its metric and the loss function. [sent-73, score-0.901]

51 In particular, the approach shows a certain analogy of a generalization of regression into a continuous space (manifold-valued regression) and regression into a discrete space (structured output learning). [sent-74, score-0.481]

52 If there is no special class-structure, then we use the discrete metric on N , dN (q, q ) = 1 if q = q and 0 else leads for any Γ to the standard multiclass classification scheme using a majority vote. [sent-80, score-0.182]

53 Multivariate regression: Let N = Rn and M be a metric space. [sent-83, score-0.142]

54 Then for Γ(x) = x2 , one gets 1 l φl (x) = arg min q∈N which has the solution, φl (x) = l i=1 1 l 1 l l q − Yi kh dM (x, Xi ) , i=1 kh dM (x,Xi ) Yi l i=1 2 kh dM (x,Xi ) . [sent-84, score-1.58]

55 This is the well-known Nadaraya-Watson estimator, see [15, 16], on a metric space. [sent-85, score-0.142]

56 In [17] a related estimator is discussed when M is a closed Riemannian manifold and [18] discusses the Nadaraya-Watson estimator when M is a metric space. [sent-86, score-0.55]

57 Manifold-valued regression: In [6] the estimator φl (x) has been proposed for the case where N is a Riemannian manifold and Γ(x) = x2 , in particular with the emphasis on N being the manifold of shapes. [sent-87, score-0.307]

58 While it has been shown in [7] that an approach using a global smoothness regularizer outperforms the estimator φl (x), it is a well working baseline with a simple implementation, see Section 4. [sent-89, score-0.163]

59 The similarity to our estimator φl (x) in (2) becomes more obvious when we use that in the framework of [1] the learned score function can be written as Ψl (x) = arg max q∈N 1 l l αi k (x, q), (Xi , Yi ) , (4) i=1 where α ∈ Rl is the learned coefficient vector. [sent-93, score-0.275]

60 Apart from the coefficient vector α this has almost the form of the previously discussed estimator in Equation (3), using a joint similarity function on input and output space. [sent-94, score-0.428]

61 Clearly, a structured output method where the coefficients α have been optimized, should perform better than αi = const. [sent-95, score-0.257]

62 In cases where training time is prohibitive the estimator without α is an alternative, at least it provides a useful baseline for structured output learning. [sent-96, score-0.42]

63 , then one can rewrite the problem in (4) as, Ψl (x) = arg min q∈N 1 l l αi kM (x, Xi )d2 (q, Yi ), N i=1 3 where dN is the induced (semi)-metric of kN . [sent-98, score-0.138]

64 N 4 4 Implementation of the kernel estimator for manifold-valued output For fixed x ∈ M , the functional F (q) for q ∈ N which is optimized in the kernel estimator φl (x) can be rewritten with wi = kh (dM (x, Xi )) as, l F (q) = wi Γ dN (q, Yi ) . [sent-103, score-1.248]

65 i=1 l The covariant gradient of F (q) is given as, F q = i=1 wi Γ dN (p, Yi ) vi , where vi ∈ Tq N is a tangent vector at q with vi Tq N = 1 given by the tangent vector at q of the minimizing4 geodesic from Yi to q (pointing “away” from Yi ). [sent-104, score-0.202]

66 5 Consistency of the kernel estimator for manifold-valued input and output In this section we show the pointwise and Bayes consistency of the kernel estimator φl in the case where M and N are Riemannian manifolds. [sent-111, score-0.875]

67 The proof of consistency of the general metric-space valued kernel estimator (for a restricted class of metric spaces including all Riemannian manifolds) requires high technical overload which is interesting in itself but which would make the paper hard accessible. [sent-113, score-0.649]

68 The consistency of φl will be proven under the following assumptions: Assumptions (A1): The loss Γ : R+ → R+ is (α, s)-bounded. [sent-114, score-0.209]

69 The marginal density on M fulfills: p(x) ≥ pmin , ∀ x ∈ M , 6. [sent-119, score-0.122]

70 In the following dV = det g dx denotes the natural volume element of a Riemannian manifold with metric g, vol(S) and diam(N ) are the volume and diameter of the set S. [sent-127, score-0.201]

71 Proposition 2 Let the assumptions A1 hold, then if f is continuous we get for any x ∈ M \∂M , lim h→0 kh (dM (x, z))f (z) dV (z) = Cx f (x), M where Cx = limh→0 M kh (dM (x, z)) dV (z) > 0. [sent-135, score-1.128]

72 If moreover f is Lipschitz continuous with Lipschitz constant L, then there exists a h0 > 0 such that for all h < h0 (x), kh (dM (x, z))f (z) dV (z) = Cx f (x) + O(h). [sent-136, score-0.567]

73 M The following main theorem proves the almost sure pointwise convergence of the manifold-valued kernel estimator for all (α, s)-bounded loss functions Γ. [sent-137, score-0.557]

74 Let φl (x) be the estimate of the kernel estimator for sample size l. [sent-139, score-0.249]

75 If h → 0 and lhm / log l → ∞, then for any x ∈ M \∂M , lim |RΓ (x, φl (x)) − arg min RΓ (x, q)| = 0, l→∞ almost surely. [sent-140, score-0.367]

76 q∈N If additionally p(·, y) is Lipschitz-continuous for any y ∈ N , then lim |RΓ (x, φl (x)) − arg min RΓ (x, q)| = O(h) + O l→∞ log l/(l hm ) , almost surely. [sent-141, score-0.352]

77 q∈N 1 The optimal rate is given by h = O (log l/l) 2+m so that lim RΓ (x, φl (x)) − arg min RΓ (x, q) = O l→∞ log l/l 1 2+m , almost surely. [sent-142, score-0.298]

78 q∈N Note, that the condition l hm / log l → ∞ for convergence is the same as for the Nadaraya-Watson estimator on a m-dimensional Euclidean space. [sent-143, score-0.247]

79 Thus, doing regression with manifold-valued output is not more “difficult” than standard regression with multivariate output. [sent-145, score-0.421]

80 Next, we show Bayes consistency of the manifold-valued kernel estimator. [sent-146, score-0.168]

81 If h → 0 and lhm / log l → ∞, then lim RΓ (φl ) − RΓ (φ∗ ) = 0, l→∞ almost surely. [sent-148, score-0.229]

82 6 Experiments We illustrate the differences of median and mean type estimator on a synthetic dataset with the task of estimating a curve on the sphere, that is M = [0, 1] and N = S 1 . [sent-152, score-0.256]

83 As expected the the L1 -loss and the Huber loss as robust loss functions outperform the L2 -loss in the presence of outliers, whereas the L2 -loss outperforms the robust versions when no outliers are present. [sent-156, score-0.447]

84 Note, that the Huber loss as a hybrid version between L1 - and L2 -loss is even slightly better than the L1 -loss in the presence of outliers as well as in the outlier free case. [sent-157, score-0.234]

85 Thus for a given dataset it makes sense not only to do cross-validation of the parameter h of the kernel function but also over different loss functions in order to adapt to possible outliers in the data. [sent-158, score-0.32]

86 In particular, the curves found using L1 and Huber loss are very close to the ground truth. [sent-172, score-0.127]

87 Table 1: Mean squared error (unit 10−1 ) for regression on the sphere - for different noise levels k, number of labeled points, without and with outliers. [sent-173, score-0.169]

88 Note that φl (x) = arg min RΓ,l (x, q) as we have only divided by a constant factor. [sent-251, score-0.138]

89 We use the standard technique for q∈N the pointwise estimate, RΓ (x, φl (x)) − min RΓ (x, q) ≤ RΓ (x, φl (x)) − RΓ,l (x, φl (x)) + RΓ,l (x, φl (x)) − min RΓ (x, q) q∈N q∈N ≤ 2 sup |RΓ,l (x, q) − RΓ (x, q)|. [sent-252, score-0.221]

90 q∈N In order to bound the supremum, we will work on the event E, where we assume, 1 l l i=1 kh (dM (x,Xi )) E[kh (dM (x,X))] m − 1 < 1 , which holds with probability 1 − 2 e−C l h for some constant C. [sent-253, score-0.492]

91 For every x ∈ M \∂M we get, lim h→0 kh (dM (x, z))p(z, y)dV (z) = Cx p(x, y), M lim h→0 kh (dM (x, z))p(z)dV (z) = Cx p(x), M where Cx > 0. [sent-261, score-1.19]

92 Thus with fh = kh (dM (x, z))p(z, y)dV (z), gh = M kh (dM (x, z))p(z)dV (z), M we get for every x ∈ M \∂M , lim h→0 fh f |fh − f | |gh − g| − ≤ lim + lim f = 0, h→0 h→0 gh g gh g gh where we have used gh ≥ aS1 r1 pmin > 0 and g = Cx p(x) > 0. [sent-262, score-2.175]

93 Moreover, using results from the proof of Proposition 2 one can show fh < C for some constant C. [sent-263, score-0.135]

94 Thus fh /gh < C for some constant and fh /gh → f /g as h → 0. [sent-264, score-0.224]

95 Using the dominated convergence theorem we thus get E lim RΓ (x, q) = lim h→0 h→0 E[Γ(dN (q, Y ))kh (dM (x, X))] = E[kh (dM (x, X))] Γ dN (q, y) N p(x, y) dy = RΓ (x, q). [sent-265, score-0.324]

96 For the case where p(·, y) is Lipschitz continuous for all E y ∈ N we have RΓ (x, q) = RΓ (x, q) + O(h) and can choose s large enough so that the bound lhm from the approximation error dominates the one of the covering. [sent-269, score-0.128]

97 Acknowledgments We thank Florian Steinke for helpful discussions about relations between generalized kernel estimators and structured output learning. [sent-272, score-0.503]

98 Large margin methods for structured and interdependent output variables. [sent-279, score-0.257]

99 The geometric median on Riemannian manifolds with application to robust atlas estimation. [sent-331, score-0.204]

100 Consistency conditions for probability estimators and integrals of density estimators. [sent-416, score-0.189]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kh', 0.471), ('dm', 0.406), ('dn', 0.273), ('riemannian', 0.218), ('estimator', 0.163), ('diam', 0.162), ('estimators', 0.16), ('output', 0.157), ('metric', 0.142), ('loss', 0.127), ('lim', 0.124), ('pointwise', 0.115), ('fh', 0.112), ('regression', 0.108), ('outliers', 0.107), ('valued', 0.107), ('cx', 0.105), ('gh', 0.104), ('structured', 0.1), ('huber', 0.099), ('dv', 0.097), ('median', 0.093), ('pmin', 0.093), ('vol', 0.087), ('kernel', 0.086), ('arg', 0.085), ('consistency', 0.082), ('kn', 0.081), ('yi', 0.081), ('frech', 0.069), ('lhm', 0.069), ('manifolds', 0.068), ('sphere', 0.061), ('wi', 0.061), ('tq', 0.061), ('manifold', 0.059), ('measurable', 0.058), ('lipschitz', 0.056), ('bayes', 0.055), ('hm', 0.054), ('min', 0.053), ('uniqueness', 0.052), ('euclidean', 0.049), ('multivariate', 0.048), ('proposition', 0.048), ('minimizer', 0.046), ('spaces', 0.046), ('qk', 0.045), ('mse', 0.045), ('nonparametric', 0.044), ('robust', 0.043), ('xi', 0.043), ('ful', 0.042), ('risk', 0.042), ('limh', 0.041), ('steinke', 0.041), ('multiclass', 0.04), ('continuous', 0.038), ('fletcher', 0.037), ('almost', 0.036), ('hein', 0.035), ('med', 0.033), ('mapping', 0.033), ('lemma', 0.032), ('proofs', 0.032), ('family', 0.032), ('saarland', 0.031), ('absolutely', 0.031), ('separable', 0.031), ('convergence', 0.03), ('lls', 0.03), ('moreover', 0.03), ('apart', 0.029), ('semi', 0.029), ('gets', 0.029), ('density', 0.029), ('exists', 0.028), ('minimizers', 0.028), ('monotonically', 0.027), ('similarity', 0.027), ('generalizes', 0.027), ('emphasis', 0.026), ('ey', 0.026), ('geodesic', 0.025), ('multiscale', 0.025), ('tangent', 0.025), ('therein', 0.024), ('analogy', 0.024), ('get', 0.024), ('km', 0.024), ('coef', 0.023), ('space', 0.023), ('closed', 0.023), ('proof', 0.023), ('stopping', 0.023), ('input', 0.023), ('vi', 0.022), ('dominated', 0.022), ('joint', 0.022), ('bound', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

Author: Matthias Hein

Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1

2 0.20958453 238 nips-2009-Submanifold density estimation

Author: Arkadas Ozakin, Alexander G. Gray

Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modification to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modified estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisfies certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is defined in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier defines an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “infinitesimal distance” ds between two points whose P coordinates differ by the infinitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes infinite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is infinite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modification of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal definition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function defined on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisfies vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisfies, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , define an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Definition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see figure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisfies, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the definitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the definition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the definition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing first derivatives at p, and gij (p) = δij [1]. Normal coordinates can be defined in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the first two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The first three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modification of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We find it likely that our estimator or slight modifications of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we fit a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8

3 0.11784661 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

4 0.11176599 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

Author: Rong Jin, Shijun Wang, Yang Zhou

Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

5 0.10136647 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

Author: Kwang I. Kim, Florian Steinke, Matthias Hein

Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We first derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to find a user-defined embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1

6 0.091394708 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

7 0.091025479 71 nips-2009-Distribution-Calibrated Hierarchical Classification

8 0.090787202 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

9 0.085942566 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

10 0.08461985 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

11 0.080996923 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

12 0.079605103 87 nips-2009-Exponential Family Graph Matching and Ranking

13 0.078717113 95 nips-2009-Fast subtree kernels on graphs

14 0.077456512 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

15 0.075681247 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

16 0.072413899 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

17 0.071853846 147 nips-2009-Matrix Completion from Noisy Entries

18 0.069146372 223 nips-2009-Sparse Metric Learning via Smooth Optimization

19 0.067690969 72 nips-2009-Distribution Matching for Transduction

20 0.067208618 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.205), (1, 0.165), (2, -0.018), (3, 0.125), (4, -0.025), (5, -0.062), (6, -0.019), (7, -0.046), (8, 0.04), (9, 0.01), (10, 0.105), (11, -0.045), (12, 0.043), (13, -0.027), (14, 0.069), (15, 0.0), (16, -0.025), (17, -0.039), (18, 0.08), (19, 0.057), (20, -0.029), (21, 0.116), (22, 0.004), (23, -0.074), (24, -0.01), (25, 0.04), (26, 0.0), (27, 0.055), (28, -0.002), (29, 0.042), (30, 0.028), (31, 0.097), (32, 0.052), (33, 0.129), (34, -0.155), (35, -0.019), (36, 0.097), (37, 0.05), (38, 0.137), (39, 0.045), (40, -0.122), (41, 0.159), (42, 0.005), (43, 0.13), (44, 0.012), (45, -0.071), (46, 0.041), (47, -0.022), (48, -0.098), (49, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94965684 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

Author: Matthias Hein

Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1

2 0.86531675 238 nips-2009-Submanifold density estimation

Author: Arkadas Ozakin, Alexander G. Gray

Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modification to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modified estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisfies certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is defined in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier defines an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “infinitesimal distance” ds between two points whose P coordinates differ by the infinitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes infinite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is infinite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modification of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal definition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function defined on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisfies vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisfies, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , define an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Definition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see figure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisfies, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the definitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the definition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the definition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing first derivatives at p, and gij (p) = δij [1]. Normal coordinates can be defined in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the first two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The first three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modification of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We find it likely that our estimator or slight modifications of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we fit a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8

3 0.61156785 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

Author: Samory Kpotufe

Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1

4 0.56944317 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

Author: Xiaolin Yang, Seyoung Kim, Eric P. Xing

Abstract: Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their input-output mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1 /L∞ or L1 /L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1

5 0.50988901 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

6 0.50647104 71 nips-2009-Distribution-Calibrated Hierarchical Classification

7 0.50541717 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

8 0.50506514 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

9 0.50377107 94 nips-2009-Fast Learning from Non-i.i.d. Observations

10 0.49068344 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

11 0.44438881 124 nips-2009-Lattice Regression

12 0.41264045 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

13 0.40965247 55 nips-2009-Compressed Least-Squares Regression

14 0.40635023 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

15 0.40617374 223 nips-2009-Sparse Metric Learning via Smooth Optimization

16 0.40446883 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

17 0.39679584 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

18 0.3957856 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

19 0.39571494 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

20 0.39519265 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.013), (24, 0.079), (25, 0.08), (35, 0.035), (36, 0.162), (39, 0.051), (58, 0.104), (61, 0.054), (71, 0.049), (83, 0.219), (86, 0.055), (91, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86769342 125 nips-2009-Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data

Author: Shuai Huang, Jing Li, Liang Sun, Jun Liu, Teresa Wu, Kewei Chen, Adam Fleisher, Eric Reiman, Jieping Ye

Abstract: Recent advances in neuroimaging techniques provide great potentials for effective diagnosis of Alzheimer’s disease (AD), the most common form of dementia. Previous studies have shown that AD is closely related to the alternation in the functional brain network, i.e., the functional connectivity among different brain regions. In this paper, we consider the problem of learning functional brain connectivity from neuroimaging, which holds great promise for identifying image-based markers used to distinguish Normal Controls (NC), patients with Mild Cognitive Impairment (MCI), and patients with AD. More specifically, we study sparse inverse covariance estimation (SICE), also known as exploratory Gaussian graphical models, for brain connectivity modeling. In particular, we apply SICE to learn and analyze functional brain connectivity patterns from different subject groups, based on a key property of SICE, called the “monotone property” we established in this paper. Our experimental results on neuroimaging PET data of 42 AD, 116 MCI, and 67 NC subjects reveal several interesting connectivity patterns consistent with literature findings, and also some new patterns that can help the knowledge discovery of AD. 1 In trod u cti on Alzheimer’s disease (AD) is a fatal, neurodegenerative disorder characterized by progressive impairment of memory and other cognitive functions. It is the most common form of dementia and currently affects over five million Americans; this number will grow to as many as 14 million by year 2050. The current knowledge about the cause of AD is very limited; clinical diagnosis is imprecise with definite diagnosis only possible by autopsy; also, there is currently no cure for AD, while most drugs only alleviate the symptoms. To tackle these challenging issues, the rapidly advancing neuroimaging techniques provide great potentials. These techniques, such as MRI, PET, and fMRI, produce data (images) of brain structure and function, making it possible to identify the difference between AD and normal brains. Recent studies have demonstrated that neuroimaging data provide more sensitive and consistent measures of AD onset and progression than conventional clinical assessment and neuropsychological tests [1]. Recent studies have found that AD is closely related to the alternation in the functional brain network, i.e., the functional connectivity among different brain regions [ 2]-[3]. Specifically, it has been shown that functional connectivity substantially decreases between the hippocampus and other regions of AD brains [3]-[4]. Also, some studies have found increased connectivity between the regions in the frontal lobe [ 6]-[7]. Learning functional brain connectivity from neuroimaging data holds great promise for identifying image-based markers used to distinguish among AD, MCI (Mild Cognitive Impairment), and normal aging. Note that MCI is a transition stage from normal aging to AD. Understanding and precise diagnosis of MCI have significant clinical value since it can serve as an early warning sign of AD. Despite all these, existing research in functional brain connectivity modeling suffers from limitations. A large body of functional connectivity modeling has been based on correlation analysis [2]-[3], [5]. However, correlation only captures pairwise information and fails to provide a complete account for the interaction of many (more than two) brain regions. Other multivariate statistical methods have also been used, such as Principle Component Analysis (PCA) [8], PCA-based Scaled Subprofile Model [9], Independent Component Analysis [10]-[11], and Partial Least Squares [12]-[13], which group brain regions into latent components. The brain regions within each component are believed to have strong connectivity, while the connectivity between components is weak. One major drawback of these methods is that the latent components may not correspond to any biological entities, causing difficulty in interpretation. In addition, graphical models have been used to study brain connectivity, such as structural equation models [14]-[15], dynamic causal models [16], and Granger causality. However, most of these approaches are confirmative, rather than exploratory, in the sense that they require a prior model of brain connectivity to begin with. This makes them inadequate for studying AD brain connectivity, because there is little prior knowledge about which regions should be involved and how they are connected. This makes exploratory models highly desirable. In this paper, we study sparse inverse covariance estimation (SICE), also known as exploratory Gaussian graphical models, for brain connectivity modeling. Inverse covariance matrix has a clear interpretation that the off-diagonal elements correspond to partial correlations, i.e., the correlation between each pair of brain regions given all other regions. This provides a much better model for brain connectivity than simple correlation analysis which models each pair of regions without considering other regions. Also, imposing sparsity on the inverse covariance estimation ensures a reliable brain connectivity to be modeled with limited sample size, which is usually the case in AD studies since clinical samples are difficult to obtain. From a domain perspective, imposing sparsity is also valid because neurological findings have demonstrated that a brain region usually only directly interacts with a few other brain regions in neurological processes [ 2]-[3]. Various algorithms for achieving SICE have been developed in recent year [ 17]-[22]. In addition, SICE has been used in various applications [17], [21], [23]-[26]. In this paper, we apply SICE to learn functional brain connectivity from neuroimaging and analyze the difference among AD, MCI, and NC based on a key property of SICE, called the “monotone property” we established in this paper. Unlike the previous study which is based on a specific level of sparsity [26], the monotone property allows us to study the connectivity pattern using different levels of sparsity and obtain an order for the strength of connection between pairs of brain regions. In addition, we apply bootstrap hypothesis testing to assess the significance of the connection. Our experimental results on PET data of 42 AD, 116 MCI, and 67 NC subjects enrolled in the Alzheimer’s Disease Neuroimaging Initiative project reveal several interesting connectivity patterns consistent with literature findings, and also some new patterns that can help the knowledge discovery of AD. 2 S ICE : B ack grou n d an d th e Mon oton e P rop erty An inverse covariance matrix can be represented graphically. If used to represent brain connectivity, the nodes are activated brain regions; existence of an arc between two nodes means that the two brain regions are closely related in the brain's functiona l process. Let be all the brain regions under study. We assume that follows a multivariate Gaussian distribution with mean and covariance matrix . Let be the inverse covariance matrix. Suppose we have samples (e.g., subjects with AD) for these brain regions. Note that we will only illustrate here the SICE for AD, whereas the SICE for MCI and NC can be achieved in a similar way. We can formulate the SICE into an optimization problem, i.e., (1) where is the sample covariance matrix; , , and denote the determinant, trace, and sum of the absolute values of all elements of a matrix, respectively. The part “ ” in (1) is the log-likelihood, whereas the part “ ” represents the “sparsity” of the inverse covariance matrix . (1) aims to achieve a tradeoff between the likelihood fit of the inverse covariance estimate and the sparsity. The tradeoff is controlled by , called the regularization parameter; larger will result in more sparse estimate for . The formulation in (1) follows the same line of the -norm regularization, which has been introduced into the least squares formulation to achieve model sparsity and the resulting model is called Lasso [27]. We employ the algorithm in [19] in this paper. Next, we show that with going from small to large, the resulting brain connectivity models have a monotone property. Before introducing the monotone property, the following definitions are needed. Definition: In the graphical representation of the inverse covariance, if node to by an arc, then is called a “neighbor” of . If is connected to chain of arcs, then is called a “connectivity component” of . is connected though some Intuitively, being neighbors means that two nodes (i.e., brain regions) are directly connected, whereas being connectivity components means that two brain regions are indirectly connected, i.e., the connection is mediated through other regions. In other words, not being connectivity components (i.e., two nodes completely separated in the graph) means that the two corresponding brain regions are completely independent of each other. Connectivity components have the following monotone property: Monotone property of SICE: Let components of with and and be the sets of all the connectivity , respectively. If , then . Intuitively, if two regions are connected (either directly or indirectly) at one level of sparseness ( ), they will be connected at all lower levels of sparseness ( ). Proof of the monotone property can be found in the supplementary file [29]. This monotone property can be used to identify how strongly connected each node (brain region) to its connectivity components. For example, assuming that and , this means that is more strongly connected to than . Thus, by changing from small to large, we can obtain an order for the strength of connection between pairs of brain regions. As will be shown in Section 3, this order is different among AD, MCI, and NC. 3 3.1 Ap p l i cati on i n B rai n Con n ecti vi ty M od el i n g of AD D a t a a c q u i s i t i o n a n d p re p ro c e s s i n g We apply SICE on FDG-PET images for 49 AD, 116 MCI, and 67 NC subjects downloaded from the ADNI website. We apply Automated Anatomical Labeling (AAL) [28] to extract data from each of the 116 anatomical volumes of interest (AVOI), and derived average of each AVOI for every subject. The AVOIs represent different regions of the whole brain. 3.2 B r a i n c o n n e c t i v i t y mo d e l i n g b y S I C E 42 AVOIs are selected for brain connectivity modeling, as they are considered to be potentially related to AD. These regions distribute in the frontal, parietal, occipital, and temporal lobes. Table 1 list of the names of the AVOIs with their corresponding lobes. The number before each AVOI is used to index the node in the connectivity models. We apply the SICE algorithm to learn one connectivity model for AD, one for MCI, and one for NC, for a given . With different ’s, the resulting connectivity models hold a monotone property, which can help obtain an order for the strength of connection between brain regions. To show the order clearly, we develop a tree-like plot in Fig. 1, which is for the AD group. To generate this plot, we start at a very small value (i.e., the right-most of the horizontal axis), which results in a fully-connected connectivity model. A fully-connected connectivity model is one that contains no region disconnected with the rest of the brain. Then, we decrease by small steps and record the order of the regions disconnected with the rest of the brain regions. Table 1: Names of the AVOIs for connectivity modeling (“L” means that the brain region is located at the left hemisphere; “R” means right hemisphere.) Frontal lobe Parietal lobe Occipital lobe Temporal lobe 1 Frontal_Sup_L 13 Parietal_Sup_L 21 Occipital_Sup_L 27 T emporal_Sup_L 2 Frontal_Sup_R 14 Parietal_Sup_R 22 Occipital_Sup_R 28 T emporal_Sup_R 3 Frontal_Mid_L 15 Parietal_Inf_L 23 Occipital_Mid_L 29 T emporal_Pole_Sup_L 4 Frontal_Mid_R 16 Parietal_Inf_R 24 Occipital_Mid_R 30 T emporal_Pole_Sup_R 5 Frontal_Sup_Medial_L 17 Precuneus_L 25 Occipital_Inf_L 31 T emporal_Mid_L 6 Frontal_Sup_Medial_R 18 Precuneus_R 26 Occipital_Inf_R 32 T emporal_Mid_R 7 Frontal_Mid_Orb_L 19 Cingulum_Post_L 33 T emporal_Pole_Mid_L 8 Frontal_Mid_Orb_R 20 Cingulum_Post_R 34 T emporal_Pole_Mid_R 9 Rectus_L 35 T emporal_Inf_L 8301 10 Rectus_R 36 T emporal_Inf_R 8302 11 Cingulum_Ant_L 37 Fusiform_L 12 Cingulum_Ant_R 38 Fusiform_R 39 Hippocampus_L 40 Hippocampus_R 41 ParaHippocampal_L 42 ParaHippocampal_R For example, in Fig. 1, as decreases below (but still above ), region “Tempora_Sup_L” is the first one becoming disconnected from the rest of the brain. As decreases below (but still above ), the rest of the brain further divides into three disconnected clusters, including the cluster of “Cingulum_Post_R” and “Cingulum_Post_L”, the cluster of “Fusiform_R” up to “Hippocampus_L”, and the cluster of the other regions. As continuously decreases, each current cluster will split into smaller clusters; eventually, when reaches a very large value, there will be no arc in the IC model, i.e., each region is now a cluster of itself and the split will stop. The sequence of the splitting gives an order for the strength of connection between brain regions. Specifically, the earlier (i.e., smaller ) a region or a cluster of regions becomes disconnected from the rest of the brain, the weaker it is connected with the rest of the brain. For example, in Fig. 1, it can be known that “Tempora_Sup_L” may be the weakest region in the brain network of AD; the second weakest ones are the cluster of “Cingulum_Post_R” and “Cingulum_Post_L”, and the cluster of “Fusiform_R” up to “Hippocampus_L”. It is very interesting to see that the weakest and second weakest brain regions in the brain network include “Cingulum_Post_R” and “Cingulum_Post_L” as well as regions all in the temporal lobe, all of which have been found to be affected by AD early and severely [3]-[5]. Next, to facilitate the comparison between AD and NC, a tree-like plot is also constructed for NC, as shown in Fig. 2. By comparing the plots for AD and NC, we can observe the following two distinct phenomena: First, in AD, between-lobe connectivity tends to be weaker than within-lobe connectivity. This can be seen from Fig. 1 which shows a clear pattern that the lobes become disconnected with each other before the regions within each lobe become disconnected with each other, as goes from small to large. This pattern does not show in Fig. 2 for NC. Second, the same brain regions in the left and right hemisphere are connected much weaker in AD than in NC. This can be seen from Fig. 2 for NC, in which the same brain regions in the left and right hemisphere are still connected even at a very large for NC. However, this pattern does not show in Fig. 1 for AD. Furthermore, a tree-like plot is also constructed for MCI (Fig. 3), and compared with the plots for AD and NC. In terms of the two phenomena discussed previously, MCI shows similar patterns to AD, but these patterns are not as distinct from NC as AD. Specifically, in terms of the first phenomenon, MCI also shows weaker between-lobe connectivity than within-lobe connectivity, which is similar to AD. However, the degree of weakerness is not as distinctive as AD. For example, a few regions in the temporal lobe of MCI, including “Temporal_Mid_R” and “Temporal_Sup_R”, appear to be more strongly connected with the occipital lobe than with other regions in the temporal lobe. In terms of the second phenomenon, MCI also shows weaker between-hemisphere connectivity in the same brain region than NC. However, the degree of weakerness is not as distinctive as AD. For example, several left-right pairs of the same brain regions are still connected even at a very large , such as “Rectus_R” and “Rectus_L”, “Frontal_Mid_Orb_R” and “Frontal_Mid_Orb _L”, “Parietal_Sup_R” and “Parietal_Sup_L”, as well as “Precuneus_R” and “Precuneus_L”. All above findings are consistent with the knowledge that MCI is a transition stage between normal aging and AD. Large λ λ3 λ2 λ1 Small λ Fig 1: Order for the strength of connection between brain regions of AD Large λ Small λ Fig 2: Order for the strength of connection between brain regions of NC Fig 3: Order for the strength of connection between brain regions of MCI Furthermore, we would like to compare how within-lobe and between-lobe connectivity is different across AD, MCI, and NC. To achieve this, we first learn one connectivity model for AD, one for MCI, and one for NC. We adjust the in the learning of each model such that the three models, corresponding to AD, MCI, and NC, respectively, will have the same total number of arcs. This is to “normalize” the models, so that the comparison will be more focused on how the arcs distribute differently across different models. By selecting different values for the total number of arcs, we can obtain models representing the brain connectivity at different levels of strength. Specifically, given a small value for the total number of arcs, only strong arcs will show up in the resulting connectivity model, so the model is a model of strong brain connectivity; when increasing the total number of arcs, mild arcs will also show up in the resulting connectivity model, so the model is a model of mild and strong brain connectivity. For example, Fig. 4 shows the connectivity models for AD, MCI, and NC with the total number of arcs equal to 50 (Fig. 4(a)), 120 (Fig. 4(b)), and 180 (Fig. 4(c)). In this paper, we use a “matrix” representation for the SICE of a connectivity model. In the matrix, each row represents one node and each column also represents one node. Please see Table 1 for the correspondence between the numbering of the nodes and the brain region each number represents. The matrix contains black and white cells: a black cell at the -th row, -th column of the matrix represents existence of an arc between nodes and in the SICE-based connectivity model, whereas a white cell represents absence of an arc. According to this definition, the total number of black cells in the matrix is equal to twice the total number of arcs in the SICE-based connectivity model. Moreover, on each matrix, four red cubes are used to highlight the brain regions in each of the four lobes; that is, from top-left to bottom-right, the red cubes highlight the frontal, parietal, occipital, and temporal lobes, respectively. The black cells inside each red cube reflect within-lobe connectivity, whereas the black cells outside the cubes reflect between-lobe connectivity. While the connectivity models in Fig. 4 clearly show some connectivity difference between AD, MCI, and NC, it is highly desirable to test if the observed difference is statistically significant. Therefore, we further perform a hypothesis testing and the results are summarized in Table 2. Specifically, a P-value is recorded in the sub-table if it is smaller than 0.1, such a P-value is further highlighted if it is even smaller than 0.05; a “---” indicates that the corresponding test is not significant (P-value>0.1). We can observe from Fig. 4 and Table 2: Within-lobe connectivity: The temporal lobe of AD has significantly less connectivity than NC. This is true across different strength levels (e.g., strong, mild, and weak) of the connectivity; in other words, even the connectivity between some strongly-connected brain regions in the temporal lobe may be disrupted by AD. In particular, it is clearly from Fig. 4(b) that the regions “Hippocampus” and “ParaHippocampal” (numbered by 39-42, located at the right-bottom corner of Fig. 4(b)) are much more separated from other regions in AD than in NC. The decrease in connectivity in the temporal lobe of AD, especially between the Hippocampus and other regions, has been extensively reported in the literature [3]-[5]. Furthermore, the temporal lobe of MCI does not show a significant decrease in connectivity, compared with NC. This may be because MCI does not disrupt the temporal lobe as badly as AD. AD MCI NC Fig 4(a): SICE-based brain connectivity models (total number of arcs equal to 50) AD MCI NC Fig 4(b): SICE-based brain connectivity models (total number of arcs equal to 120) AD MCI NC Fig 4(c): SICE-based brain connectivity models (total number of arcs equal to 180) The frontal lobe of AD has significantly more connectivity than NC, which is true across different strength levels of the connectivity. This has been interpreted as compensatory reallocation or recruitment of cognitive resources [6]-[7]. Because the regions in the frontal lobe are typically affected later in the course of AD (our data are early AD), the increased connectivity in the frontal lobe may help preserve some cognitive functions in AD patients. Furthermore, the frontal lobe of MCI does not show a significant increase in connectivity, compared with NC. This indicates that the compensatory effect in MCI brain may not be as strong as that in AD brains. Table 2: P-values from the statistical significance test of connectivity difference among AD, MCI, and NC (a) Total number of arcs = 50 (b) Total number of arcs = 120 (c) Total number of arcs = 180 There is no significant difference among AD, MCI, and NC in terms of the connectivity within the parietal lobe and within the occipital lobe. Another interesting finding is that all the P-values in the third sub-table of Table 2(a) are insignificant. This implies that distribution of the strong connectivity within and between lobes for MCI is very similar to NC; in other words, MCI has not been able to disrupt the strong connectivity among brain regions (it disrupts some mild and weak connectivity though). Between-lobe connectivity: In general, human brains tend to have less between-lobe connectivity than within-lobe connectivity. A majority of the strong connectivity occurs within lobes, but rarely between lobes. These can be clearly seen from Fig. 4 (especially Fig. 4(a)) in which there are much more black cells along the diagonal direction than the off-diagonal direction, regardless of AD, MCI, and NC. The connectivity between the parietal and occipital lobes of AD is significantly more than NC which is true especially for mild and weak connectivity. The increased connectivity between the parietal and occipital lobes of AD has been previously reported in [3]. It is also interpreted as a compensatory effect in [6]-[7]. Furthermore, MCI also shows increased connectivity between the parietal and occipital lobes, compared with NC, but the increase is not as significant as AD. While the connectivity between the frontal and occipital lobes shows little difference between AD and NC, such connectivity for MCI shows a significant decrease especially for mild and weak connectivity. Also, AD may have less temporal-occipital connectivity, less frontal-parietal connectivity, but more parietal-temporal connectivity than NC. Between-hemisphere connectivity: Recall that we have observed from the tree-like plots in Figs. 3 and 4 that the same brain regions in the left and right hemisphere are connected much weaker in AD than in NC. It is desirable to test if this observed difference is statistically significant. To achieve this, we test the statistical significance of the difference among AD, MCI, and NC, in term of the number of connected same-region left-right pairs. Results show that when the total number of arcs in the connectivity models is equal to 120 or 90, none of the tests is significant. However, when the total number of arcs is equal to 50, the P-values of the tests for “AD vs. NC”, “AD vs. MCI”, and “MCI vs. NC” are 0.009, 0.004, and 0.315, respectively. We further perform tests for the total number of arcs equal to 30 and find the P-values to be 0. 0055, 0.053, and 0.158, respectively. These results indicate that AD disrupts the strong connectivity between the same regions of the left and right hemispheres, whereas this disruption is not significant in MCI. 4 Con cl u si on In the paper, we applied SICE to model functional brain connectivity of AD, MCI, and NC based on PET neuroimaging data, and analyze the patterns based on the monotone property of SICE. Our findings were consistent with the previous literature and also showed some new aspects that may suggest further investigation in brain connectivity research in the future. R e f e re n c e s [1] S. Molchan. (2005) The Alzheimer's disease neuroimaging initiative. Business Briefing: US Neurology Review, pp.30-32, 2005. [2] C.J. Stam, B.F. Jones, G. Nolte, M. Breakspear, and P. Scheltens. (2007) Small-world networks and functional connectivity in Alzheimer’s disease. Cerebral Corter 17:92-99. [3] K. Supekar, V. Menon, D. Rubin, M. Musen, M.D. Greicius. (2008) Network Analysis of Intrinsic Functional Brain Connectivity in Alzheimer's Disease. PLoS Comput Biol 4(6) 1-11. [4] K. Wang, M. Liang, L. Wang, L. Tian, X. Zhang, K. Li and T. Jiang. (2007) Altered Functional Connectivity in Early Alzheimer’s Disease: A Resting-State fMRI Study, Human Brain Mapping 28, 967978. [5] N.P. Azari, S.I. Rapoport, C.L. Grady, M.B. Schapiro, J.A. Salerno, A. Gonzales-Aviles. (1992) Patterns of interregional correlations of cerebral glucose metabolic rates in patients with dementia of the Alzheimer type. Neurodegeneration 1: 101–111. [6] R.L. Gould, B.Arroyo, R,G. Brown, A.M. Owen, E.T. Bullmore and R.J. Howard. (2006) Brain Mechanisms of Successful Compensation during Learning in Alzheimer Disease, Neurology 67, 1011-1017. [7] Y. Stern. (2006) Cognitive Reserve and Alzheimer Disease, Alzheimer Disease Associated Disorder 20, 69-74. [8] K.J. Friston. (1994) Functional and effective connectivity: A synthesis. Human Brain Mapping 2, 56-78. [9] G. Alexander, J. Moeller. (1994) Application of the Scaled Subprofile model: a statistical approach to the analysis of functional patterns in neuropsychiatric disorders: A principal component approach to modeling regional patterns of brain function in disease. Human Brain Mapping, 79-94. [10] V.D. Calhoun, T. Adali, G.D. Pearlson, J.J. Pekar. (2001) Spatial and temporal independent component analysis of functional MRI data containing a pair of task-related waveforms. Hum.Brain Mapp. 13, 43-53. [11] V.D. Calhoun, T. Adali, J.J. Pekar, G.D. Pearlson. (2003) Latency (in)sensitive ICA. Group independent component analysis of fMRI data in the temporal frequency domain. Neuroimage. 20, 1661-1669. [12] A.R. McIntosh, F.L. Bookstein, J.V. Haxby, C.L. Grady. (1996) Spatial pattern analysis of functional brain images using partial least squares. Neuroimage. 3, 143-157. [13] K.J. Worsley, J.B. Poline, K.J. Friston, A.C. Evans. (1997) Characterizing the response of PET and fMRI data using multivariate linear models. Neuroimage. 6, 305-319. [14] E. Bullmore, B. Horwitz, G. Honey, M. Brammer, S. Williams, T. Sharma. (2000) How good is good enough in path analysis of fMRI data? NeuroImage 11, 289–301. [15] A.R. McIntosh, C.L. Grady, L.G. Ungerieider, J.V. Haxby, S.I. Rapoport, B. Horwitz. (1994) Network analysis of cortical visual pathways mapped with PET. J. Neurosci. 14 (2), 655–666. [16] K.J. Friston, L. Harrison, W. Penny. (2003) Dynamic causal modelling. Neuroimage 19, 1273-1302. [17] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. (2008) Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research 9:485516. [18] J. Dahl, L. Vandenberghe, and V. Roycowdhury. (2008) Covariance selection for nonchordal graphs via chordal embedding. Optimization Methods Software 23(4):501-520. [19] J. Friedman, T.astie, and R. Tibsirani. (2007) Spares inverse covariance estimation with the graphical lasso, Biostatistics 8(1):1-10. [20] J.Z. Huang, N. Liu, M. Pourahmadi, and L. Liu. (2006) Covariance matrix selection and estimation via penalized normal likelihood. Biometrika, 93(1):85-98. [21] H. Li and J. Gui. (2005) Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7(2):302-317. [22] Y. Lin. (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94(1)19-35, 2007. [23] A. Dobra, C. Hans, B. Jones, J.R. Nevins, G. Yao, and M. West. (2004) Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis 90(1):196-212. [24] A. Berge, A.C. Jensen, and A.H.S. Solberg. (2007) Sparse inverse covariance estimates for hyperspectral image classification, Geoscience and Remote Sensing, IEEE Transactions on, 45(5):1399-1407. [25] J.A. Bilmes. (2000) Factored sparse inverse covariance matrices. In ICASSP:1009-1012. [26] L. Sun and et al. (2009) Mining Brain Region Connectivity for Alzheimer's Disease Study via Sparse Inverse Covariance Estimation. In KDD: 1335-1344. [27] R. Tibshirani. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B 58(1):267-288. [28] N. Tzourio-Mazoyer and et al. (2002) Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single subject brain. Neuroimage 15:273-289. [29] Supplemental information for “Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data”. http://www.public.asu.edu/~jye02/Publications/AD-supplemental-NIPS09.pdf

same-paper 2 0.86112034 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

Author: Matthias Hein

Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1

3 0.7195912 180 nips-2009-On the Convergence of the Concave-Convex Procedure

Author: Gert R. Lanckriet, Bharath K. Sriperumbudur

Abstract: The concave-convex procedure (CCCP) is a majorization-minimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive SVMs, sparse principal component analysis, etc. Though widely used in many applications, the convergence behavior of CCCP has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper, however, we believe the analysis is not complete. Although the convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), its proof is more specialized and technical than actually required for the specific case of CCCP. In this paper, we follow a different reasoning and show how Zangwill’s global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP, allowing a more elegant and simple proof. This underlines Zangwill’s theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectation-maximization, generalized alternating minimization, etc. In this paper, we provide a rigorous analysis of the convergence of CCCP by addressing these questions: (i) When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? (ii) When does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP.

4 0.71948898 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

5 0.71855164 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

6 0.71635026 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

7 0.71324772 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

8 0.71268415 72 nips-2009-Distribution Matching for Transduction

9 0.71094656 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

10 0.70961905 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

11 0.70915848 71 nips-2009-Distribution-Calibrated Hierarchical Classification

12 0.70913047 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

13 0.7081641 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

14 0.70764792 129 nips-2009-Learning a Small Mixture of Trees

15 0.7074616 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

16 0.70638603 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

17 0.70619106 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

18 0.70476031 220 nips-2009-Slow Learners are Fast

19 0.70470589 97 nips-2009-Free energy score space

20 0.70277542 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors