nips nips2009 nips2009-179 knowledge-graph by maker-knowledge-mining

179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

Source: pdf

Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.

Abstract: Motivated from real world problems, like object categorization, we study a particular mixed-norm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The formulation hence employs l∞ regularization for promoting combinations at the component level and l1 regularization for promoting sparsity among kernels in each component. While previous attempts have formulated this as a non-convex problem, the formulation given here is an instance of non-smooth convex optimization problem which admits an efﬁcient Mirror-Descent (MD) based procedure. The MD procedure optimizes over product of simplexes, which is not a well-studied case in literature. Results on real-world datasets show that the new MKL formulation is well-suited for object categorization tasks and that the MD based algorithm outperforms stateof-the-art MKL solvers like simpleMKL in terms of computational effort. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 in Abstract Motivated from real world problems, like object categorization, we study a particular mixed-norm regularization for Multiple Kernel Learning (MKL). [sent-38, score-0.235]

2 It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. [sent-39, score-0.332]

3 The formulation hence employs l∞ regularization for promoting combinations at the component level and l1 regularization for promoting sparsity among kernels in each component. [sent-40, score-0.84]

4 While previous attempts have formulated this as a non-convex problem, the formulation given here is an instance of non-smooth convex optimization problem which admits an efﬁcient Mirror-Descent (MD) based procedure. [sent-41, score-0.17]

5 Results on real-world datasets show that the new MKL formulation is well-suited for object categorization tasks and that the MD based algorithm outperforms stateof-the-art MKL solvers like simpleMKL in terms of computational effort. [sent-43, score-0.505]

6 1 Introduction In this paper the problem of Multiple Kernel Learning (MKL) is studied where the given kernels are assumed to be grouped into distinct components and each component is crucial for the learning task in hand. [sent-44, score-0.332]

7 The focus of this paper is to study the formalism, algorithmics of a speciﬁc mixed-norm regularization based MKL formulation suited for such tasks. [sent-45, score-0.295]

8 Majority of existing MKL literature have considered employing a block l1 norm regularization leading to selection of few of the given kernels [8, 1, 16, 14, 20] . [sent-46, score-0.46]

9 Such formulations tend to select the “best” among the given kernels and consequently the decision functions tend to depend only on the selected kernel. [sent-47, score-0.274]

10 Recently [17] extended the framework of MKL to the case where kernels are partitioned into groups and introduces a generic mixed-norm regularization based MKL formulation in order to handle groups of kernels. [sent-48, score-0.582]

11 The proposed MKL formulation hence employs l∞ regularization and promotes combinations of kernels at the component level. [sent-52, score-0.674]

12 Moreover it employs l1 regularization for promoting sparsity among kernels in each component. [sent-53, score-0.455]

13 The formulation studied here is motivated by real-world learning applications like object categorization where multiple feature representations need to be employed simultaneously for achieving good generalization. [sent-54, score-0.457]

14 Combining feature descriptors using the framework of Multiple Kernel Learning (MKL) [8] for object categorization has been a topic of interest for many recent studies [19, 13]. [sent-55, score-0.337]

15 , in the case of ﬂower classiﬁcation feature descriptors for shape, color and texture need to be employed in order to achieve good visual discrimination as well as signiﬁcant within-class variation [12]. [sent-58, score-0.155]

16 A key ﬁnding of [12] is the following: in object categorization tasks, employing few of the feature descriptors or employing a canonical combination of them often leads to sub-optimal solutions. [sent-59, score-0.48]

17 Hence, in the framework of MKL, employing a l1 regularization, which is equivalent to selecting one of the given kernels, as well as employing a l2 regularization, which is equivalent to working with a canonical combination of the given kernels, may lead to sub-optimality. [sent-60, score-0.143]

18 This important ﬁnding clearly motivates the use of l∞ norm regularization for combining kernels generated from different feature descriptors and l1 norm regularization for selecting kernels generated from the same feature descriptor. [sent-61, score-0.892]

19 Hence, by grouping kernels generated from the same feature descriptor together and employing the new MKL formulation, classiﬁers which are potentially well-suited for object categorization tasks can be built. [sent-62, score-0.63]

20 Apart from the novel MKL formulation the main contribution of the paper is a highly efﬁcient algorithm for solving it. [sent-63, score-0.213]

21 Since the formulation is an instance of a Second Order Cone Program (SOCP), it can be solved using generic interior point algorithms. [sent-64, score-0.2]

22 Also the generic wrapper approach proposed in [17] cannot be employed as it solves a non-convex variant of the proposed (convex) formulation. [sent-66, score-0.202]

23 The proposed algorithm employs mirror-descent [3, 2, 9] leading to extremely scalable solutions. [sent-67, score-0.102]

24 The feasibility set for the minimization problem tackled by Mirror-Descent (MD) turns out to be direct product of simplexes, which is not a standard set-up discussed in optimization literature. [sent-68, score-0.095]

25 We employ a weighted version of the entropy function as the prox-function in the auxiliary problem solved by MD at each iteration and justify its suitability for the case of direct product of simplexes. [sent-69, score-0.131]

26 The remainder of this paper is organized as follows: in section 2, details of the new MKL formulation and its dual are presented. [sent-72, score-0.201]

27 The mirror-descent based algorithm which efﬁciently solves the dual is presented in section 3. [sent-73, score-0.123]

28 In particular, the empirical ﬁndings are a) the new MKL formulation is well-suited for object categorization tasks b) the MD based algorithm scales better than state-of-the-art gradient descent methods (e. [sent-75, score-0.454]

29 simpleMKL) in solving the special case where number of components (groups) of kernels is unity. [sent-77, score-0.301]

30 2 Mixed-norm based MKL Formulation This section presents the novel mixed-norm regularization based MKL formulation and its dual. [sent-78, score-0.26]

31 Suppose the given kernels are divided into n groups (components) and the j th component has nj number of kernels. [sent-87, score-0.644]

32 Let the feature-space mapping generated from the k th kernel of the j th component be φjk (·) and the corresponding gram-matrix of training data points be Kjk 1 . [sent-88, score-0.192]

33 2 n n j siﬁer of the form j=1 k=1 wjk φjk (xi ) − b = 0. [sent-90, score-0.18]

34 As discussed above, we wish to perform a block l∞ regularization over the model parameters wjk associated with distinct components and l1 regularization for those associated with the same component. [sent-91, score-0.477]

35 Intuitively, such a regularization promotes combinations of kernels belonging to different components and selections among kernels of the same component. [sent-92, score-0.702]

36 Following the framework of MKL and the mixed norm regularization detailed here, the following formulation is immediate: 1 2 n j=1 s. [sent-93, score-0.294]

37 MKL formulation in (1) is convex and moreover an instance of SOCP. [sent-96, score-0.17]

38 This formulation can also be realized as a limiting case of the generic CAP formulation presented in [17] (with γ = 1, γ0 → ∞). [sent-97, score-0.342]

39 Moreover, the generic wrapper approach of [17] is inappropriate for solving this limiting case as that approach would solve a non-convex variant of this (convex) formulation. [sent-99, score-0.136]

40 λjnj ∈ ∆nj and re-write (1) as follows: 1 2 min wjk ,b,ξi s. [sent-105, score-0.218]

41 yi maxj minλj ∈∆nj n j=1 nj k=1 nj k=1 wjk λjk 2 2 +C i ξi wjk φjk (xi ) − b ≥ 1 − ξi , ξi ≥ 0 ∀ i (2) a2 This is because for any vector [a1 . [sent-107, score-1.158]

42 To see that rewrite n 2 w j maxj as mint t with constraints minλj ∈∆nj k=1 λjk 2 ≤ t, where t is a new decision variable. [sent-112, score-0.145]

43 jk This problem is feasible in both λj s and t and hence we can drop the minimization over individual n w 2 j constraints to obtain an equivalent problem: minλj ∈∆nj ∀j,t t subject to k=1 λjk 2 ≤ t. [sent-113, score-0.347]

44 One can jk now eliminate t by reintroducing the maxj and interchange the minλj ∈∆nj ∀j with other variables to obtain: min λj ∈∆nj ∀j 1 2 min wjk ,b,ξi s. [sent-114, score-0.693]

45 yi n j=1 maxj nj k=1 nj k=1 wjk λjk 2 2 +C i ξi wjk φjk (xi ) − b ≥ 1 − ξi , ξi ≥ 0 ∀ i Now one can derive the standard dual of (3) wrt. [sent-116, score-1.218]

46 to the variables wjk , b, ξi alone, leading to:   n nj λjk Qjk  1 k=1 α min max 1 α− α  λj ∈∆nj ∀j α∈Sm (C), γ∈∆n 2 γj j=1 where α, γ are Lagrange multipliers, Sm (C) ≡ {x ∈ Rm | 0 ≤ x ≤ C1, Qjk ≡ YKjk Y. [sent-117, score-0.541]

47 In other words, 1 ∗ γj (4) xi yi = 0} and • (4) is equivalent to the well-known SVM [18] formulation with kernel Kef f n j=1 (3) ≡ is the weight given to the j th component and λ∗ is weight given to the k th kernel of the j th component. [sent-119, score-0.468]

48 These facts readily justify the suitability of the particular mixed norm regularization for object categorization. [sent-126, score-0.336]

49 Indeed, in-sync with ﬁndings of [12], kernels from different feature descriptors (components) are combined using non-trivial weights (i. [sent-127, score-0.328]

50 Moreover, only the “best” kernels from j each feature descriptor (component) are utilized by the model. [sent-130, score-0.31]

51 In the following section an efﬁcient iterative algorithm for solving the dual (4) is presented. [sent-132, score-0.163]

52 3 Efﬁcient Algorithm for Solving the Dual This section presents an efﬁcient algorithm for solving the dual (4). [sent-133, score-0.132]

53 Note that typically in object categorization or other such multi-modal learning tasks, the number of feature descriptors (i. [sent-134, score-0.337]

54 However the kernels constructed from each feature descriptor can be very high in number i. [sent-137, score-0.31]

55 Using the minimax theorem [15], one can interchange the min over λj s and max over γ to obtain:        n nj   λjk Qjk   1   k=1 max 1 α − α  min −  min α  (5) γ∈∆n λj ∈∆nj ∀j α∈Sm (C)  2 γj j=1   gγ (λ1 ,. [sent-145, score-0.469]

56 The proposed algorithm performs alternate minimization over the variables γ and (λ1 , . [sent-149, score-0.121]

57 This leads to the following optimization problem: n min γ∈∆n where Wj = α nj k=1 j=1 Wj γj λjk Qjk α. [sent-158, score-0.361]

58 In the following we present a mirror-descent (MD) based algorithm which evaluates f to sufﬁcient accuracy in O(log [maxj nj ])O(SVMm ). [sent-166, score-0.349]

59 The following points highlight the merits and de-merits of these two methods: • In case of SD, the per-step auxiliary problem has no closed form solution and projections onto the feasibility set need to be done which are computationally intensive especially for problems with high dimensions. [sent-170, score-0.1]

60 st is a regularization parameter and also determines the step-size. [sent-191, score-0.157]

61 Intuitively (7) minimizes a weighted sum of the local linear approximation of the original objective and a regularization term that penalizes solutions far from the current iterate. [sent-193, score-0.119]

62 We choose the distance-generating function as the following modiﬁed entropy function: ω(x) ≡ nj n −1 + δn−1 nj −1 log xjk n−1 + δn−1 nj −1 where δ is a small positive number j=1 k=1 xjk n (say, 10e − 16). [sent-197, score-1.039]

63 Gradient of gγ can (α(t) ) Qjk α(t) ∂gγ where α(t) is the optimal α obtained by solving an be computed using (t) = − 1 2 γj ∂λjk SVM with kernel as n j=1 Pnj k=1 (t) λjk Kjk γj . [sent-202, score-0.109]

64 With this notation, it is easy to show that the optimal 4 update (7) has the following analytical form : (t) (t+1) λjk = exp −gγ jk n ˜ nj k=1 (t) (8) exp −gγ jk n ˜ The following text discusses the convergence issues with MD. [sent-203, score-0.932]

65 It is easy to verify that σ = O(1)n−2 and Θ = O (log [maxj nj ]) in our case. [sent-210, score-0.323]

66 The convergence and its efﬁciency follow from this result [3, 2, 9]: Result 1 With step-sizes:st = T: T √ Θσ √ gγ ∗ t = mint≤T gγ (λ(t) ) − gγ (λ∗ ) ≤ O(1) one has the following bound on error after iteration √ ΘL · (gγ ) √ σT where · ∗ is the dual norm of the norm wrt. [sent-211, score-0.128]

67 Substituting the particular values for our case, we obtain st = log [maxj nj ] √ n gγ ∞ t (9) √ log[maxj nj ] √ and T ∝ . [sent-214, score-0.684]

68 In other words, for reaching a reasonable approximation of the optimal, T the number iterations required are O(log [maxj nj ]), which is nearly-independent of the number 4 Since the term involving δ is λjk , it is neglected in this computation. [sent-215, score-0.323]

69 Note that the iterative algorithm can be improved by improving the algorithm for solving the SVM problem. [sent-218, score-0.129]

70 The MKL formulation presented here exploits the special structure in the kernels and Algorithm 1: Mirror-descent based alternate minimization algorithm Data: Labels and gram-matrices of training eg. [sent-220, score-0.484]

71 , component-id of each kernel, regularization parameter (C) Result: Optimal values of α, γ, λ in (4) begin Set γ, λ to some initial feasible values. [sent-221, score-0.119]

72 Moreover the proposed iterative algorithm solves the formulation with a per-step complexity of (almost) O(SV Mm ), which is the same as that with traditional MKL formulations (which do not exploit this structure). [sent-223, score-0.314]

73 4 Numerical Experiments This section presents results of experiments which empirically verify the major claims of the paper: a) The proposed formulation is well-suited for object categorization b) In the case n = 1, the proposed algorithm outperforms simpleMKL wrt. [sent-231, score-0.507]

74 In the following, the experiments done on real-world object categorization datasets are summarized. [sent-233, score-0.281]

75 The proposed MKL formulation is compared with state-of-the-art methodology for object categorization [19, 13] that employs a block l1 regularization based MKL formulation with additional constraints for including prior information regarding weights of kernels. [sent-234, score-0.734]

76 In all cases, the hyper-parameters of the various formulations were tuned using suitable cross-validation procedures and the accuracies reported denote testset accuracies achieved by the respective classiﬁers using the tuned set of hyper-parameters. [sent-242, score-0.213]

77 The Caltech-101 dataset has 101 categories of objects whereas Caltech-5 dataset is a subset of the Caltech-101 dataset including images of Airplanes, Car sides, Faces, Leopards and Motorbikes alone. [sent-271, score-0.195]

78 In order to make the results presented here comparable to others in literature we have followed the usual practice of generating training and test sets using a ﬁxed number of pictures from each object category and repeating the experiments with different random selections of pictures. [sent-275, score-0.216]

79 For the Caltech-5, Caltech-101 and Oxford ﬂowers datasets we have used 50, 15, 60 images per object category as training images and 50, 15, 20 images per object category as testing images respectively. [sent-276, score-0.501]

80 Also, in case of Caltech-5 and Oxford ﬂowers datasets, the accuracies reported are the testset accuracies averaged over 10 such randomly sampled training and test datasets. [sent-277, score-0.188]

81 Since the Caltech-101 dataset has large number of classes and the experiments are computationally intensive (100 choose 2 classiﬁers need to be built in each case), the results are averaged over 3 sets of training and test datasets only. [sent-278, score-0.11]

82 Using each feature descriptor, nine kernels were generated by varying the width-parameter of the Gaussian kernel. [sent-281, score-0.258]

83 The kernels can be grouped based on the feature descriptor they were generated from and the proposed formulation can be employed to construct classiﬁers well-suited for object categorization. [sent-282, score-0.667]

84 in case of the Caltech datasets, n = 5 and nj = 9 ∀ j and in case of Oxford ﬂowers dataset, n = 7 and nj = 9 ∀ j. [sent-284, score-0.646]

85 Each plot shows the % gain in accuracy achieved by MixNorm-MKL over L1-MKL and L2-MKL for each object category. [sent-287, score-0.175]

86 most object categories, the gains are positive and moreover quite high. [sent-303, score-0.116]

87 The gain in terms of numbers for the other two datasets are not as high merely because the baseline accuracies were themselves high. [sent-308, score-0.167]

88 The ﬁgures clearly show that the proposed formulation outperforms state-of-the-art object categorization techniques and is hence highly-suited for such tasks. [sent-315, score-0.453]

89 Another observation was that the average sparsity (% of kernels with zero weightages) with the methods MixNorm-MKL, L1-MKL and L2-MKL is 57%, 96% and 0% respectively. [sent-316, score-0.248]

90 Also, it was observed that L1-MKL almost always selected kernels from one or two components (feature descriptors) only whereas MixNorm-MKL (and ofcourse L2-MKL) selected kernels from all the components. [sent-317, score-0.477]

91 These observations clearly show that the proposed formulation combines important kernels while eliminating redundant and noisy kernels using the information embedded in the group structure of the kernels. [sent-318, score-0.612]

92 Note that in the special case, n = 1, the proposed formulation is exactly same as the l1 regularization based formulation. [sent-321, score-0.287]

93 Hence the mirror-descent based iterative algorithm proposed here can also be employed for solving l1 regularization based MKL. [sent-322, score-0.298]

94 Figure 2 shows plots of the training times as a function of number of kernels with the algorithms on two binary classiﬁcation problems encountered in the object categorization experiments. [sent-323, score-0.48]

95 We think this is the crucial advantage of the proposed iterative algorithm over the gradient-decent based algorithms which were traditionally employed for solving the MKL formulations. [sent-327, score-0.204]

96 5 Conclusions This paper makes two important contributions: a) a speciﬁc mixed-norm regularization based MKL formulation which is well-suited for object categorization and multi-modal tasks b) An efﬁcient mirror-descent based algorithm for solving the new formulation. [sent-328, score-0.593]

97 Empirical results on real-world datasets show that the new formulation achieves far better generalization than state-of-the-art object categorization techniques. [sent-329, score-0.422]

98 In some cases, the average gain in testset accuracy compared to state-of-the-art was as high as 37%. [sent-330, score-0.104]

99 The mirror-descent based algorithm presented in the paper not only solves the proposed formulation efﬁciently but also outperforms simpleMKL in solving the traditional l1 regularization based MKL. [sent-331, score-0.423]

100 Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. [sent-371, score-0.143]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mkl', 0.582), ('nj', 0.323), ('jk', 0.288), ('kernels', 0.222), ('md', 0.183), ('wjk', 0.18), ('formulation', 0.141), ('simplemkl', 0.14), ('owers', 0.122), ('regularization', 0.119), ('maxj', 0.117), ('object', 0.116), ('categorization', 0.115), ('kjk', 0.1), ('qjk', 0.1), ('oxford', 0.089), ('sd', 0.085), ('nilsback', 0.08), ('simplexes', 0.08), ('indian', 0.072), ('descriptors', 0.07), ('kernel', 0.063), ('feasibility', 0.063), ('categories', 0.062), ('ower', 0.06), ('arkadi', 0.06), ('kef', 0.06), ('svmm', 0.06), ('dual', 0.06), ('gain', 0.059), ('employing', 0.059), ('accuracies', 0.058), ('caltech', 0.057), ('svm', 0.053), ('descriptor', 0.052), ('formulations', 0.052), ('wj', 0.052), ('aharon', 0.052), ('datasets', 0.05), ('employed', 0.049), ('employs', 0.049), ('solving', 0.046), ('selections', 0.045), ('testset', 0.045), ('dinesh', 0.04), ('flowers', 0.04), ('mixnorm', 0.04), ('pnj', 0.04), ('suitability', 0.04), ('mirror', 0.039), ('promoting', 0.039), ('st', 0.038), ('min', 0.038), ('solves', 0.037), ('th', 0.037), ('auxiliary', 0.037), ('automation', 0.036), ('feature', 0.036), ('alternate', 0.036), ('yi', 0.035), ('algorithmics', 0.035), ('xjk', 0.035), ('norm', 0.034), ('sm', 0.034), ('images', 0.034), ('groups', 0.034), ('dataset', 0.033), ('components', 0.033), ('analytical', 0.033), ('flower', 0.032), ('interchange', 0.032), ('minimization', 0.032), ('generic', 0.032), ('iterative', 0.031), ('combinations', 0.031), ('tasks', 0.03), ('promotes', 0.03), ('wrapper', 0.03), ('convex', 0.029), ('mint', 0.028), ('claims', 0.028), ('limiting', 0.028), ('category', 0.028), ('component', 0.028), ('training', 0.027), ('proposed', 0.027), ('outperforms', 0.027), ('justify', 0.027), ('regularizations', 0.027), ('modulus', 0.027), ('hence', 0.027), ('xi', 0.027), ('solved', 0.027), ('descent', 0.026), ('algorithm', 0.026), ('block', 0.026), ('sparsity', 0.026), ('crucial', 0.025), ('canonical', 0.025), ('grouped', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.

2 0.56215072 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert Müller, Alexander Zien, Sören Sonnenburg

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, 1 -norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary p -norms. We devise new insights on the connection between several existing MKL formulations and develop two efﬁcient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply p -norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art. 1

3 0.15048678 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

4 0.11425147 119 nips-2009-Kernel Methods for Deep Learning

Author: Youngmin Cho, Lawrence K. Saul

Abstract: We introduce a new family of positive-deﬁnite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1

5 0.10039666 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of ﬁnite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classiﬁer, thus forming a natural link between the distance between distributions and their ease of classiﬁcation. An important consequence is that a kernel must be characteristic to guarantee classiﬁability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive deﬁnite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classiﬁer. The generalized MMD is shown to have consistent ﬁnite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1

6 0.097839773 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

7 0.091844395 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

8 0.083562553 133 nips-2009-Learning models of object structure

9 0.080072649 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

10 0.079896197 104 nips-2009-Group Sparse Coding

11 0.079186633 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

12 0.07547304 33 nips-2009-Analysis of SVM with Indefinite Kernels

13 0.072115719 95 nips-2009-Fast subtree kernels on graphs

14 0.07100293 236 nips-2009-Structured output regression for detection with partial truncation

15 0.069427066 201 nips-2009-Region-based Segmentation and Object Detection

16 0.06675826 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

17 0.062401947 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

18 0.053606842 47 nips-2009-Boosting with Spatial Regularization

19 0.052373186 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships

20 0.051484749 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.199), (1, 0.071), (2, -0.118), (3, 0.077), (4, -0.071), (5, -0.01), (6, 0.003), (7, 0.19), (8, -0.017), (9, 0.06), (10, 0.246), (11, -0.291), (12, 0.007), (13, 0.027), (14, -0.232), (15, 0.155), (16, -0.126), (17, -0.019), (18, -0.088), (19, 0.107), (20, 0.051), (21, -0.281), (22, -0.054), (23, 0.102), (24, 0.114), (25, -0.198), (26, -0.054), (27, 0.082), (28, 0.012), (29, -0.006), (30, 0.127), (31, 0.214), (32, 0.121), (33, 0.031), (34, 0.071), (35, -0.128), (36, 0.021), (37, -0.137), (38, -0.139), (39, -0.062), (40, 0.095), (41, 0.062), (42, 0.039), (43, 0.078), (44, -0.073), (45, -0.072), (46, 0.033), (47, 0.006), (48, 0.006), (49, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94124329 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.

2 0.89221829 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert Müller, Alexander Zien, Sören Sonnenburg

3 0.45773473 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

4 0.42742461 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

Author: Liang Sun, Jun Liu, Jianhui Chen, Jieping Ye

Abstract: We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the (2, 1)-norm minimization, which is an extension of the well-known 1-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is signiﬁcantly much more difﬁcult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semideﬁnite programming (SDP) problem, which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efﬁcient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.

5 0.38893929 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

Abstract: The recent introduction of indeﬁnite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classiﬁcation with a non-positive semi-deﬁnite kernel (indeﬁnite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indeﬁnite kernel matrix and the proxy positive semi-deﬁnite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efﬁciency of our proposed algorithms.

6 0.3843855 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

7 0.33747417 119 nips-2009-Kernel Methods for Deep Learning

8 0.30342418 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

9 0.27675831 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

10 0.26395243 76 nips-2009-Efficient Learning using Forward-Backward Splitting

11 0.2525712 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine

12 0.25073031 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

13 0.24331962 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

14 0.24203436 46 nips-2009-Bilinear classifiers for visual recognition

15 0.24053708 95 nips-2009-Fast subtree kernels on graphs

16 0.23849835 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships

17 0.23309226 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

18 0.22685057 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

19 0.22649921 72 nips-2009-Distribution Matching for Transduction

20 0.2237564 236 nips-2009-Structured output regression for detection with partial truncation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.015), (24, 0.025), (25, 0.055), (35, 0.033), (36, 0.556), (39, 0.048), (58, 0.046), (61, 0.013), (71, 0.042), (86, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98891342 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efﬁcient inference algorithms for a conditional random ﬁeld using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efﬁcient learning algorithms for these conditional random ﬁelds. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1

same-paper 2 0.98671186 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.

3 0.98490739 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming ﬁxed tic-tac-toe pruning. 1

4 0.98184609 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efﬁcient boosting algorithm that uses additional weights/priors on the base classiﬁers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classiﬁers. The algorithm’s primary advantage is in applications where the trained classiﬁer is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

5 0.97358149 238 nips-2009-Submanifold density estimation

Author: Arkadas Ozakin, Alexander G. Gray

Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modiﬁcation to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modiﬁed estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisﬁes certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As ﬁrst clariﬁed by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is deﬁned in terms of a metric tensor is called a Riemannian manifold (for a rigorous deﬁnition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier deﬁnes an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “inﬁnitesimal distance” ds between two points whose P coordinates differ by the inﬁnitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes inﬁnite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is inﬁnite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modiﬁcation of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal deﬁnition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function deﬁned on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisﬁes vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisﬁes, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , deﬁne an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisﬁes limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Deﬁnition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see ﬁgure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisﬁes, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the deﬁnitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the deﬁnition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the deﬁnition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing ﬁrst derivatives at p, and gij (p) = δij [1]. Normal coordinates can be deﬁned in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the ﬁrst two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The ﬁrst three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modiﬁcation of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We ﬁnd it likely that our estimator or slight modiﬁcations of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we ﬁt a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8

6 0.9642517 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

7 0.80594081 128 nips-2009-Learning Non-Linear Combinations of Kernels

8 0.79924965 193 nips-2009-Potential-Based Agnostic Boosting

9 0.79831696 76 nips-2009-Efficient Learning using Forward-Backward Splitting

10 0.79697138 129 nips-2009-Learning a Small Mixture of Trees

11 0.7935822 27 nips-2009-Adaptive Regularization of Weight Vectors

12 0.78563625 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

13 0.78323996 72 nips-2009-Distribution Matching for Transduction

14 0.77960831 166 nips-2009-Noisy Generalized Binary Search

15 0.77554798 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

16 0.77434504 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

17 0.76751888 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

18 0.76501554 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

19 0.75954318 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

20 0.75852185 180 nips-2009-On the Convergence of the Concave-Convex Procedure