nips nips2009 nips2009-47 knowledge-graph by maker-knowledge-mining

47 nips-2009-Boosting with Spatial Regularization


Source: pdf

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. [sent-3, score-0.704]

2 From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. [sent-4, score-0.526]

3 We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. [sent-5, score-0.466]

4 The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e. [sent-6, score-0.291]

5 1 Introduction When applying off-the-shelf machine learning algorithms to data with spatial dimensions (images, geo-spatial data, fMRI, etc) a central question arises: how to incorporate prior information on the spatial characteristics of the data? [sent-10, score-0.496]

6 For example, if we feed a boosting or SVM algorithm with individual image voxels as features, the voxel spatial information is ignored. [sent-11, score-0.761]

7 Yet in many cases the spatial arrangement of the voxels together with prior information about expected spatial characteristics of the data may be very helpful. [sent-13, score-0.719]

8 We are particularly interested in the situation when the trained classifier is used to identify relevant spatial regions. [sent-14, score-0.248]

9 Successful classification suggests that the voxels used are important in discriminating between the two classes. [sent-16, score-0.265]

10 We expect that these voxels will be spatially compact and clustered. [sent-18, score-0.318]

11 In summary, our primary objective is improving the ability of the trained classifier to usefully identify the spatial pattern of discriminative information. [sent-20, score-0.291]

12 However, incorporating spatial information into boosting may also improve classification accuracy. [sent-21, score-0.475]

13 Our key contribution is the development of a framework for spatially regularized boosting. [sent-22, score-0.198]

14 We do this by adding a spatial regularization kernel to the standard loss minimization formulation of boosting. [sent-23, score-0.354]

15 We then design an associated boosting algorithm by using coordinate descent on the regularized loss. [sent-24, score-0.363]

16 We show that the algorithm minimizes the regularized loss function and has a natural interpretation of boosting with additional adaptive priors/weights on both spatial locations and training examples. [sent-25, score-0.701]

17 We also show that it exhibits a natural grouping effect on nearby spatial locations with similar discriminative power. [sent-26, score-0.464]

18 1 Briefly, the fMRI voxel selection problem is to use the fMRI signal to identify a subset of voxels that are key in discriminating between two stimuli. [sent-29, score-0.38]

19 One expects such voxels to be spatially compact and clustered. [sent-30, score-0.318]

20 Traditionally this is done by thresholding a statistical univariate test score on each voxel [1]. [sent-31, score-0.169]

21 An extreme case is hypothesis testings on clusters of voxels rather than on voxels themselves [2]. [sent-33, score-0.39]

22 The problem with these methods is that they greatly sacrifice the spatial resolution of the results and averaging could hide fine patterns in data. [sent-34, score-0.248]

23 An alternative is to spatially average the univariate test scores, e. [sent-35, score-0.19]

24 However, this also compromises the spatial accuracy of the result because one selects discriminating wavelet components, not voxels. [sent-40, score-0.422]

25 A more promising spatially aware approach selects voxels with tree-based spatial regularization of a univariate statistic [5, 6]. [sent-41, score-0.734]

26 This can achieve both spatial precision and smoothness but uses a complex regularization method. [sent-42, score-0.328]

27 Our proposed method also selects single voxels with the help of spatial regularization but operates in a multivariate classifier framework using a simpler form of regularization. [sent-43, score-0.544]

28 To ensure spatial clustering of selected voxels, one can run a searchlight (a spherical mask) [9] to pre-select clustered informative features. [sent-48, score-0.33]

29 A variant of this two-stage framework is to train classifiers on a few predefined masks, and then aggregate these classifiers by boosting [10, 11]. [sent-51, score-0.227]

30 Unlike two-stage approaches, [12] directly uses AdaBoost to train classifiers with “rich features” (features involving the values of several adjacent voxels) to capture spatial structure in the data. [sent-53, score-0.248]

31 Moreover, there is no control on the spatial smoothness of the results. [sent-55, score-0.275]

32 Our method is similar to [12] in that we combine the feature selection and classification into one boosting process. [sent-56, score-0.279]

33 But our algorithm operates on single voxels and uses simple spatial regularization to incorporate spatial information. [sent-57, score-0.772]

34 After introducing notation in §2, we formulate our spatial regularization approach in §3 and derive an associated spatially regularized boosting algorithm in §4. [sent-59, score-0.754]

35 We prove an interesting property of the algorithm in §5 that guarantees the simultaneous selection of equivalent locations that are spatially close. [sent-60, score-0.268]

36 Using the training instances X , we select a pool of base classifiers H = {hj : Rn → {−1, +1}, j = 1, . [sent-69, score-0.173]

37 Our objective is to train p a composite binary classifier of the form hα (xi ) = sgn( j=1 αj hj (xi )). [sent-73, score-0.608]

38 We can further assume that hj ∈ H ⇒ −hj ∈ H, thus all values in α can be assumed to be nonnegative. [sent-74, score-0.569]

39 (1) i=1 Various boosting algorithms can be derived as iterative greedy coordinate descent procedures to minimize (1) [13]. [sent-77, score-0.293]

40 The result of a conventional boosting algorithm is determined by the m × p matrix M = [yi hj (xi )] ˆ [15]. [sent-79, score-0.824]

41 Under a component permutation xi = P xi , the base classifiers become hj = hj · P −1 ; so ˆ ˆ x ˆ M = [yi hj (ˆi )] = [yi hj (xi )] = M . [sent-80, score-2.699]

42 Hence training on {P xi , yi } or {xi , yi } yields the same α, i. [sent-81, score-0.483]

43 To see this, assume each 2 hj depends on only a single component of x ∈ Rn , i. [sent-85, score-0.607]

44 , for some standard basis vector ek , and function gj : R → {−1, +1}, hj (x) = gj (eT x) (the base classifiers are decision stumps). [sent-87, score-0.814]

45 To make k the association between base classifiers and components explicit, let s be the function s(j) = k if hj (x) = gj (eT x) and Q = [qkj ] be the n × p matrix with qkj = 1[s(j)=k] . [sent-88, score-0.819]

46 Although we used decision stumps above for simplicity, more complex base classifiers such as decision trees could be used with proper modification of mapping from α to β. [sent-90, score-0.175]

47 Suppose the instance components reflect spatial structure in the data, e. [sent-92, score-0.287]

48 Then the component importance map is indicating the spatial distribution of weights that the classifier employs. [sent-95, score-0.376]

49 Presumably a good classifier distributes the weights in accordance with the discriminative power of the components; in which case, the map is indicating how discriminative information is spatially distributed. [sent-96, score-0.25]

50 Now as shown above, conventional boosting ignores spatial information. [sent-98, score-0.475]

51 Our objective, pursued in the next sections, is to incorporated prior information on spatial structure, e. [sent-99, score-0.248]

52 a prior on the component importance map, into the boosting problem. [sent-101, score-0.314]

53 3 Adding Spatial Regularization To incorporate spatial information we add spatial regularization of the form β T Kβ to the loss (1) where the kernel K ∈ Rn×n is positive definite. [sent-102, score-0.602]

54 Thus the regularized loss is: p m Lexp (X , Y, α) = reg αj hj (xi )) + λβ T Kβ i=1 (3) j=1 p m = (2) αj hj (xi )) + λαT QT KQα. [sent-104, score-1.302]

55 exp(−yi exp(−yi i=1 j=1 The term β T Kβ imposes a spatial smoothness constraint on β. [sent-105, score-0.275]

56 The regularization imposes a spatial smoothness constraint by encouraging β to give more weight to the eigenimages with smaller eigenvalues, e. [sent-123, score-0.403]

57 4 A Spatially Regularized Boosting Algorithm We now derive a spatially regularized boosting algorithm (abbreviated as SRB) using coordinate descent on (3). [sent-126, score-0.486]

58 This results in an algorithm similar to AdaBoost, but with additional consideration of spatial location. [sent-128, score-0.276]

59 αj : − ∂ Lexp (X , Y, α) = ∂αj reg p m αj hj (xi )) − 2eT λQT KQα. [sent-132, score-0.605]

60 j yi hj (xi ) exp(−yi i=1 j=1 Here ej is the j -th standard basis vector, so eT λQT KQα is the j -th element of λQT KQα. [sent-133, score-0.742]

61 This corresponds to choosing the best base classifier under the current weight distribution. [sent-137, score-0.181]

62 However, here we have an additional term: the performance of base classifier hj is enhanced by a weight γs(j ) on its corresponding component s(j ). [sent-138, score-0.788]

63 To proceed, we choose a base classifier hj to maximize the sum of these two terms and then increase the weight of that base classifier by a step size ε. [sent-140, score-0.893]

64 The key differences from AdaBoost are: (a) the new algorithm maintains a new set of “spatial compensation weights” γ; (b) the weights on training examples wi are not normalized at the end of each iteration. [sent-142, score-0.301]

65 Therefore, 4 ¯ a component receives a high compensation weight γk = 2λ(βk − µβk ) if some neighboring spatial locations have already been selected (i. [sent-147, score-0.42]

66 So the algorithm encourages the selection of base classifiers associated with “inactive” locations that are close to “active” locations. [sent-153, score-0.292]

67 We can enhance the algorithm by including a backward step each iteration: αj ← αj − ε , where m j = arg min 1≤j≤p,αj >0 yi hj (xi )wi + γs(j) . [sent-154, score-0.747]

68 Alternatively, one can be greedy and select ε to minimize the value of the loss function (3) after the change αj ← αj + ε: W− eε + W+ e−ε + λ(β + εek )T K(β + εek ), where W− = i:yi hj (xi )=−1 exp(−yi hα (xi )), W+ = s(j ). [sent-162, score-0.655]

69 Setting the derivative of (6) to 0 yields: i:yi hj (xi )=1 (6) exp(−yi hα (xi )) and k = W− eε − W+ e−ε − γk + 2λεKk k = 0. [sent-163, score-0.569]

70 − 5 The Grouping Effect: Asymptotic Analysis Recall our objective of using the component importance map of the trained classifier to ascertain the spatial distribution of informative components in the data. [sent-176, score-0.374]

71 In general, however, a boosting algorithm will select a sufficient but incomplete collection of base classifiers (and hence components) to accomplish the classification. [sent-178, score-0.398]

72 For example, after selecting one base classifier hj , AdaBoost will adjust the weights of training examples to make the weighted training error of hj exactly 1 (totally uninformative), thus preventing 2 the selection of any classifiers similar to hj in the next iteration. [sent-179, score-2.035]

73 In fact, for AdaBoost we can prove that in the optimal solution α∗ , we can transfer coefficient weights between any two equivalent base classifiers without impacting optimality. [sent-180, score-0.209]

74 Assume hj1 and hj2 , j1 < j2 , are base classifiers with s(j1 ) = s(j2 ), and hj1 (xi ) = ∗ ∗ hj2 (xi ) for all xi ∈ X . [sent-184, score-0.264]

75 Let Hk denote the subset of base classifiers acting on component k, i. [sent-192, score-0.181]

76 The following lemma is proved in the supplementary material: ∗ Lemma: For any k, 1 ≤ k ≤ n, −γk ≥ maxhj ∈Hk m i=1 ∗ ∗ yi hj (xi )wi with equality if βk > 0. [sent-195, score-0.812]

77 Then for any k1 and k2 : ∗ ∗ |βk1 − βk2 | ≤ m i=1 where d(k1 , k2 ) = | maxhj ∈Hk1 1 ¯∗ 1 ¯∗ |βk1 − βk2 | + d(k1 , k2 ), µ λµ ∗ yi hj (xi )wi − maxhj ∈Hk2 m i=1 (10) ∗ yi hj (xi )wi |. [sent-198, score-1.624]

78 Then −γk1 ≥ m m ∗ ∗ ∗ maxhj ∈Hk1 i=1 yi hj (xi )wi and −γk2 = maxhj ∈Hk2 i=1 yi hj (xi )wi . [sent-207, score-1.624]

79 This gives: m hj ∈Hk2 m ∗ yi hj (xi )wi ≤ d(vk1 , vk2 ). [sent-208, score-1.288]

80 ∗ yi hj (xi )wi − max ∗ ∗ γk1 − γk2 ≤ max i=1 hj ∈Hk1 i=1 ¯ ¯∗ Substituting the definition of γ: γ = 2λGβ − 2λµβ = 2λβ − 2λµβ, yields (2λβk1 − 2λµ0) − ∗ ∗ ¯∗ ¯∗ ¯∗ (2λβk2 − 2λµβk2 ) ≤ d(vk1 , vk2 ). [sent-209, score-1.32]

81 The theorem upper bounds the difference in the importance coefficient of two components by the ¯∗ ¯∗ sum of two terms: the first, |βk1 − βk2 |, takes into account the importance weight of nearby locations. [sent-214, score-0.175]

82 The component importance map β of SRB reveals both eyes as discriminating areas and demonstrates the grouping effect. [sent-231, score-0.239]

83 We also tried all methods with Gaussian spatial pre-smoothing as a preprocessing step. [sent-249, score-0.248]

84 In all cases, local spatial averaging deteriorates the classification performance of boosting. [sent-257, score-0.248]

85 We note that spatially regularized boosting yields a more clustered and interpretable selection of voxels. [sent-273, score-0.546]

86 The result for one subject (Figure 6) shows that standard boosting (AdaBoost) selects voxels scattered in the brain, while SRB selects clustered voxels and nicely highlights the relevant FFA area [21] and posterior central sulcus [22, 23]. [sent-274, score-0.75]

87 7 Conclusions The proposed SRB algorithm is applicable to a variety of situations in which one needs to boost the performance of base classifiers with spatial structure. [sent-275, score-0.419]

88 92 Spatial Regularized Boosting Spatial Regularized Boosting with smoothing AdaBoost AdaBoost with smoothing Gaussian naive Bayes Gaussian naive Bayes with smoothing 0. [sent-279, score-0.359]

89 86 0 50 100 150 iterations of boosting 1 classification accuracy on test images 0. [sent-282, score-0.312]

90 98 classification accuracy on test images classification accuracy on test images 1 0. [sent-284, score-0.17]

91 7 0 200 50 (a) 100 150 iterations of boosting 0. [sent-289, score-0.227]

92 8 0 200 50 (b) 100 150 iterations of boosting 200 (c) 1 1 1 0. [sent-293, score-0.227]

93 Gaussian noise, (e) poisson noise, (f) spatial correlated Gaussian noise. [sent-356, score-0.248]

94 The additional set of location weights encourages or discourages the selection of certain base classifiers based on the spatial location of base classifiers that have already been selected. [sent-359, score-0.704]

95 The grouping effect is clearly demonstrated in the face gender detection experiment. [sent-362, score-0.194]

96 In the OCR classification experiment, the algorithm shows superior performance in pixel selection accuracy without loss of classification accuracy. [sent-363, score-0.19]

97 The algorithm matches the performance of the state-of-the-art set estimation methods [6] that use a more complex spatial regularization and cycle spinning technique. [sent-364, score-0.329]

98 In the fMRI experiment, the algorithm yields a clustered selection of voxels in positions relevant to the task. [sent-365, score-0.344]

99 Integrated wavelet processing and spatial statistical testing of fMRI data. [sent-390, score-0.279]

100 Optimal aggregation of classifiers and boosting maps in functional magnetic resonance imaging. [sent-449, score-0.227]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hj', 0.569), ('srb', 0.296), ('spatial', 0.248), ('adaboost', 0.228), ('boosting', 0.227), ('voxels', 0.195), ('fmri', 0.166), ('yi', 0.15), ('wi', 0.146), ('base', 0.143), ('classi', 0.141), ('spatially', 0.123), ('xi', 0.121), ('er', 0.116), ('kq', 0.111), ('ers', 0.105), ('maxhj', 0.093), ('smoothing', 0.089), ('grouping', 0.082), ('regularized', 0.075), ('discriminating', 0.07), ('univariate', 0.067), ('voxel', 0.063), ('compensation', 0.056), ('lexp', 0.056), ('regularization', 0.053), ('loss', 0.053), ('selection', 0.052), ('importance', 0.049), ('qt', 0.049), ('selects', 0.048), ('princeton', 0.046), ('uj', 0.046), ('naive', 0.046), ('hasson', 0.045), ('searchlight', 0.045), ('discriminative', 0.043), ('face', 0.042), ('ocr', 0.042), ('gender', 0.042), ('shuf', 0.042), ('weights', 0.041), ('locations', 0.04), ('ek', 0.04), ('components', 0.039), ('thresholding', 0.039), ('composite', 0.039), ('neuroimage', 0.039), ('component', 0.038), ('weight', 0.038), ('clustered', 0.037), ('blu', 0.037), ('eigenimages', 0.037), ('imm', 0.037), ('qkj', 0.037), ('ville', 0.037), ('willett', 0.037), ('kk', 0.036), ('reg', 0.036), ('classification', 0.035), ('greedy', 0.033), ('maxj', 0.033), ('coordinate', 0.033), ('mart', 0.032), ('stumps', 0.032), ('adjust', 0.032), ('pixel', 0.032), ('yields', 0.032), ('pixels', 0.032), ('hk', 0.031), ('gj', 0.031), ('wavelet', 0.031), ('training', 0.03), ('norman', 0.03), ('bayes', 0.029), ('brain', 0.029), ('encourages', 0.029), ('annotated', 0.028), ('gij', 0.028), ('levy', 0.028), ('arrangement', 0.028), ('algorithm', 0.028), ('effect', 0.028), ('smoothness', 0.027), ('gaussian', 0.027), ('koltchinskii', 0.026), ('pca', 0.026), ('legend', 0.025), ('brings', 0.025), ('haxby', 0.025), ('accuracy', 0.025), ('inequality', 0.025), ('prove', 0.025), ('images', 0.025), ('noise', 0.024), ('location', 0.024), ('digits', 0.023), ('exhibits', 0.023), ('vj', 0.023), ('ej', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999863 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

2 0.19102865 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis

Author: Barry Chai, Dirk Walther, Diane Beck, Li Fei-fei

Abstract: In this study, we present a new method for establishing fMRI pattern-based functional connectivity between brain regions by estimating their multivariate mutual information. Recent advances in the numerical approximation of highdimensional probability distributions allow us to successfully estimate mutual information from scarce fMRI data. We also show that selecting voxels based on the multivariate mutual information of local activity patterns with respect to ground truth labels leads to higher decoding accuracy than established voxel selection methods. We validate our approach with a 6-way scene categorization fMRI experiment. Multivariate information analysis is able to find strong information sharing between PPA and RSC, consistent with existing neuroscience studies on scenes. Furthermore, an exploratory whole-brain analysis uncovered other brain regions that share information with the PPA-RSC scene network.

3 0.17237249 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

4 0.14106639 70 nips-2009-Discriminative Network Models of Schizophrenia

Author: Irina Rish, Benjamin Thyreau, Bertrand Thirion, Marion Plaze, Marie-laure Paillere-martinot, Catherine Martelli, Jean-luc Martinot, Jean-baptiste Poline, Guillermo A. Cecchi

Abstract: Schizophrenia is a complex psychiatric disorder that has eluded a characterization in terms of local abnormalities of brain activity, and is hypothesized to affect the collective, “emergent” working of the brain. We propose a novel data-driven approach to capture emergent features using functional brain networks [4] extracted from fMRI data, and demonstrate its advantage over traditional region-of-interest (ROI) and local, task-specific linear activation analyzes. Our results suggest that schizophrenia is indeed associated with disruption of global brain properties related to its functioning as a network, which cannot be explained by alteration of local activation patterns. Moreover, further exploitation of interactions by sparse Markov Random Field classifiers shows clear gain over linear methods, such as Gaussian Naive Bayes and SVM, allowing to reach 86% accuracy (over 50% baseline - random guess), which is quite remarkable given that it is based on a single fMRI experiment using a simple auditory task. 1

5 0.13445398 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1

6 0.13280827 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

7 0.12416276 260 nips-2009-Zero-shot Learning with Semantic Output Codes

8 0.11352241 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

9 0.096438527 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

10 0.094489738 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

11 0.085973628 71 nips-2009-Distribution-Calibrated Hierarchical Classification

12 0.077617317 43 nips-2009-Bayesian estimation of orientation preference maps

13 0.076398239 72 nips-2009-Distribution Matching for Transduction

14 0.076216139 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

15 0.075857021 204 nips-2009-Replicated Softmax: an Undirected Topic Model

16 0.075448811 237 nips-2009-Subject independent EEG-based BCI decoding

17 0.072090782 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

18 0.070104085 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions

19 0.067510821 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

20 0.067449227 129 nips-2009-Learning a Small Mixture of Trees


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.229), (1, -0.016), (2, -0.08), (3, 0.109), (4, -0.024), (5, 0.089), (6, -0.003), (7, -0.188), (8, -0.136), (9, 0.079), (10, 0.002), (11, -0.045), (12, -0.035), (13, 0.039), (14, 0.05), (15, -0.015), (16, 0.023), (17, -0.017), (18, 0.104), (19, -0.042), (20, -0.007), (21, -0.157), (22, -0.109), (23, -0.061), (24, 0.017), (25, -0.009), (26, 0.037), (27, 0.076), (28, -0.072), (29, 0.001), (30, -0.083), (31, -0.063), (32, 0.083), (33, 0.173), (34, 0.051), (35, -0.158), (36, -0.074), (37, 0.038), (38, 0.021), (39, 0.015), (40, 0.036), (41, 0.064), (42, -0.2), (43, -0.146), (44, -0.04), (45, -0.069), (46, -0.035), (47, -0.095), (48, 0.08), (49, 0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94917786 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

2 0.63119984 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1

3 0.58204585 70 nips-2009-Discriminative Network Models of Schizophrenia

Author: Irina Rish, Benjamin Thyreau, Bertrand Thirion, Marion Plaze, Marie-laure Paillere-martinot, Catherine Martelli, Jean-luc Martinot, Jean-baptiste Poline, Guillermo A. Cecchi

Abstract: Schizophrenia is a complex psychiatric disorder that has eluded a characterization in terms of local abnormalities of brain activity, and is hypothesized to affect the collective, “emergent” working of the brain. We propose a novel data-driven approach to capture emergent features using functional brain networks [4] extracted from fMRI data, and demonstrate its advantage over traditional region-of-interest (ROI) and local, task-specific linear activation analyzes. Our results suggest that schizophrenia is indeed associated with disruption of global brain properties related to its functioning as a network, which cannot be explained by alteration of local activation patterns. Moreover, further exploitation of interactions by sparse Markov Random Field classifiers shows clear gain over linear methods, such as Gaussian Naive Bayes and SVM, allowing to reach 86% accuracy (over 50% baseline - random guess), which is quite remarkable given that it is based on a single fMRI experiment using a simple auditory task. 1

4 0.52679247 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis

Author: Barry Chai, Dirk Walther, Diane Beck, Li Fei-fei

Abstract: In this study, we present a new method for establishing fMRI pattern-based functional connectivity between brain regions by estimating their multivariate mutual information. Recent advances in the numerical approximation of highdimensional probability distributions allow us to successfully estimate mutual information from scarce fMRI data. We also show that selecting voxels based on the multivariate mutual information of local activity patterns with respect to ground truth labels leads to higher decoding accuracy than established voxel selection methods. We validate our approach with a 6-way scene categorization fMRI experiment. Multivariate information analysis is able to find strong information sharing between PPA and RSC, consistent with existing neuroscience studies on scenes. Furthermore, an exploratory whole-brain analysis uncovered other brain regions that share information with the PPA-RSC scene network.

5 0.5040679 71 nips-2009-Distribution-Calibrated Hierarchical Classification

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classification learning, we take a step back and examine how a hierarchical classification problem should be formally defined. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classification loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated definition of statistical risk for hierarchical classification, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classification to cost-sensitive classification.

6 0.47721085 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

7 0.47316036 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

8 0.46456426 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

9 0.45887408 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

10 0.44907662 237 nips-2009-Subject independent EEG-based BCI decoding

11 0.44843686 72 nips-2009-Distribution Matching for Transduction

12 0.42763281 260 nips-2009-Zero-shot Learning with Semantic Output Codes

13 0.41956431 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

14 0.40420252 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

15 0.40233588 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

16 0.39963251 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares

17 0.39579284 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

18 0.38791302 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

19 0.3869437 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

20 0.38316143 129 nips-2009-Learning a Small Mixture of Trees


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.045), (25, 0.054), (35, 0.039), (36, 0.516), (39, 0.042), (58, 0.081), (71, 0.033), (81, 0.011), (86, 0.06), (91, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9914549 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

2 0.99106538 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efficient inference algorithms for a conditional random field using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1

3 0.98992378 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. 1

4 0.98731512 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.

Abstract: Motivated from real world problems, like object categorization, we study a particular mixed-norm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The formulation hence employs l∞ regularization for promoting combinations at the component level and l1 regularization for promoting sparsity among kernels in each component. While previous attempts have formulated this as a non-convex problem, the formulation given here is an instance of non-smooth convex optimization problem which admits an efficient Mirror-Descent (MD) based procedure. The MD procedure optimizes over product of simplexes, which is not a well-studied case in literature. Results on real-world datasets show that the new MKL formulation is well-suited for object categorization tasks and that the MD based algorithm outperforms stateof-the-art MKL solvers like simpleMKL in terms of computational effort. 1

5 0.98229599 238 nips-2009-Submanifold density estimation

Author: Arkadas Ozakin, Alexander G. Gray

Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modification to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modified estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisfies certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is defined in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier defines an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “infinitesimal distance” ds between two points whose P coordinates differ by the infinitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes infinite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is infinite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modification of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal definition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function defined on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisfies vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisfies, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , define an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Definition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see figure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisfies, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the definitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the definition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the definition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing first derivatives at p, and gij (p) = δij [1]. Normal coordinates can be defined in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the first two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The first three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modification of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We find it likely that our estimator or slight modifications of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we fit a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8

6 0.97981024 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

7 0.83095163 76 nips-2009-Efficient Learning using Forward-Backward Splitting

8 0.82840908 128 nips-2009-Learning Non-Linear Combinations of Kernels

9 0.82270241 193 nips-2009-Potential-Based Agnostic Boosting

10 0.81948578 129 nips-2009-Learning a Small Mixture of Trees

11 0.81591594 27 nips-2009-Adaptive Regularization of Weight Vectors

12 0.8086713 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

13 0.80805552 166 nips-2009-Noisy Generalized Binary Search

14 0.80801755 72 nips-2009-Distribution Matching for Transduction

15 0.80233926 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

16 0.79839504 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

17 0.79755342 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

18 0.79476267 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

19 0.79373479 180 nips-2009-On the Convergence of the Concave-Convex Procedure

20 0.7937066 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm