nips nips2009 nips2009-57 knowledge-graph by maker-knowledge-mining

57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Source: pdf

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efﬁcient inference algorithms for a conditional random ﬁeld using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efﬁcient learning algorithms for these conditional random ﬁelds. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 sg Abstract Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. [sent-8, score-0.377]

2 However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. [sent-9, score-0.294]

3 In this paper, we show that it is possible to design efﬁcient inference algorithms for a conditional random ﬁeld using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. [sent-10, score-1.154]

4 We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. [sent-12, score-0.511]

5 1 Introduction In a sequence labeling problem, we are given an input sequence x and need to label each component of x with its class to produce a label sequence y. [sent-13, score-0.829]

6 Examples of sequence labeling problems include labeling words in sentences with its type in named-entity recognition problems [16], handwriting recognition problems [15], and deciding whether each DNA base in a DNA sequence is part of a gene in gene prediction problems [2]. [sent-14, score-0.751]

7 Hence, dependencies are usually assumed to exist only between adjacent components of y, giving rise to linear-chain CRFs which limits the order of the features to one. [sent-19, score-0.314]

8 We give an algorithm for computing the marginals and the CRF log likelihood gradient that runs in time polynomial in the number and length of the label sequences that the features depend on. [sent-21, score-0.769]

9 We also provide an efﬁcient decoding algorithm for ﬁnding the most probable label sequence in the presence of long label sequence features. [sent-23, score-0.661]

10 This can be used with cutting plane methods to train max-margin solutions for sequence labeling problems in polynomial time [18]. [sent-24, score-0.266]

11 We show experimentally that using high-order features can improve performance in sequence labeling problems. [sent-25, score-0.409]

12 We show that in handwriting recognition, using even simple high-order indicator features improves performance over using linear-chain CRFs, and impressive performance improvement is observed when the maximum order of the indicator features is increased. [sent-26, score-0.76]

13 We also use a synthetic data set to discuss the conditions under which higher order features can be helpful. [sent-27, score-0.316]

14 We further show that higher order label features can sometimes be more stable under change of data distribution using a named entity data set. [sent-28, score-0.794]

15 For sequence models, a feature of order k can be incorporated into a k-order Markov chain, but the complexity of inference is again exponential in k. [sent-38, score-0.26]

16 In the special case of a semi-Markov random ﬁelds, where high-order features depend on segments of identical labels, the complexity of inference is linear in the maximum length of the segments [13]. [sent-40, score-0.429]

17 Compared to this approach, our algorithm has the advantage of being able to efﬁciently handle high-order features having arbitrary label patterns. [sent-42, score-0.403]

18 The time complexity of inference in an HHMM is O(min{nl3 , n2 l}) [4, 10], where n is the number of states and l is the length of the sequence. [sent-44, score-0.236]

19 For example, if we use PCFG to do named entity recognition, we need to provide the parse trees for efﬁcient learning; providing the named entity labels for each word is not sufﬁcient. [sent-49, score-0.765]

20 Using high-order features captures less expressive form of dependencies than these models but allows efﬁcient learning without relabeling the training set with hierarchical labels. [sent-54, score-0.373]

21 Similar work on using higher order features for CRFs was independently done in [11]. [sent-55, score-0.28]

22 3 CRF with High-order Features Throughout the remainder of this paper, x, y, z (with or without decorations) respectively denote an observation sequence of length T , a label sequence of length T , and an arbitrary label sequence. [sent-57, score-0.815]

23 Each feature fi is associated with a label sequence zi , called fi ’s label pattern, and fi has the form fi (x, y, t) = gi (x, t), if yt−|zi |+1:t = zi 0, otherwise. [sent-75, score-1.699]

24 Consider, for example, the problem of named entity recognition. [sent-77, score-0.313]

25 A CRF deﬁnes conditional probability distributions P (y|x) = Zx (y)/Zx , where Zx (y) = m T exp( i=1 t=|zi | λi fi (x, y, t)), and Zx = y Zx (y). [sent-82, score-0.259]

26 1 Inference for High-order CRF In this section, we describe the algorithms for computing the partition function, the marginals and the most likely label sequence for high-order CRFs. [sent-86, score-0.434]

27 It can also be veriﬁed that the algorithms for linear chain CRF [8] are special cases of our algorithms when only zero-th and ﬁrst order features are considered. [sent-89, score-0.238]

28 The pattern set, Z, is the set of distinct label patterns used in the m features. [sent-96, score-0.339]

29 p|P| }, consists of distinct elements in Y ∪ {zj }0≤k≤|zj |−1,1≤j≤M ; that is, P consists of all labels and all proper 1:k preﬁxes (including ) of label patterns, with duplicates removed. [sent-104, score-0.293]

30 We shall give rough time bounds in terms of m (the total number of features), n (the number of labels), T (the length of the sequence), M (the number of distinct label patterns in Z), and the maximum order K = max{|z1 | − 1, . [sent-115, score-0.537]

31 Suppose z ≤p y, then deﬁne y’s preﬁx |z| m p score Zx (z) = exp( i=1 t=|zi | λi fi (x, y, t)). [sent-122, score-0.223]

32 Similarly, if z ≤s y, then deﬁne y’s sufﬁx score s Zx (z) = exp( m i=1 T t=T −|z|+|zi | p s λi fi (x, y, t)). [sent-123, score-0.223]

33 Let αx (t, pi ) p Zx (z) = z:|z|=t,pi ≤s z P βx (t, si ) s Zx (z). [sent-125, score-0.463]

34 = z:|z|=T +1−t,si ≤p z S The variable αx (t, pi ) computes for x1:t the sum of the scores of all its label sequences z having pi as the longest sufﬁx. [sent-126, score-1.107]

35 Similarly, the variable βx (t, si ) computes for xt:T the sum of scores of all its label sequence z having si as the longest preﬁx. [sent-127, score-0.489]

36 For y with p ≤s y1:t , this function counts the contribux tion towards Zx (y) by all features fi with their label patterns ending at position t and being sufﬁxes of p. [sent-131, score-0.676]

37 Let pi y be the concatenation of pi with a label y. [sent-132, score-0.991]

38 Proposition 1 (a) For any z, there is a unique pi such that pi ≤s z. [sent-134, score-0.79]

39 P p p (b) For any z, y, if pi ≤s z and pk ≤s pi y, then pk ≤s zy and Zx (zy) = Ψp (t, pi y)Zx (z). [sent-135, score-1.441]

40 x P P P Proposition 1(a) means that we can induce partitions of label sequences using the forward states. [sent-136, score-0.361]

41 and Proposition 1(b) shows how to make well-deﬁned transition from one forward state at a time slice to another forward state at the next time slice. [sent-137, score-0.236]

42 By deﬁnition, αx (0, ) = 1, and αx (0, pi ) = 0 for all pi = . [sent-138, score-0.79]

43 Using Proposition 1(b), the recurrence for αx is αx (t, pk ) Ψp (t, pi y)αx (t − 1, pi ), for 1 ≤ t ≤ T. [sent-139, score-0.935]

44 x = (pi ,y):pk ≤s pi y P Similarly, for the backward vectors βx , let Ψs (t, s) = exp( i:zi ≤p s λi gi (x, t + |zi | − 1)). [sent-140, score-0.557]

45 x = (si ,y):sk ≤p ysi S Once αx or βx is computed, then using Proposition 1(a), Zx can be easily obtained: |P| |S| αx (T, pi ) = Zx = i=1 βx (1, si ). [sent-143, score-0.509]

46 3 Computing the Most Likely Label Sequence As in the case of HMM [12], Viterbi decoding (calculating the most likely label sequence) is obtained by replacing the sum operator in the forward backward algorithm with the max operator. [sent-152, score-0.436]

47 p Formally, let δx (t, pi ) = maxz:|z|=t,pi ≤s z Zx (z). [sent-153, score-0.395]

48 By deﬁnition, δx (0, ) = 1, and δx (0, pi ) = 0 P i for all p = , and using Proposition 1, we have δx (t, pk ) = max (pi ,y):pk ≤s pi y P Ψp (t, pi y)δx (t − 1, pi ), for 1 ≤ t ≤ T. [sent-154, score-1.688]

49 x We use Φx (t, pk ) to record the pair (pi , y) chosen to obtain δx (t, pk ), Φx (t, pk ) = arg max(pi ,y):pk ≤s pi y Ψp (t, pi y)δx (t − 1, pi ). [sent-155, score-1.509]

50 x P ∗ ∗ ∗ Let p∗ = arg maxpi δx (T, pi ), then the most likely path y∗ = (y1 , . [sent-156, score-0.424]

51 , yT ) has yT as the last label T in p∗ , and the full sequence can be traced backwards using Φx (·, ·) as follows T ∗ (p∗ , yt ) t = Φx (t + 1, p∗ ), for 1 ≤ t < T. [sent-159, score-0.425]

52 4 Computing the Marginals We need to compute marginals of label sequences and single variables, that is, compute P (yt−|z|:t = z|x) for z ∈ Z ∪ Y. [sent-163, score-0.369]

53 Unlike in the traditional HMM, additional care need to be taken regarding features having label patterns that are super or sub sequences of z. [sent-164, score-0.553]

54 If z1:|z|−1 ≤s pi and z2:|z| ≤p sj , deﬁne [pi , z, sj ] as the sequence pi i |−(|z|−1) zsj 1:|p |z|−1:|sj | , and Ox (t, pi , sj , z) = λk gk (x, t − |pi | + k − 1)). [sent-167, score-1.73]

55 exp( (k,k ):z⊆zk ,zk ⊆k [pi ,z,sj ] Ox (t, pi , sj , z) counts the contribution of features with their label patterns properly containing z but within [pi , z, sj ]. [sent-168, score-1.164]

56 For any y with yt−|z|+1:t = z, there exists unique pi , sj such that z1:|z|−1 ≤s pi , z2:|z| ≤p sj , pi ≤s y1:t−1 , and sj ≤p yt−|z|+2:T . [sent-170, score-1.62]

57 p s Multiplying by Ox counts features that are not counted in Zx Zx while division by Wx removes features that are double-counted. [sent-172, score-0.404]

58 By Proposition 2, we have P (yt−|z|+1:t = z|x) = (i,j):z1:|z|−1 ≤s pi ,z2:|z| ≤p sj αx (t − 1, pi )βx (t − |z| + 2, sj )Ox (t, pi , sj , z) Zx Wx (t, z) Time Complexity: Both Wx (t, z) and Ox (t, pi , sj , z) can be computed in O(|pi ||sj |) = O(K 2 ) time (with some precomputation). [sent-173, score-2.192]

59 , λm , and its maximum is achieved when ∂LT λk ˜ = E(fi ) − E(fi ) − 2 = 0 ∂λi σreg ˜ where E(fi ) = (x,y)∈T |x| t=|zi | fi (x, y, t) is the empirical sum of the feature fi in the observed |x| data, and E(fi ) = (x,y)∈T |y |=|x| P (y |x) t=|zi | fi (x, y , t) is the expected sum of fi . [sent-180, score-0.858]

60 straightforward, and E(fi ) can be computed using marginals computed in previous section: |x| E(fi ) P (yt−|zi |+1:t = zi |x)gi (x, t). [sent-184, score-0.237]

61 4 Experiments The practical feasibility of making use of high-order features based on our algorithm lies in the observation that the pattern sparsity assumption often holds. [sent-190, score-0.306]

62 Our algorithm can be applied to take those high-order features into consideration; high-order features now form a component that one can play with in feature engineering. [sent-191, score-0.436]

63 We ﬁrst use a synthetic data set to explore conditions under which high-order features can be expected to help. [sent-193, score-0.238]

64 We then use a handwritten character recognition problem to demonstrate that even incorporating simple highorder features can lead to impressive performance improvement on a naturally occurring dataset. [sent-194, score-0.478]

65 Finally, we use a named entity data set to show that for some data sets, higher order label features may be more robust to changes in data distributions than observation features. [sent-195, score-0.827]

66 This limits the number of possible label sequences in each length k + 1 segment from nk+1 to nk r. [sent-202, score-0.386]

67 If σ is very small as compared to most µij ’s, then using the observations alone as features is likely to be good enough to obtain a good classiﬁer of the states; the label correlations becomes less important for classiﬁcation. [sent-211, score-0.467]

68 However, if σ is large, then it is difﬁcult to distinguish the states based on the observations alone and the label correlations, particularly those captured by higher order features are likely to be helpful. [sent-212, score-0.587]

69 For higher order features, we simply use all indicator features that appeared in the training data up to a maximum order. [sent-215, score-0.433]

70 The training set and test set each contains 500 sequences of length 20; each sequence was initialized with a random sequence of length k and generated using the randomly generated order k Markov model. [sent-217, score-0.538]

71 Figure 1 shows that the high-order indicator features are useful in this case. [sent-220, score-0.269]

72 In particular, we can see that it is beneﬁcial to increase the order of the high-order features when the underlying model has longer distance correlations. [sent-221, score-0.238]

73 As expected, increasing the order of the features beyond the order of the underlying model is not helpful. [sent-222, score-0.274]

74 it is less important to use high-order features to capture label correlations. [sent-231, score-0.403]

75 On the other hand, when such coupling is not clear, it becomes important to capture the label correlations, and high-order features can be useful. [sent-232, score-0.403]

76 2 Handwriting Recognition We used the handwriting recognition data set from [15], consisting of around 6100 handwritten words with an average length of around 8 characters. [sent-234, score-0.32]

77 The experimental setup is the same as that used in [15]: the data set was divided into 10 folds with each fold having approximately 600 training and 5500 test examples and the zero-th order features for a character are the pixel values. [sent-238, score-0.408]

78 For higher order features, we again used all indicator features that appeared in the training data up to a maximum order. [sent-239, score-0.433]

79 The running time appears to grow no more than linearly with the maximum order of the features for this data set. [sent-242, score-0.308]

80 In such cases, we hypothesize that higher order label features may be more stable than observation features and can sometimes offer performance gain. [sent-247, score-0.716]

81 62 70 Linear Chain Second Order 65 60 F1 Score 55 50 45 40 35 30 un nw l: l: w w l bc l: w un :w :w l un :b c un :n w :u n nw nw l :b c bc :w nw bc :u n 25 bc :n w We use all pairs of genres as training and test data. [sent-252, score-0.92]

82 The features used are previous word, next word, current word, case patterns for these words, and all indicator label features of order up to k. [sent-254, score-0.784]

83 Introducing second order indicator features shows improvement in 10 out of the 12 combinations and degrades performance in two of the combinations. [sent-256, score-0.35]

84 Training Domain : Test Domain Figure 3: Named entity recognition results. [sent-259, score-0.24]

85 In our experiments, we used indicator features of all label patterns that appear in the training data. [sent-260, score-0.594]

86 For real applications, if the pattern sparsity assumption is not satisﬁed, but certain patterns do not appear frequently enough and are not really important, then it is useful to see how we can select a subset of features with few distinct label patterns automatically. [sent-261, score-0.658]

87 An alternate approach to feature selection is to use all possible features and maximize the margin of the solution instead. [sent-263, score-0.234]

88 Generalization error bounds [15] show that it is possible to obtain good generalization with a relatively small training set size despite having a very large number of features if the margin is large. [sent-264, score-0.25]

89 Theoretically, it is also interesting to note that minimizing the regularized training cost when all possible high-order features of arbitrary length are used is computationally tractable. [sent-266, score-0.363]

90 Hence, even when we are learning with arbitrary sets of high-order features, we only need to use the features that appear in the training set to obtain the optimal solution. [sent-268, score-0.25]

91 Given a training set of N sequences of length l, only O(l2 N ) long label sequences of all orders are observed. [sent-269, score-0.477]

92 Using cutting plane techniques [18] the computational complexity of optimization is polynomial in inverse accuracy parameter, the training set size and maximum length of the sequences. [sent-270, score-0.226]

93 However, we note that the advantage of the higher order features may become less substantial as the observations become more powerful in distinguishing the classes. [sent-274, score-0.346]

94 Whether the use of higher order features together with kernels brings substantial improvement in performance is likely to be problem dependent. [sent-275, score-0.385]

95 Similarly, observation features that are more distribution invariant such as comprehensive name lists can be used for the NER task we experimented with and may reduce the improvements offered by higher order features. [sent-276, score-0.313]

96 5 Conclusion The pattern sparsity assumption often holds in real applications, and we give efﬁcient inference algorithms for CRF with high-order features when the pattern sparsity assumption is satisﬁed. [sent-277, score-0.393]

97 This allows high-order features to be explored in feature engineering for real applications. [sent-278, score-0.234]

98 We studied the conditions that are favourable for using high-order features using a synthetic data set, and demonstrated that using simple high-order features can lead to performance improvement on a handwriting recognition problem and a named entity recognition problem. [sent-279, score-1.077]

99 Wu, “Sparse higher order conditional random ﬁelds for improved sequence labeling,” in ICML, 2009, p. [sent-340, score-0.25]

100 Meulder, “Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition,” in Proceedings of Conference on Computational Natural Language Learning, 2003. [sent-365, score-0.313]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zx', 0.46), ('pi', 0.395), ('features', 0.202), ('label', 0.201), ('fi', 0.197), ('named', 0.161), ('entity', 0.152), ('sj', 0.145), ('zi', 0.143), ('crf', 0.136), ('sigma', 0.12), ('yt', 0.114), ('ox', 0.11), ('sequence', 0.11), ('pk', 0.108), ('handwriting', 0.103), ('crfs', 0.098), ('labeling', 0.097), ('character', 0.094), ('marginals', 0.094), ('recognition', 0.088), ('forward', 0.086), ('lt', 0.083), ('backward', 0.081), ('gi', 0.081), ('length', 0.08), ('nw', 0.078), ('dependencies', 0.076), ('patterns', 0.076), ('pre', 0.074), ('pessimistic', 0.074), ('sequences', 0.074), ('bc', 0.072), ('proposition', 0.071), ('wx', 0.069), ('pcfg', 0.069), ('si', 0.068), ('indicator', 0.067), ('markov', 0.065), ('conditional', 0.062), ('elds', 0.061), ('ace', 0.06), ('labels', 0.06), ('reg', 0.06), ('un', 0.058), ('handwritten', 0.049), ('inference', 0.049), ('subsequence', 0.049), ('singapore', 0.049), ('training', 0.048), ('hierarchical', 0.047), ('hhmm', 0.046), ('kassel', 0.046), ('ysi', 0.046), ('improvement', 0.045), ('word', 0.045), ('shall', 0.042), ('higher', 0.042), ('states', 0.042), ('longest', 0.042), ('xes', 0.042), ('zj', 0.042), ('sparsity', 0.041), ('dso', 0.04), ('genres', 0.04), ('zy', 0.04), ('clique', 0.039), ('decoding', 0.039), ('hmm', 0.038), ('maximum', 0.038), ('recurrence', 0.037), ('order', 0.036), ('synthetic', 0.036), ('observations', 0.035), ('parse', 0.034), ('complexity', 0.033), ('regularized', 0.033), ('observation', 0.033), ('gradient', 0.032), ('feature', 0.032), ('distinct', 0.032), ('time', 0.032), ('hidden', 0.031), ('nk', 0.031), ('substantial', 0.031), ('consecutive', 0.03), ('grammar', 0.03), ('pattern', 0.03), ('gene', 0.029), ('likely', 0.029), ('dna', 0.028), ('folds', 0.028), ('ner', 0.028), ('polynomial', 0.027), ('xt', 0.027), ('zm', 0.027), ('kt', 0.027), ('extraction', 0.027), ('depend', 0.027), ('score', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

2 0.21219808 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1

3 0.14078298 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

Abstract: The inter-subject alignment of functional MRI (fMRI) data is important for improving the statistical power of fMRI group analyses. In contrast to existing anatomically-based methods, we propose a novel multi-subject algorithm that derives a functional correspondence by aligning spatial patterns of functional connectivity across a set of subjects. We test our method on fMRI data collected during a movie viewing experiment. By cross-validating the results of our algorithm, we show that the correspondence successfully generalizes to a secondary movie dataset not used to derive the alignment. 1

4 0.11002272 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

5 0.10976627 72 nips-2009-Distribution Matching for Transduction

Author: Novi Quadrianto, James Petterson, Alex J. Smola

Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modiﬁcation for classiﬁcation, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efﬁciently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1

6 0.10633473 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

7 0.10565042 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

8 0.10257319 56 nips-2009-Conditional Neural Fields

9 0.10198131 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

10 0.10186651 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

11 0.099372596 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

12 0.098374844 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

13 0.088966481 157 nips-2009-Multi-Label Prediction via Compressed Sensing

14 0.086246505 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

15 0.085202947 71 nips-2009-Distribution-Calibrated Hierarchical Classification

16 0.084179372 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

17 0.082763597 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

18 0.081340395 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

19 0.080216683 97 nips-2009-Free energy score space

20 0.07980001 122 nips-2009-Label Selection on Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.247), (1, 0.023), (2, -0.004), (3, -0.065), (4, 0.038), (5, 0.003), (6, -0.033), (7, 0.007), (8, -0.108), (9, -0.007), (10, 0.026), (11, 0.095), (12, -0.044), (13, -0.006), (14, -0.037), (15, -0.014), (16, 0.216), (17, -0.101), (18, -0.07), (19, -0.109), (20, 0.29), (21, -0.0), (22, -0.164), (23, 0.096), (24, 0.026), (25, 0.055), (26, -0.075), (27, -0.076), (28, -0.034), (29, 0.132), (30, -0.006), (31, 0.088), (32, -0.205), (33, 0.078), (34, -0.012), (35, -0.052), (36, -0.005), (37, -0.007), (38, -0.016), (39, -0.086), (40, -0.005), (41, -0.003), (42, 0.101), (43, -0.023), (44, -0.028), (45, 0.089), (46, 0.069), (47, 0.043), (48, -0.054), (49, -0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95976812 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

2 0.65042931 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

3 0.62896174 56 nips-2009-Conditional Neural Fields

Author: Jian Peng, Liefeng Bo, Jinbo Xu

Abstract: Conditional random ﬁelds (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs signiﬁcantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.

4 0.61322713 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

5 0.55165029 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

Author: Natasha Singh-miller, Michael Collins

Abstract: We consider the problem of using nearest neighbor methods to provide a conditional probability estimate, P (y|a), when the number of labels y is large and the labels share some underlying structure. We propose a method for learning label embeddings (similar to error-correcting output codes (ECOCs)) to model the similarity between labels within a nearest neighbor framework. The learned ECOCs and nearest neighbor information are used to provide conditional probability estimates. We apply these estimates to the problem of acoustic modeling for speech recognition. We demonstrate signiﬁcant improvements in terms of word error rate (WER) on a lecture recognition task over a state-of-the-art baseline GMM model. 1

6 0.50143939 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

7 0.4921183 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

8 0.46908906 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

9 0.46778738 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

10 0.4644931 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

11 0.46163964 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

12 0.44528791 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

13 0.44226211 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

14 0.44045198 97 nips-2009-Free energy score space

15 0.42567039 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.41611913 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

17 0.40552652 72 nips-2009-Distribution Matching for Transduction

18 0.39972395 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

19 0.38796151 157 nips-2009-Multi-Label Prediction via Compressed Sensing

20 0.38783944 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.04), (25, 0.055), (35, 0.031), (36, 0.574), (39, 0.021), (58, 0.036), (61, 0.02), (71, 0.076), (86, 0.051), (91, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99013007 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

2 0.98273814 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming ﬁxed tic-tac-toe pruning. 1

3 0.97809684 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.

Abstract: Motivated from real world problems, like object categorization, we study a particular mixed-norm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The formulation hence employs l∞ regularization for promoting combinations at the component level and l1 regularization for promoting sparsity among kernels in each component. While previous attempts have formulated this as a non-convex problem, the formulation given here is an instance of non-smooth convex optimization problem which admits an efﬁcient Mirror-Descent (MD) based procedure. The MD procedure optimizes over product of simplexes, which is not a well-studied case in literature. Results on real-world datasets show that the new MKL formulation is well-suited for object categorization tasks and that the MD based algorithm outperforms stateof-the-art MKL solvers like simpleMKL in terms of computational effort. 1

4 0.97427857 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efﬁcient boosting algorithm that uses additional weights/priors on the base classiﬁers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classiﬁers. The algorithm’s primary advantage is in applications where the trained classiﬁer is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

5 0.97181618 238 nips-2009-Submanifold density estimation

Author: Arkadas Ozakin, Alexander G. Gray

Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modiﬁcation to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modiﬁed estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisﬁes certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As ﬁrst clariﬁed by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is deﬁned in terms of a metric tensor is called a Riemannian manifold (for a rigorous deﬁnition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier deﬁnes an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “inﬁnitesimal distance” ds between two points whose P coordinates differ by the inﬁnitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes inﬁnite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is inﬁnite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modiﬁcation of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal deﬁnition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function deﬁned on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisﬁes vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisﬁes, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , deﬁne an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisﬁes limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Deﬁnition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see ﬁgure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisﬁes, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the deﬁnitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the deﬁnition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the deﬁnition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing ﬁrst derivatives at p, and gij (p) = δij [1]. Normal coordinates can be deﬁned in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the ﬁrst two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The ﬁrst three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modiﬁcation of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We ﬁnd it likely that our estimator or slight modiﬁcations of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we ﬁt a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8

6 0.95798528 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

7 0.79244202 193 nips-2009-Potential-Based Agnostic Boosting

8 0.78929168 129 nips-2009-Learning a Small Mixture of Trees

9 0.78929061 128 nips-2009-Learning Non-Linear Combinations of Kernels

10 0.78832775 76 nips-2009-Efficient Learning using Forward-Backward Splitting

11 0.78729934 27 nips-2009-Adaptive Regularization of Weight Vectors

12 0.7775206 166 nips-2009-Noisy Generalized Binary Search

13 0.77443302 72 nips-2009-Distribution Matching for Transduction

14 0.77237183 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

15 0.76705927 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.76401615 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

17 0.76057839 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

18 0.75677019 180 nips-2009-On the Convergence of the Concave-Convex Procedure

19 0.75630265 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

20 0.75186688 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains