nips nips2009 nips2009-197 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein
Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We present a novel approach which employs a randomized sequence of pruning masks. [sent-8, score-0.543]
2 Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. [sent-9, score-0.378]
3 Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. [sent-10, score-0.373]
4 Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. [sent-11, score-0.283]
5 Problem scale comes from a combination of large constant factors (such as the massive grammar sizes in monolingual parsing) or high-degree algorithms (such as the many dimensions of bitext parsing). [sent-13, score-0.416]
6 For example, in monolingual parsing, entire labeled spans may be skipped on the basis of posterior probabilities in a coarse grammar [17, 7]. [sent-15, score-0.448]
7 Unfortunately, aggressive pruning introduces biases in the resulting expectations. [sent-17, score-0.52]
8 In randomized pruning, multiple pruning masks are used in sequence. [sent-21, score-0.706]
9 Our approach is based on the idea of auxiliary variable sampling [31], where a set of auxiliary variables formalizes the idea of a pruning mask. [sent-24, score-1.139]
10 Resampling the auxiliary variables changes the mask at each iteration, so that the portion of the chart that is unconstrained at a given iteration can improve the mask for subsequent iterations. [sent-25, score-1.064]
11 In other words, pruning decisions are continuously revisited and revised. [sent-26, score-0.411]
12 The settings randomized pruning will be most advantageous will be those in which high-order dynamic s can be vastly sped up by masking, yet no single aggressive mask is likely to be adequate. [sent-30, score-1.174]
13 ndomized pruning e need for expectations ms for discriminative training, consensus decoding, and unsupervised learning typically S (a) epetitively computing a large number of expectations. [sent-31, score-0.671]
14 a(0, 3) ··· a(0, 5) + Figure 1: (b), from which f (ti ,Tree− Eθ the corresponding chart w) Pθ (T = ti |y(T ) = wi ) = A parse tree i(a) and [f (T, wi )|y(T ) = wi ] ,cells Assignments the assignment vector (c) is extracted. [sent-48, score-0.639]
15 Not shown are the labels of the dynamic programming chart cells. [sent-49, score-0.298]
16 me dynamic program (the inside-outside algorithm), which computes constituent posteriors approximations, Figure 1). [sent-54, score-0.287]
17 Hence, computing the behavior after a of convergence to the true ossible spans of words (the chart cells in more iterations can be performed, with a guaranteefinite number of iterations: the In practice, of course, we are only interested in expectations is he bottleneck here. [sent-55, score-0.553]
18 Here,interested in the behavior performance on English-Chinese bitext 35], or necessitating aggressive pruning [26, 12] that if decreasesunderstood. [sent-58, score-0.744]
19 Moreover, we show that our range bounded by parsing,would be useless is it did not outperform previous heuristics in a time randomized pruning method showing that bias not well over time. [sent-59, score-0.613]
20 Here, tic-tac-toe pruning [40], achieving lower bias over a range of total the exact computation single-maskwe investigate empirical performance on English-Chinese bitext computation times. [sent-61, score-0.778]
21 Our technique is orthogonal to approaches that that our randomized pruning proximate expectations with a single pruning bias decreases over time. [sent-62, score-1.201]
22 Moreover, we showuse parallel computation [9, 38], parsing, showing that mask and can be standard single-mask tic-tac-toe pruning level. [sent-63, score-0.712]
23 outperformsadditionally parallelized at the sentence [40], achieving lower bias over a range of total se of monolingual parsing, computation times. [sent-64, score-0.32]
24 g mask is a map from the set M however, that the applicability of this approachindicating Note, of all possible spans to the set {prune, keep}, is parsing because it to parsing. [sent-68, score-0.695]
25 The settings in which randomized pruning will be most advantageous will be those in which high-order dynamic programs can be vastly sped up by masking, yet no single aggressive mask is likely to be 2 Randomized pruning adequate. [sent-72, score-1.585]
26 1 2 The need for expectations Randomized pruning Algorithms for discriminative training, consensus decoding, and unsupervised learning typically involve repetitively computing a large number of expectations. [sent-74, score-0.7]
27 1 The need for expectations 32], one needs to repeatedly parse the entire training set in order to bilistic parsers, for example [18, compute the necessary expected feature counts. [sent-76, score-0.31]
28 Hence, computing expectations is where {wi : i ∈ I} are the training sentences with corresponding gold trees {ti }. [sent-83, score-0.336]
29 for all possible spans of words (the chart cells in Figure 1). [sent-86, score-0.376]
30 2 Approximate expectations it is not impossible to calculate these expectations exactly, it is computationally very expensive, limiting previous work to toy setups with 15 word sentences [18, 32, 35], ormonolingual parsing, the pruning [26, 12] featurenot well understood. [sent-89, score-0.917]
31 is usually approxIn the case of necessitating aggressive computation of that is count expectations imated with a pruning mask which allows the omission of low probability constituents. [sent-90, score-1.082]
32 2 Approximate expectations with a single pruning mask In the case of monolingual parsing, the computation of feature count expectations is usually approximated with a pruning mask, which allows the omission of low probability constituents. [sent-93, score-1.655]
33 Formally, a pruning mask is a map from the set M of all possible spans to the set {prune, keep}, indicating whether a given span is to be ignored. [sent-94, score-0.893]
34 However, the expected feature counts Em [f ] computed by pruned inside-outside with a single mask m are not exact, introducing a systematic error and biasing the model in undesirable ways. [sent-96, score-0.43]
35 3 Approximate expectations with a sequence of masks To reduce the bias resulting from the use of a single pruning mask, we propose a novel algorithm that can combine several masks. [sent-98, score-0.821]
36 The key is to define a set of auxiliary variables, and we present this construction in more detail in the following sections. [sent-104, score-0.3]
37 The masks are defined via two vector-valued Markov chains: a selection chain with current value denoted by s, and an assignment chain with current value a. [sent-106, score-0.432]
38 Our masks m = m(s, a) are generated deterministically from the selection and assignment vectors. [sent-110, score-0.324]
39 The deterministic procedure uses s to pick a few spans and values to fix from a, forming a mask m. [sent-111, score-0.415]
40 Note that a single span ι that is both positive and selected implies that all of the spans κ crossing ι should be pruned (i. [sent-112, score-0.433]
41 This compilation of the pruning constraints is described in Algorithm 2. [sent-115, score-0.411]
42 1 We can now summarize how randomized pruning works (see Algorithm 1 for pseudocode). [sent-118, score-0.543]
43 Once a new mask m has been precomputed from the current selection vector and assignments, pruned inside-outside scores are computed using this mask. [sent-121, score-0.494]
44 We start by developing the form of the posterior distribution over trees when thereι is a+ and ι κ m ← CreateMask(a, s) if a = single selected auxiliary variable, i. [sent-132, score-0.416]
45 If a = PrunedInside(w,m) |Aι = and κ ι then ι requires the Compute −, sampling from T same dynamic program as for exact sampling, except that a single cell in the chart is mι ← prune pruned (the Compute PrunedOutside(w,m) cell ι). [sent-135, score-0.8]
46 We can write the d auxiliary of excluded / auxiliary variables {Aι : ι ∈ s} given a collection of selected ones. [sent-143, score-0.859]
47 Theways: first, to calculate shows that once afeature counts under product of indicator functions sho standard way} dynamic program (described in Algorithm two product of indicator functions the expected dynamic program (described in Algorithm 3). [sent-149, score-0.386]
48 This is a write itsteptransition kernel on the space {+, −}|M |tof auxiliary variable assign- chart computed by ments: ks (a, a ). [sent-156, score-0.655]
49 =We will denote∈this property ∗ ι∈s / by [ιThere is a separate mechanism, k , that updates at each iteration the selection s of the auxiliary ∈ t]. [sent-164, score-0.364]
50 This mechanism corresponds to picking which Gibbs operator ks will be used to tranwhere S = Aι = aι : ι ∈ s is a configuration of the excluded auxiliary variables and C = Aι = / sition to Section chain on assignments described above. [sent-166, score-0.753]
51 The original state space (the space of trees) does not easily decompose into a graphical model where textbook Gibbs sampling could be applied, so we first augment the state space with auxiliary variables. [sent-185, score-0.344]
52 Broadly speaking, an auxiliary variable is a state augmentation such that the target distribution is a marginal of the expanded distribution. [sent-186, score-0.393]
53 It is called auxiliary because the parts of the samples corresponding to the augmentation are discarded at the end of the computation. [sent-187, score-0.359]
54 The auxiliary variable corresponding to span ι ∈ M will be denoted by Aι . [sent-191, score-0.401]
55 We define the auxiliary variables by specifying their conditional distribution Aι |(T = t). [sent-192, score-0.35]
56 4 The blocks of resampled variables will always contain T as well as a subset of the excluded auxiliary variables. [sent-198, score-0.474]
57 Note that when conditioning on all of the auxiliary variables, the posterior distribution on T is deterministic. [sent-199, score-0.371]
58 We start by developing the form of the posterior distribution over trees when there is a single selected auxiliary variable, i. [sent-204, score-0.416]
59 If a = −, sampling from T |Aι = ι requires the same dynamic program as for exact sampling, except that a single cell in the chart is pruned (the cell ι). [sent-207, score-0.699]
60 Consider now the problem of jointly resampling the block containing T and a collection of excluded auxiliary variables {Aι : ι ∈ s} given a collection of selected ones. [sent-214, score-0.706]
61 We can write the decomposition: / Y ` ` ´ ´ P T = t, S|C = P(T = t|C) P Aι = aι |T = t ι∈s / Y ˘ ¯ = P(T = t|C) 1 aι = [ι ∈ t] , ι∈s / where S = Aι = aι : ι ∈ s is a configuration of the excluded auxiliary variables and C = Aι = / aι : ι ∈ s is a configuration of the selected ones. [sent-215, score-0.521]
62 The first factor in the second line is again a pruned dynamic program (described in Algorithm 3). [sent-216, score-0.36]
63 The product of indicator functions shows that once a tree has been picked, the excluded auxiliary variables can be set to new values deterministically by reading from the sampled tree t whether ι is a constituent, for each ι ∈ s. [sent-217, score-0.596]
64 Since this kernel depends on the previous state only through the assignments of the auxiliary variables, we can also write it as a transition kernel on the space {+, −}|M | of auxiliary variable assignments: ks (a, a ). [sent-219, score-0.864]
65 2 The selection chain There is a separate mechanism, k ∗ , that updates at each iteration the selection s of the auxiliary variables. [sent-221, score-0.506]
66 This mechanism corresponds to picking which Gibbs operator ks will be used to transition in the Markov chain on assignments described above. [sent-222, score-0.279]
67 To understand why, recall that in the situation where (Aι = −), a single cell in the chart is pruned, whereas in the case where (Aι = +), a large fraction of the chart can be ignored. [sent-228, score-0.411]
68 Note that the algorithm can recover from mistakes in the simpler model, since the assignments of the auxiliary variables are also resampled. [sent-230, score-0.415]
69 First, it uses a simpler model (for instance a grammar with fewer non-terminal symbols) to pick a subset M ⊆ M of the spans that have high posterior probability. [sent-241, score-0.281]
70 Given the asymmetric effect between conditioning on positive versus negative auxiliary variables, it is tempting to let the k ∗ depend on the current assignment of the auxiliary variables. [sent-246, score-0.689]
71 More formally, the accelerated averaging method consists of adding the following soft count instead: f (x)kS (i) (X (i) , dx), which can be computed with one extra pruned outside computation in our parsing setup. [sent-272, score-0.448]
72 i=1 −1 4 Experiments While we used the task of monolingual parsing to illustrate our randomized pruning procedure, the technique is most powerful when the dynamic program is a higher-order polynomial. [sent-275, score-1.113]
73 We therefore demonstrate the utility of randomized pruning on a bitext parsing task. [sent-276, score-0.97]
74 In bitext parsing, we have sentence-aligned corpora from two languages, and are computing expectations over aligned parse trees [6, 28]. [sent-277, score-0.547]
75 Our auxiliary variable sampling scheme also substantially outperforms the tic-tac-toe pruning heuristic (d). [sent-289, score-0.789]
76 We picked this particular bilingual bitext parsing formalism for two reasons. [sent-296, score-0.477]
77 Second, the randomized pruning method is most competitive in cases where the dynamic program has a sufficiently high degree. [sent-299, score-0.736]
78 We did experiments on monolingual parsing that showed that the improvements were not significant for most sentence lengths, and inferior to the coarse-to-fine method of [25]. [sent-300, score-0.455]
79 The bitext parsing version of the randomized pruning algorithm is very similar to the monolingual case. [sent-301, score-1.104]
80 Rather than being over constituent spans, our auxiliary variables in the bitext case are over induced alignments of synchronous derivations. [sent-302, score-0.736]
81 Given two aligned sentences, the auxiliary variables Ai,j are the pq binary random variables indicating whether word i is aligned with word j. [sent-305, score-0.496]
82 To compare our approximate inference procedure to exact inference, we follow previous work [15, 29] and measure the L2 distance between the pruned expectations and the exact expectations. [sent-306, score-0.418]
83 Note that our pruning method, in contrast, can handle much longer sentences without problem—one pass through all 1493 sentences with a product length of less than 1000 took 28 minutes on one 2. [sent-310, score-0.682]
84 We used the BerkeleyAligner [21] to obtain high-precision, intersected alignments to construct the high-confidence set M of auxiliary variables needed for k ∗ (Section 3. [sent-312, score-0.4]
85 For randomized pruning to be efficient, we need to be able to extract a large number of samples within the time required for computing the exact expectations. [sent-314, score-0.58]
86 Figure 4(a) shows the average time required to compute the full dynamic program and the dynamic program required to extract a single sample for varying sentence product lengths. [sent-315, score-0.464]
87 To determine the number of samples in this experiment, we measured the time required for exact inference, and ran the auxiliary variable sampler for half of that time. [sent-321, score-0.45]
88 The main point of Figure 4(c) is to show that under realistic running time conditions, the bias of the auxiliary variable sampler stays roughly constant as a function of sentence length. [sent-322, score-0.561]
89 Finally, we compared the auxiliary variable algorithm to tic-tac-toe pruning, a heuristic proposed in [40] and improved in [41]. [sent-323, score-0.334]
90 With tic-tac-toe, aggressiveness is increased via the cut-off threshold, while with the auxiliary variable sampler, it is controlled by letting the sampler run for more iterations. [sent-329, score-0.413]
91 The plot shows that there is a large regime where the auxiliary variable algorithm dominates tic-tac-toe for this task. [sent-331, score-0.334]
92 When the goal is to maximize an objective function, simple beam pruning [10] can be sufficient. [sent-334, score-0.458]
93 However, as argued in [4], beam pruning is not appropriate for computing expectations because the resulting approximation is too concentrated around the mode. [sent-335, score-0.635]
94 Their approach is quite different to ours as no auxiliary variables are used. [sent-337, score-0.35]
95 In computer vision, in particular, an auxiliary variable sampler developed by [30] is widely used for image segmentation [27]. [sent-339, score-0.413]
96 6 Conclusion Mask-based pruning is an effective way to speed up large dynamic programs for calculating feature expectations. [sent-340, score-0.521]
97 Our results show that, at least for bitext parsing, using many randomized aggressive masks generated with an auxiliary variable sampler is superior in time and bias to using a single, more conservative one. [sent-342, score-1.071]
98 Randomized pruning will be most advantageous when high-order dynamic programs can be vastly sped up by masking, yet no single aggressive mask is likely to be adequate. [sent-344, score-1.042]
99 Bilingual parsing with factored estimation: Using english to parse korean. [sent-503, score-0.343]
100 Scalable discriminative learning for natural language parsing and translation. [sent-548, score-0.297]
wordName wordTfidf (topN-words)
[('pruning', 0.411), ('auxiliary', 0.3), ('mask', 0.263), ('parsing', 0.243), ('chart', 0.188), ('bitext', 0.184), ('expectations', 0.177), ('pruned', 0.167), ('masks', 0.163), ('spans', 0.152), ('monolingual', 0.134), ('randomized', 0.132), ('excluded', 0.124), ('sentences', 0.121), ('dynamic', 0.11), ('aggressive', 0.109), ('emnlp', 0.102), ('prune', 0.101), ('ks', 0.101), ('parse', 0.1), ('grammar', 0.098), ('constituent', 0.094), ('acl', 0.091), ('prunedinside', 0.083), ('program', 0.083), ('sampler', 0.079), ('chain', 0.078), ('sentence', 0.078), ('resampling', 0.072), ('keep', 0.071), ('bias', 0.07), ('ti', 0.067), ('span', 0.067), ('sped', 0.067), ('wi', 0.066), ('assignments', 0.065), ('selection', 0.064), ('augmentation', 0.059), ('synchronous', 0.058), ('guration', 0.058), ('em', 0.056), ('mcmc', 0.054), ('discriminative', 0.054), ('masking', 0.054), ('alignments', 0.05), ('bilingual', 0.05), ('blunsom', 0.05), ('constituents', 0.05), ('createmask', 0.05), ('prunedoutside', 0.05), ('variables', 0.05), ('assignment', 0.049), ('aligned', 0.048), ('deterministically', 0.048), ('gibbs', 0.047), ('vastly', 0.047), ('beam', 0.047), ('selected', 0.047), ('loop', 0.046), ('sampling', 0.044), ('omission', 0.044), ('merit', 0.044), ('return', 0.044), ('necessitating', 0.04), ('conditioning', 0.04), ('trees', 0.038), ('computation', 0.038), ('collection', 0.038), ('grammars', 0.038), ('exact', 0.037), ('block', 0.037), ('tree', 0.037), ('rk', 0.037), ('applicability', 0.037), ('cells', 0.036), ('continue', 0.036), ('mechanism', 0.035), ('advantageous', 0.035), ('cell', 0.035), ('outer', 0.034), ('variable', 0.034), ('justify', 0.034), ('alternation', 0.034), ('approxin', 0.033), ('bilistic', 0.033), ('enumerations', 0.033), ('skipped', 0.033), ('slav', 0.033), ('terminals', 0.033), ('kernel', 0.032), ('markov', 0.032), ('paths', 0.031), ('limiting', 0.031), ('posterior', 0.031), ('inside', 0.031), ('kernels', 0.03), ('consensus', 0.029), ('length', 0.029), ('repetitively', 0.029), ('isotonic', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein
Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. 1
2 0.10346539 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
Author: Anne Hsu, Thomas L. Griffiths
Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-specific knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1
3 0.082089193 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
Author: Finale Doshi-velez, Shakir Mohamed, Zoubin Ghahramani, David A. Knowles
Abstract: Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible. 1
4 0.080322638 175 nips-2009-Occlusive Components Analysis
Author: Jörg Lücke, Richard Turner, Maneesh Sahani, Marc Henniges
Abstract: We study unsupervised learning in a probabilistic generative model for occlusion. The model uses two types of latent variables: one indicates which objects are present in the image, and the other how they are ordered in depth. This depth order then determines how the positions and appearances of the objects present, specified in the model parameters, combine to form the image. We show that the object parameters can be learnt from an unlabelled set of images in which objects occlude one another. Exact maximum-likelihood learning is intractable. However, we show that tractable approximations to Expectation Maximization (EM) can be found if the training images each contain only a small number of objects on average. In numerical experiments it is shown that these approximations recover the correct set of object parameters. Experiments on a novel version of the bars test using colored bars, and experiments on more realistic data, show that the algorithm performs well in extracting the generating causes. Experiments based on the standard bars benchmark test for object learning show that the algorithm performs well in comparison to other recent component extraction approaches. The model and the learning algorithm thus connect research on occlusion with the research field of multiple-causes component extraction methods. 1
5 0.078447595 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
Author: Kuzman Ganchev, Ben Taskar, Fernando Pereira, João Gama
Abstract: We address the problem of learning structured unsupervised models with moment sparsity typical in many natural language induction tasks. For example, in unsupervised part-of-speech (POS) induction using hidden Markov models, we introduce a bias for words to be labeled by a small number of tags. In order to express this bias of posterior sparsity as opposed to parametric sparsity, we extend the posterior regularization framework [7]. We evaluate our methods on three languages — English, Bulgarian and Portuguese — showing consistent and significant accuracy improvement over EM-trained HMMs, and HMMs with sparsity-inducing Dirichlet priors trained by variational EM. We increase accuracy with respect to EM by 2.3%-6.5% in a purely unsupervised setting as well as in a weaklysupervised setting where the closed-class words are provided. Finally, we show improvements using our method when using the induced clusters as features of a discriminative model in a semi-supervised setting. 1
6 0.072835341 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
7 0.070892088 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
8 0.067693658 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
9 0.06297864 226 nips-2009-Spatial Normalized Gamma Processes
10 0.061676499 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
11 0.060794272 74 nips-2009-Efficient Bregman Range Search
12 0.058488674 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
13 0.058221847 129 nips-2009-Learning a Small Mixture of Trees
14 0.054215632 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
15 0.052842639 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
16 0.051826503 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
17 0.051788878 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
18 0.051535394 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
19 0.051243771 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
20 0.050497536 246 nips-2009-Time-Varying Dynamic Bayesian Networks
topicId topicWeight
[(0, -0.183), (1, -0.018), (2, -0.012), (3, -0.075), (4, -0.006), (5, -0.079), (6, 0.053), (7, 0.023), (8, -0.019), (9, -0.052), (10, -0.011), (11, -0.052), (12, 0.038), (13, 0.032), (14, 0.003), (15, 0.028), (16, 0.017), (17, -0.042), (18, 0.013), (19, -0.019), (20, 0.008), (21, -0.036), (22, 0.049), (23, 0.059), (24, -0.009), (25, 0.002), (26, 0.022), (27, 0.15), (28, 0.104), (29, 0.029), (30, -0.091), (31, -0.028), (32, -0.163), (33, -0.078), (34, -0.025), (35, -0.088), (36, 0.031), (37, -0.035), (38, -0.002), (39, 0.045), (40, 0.005), (41, -0.054), (42, -0.095), (43, 0.007), (44, -0.01), (45, -0.073), (46, -0.028), (47, -0.028), (48, 0.074), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.94588035 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein
Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. 1
2 0.6130594 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
Author: Theodore J. Perkins
Abstract: Continuous-time Markov chains are used to model systems in which transitions between states as well as the time the system spends in each state are random. Many computational problems related to such chains have been solved, including determining state distributions as a function of time, parameter estimation, and control. However, the problem of inferring most likely trajectories, where a trajectory is a sequence of states as well as the amount of time spent in each state, appears unsolved. We study three versions of this problem: (i) an initial value problem, in which an initial state is given and we seek the most likely trajectory until a given final time, (ii) a boundary value problem, in which initial and final states and times are given, and we seek the most likely trajectory connecting them, and (iii) trajectory inference under partial observability, analogous to finding maximum likelihood trajectories for hidden Markov models. We show that maximum likelihood trajectories are not always well-defined, and describe a polynomial time test for well-definedness. When well-definedness holds, we show that each of the three problems can be solved in polynomial time, and we develop efficient dynamic programming algorithms for doing so. 1
3 0.51533026 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
Author: Anne Hsu, Thomas L. Griffiths
Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-specific knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1
4 0.50173813 51 nips-2009-Clustering sequence sets for motif discovery
Author: Jong K. Kim, Seungjin Choi
Abstract: Most of existing methods for DNA motif discovery consider only a single set of sequences to find an over-represented motif. In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabling us to identify multiple motifs. We present a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. Our model infers cluster-indicating latent variables and learns motifs simultaneously, where these two tasks interact with each other. We show that our model can handle various motif discovery problems, depending on how to construct multiple sets of sequences. Experiments on three different problems for discovering DNA motifs emphasize the useful behavior and confirm the substantial gains over existing methods where only a single set of sequences is considered.
5 0.4949244 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
Author: Samuel Gershman, Ed Vul, Joshua B. Tenenbaum
Abstract: While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of realworld tasks, and it remains unclear how the human mind approximates Bayesian computations algorithmically. We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision. 1
6 0.49028048 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
7 0.48977199 48 nips-2009-Bootstrapping from Game Tree Search
8 0.48302871 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
9 0.48179531 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale
10 0.47925106 233 nips-2009-Streaming Pointwise Mutual Information
11 0.47621217 129 nips-2009-Learning a Small Mixture of Trees
12 0.46305946 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs
13 0.4611598 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
14 0.46007976 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
15 0.45162627 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
16 0.44562811 168 nips-2009-Non-stationary continuous dynamic Bayesian networks
17 0.44431189 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models
18 0.44009107 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
19 0.43555823 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
20 0.42711544 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
topicId topicWeight
[(24, 0.031), (25, 0.043), (35, 0.054), (36, 0.562), (39, 0.03), (58, 0.048), (61, 0.034), (71, 0.043), (81, 0.017), (86, 0.05)]
simIndex simValue paperId paperTitle
1 0.98650515 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling
Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu
Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efficient inference algorithms for a conditional random field using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1
same-paper 2 0.98530054 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein
Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning. 1
3 0.9778657 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
Author: Saketha N. Jagarlapudi, Dinesh G, Raman S, Chiranjib Bhattacharyya, Aharon Ben-tal, Ramakrishnan K.r.
Abstract: Motivated from real world problems, like object categorization, we study a particular mixed-norm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The formulation hence employs l∞ regularization for promoting combinations at the component level and l1 regularization for promoting sparsity among kernels in each component. While previous attempts have formulated this as a non-convex problem, the formulation given here is an instance of non-smooth convex optimization problem which admits an efficient Mirror-Descent (MD) based procedure. The MD procedure optimizes over product of simplexes, which is not a well-studied case in literature. Results on real-world datasets show that the new MKL formulation is well-suited for object categorization tasks and that the MD based algorithm outperforms stateof-the-art MKL solvers like simpleMKL in terms of computational effort. 1
4 0.97626889 47 nips-2009-Boosting with Spatial Regularization
Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang
Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1
5 0.96967143 238 nips-2009-Submanifold density estimation
Author: Arkadas Ozakin, Alexander G. Gray
Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modification to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modified estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisfies certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is defined in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier defines an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “infinitesimal distance” ds between two points whose P coordinates differ by the infinitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes infinite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is infinite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modification of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal definition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function defined on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisfies vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisfies, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , define an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Definition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see figure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisfies, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the definitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the definition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the definition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing first derivatives at p, and gij (p) = δij [1]. Normal coordinates can be defined in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the first two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The first three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modification of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We find it likely that our estimator or slight modifications of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we fit a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8
6 0.96094 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
7 0.7943545 76 nips-2009-Efficient Learning using Forward-Backward Splitting
8 0.79190141 128 nips-2009-Learning Non-Linear Combinations of Kernels
9 0.78801543 193 nips-2009-Potential-Based Agnostic Boosting
10 0.78449756 129 nips-2009-Learning a Small Mixture of Trees
11 0.78254598 27 nips-2009-Adaptive Regularization of Weight Vectors
12 0.77283108 72 nips-2009-Distribution Matching for Transduction
13 0.76645142 166 nips-2009-Noisy Generalized Binary Search
14 0.76589173 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
15 0.76462358 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors
16 0.76344216 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
17 0.76141775 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
18 0.76042521 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
19 0.75543064 180 nips-2009-On the Convergence of the Concave-Convex Procedure
20 0.75031716 220 nips-2009-Slow Learners are Fast