nips nips2013 nips2013-63 knowledge-graph by maker-knowledge-mining

63 nips-2013-Cluster Trees on Manifolds

Source: pdf

Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman

Abstract: unkown-abstract

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu School of Computer Science† and Department of Statistics‡ Carnegie Mellon University In this paper we investigate the problem of estimating the cluster tree for a density f supported on or near a smooth d-dimensional manifold M isometrically embedded in RD . [sent-11, score-0.884]

2 Finally, we sketch a construction of a sample complexity lower bound instance for a natural class of manifold oblivious clustering algorithms. [sent-14, score-0.536]

3 1 Introduction In this paper, we study the problem of estimating the cluster tree of a density when the density is supported on or near a manifold. [sent-15, score-0.81]

4 The connected components Cf (λ) of the upper level set {x : f (x) ≥ λ} are called density clusters. [sent-23, score-0.368]

5 The collection C = {Cf (λ) : λ ≥ 0} of all such clusters is called the cluster tree and estimating this cluster tree is referred to as density clustering. [sent-24, score-1.123]

6 The density clustering paradigm is attractive for various reasons. [sent-25, score-0.261]

7 One of the main difﬁculties of clustering is that often the true goals of clustering are not clear and this makes clusters, and clustering as a task seem poorly deﬁned. [sent-26, score-0.288]

8 Density clustering however is estimating a well deﬁned population quantity, making its goal, consistent recovery of the population density clusters, clear. [sent-27, score-0.343]

9 Typically only mild assumptions are made on the density f and this allows extremely general shapes and numbers of clusters at each level. [sent-28, score-0.333]

10 Finally, the cluster tree is an inherently hierarchical object and thus density clustering algorithms typically do not require speciﬁcation of the “right” level, rather they capture a summary of the density across all levels. [sent-29, score-0.823]

11 The search for a simple, statistically consistent estimator of the cluster tree has a long history. [sent-30, score-0.445]

12 They showed that, as long as the parameters of the algorithm are chosen appropriately, the resulting collection of connected components correctly estimates the cluster tree with high probability. [sent-34, score-0.543]

13 In this paper, we are concerned with the problem of estimating the cluster tree when the density f is supported on or near a low dimensional manifold. [sent-35, score-0.721]

14 This so-called manifold hypothesis motivates the study of data generated on or near low dimensional manifolds and the study of procedures that can adapt effectively to the intrinsic dimensionality of this data. [sent-39, score-0.452]

15 Here is a brief summary of the main contributions of our paper: (1) We show that the simple algorithm studied in the paper Chaudhuri and Dasgupta (2010) is consistent and has fast rates of convergence for data on or near a low dimensional manifold M . [sent-40, score-0.457]

16 (2) We show that the sample complexity for identifying salient clusters is independent of the ambient dimension. [sent-43, score-0.249]

17 (3) We sketch a construction of a sample complexity lower bound instance for a natural class of clustering algorithms that we study in this paper. [sent-44, score-0.297]

18 (4) We introduce a framework for studying consistency of clustering when the distribution is not supported on a manifold but rather, is concentrated near a manifold. [sent-45, score-0.52]

19 The generative model in this case is that the data are ﬁrst sampled from a distribution on a manifold and then noise is added. [sent-46, score-0.275]

20 We show that for certain noise models we can still efﬁciently recover the cluster tree on the latent samples. [sent-48, score-0.433]

21 1 Related Work The idea of using probability density functions for clustering dates back to Wishart Wishart (1969). [sent-50, score-0.261]

22 Hartigan (1981) expanded on this idea and formalized the notions of high-density clustering, of the cluster tree and of consistency and fractional consistency of clustering algorithms. [sent-51, score-0.693]

23 In particular, Hartigan (1981) showed that single linkage clustering is consistent when D = 1 but is only fractionally consistent when D > 1. [sent-52, score-0.3]

24 (2010) and Stuetzle (2003) have also proposed procedures for recovering the cluster tree. [sent-54, score-0.285]

25 In the last two decades, much of the research effort involving the use of nonparametric density estimators for clustering has focused on the more specialized problems of optimal estimation of the support of the distribution or of a ﬁxed level set. [sent-57, score-0.318]

26 However, consistency of estimators of a ﬁxed level set does not imply cluster tree consistency, and extending the techniques and analyses mentioned above to hold simultaneously over a variety of density levels is non-trivial. [sent-58, score-0.719]

27 Estimating the cluster tree has more recently been considered by Kpotufe and von Luxburg (2011) who also give a simple pruning procedure for removing spurious clusters. [sent-64, score-0.429]

28 Steinwart (2011) and Sriperumbudur and Steinwart (2012) propose procedures for determining recursively the lowest split in the cluster tree and give conditions for asymptotic consistency with minimal assumptions on the density. [sent-65, score-0.531]

29 2 Background and Assumptions Let P be a distribution supported on an unknown d-dimensional manifold M . [sent-66, score-0.239]

30 We assume that the manifold M is a d-dimensional Riemannian manifold without boundary embedded in a compact set X ⊂ RD with d < D. [sent-67, score-0.478]

31 We further assume that the volume of the manifold is bounded from above by a constant, i. [sent-68, score-0.239]

32 The Euclidean norm is denoted by · and vd denotes the volume of the d-dimensional unit ball in Rd . [sent-75, score-0.505]

33 B(x, r) denotes the full-dimensional ball of radius r centered at x and BM (x, r) . [sent-76, score-0.29]

34 For λ ≥ 0, let Cf (λ) be the collection of connected components of the level set {x ∈ X : f (x) ≥ λ} and deﬁne the cluster tree of f to be the hierarchy C = {Cf (λ) : λ ≥ 0}. [sent-82, score-0.6]

35 For a cluster C its restriction to the sample X is deﬁned to be C[X] = C ∩ X. [sent-84, score-0.321]

36 The restriction of the cluster tree C to X is deﬁned to be C[X] = {C ∩ X : C ∈ C}. [sent-85, score-0.431]

37 Our deﬁnitions are slight modiﬁcations of those in Chaudhuri and Dasgupta (2010) to take into account the manifold assumption. [sent-88, score-0.239]

38 In the ﬁrst stage, the sample is cleaned by thresholding the k-nearest neighbor distance of the sample points at a radius r and then, in the second stage, the cleaned sample is connected at a connection radius R. [sent-95, score-0.751]

39 We deﬁne two notions of consistency for an estimator C of the cluster tree: Deﬁnition 3 (Hartigan consistency) For any sets A, A′ ⊂ X , let An (resp. [sent-99, score-0.32]

40 We say C is consistent if, whenever A and A′ are different connected components of {x : f (x) ≥ λ} (for some λ > 0), the probability that An is disconnected from A′ approaches 1 as n → ∞. [sent-101, score-0.297]

41 We say C is n consistent if, whenever A and A′ are different connected components of {x : f (x) ≥ λ} (for some λ > 0), the probability that An is disconnected from A′ approaches 1 as n → ∞. [sent-104, score-0.297]

42 n The notion of (σ, ǫ) consistency is similar that of Hartigan consistency except restricted to (σ, ǫ) separated clusters A and A′ . [sent-105, score-0.375]

43 In their result there is no manifold and f is a density with respect to the Lebesgue measure on RD . [sent-107, score-0.404]

44 Their result in essence says that if n≥O D D log 2 λǫ2 vD (σ/2)D λǫ vD (σ/2)D then an RSL algorithm with appropriately chosen parameters can resolve any pair of (σ, ǫ) clusters at level at least λ. [sent-108, score-0.261]

45 Figure 1: Robust Single Linkage (RSL) Algorithm 3 Clustering on Manifolds In this section we show that the RSL algorithm can be adapted to recover the cluster tree of a distribution supported on a manifold of dimension d < D with the rates depending only on d. [sent-117, score-0.681]

46 In place of the cluster salience parameter σ, our rates involve a new parameter ρ ρ := min 3σ ǫτ τ , , 16 72d 16 . [sent-118, score-0.265]

47 In particular, the clusters containing A[X] and A′ [X], where A and A′ are (σ, ǫ) separated, are internally connected and mutually disconnected in C(r) for r deﬁned by vd r d λ = provided λ ≥ 1 1 − ǫ/6 k C2 log(1/δ) + n n kµ 2 k . [sent-124, score-0.842]

48 The sample complexity of the RSL algorithm for recovering (σ, ǫ) clusters at level at least λ on a manifold M with condition number at most 1/τ is n=O d d log 2 λǫ2 vd ρd λǫ vd ρd where ρ = C min (σ, ǫτ /d, τ ). [sent-127, score-1.334]

49 Ignoring constants that depend on d the main difference between this result and the result of Chaudhuri and Dasgupta (2010) is that our results only depend on the manifold dimension d and not the ambient dimension D (typically D ≫ d). [sent-128, score-0.322]

50 Another aspect is that our choice of the connection radius R depends on the (typically) unknown ρ, while for comparison, the connection radius in Chaudhuri and Dasgupta (2010) is chosen to be 4 √ 2r. [sent-132, score-0.302]

51 It is easy to see that this theorem also establishes consistency for recovering the entire cluster tree by selecting an appropriate schedule on σn , ǫn and kn that ensures that all clusters are distinguished for n large enough (see Chaudhuri and Dasgupta (2010) for a formal proof). [sent-137, score-0.658]

52 2 we establish (σ, ǫ) consistency by showing that the clusters are mutually disjoint and internally connected. [sent-142, score-0.368]

53 The main technical challenge is that the curvature of the manifold, modulated by its condition number 1/τ , limits our ability to resolve the density level sets from a ﬁnite sample, by limiting the maximum cleaning and connection radii the algorithm can use. [sent-143, score-0.492]

54 In what follows, we carefully analyze this effect and show that somewhat surprisingly, despite this curvature, essentially the same algorithm is able to adapt to the unknown manifold and produce a consistent estimate of the entire cluster tree. [sent-144, score-0.536]

55 Similar manifold adaptivity results have been shown in classiﬁcation Dasgupta and Freund (2008) and in non-parametric regression Kpotufe and Dasgupta (2012); Bickel and Li (2006). [sent-145, score-0.239]

56 We get around this obstacle by using the insight that, in order to analyze the RSL algorithms, uniform convergence for Euclidean balls around the sample points and around a ﬁxed minimum s-net N of M (for an appropriately chosen s) sufﬁce to analyze the RSL algorithm. [sent-150, score-0.416]

57 This bounds the distortion of the apparent density due to the curvature of the manifold and is central to many of our arguments. [sent-161, score-0.462]

58 Then 1− r2 4τ 2 d/2 vd rd ≤ vold (S) ≤ vd 5 τ τ − 2r1 d d r1 , where r1 = τ − τ 1 − 2r/τ . [sent-168, score-0.929]

59 In particular, if r ≤ ǫτ /72d for 0 ≤ ǫ < 1, then vd rd (1 − ǫ/6) ≤ vold (S) ≤ vd rd (1 + ǫ/6). [sent-169, score-1.034]

60 2 Separation and Connectedness Lemma 8 (Separation) Assume that we pick k, r and R to satisfy the conditions: r ≤ ρ, R = 4ρ, Cδ Cδ k k kµ, vd rd (1 + ǫ/6)λ(1 − ǫ) ≤ − kµ. [sent-171, score-0.47]

61 vd rd (1 − ǫ/6)λ ≥ + n n n n Then with probability 1 − δ, we have: (1) All points in Aσ−r and A′ σ−r are kept, and all points in Sσ−r are removed. [sent-172, score-0.602]

62 Most importantly, we need to ensure that despite the curvature of the manifold we can still resolve the density well enough to guarantee that we can identify and eliminate points in the region of separation. [sent-176, score-0.573]

63 Since r ≤ ǫτ /72d, by Lemma 7 vol(BM (x, r)) is between vd rd (1 − ǫ/6) and vd rd (1+ǫ/6), for any x √ M . [sent-178, score-0.94]

64 So if Xi ∈ A∪A′ , then BM (Xi , r) has mass at least vd rd (1−ǫ/6)·λ. [sent-179, score-0.578]

65 On the other hand, if Xi ∈ √ σ−r , then the set BM (Xi , r) contains mass at most S k vd rd (1 + ǫ/6) · λ(1 − ǫ). [sent-181, score-0.549]

66 Now, τ τ notice that if the graph is connected there must be an edge that connects two points that are at a geodesic distance of at least 2(σ − r). [sent-189, score-0.304]

67 2 All the conditions in Lemma 8 can be simultaneously satisﬁed by setting k := 16Cδ (µ/ǫ2 ), and vd rd (1 − ǫ/6) · λ = The condition on r is satisﬁed since λ ≥ 2 k v d ρd n k Cδ + n n kµ. [sent-192, score-0.521]

68 Since yi ∈ A, we have zi ∈ AM,R/4 , and hence the ball BM (zi , R/4) lies completely inside AM,R/2 ⊆ AM,σ−r . [sent-208, score-0.308]

69 In particular, the density inside the ball is at least λ everywhere, and hence the mass inside it is at least vd (R/4)d (1 − ǫ/6)λ ≥ 6 Cδ µ . [sent-209, score-0.917]

70 Thus Lemma 6 guarantees that the ball BM (zi , R/4) contains at least one sample point, say xi . [sent-211, score-0.31]

71 ) Since the ball lies completely in AM,σ−r , the sample point xi is not removed in the cleaning step (Lemma 8). [sent-213, score-0.364]

72 Finally, we bound d(xi−1 , xi ) by considering the sequence of points (xi−1 , zi−1 , yi−1 , yi , zi , xi ). [sent-214, score-0.358]

73 4 A lower bound instance for the class of RSL algorithms d d Recall that the sample complexity in Theorem 5 scales as n = O λǫ2 vd ρd log λǫ2 vd ρd where ρ = C min (σ, ǫτ /d, τ ). [sent-217, score-0.858]

74 The manifold M consists of two disjoint components, C and C ′ (whose sole function is to ensure f integrates to 1). [sent-223, score-0.239]

75 This can be ﬁxed without affecting the essence of the construction by smoothing this intersection by rolling a ball of radius τ around it (a similar construction is made rigorous in Theorem 6 of Genovese et al. [sent-236, score-0.359]

76 Let P be the distribution on M whose density over C is λ if |x1 | > 1/2, and λ(1 − ǫ) if |x1 | ≤ 1/2, where λ is chosen small enough such that λ vold (C) ≤ 1. [sent-238, score-0.259]

77 The density over C ′ is chosen such that the total mass of the manifold is 1. [sent-239, score-0.483]

78 on algorithms that do in fact use a cleaning step, ignoring the single linkage algorithm which is known to be inconsistent for full dimensional densities. [sent-245, score-0.294]

79 Intuitively, because of the curvature of the described instance, the mass of a sufﬁciently large Euclidean ball in the separator set is larger than the mass of a corresponding ball in the true clusters. [sent-246, score-0.553]

80 This means that any algorithm that uses large balls cannot reliably clean the sample and this restricts the size of the balls that can be used. [sent-247, score-0.326]

81 Now if points in the regions of high density are to survive then there must be k sample points in the small ball around any point in the true clusters and this gives us a lower bound on the necessary sample size. [sent-248, score-0.789]

82 The RSL algorithms work by counting the number of sample points inside the balls B(x, r) centered at the sample points x, for some radius r. [sent-249, score-0.586]

83 In order for the algorithm to reliably resolve (σ, ǫ) clusters, it should distinguish points in the separator set S ⊂ M2 from those in the level λ clusters M1 ∪M3 . [sent-250, score-0.355]

84 A necessary condition for this is that the mass of a ball B(x, r) for x ∈ Sσ−r should be strictly smaller than the mass inside B(y, r) for y ∈ M1 ∪ M3 . [sent-251, score-0.404]

85 Since x0 should not be removed during the cleaning step, the ball B(x0 , r) must contain some other sample point (indeed, it must contain at least k − 1 more sample points). [sent-255, score-0.386]

86 5 Cluster tree recovery in the presence of noise So far we have considered the problem of recovering the cluster tree given samples from a density supported on a lower dimensional manifold. [sent-258, score-0.912]

87 Indeed it can be argued that the manifold + noise model is a natural and general model for highdimensional data. [sent-260, score-0.275]

88 In the noisy setting, it is clear that we can infer the cluster tree of the noisy density in a straightforward way. [sent-261, score-0.562]

89 Following the literature on manifold estimation (Balakrishnan et al. [sent-263, score-0.239]

90 In particular, the clusters containing A[X] and A′ [X] are internally connected and mutually disconnected in C(r) for r deﬁned by 1 k C2 log(1/δ) πvd rd λ = kµ + 1 − ǫ/6 n n d/D provided λ ≥ max stants), i. [sent-283, score-0.582]

91 , ρ := min d/D 2 k 2vD (1−π) , v d ρd n vd ǫd/D π σ ǫτ τ 7 , 72d , 24 k 1−d/D n where ρ is now slightly modiﬁed (in con- . [sent-285, score-0.365]

92 In particular, the clusters containing {Yi : Xi ∈ A} and {Yi : Xi ∈ A′ } are internally connected and mutually disconnected in C(r) for r deﬁned by Cδ k vd rd (1 − ǫ/12)(1 − ǫ/6)λ = + kµ n n k τ ǫτ if λ ≥ vd2 d n and θ ≤ ρǫ/24d, where ρ := min σ , 24 , 144d . [sent-301, score-0.947]

93 For the clutter noise case we produce a tree that is consistent for samples drawn from P (which are exactly on M ), while in the additive noise case we produce a tree on the observed Yi s which is (σ, ǫ) consistent for the latent Xi s (for θ small enough). [sent-305, score-0.596]

94 It is worth noting that in the case of clutter noise we can still consistently recover the entire cluster tree. [sent-306, score-0.33]

95 As a result the clutter noise only affects a vanishingly low level set of the cluster tree. [sent-308, score-0.387]

96 An upper bound for the volume of geodesic balls in submanifolds of euclidean spaces. [sent-338, score-0.358]

97 Measuring mass concentrations and estimating density contour clusters: an excess mass approach. [sent-415, score-0.357]

98 Fast rates for plug-in estimators of density level sets. [sent-420, score-0.267]

99 Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. [sent-456, score-0.739]

100 A generalized single linkage method for estimating the cluster tree of a density. [sent-462, score-0.539]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rsl', 0.4), ('vd', 0.365), ('manifold', 0.239), ('cluster', 0.22), ('chaudhuri', 0.217), ('dasgupta', 0.201), ('tree', 0.177), ('density', 0.165), ('ball', 0.14), ('clusters', 0.13), ('rinaldo', 0.125), ('radius', 0.118), ('balls', 0.115), ('bm', 0.115), ('linkage', 0.108), ('connected', 0.106), ('rd', 0.105), ('disconnected', 0.103), ('geodesic', 0.103), ('consistency', 0.1), ('cf', 0.098), ('clustering', 0.096), ('hartigan', 0.096), ('cuevas', 0.094), ('internally', 0.094), ('vold', 0.094), ('cleaning', 0.083), ('mass', 0.079), ('dimensional', 0.076), ('clutter', 0.074), ('xi', 0.074), ('kpotufe', 0.072), ('niyogi', 0.072), ('genovese', 0.071), ('sample', 0.067), ('points', 0.066), ('stuetzle', 0.062), ('euclidean', 0.062), ('yi', 0.059), ('curvature', 0.058), ('separator', 0.057), ('level', 0.057), ('inside', 0.055), ('manifolds', 0.054), ('lemma', 0.054), ('zi', 0.054), ('ambient', 0.052), ('singh', 0.052), ('balakrishnan', 0.051), ('condition', 0.051), ('near', 0.049), ('consistent', 0.048), ('connectedness', 0.047), ('hemisphere', 0.047), ('resp', 0.047), ('submanifolds', 0.047), ('resolve', 0.045), ('rates', 0.045), ('separated', 0.045), ('mutually', 0.044), ('wishart', 0.043), ('homology', 0.042), ('maier', 0.042), ('components', 0.04), ('pn', 0.04), ('cleaned', 0.038), ('rigollet', 0.038), ('separation', 0.038), ('mild', 0.038), ('construction', 0.037), ('noise', 0.036), ('sketch', 0.036), ('aarti', 0.036), ('concentrated', 0.036), ('estimating', 0.034), ('restriction', 0.034), ('procedures', 0.034), ('connection', 0.033), ('universal', 0.033), ('larry', 0.033), ('neighbor', 0.033), ('von', 0.032), ('centered', 0.032), ('sriperumbudur', 0.031), ('wasserman', 0.031), ('steinwart', 0.031), ('recovering', 0.031), ('bound', 0.031), ('constants', 0.031), ('lower', 0.03), ('restricts', 0.029), ('bickel', 0.029), ('apart', 0.029), ('analyze', 0.029), ('least', 0.029), ('uniform', 0.029), ('lebesgue', 0.029), ('around', 0.027), ('modi', 0.027), ('ignoring', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 63 nips-2013-Cluster Trees on Manifolds

Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman

Abstract: unkown-abstract

2 0.14600147 256 nips-2013-Probabilistic Principal Geodesic Analysis

Author: Miaomiao Zhang, P.T. Fletcher

Abstract: Principal geodesic analysis (PGA) is a generalization of principal component analysis (PCA) for dimensionality reduction of data on a Riemannian manifold. Currently PGA is deﬁned as a geometric ﬁt to the data, rather than as a probabilistic model. Inspired by probabilistic PCA, we present a latent variable model for PGA that provides a probabilistic framework for factor analysis on manifolds. To compute maximum likelihood estimates of the parameters in our model, we develop a Monte Carlo Expectation Maximization algorithm, where the expectation is approximated by Hamiltonian Monte Carlo sampling of the latent variables. We demonstrate the ability of our method to recover the ground truth parameters in simulated sphere data, as well as its effectiveness in analyzing shape variability of a corpus callosum data set from human brain images. 1

3 0.12370621 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

Author: Martin Azizyan, Aarti Singh, Larry Wasserman

Abstract: While several papers have investigated computationally and statistically efﬁcient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efﬁcient procedure. Our results provide the ﬁrst step of a theoretical basis for recent methods that combine feature selection and clustering. 1

4 0.10868599 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors inﬂuencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-ﬁrst branch-and-bound algorithm. Algorithm 1 presents a simpliﬁed version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-ﬁrst tree traversal to a speciﬁed depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the ﬁrst time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Speciﬁcally, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically veriﬁed in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-ﬁrst branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-ﬁrst traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the ﬁrst approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of ﬁnding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difﬁculty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efﬁcient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Deﬁnition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Deﬁnition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the deﬁnition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise deﬁnition of dc . an afﬁne plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a ﬁxed η > 0. This notion of the absence of outliers was ﬁrst introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following deﬁnitions to precisely state our results: Deﬁnition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any speciﬁed value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Deﬁnition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is deﬁned as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these deﬁnitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a ﬁxed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is ﬁxed, and ψ is ﬁxed for a dataset S. Then, for a ﬁxed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Deﬁnition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Deﬁnition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the deﬁnition of distance error. Let R = {r ∈ S : q − r < du }. By deﬁnition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Deﬁnition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the inﬂuence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any ﬁxed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, ﬁrstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are signiﬁcantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reﬂected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any ﬁxed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a signiﬁcant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to ﬁnd the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC ﬁnds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassiﬁcation. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efﬁciently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2009. [3] R.F. Sproull. Reﬁnements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difﬁculty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efﬁcient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artiﬁcial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9

5 0.10793579 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

Author: Nitish Srivastava, Ruslan Salakhutdinov

Abstract: High capacity classiﬁers, such as deep neural networks, often struggle on classes that have very few training examples. We propose a method for improving classiﬁcation performance for such classes by discovering similar classes and transferring knowledge among them. Our method learns to organize the classes into a tree hierarchy. This tree structure imposes a prior over the classiﬁer’s parameters. We show that the performance of deep neural networks can be improved by applying these priors to the weights in the last layer. Our method combines the strength of discriminatively trained deep neural networks, which typically require large amounts of training data, with tree-based priors, making deep neural networks work well on infrequent classes as well. We also propose an algorithm for learning the underlying tree structure. Starting from an initial pre-speciﬁed tree, this algorithm modiﬁes the tree to make it more pertinent to the task being solved, for example, removing semantic relationships in favour of visual ones for an image classiﬁcation task. Our method achieves state-of-the-art classiﬁcation results on the CIFAR-100 image data set and the MIR Flickr image-text data set. 1

6 0.099220037 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap

7 0.097039551 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

8 0.095931336 158 nips-2013-Learning Multiple Models via Regularized Weighting

9 0.095481068 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

10 0.090396717 344 nips-2013-Using multiple samples to learn mixture models

11 0.089834429 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

12 0.088872567 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends

13 0.081387222 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

14 0.080331519 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

15 0.079369135 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

16 0.077588014 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation

17 0.076747052 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

18 0.074150681 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

19 0.073981948 148 nips-2013-Latent Maximum Margin Clustering

20 0.073396601 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.187), (1, 0.062), (2, 0.04), (3, 0.012), (4, 0.052), (5, 0.117), (6, 0.059), (7, -0.024), (8, -0.053), (9, 0.066), (10, 0.014), (11, 0.015), (12, 0.062), (13, 0.068), (14, 0.066), (15, 0.037), (16, 0.067), (17, -0.048), (18, 0.036), (19, 0.066), (20, 0.082), (21, 0.148), (22, 0.01), (23, 0.036), (24, -0.077), (25, -0.012), (26, -0.056), (27, -0.061), (28, -0.017), (29, -0.048), (30, -0.062), (31, -0.046), (32, 0.171), (33, -0.026), (34, 0.031), (35, 0.034), (36, 0.069), (37, -0.03), (38, 0.003), (39, 0.029), (40, 0.018), (41, 0.026), (42, -0.12), (43, -0.033), (44, 0.011), (45, 0.035), (46, -0.101), (47, -0.033), (48, 0.035), (49, 0.117)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95449609 63 nips-2013-Cluster Trees on Manifolds

Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman

Abstract: unkown-abstract

2 0.76142275 344 nips-2013-Using multiple samples to learn mixture models

Author: Jason Lee, Ran Gilad-Bachrach, Rich Caruana

Abstract: In the mixture models problem it is assumed that there are K distributions θ1 , . . . , θK and one gets to observe a sample from a mixture of these distributions with unknown coeﬃcients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with diﬀerent mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the diﬀerences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data. 1

3 0.67337626 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

Author: Martin Azizyan, Aarti Singh, Larry Wasserman

4 0.63995272 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

5 0.62209731 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends

Author: Matus Telgarsky, Sanjoy Dasgupta

Abstract: Suppose k centers are ﬁt to m points by heuristically minimizing the k-means cost; what is the corresponding ﬁt over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with reﬁned constants is provided for k-means instances possessing some cluster structure. 1

6 0.61004651 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

7 0.60136122 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

8 0.58468747 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

9 0.58225071 158 nips-2013-Learning Multiple Models via Regularized Weighting

10 0.57859248 256 nips-2013-Probabilistic Principal Geodesic Analysis

11 0.54142845 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

12 0.52406222 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions

13 0.51974779 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

14 0.50878441 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

15 0.50815254 202 nips-2013-Multiclass Total Variation Clustering

16 0.49836516 340 nips-2013-Understanding variable importances in forests of randomized trees

17 0.49536964 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

18 0.48839688 148 nips-2013-Latent Maximum Margin Clustering

19 0.48661003 47 nips-2013-Bayesian Hierarchical Community Discovery

20 0.47883064 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.019), (16, 0.04), (33, 0.114), (34, 0.157), (41, 0.023), (49, 0.027), (56, 0.152), (70, 0.033), (78, 0.227), (85, 0.045), (89, 0.058), (93, 0.022), (95, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83155048 63 nips-2013-Cluster Trees on Manifolds

Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman

Abstract: unkown-abstract

2 0.82115996 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the ﬁrst kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is ﬂexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities deﬁned on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difﬁcult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classiﬁcation within the covariate shift framework. 1

3 0.77153808 230 nips-2013-Online Learning with Costly Features and Labels

Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari

Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label signiﬁcantly increases the regret of the problem. 1

4 0.73754919 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efﬁciently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1

5 0.73459184 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difﬁculties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

6 0.73136759 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

7 0.7308774 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

8 0.72902811 249 nips-2013-Polar Operators for Structured Sparse Estimation

9 0.72817039 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

10 0.72725147 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

11 0.72599441 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

12 0.7258876 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

13 0.72458202 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

14 0.72431904 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

15 0.72280616 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

16 0.72277534 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

17 0.72254783 184 nips-2013-Marginals-to-Models Reducibility

18 0.72129363 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

19 0.72092259 280 nips-2013-Robust Data-Driven Dynamic Programming

20 0.72046703 348 nips-2013-Variational Policy Search via Trajectory Optimization