nips nips2013 nips2013-119 knowledge-graph by maker-knowledge-mining

119 nips-2013-Fast Template Evaluation with Vector Quantization

Source: pdf

Author: Mohammad Amin Sadeghi, David Forsyth

Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a signiﬁcant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Applying linear templates is an integral part of many object detection systems and accounts for a signiﬁcant portion of computation time. [sent-3, score-0.516]

2 Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. [sent-5, score-0.341]

3 Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. [sent-6, score-0.223]

4 We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. [sent-8, score-0.362]

5 1 Introduction One core operation in computer vision involves evaluating a bank of templates at a set of sample locations in an image. [sent-9, score-0.405]

6 This is by far the most computationally demanding task in current popular object detection algorithms including canonical pedestrian [3] and face detection [4] methods (modern practice uses a linear SVM); the deformable part models [2]; and exemplar SVMs [1]. [sent-11, score-0.965]

7 The accuracy and ﬂexibility of these algorithms has turned them into the building blocks of many modern computer vision systems that would all beneﬁt from a fast template evaluation algorithm. [sent-12, score-0.509]

8 There is a vast literature of models that are variants of these methods, but they mostly evaluate banks of templates at a set of sample locations in images. [sent-13, score-0.287]

9 Because this operation is important, there is now a range of methods to speed up this process, either by pruning locations to evaluate a template [7, 8] or by using fast convolution techniques. [sent-14, score-0.66]

10 Our method rests on the idea that it is sufﬁcient to compute an accurate, ﬁxed-precision approximation to the value the original template would produce. [sent-17, score-0.402]

11 We use Vector Quantization speedups, together with a variety of evaluation techniques and a cascade to exclude unpromising sample locations, to produce this approximation quickly. [sent-18, score-0.223]

12 This library provides simple interfaces for evaluating templates in dense or sparse grids of locations. [sent-20, score-0.29]

13 We used this library to implement a deformable part model algorithm that runs nearly two orders of magnitude faster than the original implementation [2]. [sent-21, score-0.56]

14 This library is also used to obtain an order of magnitude speed-up for the exemplar SVM detectors of [1]. [sent-22, score-0.345]

15 Computation costs break into two major terms: per image terms, like computing HOG features; and per (image×category) terms, where the cost scales with the number of categories as well as the number of images. [sent-28, score-0.303]

16 1 Prior Work At heart, evaluating a deformable part model involves evaluating a bank of templates at a set of locations in a scaled feature pyramid. [sent-32, score-0.746]

17 At each iteration it evaluates the corresponding template only if the current score of the object is higher than a certain threshold (trained in advance), resulting in an order of magnitude speed-up without signiﬁcant loss of accuracy. [sent-38, score-0.726]

18 [8] follow a similar approach but estimate the score of a location using a lower resolution version of the templates. [sent-40, score-0.237]

19 Transform methods evaluate templates at all locations simultaneously by exploiting properties of the Fast Fourier Transform. [sent-41, score-0.287]

20 [9], result in a several fold speed-up while being exact; however, there is the per image overhead of computing an FFT at the start, and a per (image × category) overhead of computing an inverse FFT at the end. [sent-43, score-0.336]

21 Furthermore, the approach computes the scores of all locations at once, and so is not random-access; it cannot be efﬁciently combined with a cascade detection process. [sent-44, score-0.497]

22 In contrast, our template evaluation algorithm does not require batching template evaluations. [sent-45, score-0.768]

23 As a result, we can combine our evaluation speedups with the cascade framework of [7]. [sent-46, score-0.237]

24 We show that using our method in a cascade framework leads to two orders of magnitude speed-up comparing to the original deformable part model implementation. [sent-47, score-0.578]

25 Extreme category scaling methods exploit locality sensitive hashing to get a system that can detect 100,000 object categories in a matter of tens of seconds [10]. [sent-48, score-0.322]

26 However, the method cannot speedup detection of the 20 VOC challenge objects without signiﬁcant loss of accuracy. [sent-50, score-0.29]

27 In contrast, because our method relies on evaluation speedups, it can speed up evaluation of even a single template. [sent-51, score-0.207]

28 They represent a vector by a short code composed of a number of subspace quantization indices. [sent-60, score-0.263]

29 This method can efﬁciently estimate the score of a template at a certain location by looking-up a number of tables. [sent-64, score-0.57]

30 Model quantization: Our method is similar to [12] as we both use Vector Quantization to speed up template evaluation. [sent-67, score-0.424]

31 In contrast, our method uses legacy models (that were trained on a low-dimensional dense feature space) and quantizes the space only at the level of evaluating the scores. [sent-70, score-0.226]

32 (a) is the original image, (b) is the HOG visualization, (c) is the visualization of Vector Quantized HOG feature into c = 256 clusters, (d) is the visualization of Vector Quantized HOG feature into c = 16 clusters. [sent-73, score-0.213]

33 2 Fast Approximate Scoring with Vector Quantization The vast majority of modern object detectors work as follows: • In a preprocessing stage, an image pyramid and a set of underlying features for each layer of the pyramid are computed. [sent-76, score-0.454]

34 • For each location in each layer of the pyramid, a ﬁxed size window of the image features spanning the location is extracted. [sent-77, score-0.289]

35 The linear functions are then assembled into a score for each category at that location. [sent-79, score-0.332]

36 Precisely how the score is computed from linear functions varies from detector to detector. [sent-81, score-0.256]

37 For example, exemplar SVMs directly use the score; deformable part models summarize a score from several linear functions in nearby windows; and so on. [sent-82, score-0.696]

38 Typically, detectors are evaluated by marking true windows in test data; establishing an overlap criterion to distinguish between false and true detects; plotting precision as a function of recall; and then computing the average precision (AP; the integral of this plot). [sent-84, score-0.231]

39 A detector that gets a good AP does so by assigning high values of the score to windows that strongly overlap the right answer. [sent-85, score-0.344]

40 Notice that what matters here is the ranking of windows, rather than the actual value of the score; some inaccuracy in score computation might not affect the AP. [sent-86, score-0.192]

41 HOG features for a window consist of a grid of cells, where each cell contains a ddimensional vector (typically d = 32) that corresponds to a small region of the image (typically 8 × 8 pixels). [sent-88, score-0.267]

42 The linear template is usually thought of as an m × n table of vectors. [sent-89, score-0.378]

43 The score at location (x, y) is given by: m n w(∆x, ∆y) · h(x + ∆x − 1, y + ∆y − 1) S(x, y) = ∆y=1 ∆x=1 where w is a weight vector and h is the feature vector at a certain cell (both d-dimensional vectors). [sent-91, score-0.348]

44 We wish to compute an approximation to this score where (a) the accuracy of the approximation is 3 Computation Time vs. [sent-92, score-0.264]

45 The two scatter-plots illustrate template score estimations using 107 sample points. [sent-121, score-0.525]

46 To do so, we quantize the feature vectors in each cell h(x, y) into c clusters using a basic k-means procedure and encode each quantized cell q(x, y) using its cluster ID (which can range from 1 to c). [sent-124, score-0.4]

47 We pre-compute the partial dot product of each template cell w(∆x, ∆y) with all 1 ≤ i ≤ c possible centroids and store them in a lookup table T(∆x, ∆y, i). [sent-126, score-0.649]

48 S (x, y) = ∆y=1 ∆x=1 This reduces per template computation complexity of exhaustive search from Θ(mnd) to Θ(mn). [sent-128, score-0.446]

49 In practice 32 multiplications and 32 additions are replaced by one lookup and one addition. [sent-129, score-0.239]

50 Table lookup is often slower than multiplication, therefore gaining the full speed-up requires certain implementation techniques that we will explain in the next section. [sent-131, score-0.244]

51 Most object detection algorithms evaluate for a small fraction of the scores that are higher than a certain threshold. [sent-145, score-0.373]

52 Our experimental results show that the described Vector Quantized convolution coupled with a re-estimation step would signiﬁcantly speed up detection process without any loss of accuracy. [sent-148, score-0.378]

53 4 0 0 0 0 0 0 0 0 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 Spatial Padding Sapp Sdef S Figure 3: Left: A single template can be zero-padded spatially to generate multiple larger templates. [sent-149, score-0.333]

54 We pack the spatially padded templates to evaluate several locations in one pass. [sent-150, score-0.321]

55 to estimate the maximum score we start from center and move to the highest scoring neighbour until we reach a local maximum. [sent-152, score-0.303]

56 In this example we compute the template on 17 locations in three steps (right most image). [sent-154, score-0.445]

57 We incorporated our Vector Quantization technique into the cascade detection algorithm of [7], resulting in a few folds speed-up with no loss of accuracy. [sent-159, score-0.375]

58 The cascade algorithm estimates the root score and the part scores iteratively (based on a pre-trained order). [sent-160, score-0.457]

59 At each iteration it prunes out the locations lower than a certain score threshold. [sent-161, score-0.304]

60 This process is done in two passes; the ﬁrst pass uses a fast score estimation technique while the second pass uses the original template evaluation. [sent-162, score-0.603]

61 In the case of deformable part models this procedure limits the process for both convolution and distance transform together. [sent-166, score-0.405]

62 Fast deformation estimates: To ﬁnd the best deformation for a part template, Felzenswalb et al. [sent-168, score-0.196]

63 [7] perform an exhaustive search over a 9 × 9 grid of locations and ﬁnd the deformation (∆x, ∆y) that maximizes: max S(∆x, ∆y) = Sapp (∆x, ∆y) + Sdef (∆x, ∆y) ∆x,∆y − 4 ≤ ∆x, ∆y ≤ 4 where Sapp is the appearance score and Sdef is the deformation score. [sent-169, score-0.496]

64 In a hill-climbing process we start from S(0, 0) and iteratively move to any neighbouring location that has the highest score among all neighbours. [sent-171, score-0.281]

65 Packed Lookup Tables: Depending on the detailed structure of memory, a table lookup instruction could be a couple of folds slower than a multiplication instruction. [sent-174, score-0.286]

66 When there are multiple templates to be evaluated at a certain location we pack their corresponding lookup tables and index them all in one memory access, thereby reducing the number of individual memory references. [sent-175, score-0.606]

67 Padding Templates: Packing lookup tables appears unhelpful when there is only one template to evaluate. [sent-177, score-0.613]

68 However, we can obtain multiple templates in this case by zero-padding the original template (to represent various translates of that template; Figure 3). [sent-178, score-0.541]

69 This allows packing the lookup tables to obtain the score of multiple locations in one pass. [sent-179, score-0.63]

70 Feature computation, per image preprocess, per (image×category) process and per category preprocess. [sent-182, score-0.443]

71 Sparse lookup tables: Depending on the design of features and the clustering approach lookup tables can be sparse in some applications. [sent-184, score-0.539]

72 Packing p dense lookup tables would require a dense c × p table. [sent-185, score-0.28]

73 However, if the lookup tables are sparse each row of the table could be stored in a sparse data structure. [sent-186, score-0.325]

74 Since the template evaluation process in this paper does not involve multiplication, the power datum would stay in about the same range so one could keep the data in ﬁxed-point format as it requires simpler addition arithmetic. [sent-191, score-0.391]

75 4 Computation Cost Model In order to assess detection speed we need to understand the underlying computation cost. [sent-193, score-0.263]

76 Computation costs break into two major terms: per image terms, where the cost scales with the number of images and per (image×category) terms, where the cost scales with the number of categories as well as the number of images. [sent-201, score-0.303]

77 The total time taken is the sum of four costs: • Computing HOG features is a mandatory, per image step, shared by all HOG-based detection algorithms. [sent-202, score-0.401]

78 • per image preprocessing is any process on image data-structure except HOG feature extraction. [sent-203, score-0.324]

79 • per category preprocessing establishes the required detector data-structure. [sent-205, score-0.269]

80 29s Table 2: Comparison of various different object detection methods on PASCAL VOC 2007 dataset. [sent-235, score-0.289]

81 The reported time here is the time to complete the detection of 20 categories starting from raw image. [sent-236, score-0.237]

82 5 Experimental Results We tested our template evaluation library for two well known detections methods. [sent-239, score-0.461]

83 (a) Deformable part models and (b) exemplar SVM detectors. [sent-240, score-0.23]

84 We used PASCAL VOC 2007 dataset that is a established benchmark for object detection algorithms. [sent-241, score-0.289]

85 1 Deformable Part Models Deformable part models algorithm is the standard object detection baseline. [sent-253, score-0.341]

86 Table 2 compares our implementation to ten prominent methods including the original deformable part models versions 3, 4 and 5. [sent-255, score-0.4]

87 Detailed per category average precisions are published in the reference papers. [sent-257, score-0.293]

88 It then evaluates the original templates on the candidates to ﬁne tune the scores. [sent-263, score-0.208]

89 We ﬁrst estimate template scores using our Vector Quantization based library. [sent-268, score-0.417]

90 For the convolution we get roughly 25 fold speedup comparing to the baseline implementation. [sent-269, score-0.193]

91 We re-estimate the score of the top 1% of locations for each category and we are virtually able to reproduce the original average precisions (Table 3). [sent-271, score-0.509]

92 Including MATLAB implementation overhead, our version of exemplar SVM is roughly 8-fold faster than the baseline without any loss in accuracy. [sent-272, score-0.287]

93 The top three rows refer to DPM implementation while the last two rows refer to exemplar SVMs. [sent-381, score-0.219]

94 The two bottom rows compare the performance of our exemplar SVM implementation with the baseline. [sent-383, score-0.219]

95 6 Discussion In this paper we present a method to speed-up object detection by two orders of magnitude with little or no loss of accuracy. [sent-386, score-0.415]

96 The main contribution of this paper lies in the right selection of techniques that are compatible and together lead to a major speedup in template evaluation. [sent-387, score-0.459]

97 This library is of special interest in largescale and real-time object detection tasks. [sent-389, score-0.359]

98 Our application speciﬁc implementation of PEGASOS [24] solves a SVM classiﬁer for a 12 × 12 template with 108 training examples (uniformly distributed in the training set) in a matter of one minute. [sent-397, score-0.374]

99 Being able to access the whole training set plus faster template evaluation could make hard negative mining either faster or unnecessary. [sent-398, score-0.391]

100 Rapid object detection using a boosted cascade of simple features in Conference on Computer Vision and Pattern Recognition, 2001 [6] R. [sent-432, score-0.474]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('template', 0.333), ('deformable', 0.274), ('hog', 0.269), ('quantization', 0.263), ('lookup', 0.203), ('score', 0.192), ('exemplar', 0.178), ('templates', 0.175), ('detection', 0.172), ('dpm', 0.154), ('category', 0.14), ('quantized', 0.139), ('cascade', 0.129), ('voc', 0.125), ('object', 0.117), ('felzenszwalb', 0.113), ('locations', 0.112), ('sdef', 0.111), ('image', 0.108), ('speed', 0.091), ('rescoring', 0.089), ('windows', 0.088), ('scores', 0.084), ('speedup', 0.082), ('convolution', 0.079), ('sapp', 0.078), ('vq', 0.078), ('tables', 0.077), ('pascal', 0.075), ('vision', 0.073), ('neighbour', 0.072), ('deformation', 0.072), ('library', 0.07), ('svm', 0.069), ('girshick', 0.068), ('cell', 0.068), ('dubout', 0.066), ('legacy', 0.066), ('categories', 0.065), ('per', 0.065), ('detector', 0.064), ('pyramid', 0.062), ('fft', 0.059), ('simd', 0.059), ('evaluation', 0.058), ('published', 0.056), ('features', 0.056), ('part', 0.052), ('speedups', 0.05), ('detectors', 0.049), ('overhead', 0.049), ('cascades', 0.048), ('vedaldi', 0.048), ('magnitude', 0.048), ('exhaustive', 0.048), ('clusters', 0.048), ('visualization', 0.047), ('precision', 0.047), ('packing', 0.046), ('evaluating', 0.045), ('location', 0.045), ('table', 0.045), ('running', 0.045), ('fast', 0.045), ('pattern', 0.045), ('batching', 0.044), ('ffld', 0.044), ('illinois', 0.044), ('neighbouring', 0.044), ('padding', 0.044), ('pedersoli', 0.044), ('quantizing', 0.044), ('rescore', 0.044), ('compatible', 0.044), ('feature', 0.043), ('orders', 0.042), ('implementation', 0.041), ('discriminatively', 0.04), ('recognition', 0.039), ('felzenswalb', 0.039), ('malisiewicz', 0.039), ('quantizes', 0.039), ('scoring', 0.039), ('pca', 0.038), ('folds', 0.038), ('nearest', 0.037), ('loss', 0.036), ('additions', 0.036), ('maji', 0.036), ('memory', 0.036), ('approximation', 0.036), ('ap', 0.035), ('window', 0.035), ('quantize', 0.034), ('pack', 0.034), ('original', 0.033), ('trained', 0.033), ('precisions', 0.032), ('baseline', 0.032), ('european', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

2 0.20042124 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D afﬁne transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically conﬁrm theoretical predictions (from [1]) for the case of 2D afﬁne transformations. Next, we apply the model to non-afﬁne transformations; as expected, it performs well on face veriﬁcation tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical ﬁndings, we tested the same model on face veriﬁcation benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

3 0.1914365 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

Abstract: Deep Neural Networks (DNNs) have recently shown outstanding performance on image classiﬁcation tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We deﬁne a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC. 1

4 0.18297106 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors inﬂuencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-ﬁrst branch-and-bound algorithm. Algorithm 1 presents a simpliﬁed version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-ﬁrst tree traversal to a speciﬁed depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the ﬁrst time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Speciﬁcally, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically veriﬁed in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-ﬁrst branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-ﬁrst traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the ﬁrst approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of ﬁnding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difﬁculty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efﬁcient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Deﬁnition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Deﬁnition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the deﬁnition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise deﬁnition of dc . an afﬁne plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a ﬁxed η > 0. This notion of the absence of outliers was ﬁrst introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following deﬁnitions to precisely state our results: Deﬁnition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any speciﬁed value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Deﬁnition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is deﬁned as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these deﬁnitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a ﬁxed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is ﬁxed, and ψ is ﬁxed for a dataset S. Then, for a ﬁxed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Deﬁnition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Deﬁnition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the deﬁnition of distance error. Let R = {r ∈ S : q − r < du }. By deﬁnition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Deﬁnition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the inﬂuence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any ﬁxed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, ﬁrstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are signiﬁcantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reﬂected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any ﬁxed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a signiﬁcant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to ﬁnd the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC ﬁnds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassiﬁcation. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efﬁciently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2009. [3] R.F. Sproull. Reﬁnements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difﬁculty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efﬁcient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artiﬁcial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9

5 0.13431792 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros

Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difﬁcult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classiﬁcation, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1

6 0.10140637 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

7 0.099168286 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

8 0.09194915 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

9 0.083350249 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

10 0.082351699 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

11 0.081163771 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

12 0.079285607 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

13 0.07754641 148 nips-2013-Latent Maximum Margin Clustering

14 0.074474469 21 nips-2013-Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths

15 0.072869383 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

16 0.071437664 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

17 0.068251438 251 nips-2013-Predicting Parameters in Deep Learning

18 0.068082899 335 nips-2013-Transfer Learning in a Transductive Setting

19 0.067378059 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

20 0.064169899 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.17), (1, 0.077), (2, -0.112), (3, -0.062), (4, 0.102), (5, -0.049), (6, -0.043), (7, 0.012), (8, -0.065), (9, 0.042), (10, -0.14), (11, 0.036), (12, 0.049), (13, 0.034), (14, -0.042), (15, 0.027), (16, -0.032), (17, -0.12), (18, -0.028), (19, 0.076), (20, 0.069), (21, 0.041), (22, -0.033), (23, 0.046), (24, -0.069), (25, 0.039), (26, 0.027), (27, -0.048), (28, 0.014), (29, -0.044), (30, 0.06), (31, 0.004), (32, 0.064), (33, 0.003), (34, 0.008), (35, 0.007), (36, -0.003), (37, 0.007), (38, 0.005), (39, -0.123), (40, 0.032), (41, -0.036), (42, 0.063), (43, -0.005), (44, -0.069), (45, 0.018), (46, 0.071), (47, -0.071), (48, -0.034), (49, 0.108)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93997949 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

2 0.79624653 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

3 0.7831921 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

4 0.712075 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

Author: Vibhav Vineet, Carsten Rother, Philip Torr

Abstract: Many methods have been proposed to solve the problems of recovering intrinsic scene properties such as shape, reﬂectance and illumination from a single image, and object class segmentation separately. While these two problems are mutually informative, in the past not many papers have addressed this topic. In this work we explore such joint estimation of intrinsic scene properties recovered from an image, together with the estimation of the objects and attributes present in the scene. In this way, our uniﬁed framework is able to capture the correlations between intrinsic properties (reﬂectance, shape, illumination), objects (table, tv-monitor), and materials (wooden, plastic) in a given scene. For example, our model is able to enforce the condition that if a set of pixels take same object label, e.g. table, most likely those pixels would receive similar reﬂectance values. We cast the problem in an energy minimization framework and demonstrate the qualitative and quantitative improvement in the overall accuracy on the NYU and Pascal datasets. 1

5 0.68531173 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros

6 0.67979515 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

7 0.6588183 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

8 0.62729794 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

9 0.59445554 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

10 0.58619303 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

11 0.57516426 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

12 0.56168008 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

13 0.54875499 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars

14 0.53662282 226 nips-2013-One-shot learning by inverting a compositional causal process

15 0.53569186 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

16 0.53279626 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

17 0.52330476 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

18 0.51958746 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

19 0.51880974 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs

20 0.51572269 355 nips-2013-Which Space Partitioning Tree to Use for Search?

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.047), (19, 0.321), (33, 0.136), (34, 0.087), (41, 0.021), (49, 0.024), (56, 0.069), (70, 0.068), (85, 0.034), (89, 0.04), (93, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77288145 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

2 0.75946778 9 nips-2013-A Kernel Test for Three-Variable Interactions

Author: Dino Sejdinovic, Arthur Gretton, Wicher Bergsma

Abstract: We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak inﬂuence on a third dependent variable, but their combined effect has a strong inﬂuence. This makes the Lancaster test especially suited to ﬁnding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.

3 0.75856763 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

Author: Harish G. Ramaswamy, Shivani Agarwal, Ambuj Tewari

Abstract: The design of convex, calibrated surrogate losses, whose minimization entails consistency with respect to a desired target loss, is an important concept to have emerged in the theory of machine learning in recent years. We give an explicit construction of a convex least-squares type surrogate loss that can be designed to be calibrated for any multiclass learning problem for which the target loss matrix has a low-rank structure; the surrogate loss operates on a surrogate target space of dimension at most the rank of the target loss. We use this result to design convex calibrated surrogates for a variety of subset ranking problems, with target losses including the precision@q, expected rank utility, mean average precision, and pairwise disagreement. 1

4 0.62366217 201 nips-2013-Multi-Task Bayesian Optimization

Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams

Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efﬁciency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to ﬁnd optimal hyperparameter settings more efﬁciently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method signiﬁcantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1

5 0.57801646 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

Author: Adel Javanmard, Andrea Montanari

Abstract: Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical signiﬁcance as conﬁdence intervals or p-values. We consider here a broad class of regression problems, and propose an efﬁcient algorithm for constructing conﬁdence intervals and p-values. The resulting conﬁdence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a ‘de-biased’ version of regularized Mestimators. The new construction improves over recent work in the ﬁeld in that it does not assume a special structure on the design matrix. Furthermore, proofs are remarkably simple. We test our method on a diabetes prediction problem. 1

6 0.56337476 173 nips-2013-Least Informative Dimensions

7 0.54029143 166 nips-2013-Learning invariant representations and applications to face verification

8 0.53685212 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

9 0.53506899 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

10 0.53337842 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

11 0.53223324 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

12 0.53122652 251 nips-2013-Predicting Parameters in Deep Learning

13 0.53044635 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

14 0.53035307 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

15 0.52749145 64 nips-2013-Compete to Compute

16 0.52649671 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

17 0.52562511 5 nips-2013-A Deep Architecture for Matching Short Texts

18 0.52509958 331 nips-2013-Top-Down Regularization of Deep Belief Networks

19 0.52436411 30 nips-2013-Adaptive dropout for training deep neural networks

20 0.52332723 183 nips-2013-Mapping paradigm ontologies to and from the brain