nips nips2013 nips2013-355 knowledge-graph by maker-knowledge-mining

355 nips-2013-Which Space Partitioning Tree to Use for Search?

Source: pdf

Author: Parikshit Ram, Alexander Gray

Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors inﬂuencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-ﬁrst branch-and-bound algorithm. Algorithm 1 presents a simpliﬁed version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-ﬁrst tree traversal to a speciﬁed depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the ﬁrst time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Speciﬁcally, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically veriﬁed in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-ﬁrst branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-ﬁrst traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the ﬁrst approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of ﬁnding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difﬁculty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efﬁcient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Deﬁnition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Deﬁnition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the deﬁnition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise deﬁnition of dc . an afﬁne plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a ﬁxed η > 0. This notion of the absence of outliers was ﬁrst introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following deﬁnitions to precisely state our results: Deﬁnition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any speciﬁed value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Deﬁnition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is deﬁned as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these deﬁnitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a ﬁxed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is ﬁxed, and ψ is ﬁxed for a dataset S. Then, for a ﬁxed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Deﬁnition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Deﬁnition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the deﬁnition of distance error. Let R = {r ∈ S : q − r < du }. By deﬁnition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Deﬁnition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the inﬂuence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any ﬁxed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, ﬁrstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are signiﬁcantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reﬂected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any ﬁxed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a signiﬁcant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to ﬁnd the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC ﬁnds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassiﬁcation. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efﬁciently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2009. [3] R.F. Sproull. Reﬁnements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difﬁculty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efﬁcient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artiﬁcial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search? [sent-12, score-0.792]

2 ” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. [sent-13, score-1.078]

3 We also explore another factor affecting the search performance – margins of the partitions in these trees. [sent-14, score-0.493]

4 We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. [sent-15, score-0.827]

5 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. [sent-16, score-0.486]

6 Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. [sent-17, score-0.821]

7 Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). [sent-19, score-0.693]

8 This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. [sent-20, score-0.419]

9 This allows us to identify implicit factors inﬂuencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. [sent-25, score-0.549]

10 Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). [sent-26, score-0.545]

11 A search query on this tree is usually answered with a depth-ﬁrst branch-and-bound algorithm. [sent-27, score-0.666]

12 Algorithm 1 presents a simpliﬁed version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-ﬁrst tree traversal to a speciﬁed depth. [sent-28, score-0.913]

13 These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). [sent-33, score-0.622]

14 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc . [sent-34, score-1.444]

15 We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. [sent-40, score-1.033]

16 Our results provide theoretical evidence, for the ﬁrst time, that better quantization performance implies better search performance1 . [sent-41, score-0.957]

17 These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. [sent-42, score-0.621]

18 After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. [sent-43, score-1.836]

19 This implies that better quantization produces better worst-case search results. [sent-45, score-0.897]

20 • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . [sent-49, score-0.618]

21 • We present the quantization performance guarantee of a large margin BSP tree in Section 4. [sent-50, score-1.112]

22 O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. [sent-51, score-0.992]

23 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. [sent-53, score-0.708]

24 The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. [sent-56, score-0.632]

25 The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. [sent-57, score-0.655]

26 Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. [sent-58, score-0.533]

27 The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). [sent-59, score-0.428]

28 The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). [sent-60, score-0.584]

29 This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. [sent-64, score-0.437]

30 After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. [sent-67, score-0.447]

31 The candidate set size implies the worst-case search time bound. [sent-68, score-0.399]

32 The hash table construction guarantees the set size and search approximation. [sent-69, score-0.38]

33 Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. [sent-70, score-0.577]

34 For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. [sent-71, score-0.906]

35 For any choice of the depth l, we present the ﬁrst approximation guarantee for this search process. [sent-72, score-0.468]

36 Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. [sent-73, score-0.889]

37 The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. [sent-75, score-0.605]

38 Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. [sent-80, score-1.284]

39 The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. [sent-82, score-1.573]

40 (2) Tree-based structured vector quantization is used for efﬁcient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. [sent-83, score-1.572]

41 The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. [sent-84, score-1.511]

42 For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). [sent-87, score-0.828]

43 D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. [sent-90, score-0.683]

44 These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. [sent-117, score-0.656]

45 Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. [sent-119, score-0.894]

46 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. [sent-120, score-0.909]

47 The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. [sent-121, score-0.474]

48 For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. [sent-123, score-0.482]

49 Algorithm 1 requires the query traversal depth l as an input. [sent-124, score-0.365]

50 Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . [sent-127, score-0.837]

51 We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. [sent-128, score-0.771]

52 Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3. [sent-135, score-0.36]

53 Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . [sent-151, score-0.514]

54 c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . [sent-153, score-0.814]

55 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a ﬁxed η ≥ 8. [sent-154, score-0.363]

56 For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . [sent-155, score-0.729]

57 Then, for a ﬁxed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). [sent-161, score-1.072]

58 Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. [sent-162, score-0.436]

59 As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. [sent-165, score-0.721]

60 For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. [sent-167, score-0.637]

61 This result is obtained by recursively applying the quantization error improvement in Deﬁnition 2. [sent-174, score-0.69]

62 Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. [sent-178, score-0.538]

63 1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . [sent-195, score-0.498]

64 Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . [sent-205, score-0.47]

65 Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. [sent-206, score-1.247]

66 For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. [sent-209, score-0.485]

67 Dependence of the nearest-neighbor search error on the partition margins. [sent-210, score-0.399]

68 However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. [sent-213, score-0.458]

69 Further assume that γ is the smallest margin size on both sides of any partition in the tree T . [sent-218, score-0.547]

70 ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. [sent-220, score-0.859]

71 This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. [sent-221, score-0.61]

72 Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. [sent-229, score-0.924]

73 Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. [sent-230, score-0.814]

74 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). [sent-231, score-0.782]

75 This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. [sent-232, score-1.068]

76 If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. [sent-234, score-0.713]

77 For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). [sent-235, score-1.108]

78 We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. [sent-240, score-0.357]

79 We only use hyperplane partitions for the tree construction. [sent-241, score-0.501]

80 The quantization performance of the 2M-tree, PA-tree and MM-tree are signiﬁcantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3. [sent-246, score-0.651]

81 The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. [sent-248, score-0.622]

82 The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. [sent-252, score-0.534]

83 Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). [sent-256, score-0.485]

84 The results imply that for any ﬁxed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. [sent-261, score-0.376]

85 Moreover, with η ≥ 8, for only one of the datasets would a signiﬁcant fraction of the partitions in the tree (of any type) need to be the alternate partition. [sent-262, score-0.528]

86 The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. [sent-264, score-0.913]

87 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. [sent-266, score-1.254]

88 The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. [sent-267, score-0.885]

89 In this section, we establish a theoretical guarantee for the MM-tree quantization performance. [sent-268, score-0.674]

90 Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4. [sent-282, score-0.69]

91 Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . [sent-284, score-0.517]

92 Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . [sent-285, score-0.69]

93 The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. [sent-289, score-0.81]

94 If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). [sent-291, score-0.622]

95 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. [sent-301, score-0.833]

96 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. [sent-329, score-1.133]

97 This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3. [sent-330, score-0.865]

98 Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. [sent-336, score-0.597]

99 This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. [sent-343, score-0.642]

100 Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. [sent-348, score-0.945]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('quantization', 0.593), ('tree', 0.284), ('search', 0.243), ('depth', 0.175), ('ar', 0.166), ('margin', 0.156), ('vs', 0.152), ('du', 0.152), ('partitions', 0.144), ('al', 0.128), ('trees', 0.115), ('partition', 0.107), ('query', 0.102), ('mmc', 0.102), ('alternate', 0.1), ('neighbor', 0.1), ('tc', 0.096), ('defeatist', 0.096), ('verma', 0.096), ('candidate', 0.095), ('splits', 0.094), ('traversal', 0.088), ('split', 0.085), ('hash', 0.082), ('hashing', 0.082), ('node', 0.079), ('expansion', 0.079), ('margins', 0.077), ('mages', 0.077), ('outliers', 0.076), ('hyperplane', 0.073), ('region', 0.071), ('balanced', 0.07), ('implies', 0.061), ('bsptree', 0.058), ('guarantees', 0.055), ('distance', 0.055), ('lc', 0.052), ('nearestneighbor', 0.051), ('bsp', 0.051), ('guarantee', 0.05), ('points', 0.049), ('partitioning', 0.049), ('error', 0.049), ('improvement', 0.048), ('rd', 0.047), ('minr', 0.047), ('lsh', 0.044), ('dc', 0.042), ('mp', 0.042), ('condition', 0.042), ('disjoint', 0.04), ('axis', 0.04), ('queries', 0.04), ('nearest', 0.039), ('runtime', 0.039), ('ptdigits', 0.038), ('spill', 0.038), ('imply', 0.038), ('attributed', 0.037), ('answered', 0.037), ('dasgupta', 0.037), ('covariance', 0.036), ('balance', 0.036), ('intrinsic', 0.035), ('deteriorating', 0.034), ('iny', 0.034), ('quantizer', 0.034), ('heuristics', 0.034), ('candidates', 0.034), ('principal', 0.033), ('mn', 0.032), ('rank', 0.032), ('farther', 0.031), ('inlier', 0.031), ('theoretical', 0.031), ('cov', 0.031), ('dimensionality', 0.03), ('bounded', 0.03), ('worst', 0.03), ('enforced', 0.03), ('corollary', 0.03), ('algorithmically', 0.029), ('worstcase', 0.029), ('invoked', 0.029), ('curse', 0.029), ('performance', 0.029), ('freund', 0.028), ('doubling', 0.027), ('regions', 0.027), ('proportional', 0.026), ('clustering', 0.026), ('rigorously', 0.026), ('atlanta', 0.026), ('georgia', 0.026), ('xi', 0.025), ('child', 0.025), ('nition', 0.025), ('leaf', 0.025), ('desired', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

2 0.18297106 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a signiﬁcant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1

3 0.16927516 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

Author: Nitish Srivastava, Ruslan Salakhutdinov

Abstract: High capacity classiﬁers, such as deep neural networks, often struggle on classes that have very few training examples. We propose a method for improving classiﬁcation performance for such classes by discovering similar classes and transferring knowledge among them. Our method learns to organize the classes into a tree hierarchy. This tree structure imposes a prior over the classiﬁer’s parameters. We show that the performance of deep neural networks can be improved by applying these priors to the weights in the last layer. Our method combines the strength of discriminatively trained deep neural networks, which typically require large amounts of training data, with tree-based priors, making deep neural networks work well on infrequent classes as well. We also propose an algorithm for learning the underlying tree structure. Starting from an initial pre-speciﬁed tree, this algorithm modiﬁes the tree to make it more pertinent to the task being solved, for example, removing semantic relationships in favour of visual ones for an image classiﬁcation task. Our method achieves state-of-the-art classiﬁcation results on the CIFAR-100 image data set and the MIR Flickr image-text data set. 1

4 0.15765722 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Author: Anshumali Shrivastava, Ping Li

Abstract: We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1 ∩S2 ∩S3 | , S1 , S2 , S3 ∈ C, where C is a |S1 ∪S2 ∪S3 | size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1 Introduction and Motivation Similarity search (near neighbor search) is one of the fundamental problems in Computer Science. The task is to identify a small set of data points which are “most similar” to a given input query. Similarity search algorithms have been one of the basic building blocks in numerous applications including search, databases, learning, recommendation systems, computer vision, etc. One widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20]. Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, the resemblance R2way between S1 and S2 is deﬁned as: R2way = |S1 ∩S2 | . Existing notions of similarity in search problems mainly work with |S1 ∪S2 | pairwise similarity functions. In this paper, we go beyond this notion and look at the problem of k-way similarity search, where the similarity function of interest involves k sets (k ≥ 2). Our work exploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18, 21], deﬁned over k sets {S1 , S2 , ..., Sk } as Rk−way = |S1 ∩S2 ∩...∩Sk | . |S1 ∪S2 ∪...∪Sk | Binary high-dimensional data The current web datasets are typically binary, sparse, and extremely high-dimensional, largely due to the wide adoption of the “Bag of Words” (BoW) representations for documents and images. It is often the case, in BoW representations, that just the presence or absence (0/1) of speciﬁc feature words captures sufﬁcient information [7, 16, 20], especially with (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse of ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently viewed as sets (containing locations of the nonzero features). We list four practical scenarios where k-way resemblance search would be a natural choice. (i) Google Sets: (http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html) Google Sets is among the earliest google projects, which allows users to generate list of similar words by typing only few related keywords. For example, if the user types “mazda” and “honda” the application will automatically generate related words like “bmw”, “ford”, “toyota”, etc. This application is currently available in google spreadsheet. If we assume the term document binary representation of each word w in the database, then given query w1 and w2 , we show that |w1 ∩w2 ∩w| |w1 ∪w2 ∪w| turns out to be a very good similarity measure for this application (see Section 7.1). 1 (ii) Joint recommendations: Users A and B would like to watch a movie together. The proﬁle of each person can be represented as a sparse vector over a giant universe of attributes. For example, a user proﬁle may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the other hand, we can represent a movie M in the database over the same universe based on attributes associated with the movie. If we have to recommend movie M, jointly to users A and B, then a natural measure to maximize is |A∩B∩M | . The problem of group recommendation [3] is applicable |A∪B∪M | in many more settings such as recommending people to join circles, etc. (iii) Improving retrieval quality: We are interested in ﬁnding images of a particular type of object, and we have two or three (possibly noisy) representative images. In such a scenario, a natural expectation is that retrieving images simultaneously similar to all the representative images should be more reﬁned than just retrieving images similar to any one of them. In Section 7.2, we demonstrate that in cases where we have more than one element to search for, we can reﬁne our search quality using k-way resemblance search. In a dynamic feedback environment [4], we can improve subsequent search quality by using k-way similarity search on the pages already clicked by the user. (iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data through pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where the afﬁnity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational cost, of course, will increase exponentially if we go beyond pairwise similarity. Efﬁciency is crucial With the data explosion in modern applications, the brute force way of scanning all the data for searching is prohibitively expensive, specially in user-facing applications like search. The need for k-way similarity search can only be fulﬁlled if it admits efﬁcient algorithms. This paper fulﬁlls this requirement for k-way resemblance and its derived similarities. In particular, we show fast algorithms with provable query time guarantees for approximate k-way resemblance search. Our algorithms and analysis naturally provide a framework to extend classical LSH framework [14, 13] to handle higher-order similarities, which could be of independent theoretical interest. Organization In Section 2, we review approximate near neighbor search and classical Locality Sensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sections 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demonstrates the applicability of 3-way resemblance search in real applications. 2 Classical c-NN and Locality Sensitive Hashing (LSH) Initial attempts of ﬁnding efﬁcient (sub-linear time) algorithms for exact near neighbor search, based on space partitioning, turned out to be a disappointment with the massive dimensionality of current datasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear query time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN). Deﬁnition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a D-dimensional space RD , and parameters R0 > 0, δ > 0. The task is to construct a data structure which, given any query point q, if there exist an R0 -near neighbor of q in P, it reports some cR0 -near neighbor of q in P with probability 1 − δ. The usual notion of c-NN is for distance. Since we deal with similarities, we deﬁne R0 -near neighbor of point q as a point p with Sim(q, p) ≥ R0 , where Sim is the similarity function of interest. Locality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a family of functions, with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H a family of hash functions mapping RD to some set S Deﬁnition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0 , cR0 , p1 , p2 )-sensitive if for any two points x, y ∈ RD and h chosen uniformly from H satisﬁes the following: • if Sim(x, y) ≥ R0 then P rH (h(x) = h(y)) ≥ p1 • if Sim(x, y) ≤ cR0 then P rH (h(x) = h(y)) ≤ p2 For approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as we are deﬁning neighbors in terms of similarity. Basically, LSH trades off query time with extra preprocessing time and space which can be accomplished off-line. 2 Fact 1 Given a family of (R0 , cR0 , p1 , p2 ) -sensitive hash functions, one can construct a data structure for c-NN with O(nρ log1/p2 n) query time and space O(n1+ρ ), where ρ = log 1/p1 . log 1/p2 Minwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions associated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing family applies an independent random permutation π : Ω → Ω, on the given set S ⊆ Ω, and looks at the minimum element under π, i.e. min(π(S)). Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, it can be shown by elementary probability argument that P r (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R2way . |S1 ∪ S2 | (1) The recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the lowest b bits of the hashed values: min(π(S1 )), min(π(S2 )). [26] implemented the idea of building hash tables for near neighbor search, by directly using the bits from b-bit minwise hashing. 3 3-way Similarity Search Formulation Our focus will remain on binary vectors which can also be viewed as sets. We illustrate our method |S1 ∩S2 ∩S3 | using 3-way resemblance similarity function Sim(S1 , S2 , S3 ) = |S1 ∪S2 ∪S3 | . The algorithm and guarantees naturally extend to k-way resemblance. Given a size n collection C ⊆ 2Ω of sets (or binary vectors), we are particularly interested in the following three problems: 1. Given two query sets S1 and S2 , ﬁnd S3 ∈ C that maximizes Sim(S1 , S2 , S3 ). 2. Given a query set S1 , ﬁnd two sets S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). 3. Find three sets S1 , S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). The brute force way of enumerating all possibilities leads to the worst case query time of O(n), O(n2 ) and O(n3 ) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the case of pairwise near neighbor search, we deﬁne the c-approximate (c < 1) versions of the above three problems. As in the case of c-NN, we are given two parameters R0 > 0 and δ > 0. For each of the following three problems, the guarantee is with probability at least 1 − δ: 1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2 , if there ′ exists S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report some S3 ∈ C so that ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1 , if there exists a pair of ′ ′ set S2 , S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S2 , S3 ∈ C so that ′ ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1 , S2 , S3 ∈ C with ′ ′ ′ ′ ′ ′ Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S1 , S2 , S3 ∈ C so that Sim(S1 , S2 , S3 ) ≥ cR0 . 4 Sub-linear Algorithm for 3-way c-NN The basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset in a fashion so that we can ﬁlter many bad candidates without scanning all of them. LSH-based techniques rely on randomized hash functions to create buckets that probabilistically ﬁlter bad candidates. This philosophy is not restricted for binary similarity functions and is much more general. Here, we ﬁrst focus on 3-way c-NN problem for binary data. Theorem 1 For R3way c-NN one can construct a data structure with O(nρ log1/cR0 n) query time and O(n1+ρ ) space, where ρ = 1 − log 1/c log 1/c+log 1/R0 . The argument for 2-way resemblance can be naturally extended to k-way resemblance. Speciﬁcally, given three sets S1 , S2 , S3 ⊆ Ω and an independent random permutation π : Ω → Ω, we have: P r (min(π(S1 )) = min(π(S2 )) = min(π(S3 ))) = R3way . (2) Eq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way (in fact k-way) similarity structure of the data. The existence of such a hash function is the key requirement behind the existence of efﬁcient approximate search. For the pairwise case, the probability event was a simple hash collision, and the min-hash itself serves as the bucket index. In case 3 of 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an indexing scheme. In particular, during preprocessing we need to create buckets for each individual S3 , and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We need extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme. Proof of Theorem 1: We use two additional functions: f1 : Ω → N for manipulating min(π(S3 )) and f2 : Ω × Ω → N for manipulating both min(π(S1 )) and min(π(S2 )). Let a ∈ N+ such that |Ω| = D < 10a . We deﬁne f1 (x) = (10a + 1) × x and f2 (x, y) = 10a x + y. This choice ensures that given query S1 and S2 , for any S3 ∈ C, f1 (min(π(S3 ))) = f2 (min(π(S1 )), min(π(S2 ))) holds if and only if min(π(S1 )) = min(π(S2 )) = min(π(S2 )), and thus we get a bucketing scheme. To complete the proof, we introduce two integer parameters K and L. Deﬁne a new hash function by concatenating K events. To be more precise, while preprocessing, for every element S3 ∈ C create buckets g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hK (S3 ))] where hi is chosen uniformly from minwise hashing family. For given query points S1 and S2 , retrieve only points in the bucket g2 (S1 , S2 ) = [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hK (S1 ), hK (S2 ))]. Repeat this process L times independently. For any K S3 ∈ C, with Sim(S1 , S2 , S3 ) ≥ R0 , is retrieved with probability at least 1 − (1 − R0 )L . Using log 1/c log K = ⌈ log n ⌉ and L = ⌈nρ log( 1 )⌉, where ρ = 1 − log 1/c+log 1/R0 , the proof can be obtained 1 δ cR0 using standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that the probability guarantee parameter δ gets absorbed in the constants as log( 1 ). Note, the process is δ stopped as soon as we ﬁnd some element with R3way ≥ cR0 . Theorem 1 can be easily extended to k-way resemblance with same query time and space guarantees. Note that k-way c-NN is at least as hard as k ∗ -way c-NN for any k ∗ ≤ k, because we can always choose (k −k ∗ +1) identical query sets in k-way c-NN, and it reduces to k ∗ -way c-NN problem. So, any improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard similarity. The proposed analysis is thus tight in this sense. The above observation makes it possible to also perform the traditional pairwise c-NN search using the same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have two different queries S1 , S2 , then we retrieve from bucket g2 (S1 , S2 ) and that is usual 3-way c-NN search. If we are just interested in pairwise near neighbor search given one query S1 , then we will look into bucket g2 (S1 , S1 ), and we know that the 3-way resemblance between S1 , S1 , S3 boils down to the pairwise resemblance between S1 and S3 . So, the same hash tables can be used for both the purposes. This property generalizes, and hash tables created for k-way c-NN can be used for any k ∗ -way similarity search so long as k ∗ ≤ k. The approximation guarantees still holds. This ﬂexibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme. ρ 1 One of the peculiarity of LSH based techniques is that the query complexity exponent ρ < 1 is dependent on the choice R0=0.01 0.8 of the threshold R0 we are interested in and the value of c 0.05 0.1 0.3 0.6 which is the approximation ratio that we will tolerate. Figure 1 0.2 0.4 0.8 log 1/c 0.5 plots ρ = 1− log 1/c+log 1/R0 with respect to c, for selected R0 0.4 0.6 0.9 0.7 values from 0.01 to 0.99. For instance, if we are interested in 0.2 0.95 highly similar pairs, i.e. R0 ≈ 1, then we are looking at near R =0.99 0 O(log n) query complexity for c-NN problem as ρ ≈ 0. On 0 0 0.2 0.4 0.6 0.8 1 the other hand, for very lower threshold R0 , there is no much c log 1/c of hope of time-saving because ρ is close to 1. Figure 1: ρ = 1 − log 1/c+log 1/R0 . 5 Other Efﬁcient k-way Similarities We refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with query and space complexity exactly as given in Theorem 1 as efﬁcient . We have demonstrated existence of one such example of efﬁcient similarities, which is the k-way resemblance. This leads to a natural question: “Are there more of them?”. [9] analyzed all the transformations on similarities that preserve existence of efﬁcient LSH search. In particular, they showed that if S is a similarity for which there exists an LSH family, then there also exists an LSH family for any similarity which is a probability generating function (PGF) transfor∑∞ mation on S. PGF transformation on S is deﬁned as P GF (S) = i=1 pi S i , where S ∈ [0, 1] and ∑∞ pi ≥ 0 satisﬁes i=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance. 4 Theorem 2 Any PGF transformation on 3-way resemblance R3way is efﬁcient. Recall in the proof of Theorem 1, we created hash assignments f1 (min(π(S3 ))) and f2 (min(π(S1 )), min(π(S2 ))), which lead to a bucketing scheme for the 3-way resemblance search, where the collision event E = {f1 (min(π(S3 )) = f2 (min(π(S1 )), min(π(S2 )))} happens with probability P r(E) = R3way . To prove the above Theorem 2, we will need to create hash events ∑∞ i having probability P GF (R3way ) = i=1 pi (R3way ) . Note that 0 ≤ P GF (R3way ) ≤ 1. We will make use of the following simple lemma. Lemma 1 (R3way )n is efﬁcient for all n ∈ N. n n Proof: Deﬁne new hash assignments g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hn (S3 ))] and g2 (S1 , S2 ) = n n [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hn (S1 ), hn (S2 ))]. The collision event g1 (S3 ) = g2 (S1 , S2 ) has n n probability (R3way )n . We now use the pair < g1 , g2 > instead of < f1 , f2 > and obtain same 3way n guarantees, as in Theorem 1, for (R ) as well. i i Proof of Theorem 2: From Lemma 1, let < g1 , g2 > be the hash pair corresponding to (R3way )i i i as used in above lemma. We sample one hash pair from the set {< g1 , g2 >: i ∈ N}, where i i the probability of sampling < g1 , g2 > is proportional to pi . Note that pi ≥ 0, and satisﬁes ∑∞ is i=1 pi = 1, and so the above sampling ∑ valid. It is not difﬁcult to see that the collision of the ∞ sampled hash pair has probability exactly i=1 pi (R3way )i . Theorem 2 can be naturally extended to k-way similarity for any k ≥ 2. Thus, we now have inﬁnitely many k-way similarity functions admitting efﬁcient sub-linear search. One, that might be interesting, because of its radial basis kernel like nature, is shown in the following corollary. Corollary 1 eR k−way −1 is efﬁcient. Proof: Use the expansion of eR k−way normalized by e to see that eR k−way −1 is a PGF on Rk−way . 6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems For 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family will save even more computations. Theorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with log 1/c O(n2ρ log1/cR0 n) query time and O(n1+2ρ ) space, where ρ = 1 − log 1/c+log 1/R0 . Note that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN problem with search space O(n2 ) (all pairs) instead of n. A bit of analysis, similar to Theorem 1, will show that this procedure achieves the required query time O(n2ρ log1/cR0 n), but uses a lot more space, O(n2(1+ρ )), than shown in the above theorem. It turns out that there is a better way of doing c-CP that saves us space. Proof of Theorem 3: We again start with constructing hash tables. For every element Sc ∈ C, we create a hash-table and store Sc in bucket B(Sc ) = [h1 (Sc ); h2 (Sc ); ...; hK (Sc )], where hi is chosen uniformly from minwise independent family of hash functions H. We create L such hash-tables. For a query element Sq we look for all pairs in bucket B(Sq ) = [h1 (Sq ); h2 (Sq ); ...; hK (Sq )] and repeat this for each of the L tables. Note, we do not form pairs of elements retrieved from different tables as they do not satisfy Eq. (2). If there exists a pair S1 , S2 ∈ C with Sim(Sq , S1 , S2 ) ≥ R0 , using K Eq. (2), we can see that we will ﬁnd that pair in bucket B(Sq ) with probability 1 − (1 − R0 )L . Here, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there 2 log are O(n2 ) instead of O(n) possible pairs. We instead use K = ⌈ log 1n ⌉ and L = ⌈n2ρ log( 1 )⌉, δ cR0 log 1/c with ρ = 1 − log 1/c+log 1/R0 . With this choice of K and L, the result follows. Note, the process is stopped as soon as we ﬁnd pairs S1 and S2 with Sim(Sq , S1 , S2 ) ≥ cR0 . The key argument that saves space from O(n2(1+ρ) ) to O(n1+2ρ ) is that we hash n points individually. Eq. (2) makes it clear that hashing all possible pairs is not needed when every point can be processed individually, and pairs formed within each bucket itself ﬁlter out most of the unnecessary combinations. 5 Theorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time log 1/c O(n1+2ρ log1/cR0 n), where ρ = 1 − log 1/c+log 1/R0 . The argument similar to one used in proof of Theorem 3 leads to the running time of O(n1+3ρ log1/cR0 n) as we need L = O(n3ρ ), and we have to processes all points at least once. Proof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting as query once. We use the same set of hash tables and hash functions every time. The preprocessing time is O(n1+2ρ log1/cR0 n) evaluations of hash functions and the total querying time is O(n × n2ρ log1/cR0 n), which makes the total running time O(n1+2ρ log1/cR0 n). For k-way c-BC Problem, we can achieve O(n1+(k−1)ρ log1/cR0 n) running time. If we are interested in very high similarity cluster, with R0 ≈ 1, then ρ ≈ 0, and the running time is around O(n log n). This is a huge saving over the brute force O(nk ). In most practical cases, specially in big data regime where we have enormous amount of data, we can expect the k-way similarity of good clusters to be high and ﬁnding them should be efﬁcient. We can see that with increasing k, hashing techniques save more computations. 7 Experiments In this section, we demonstrate the usability of 3-way and higher-order similarity search using (i) Google Sets, and (ii) Improving retrieval quality. 7.1 Google Sets: Generating Semantically Similar Words Here, the task is to retrieve words which are “semantically” similar to the given set of query words. We collected 1.2 million random documents from Wikipedia and created a standard term-doc binary vector representation of each term present in the collected documents after removing standard stop words and punctuation marks. More speciﬁcally, every word is represented as a 1.2 million dimension binary vector indicating its presence or absence in the corresponding document. The total number of terms (or words) was around 60,000 in this experiment. Since there is no standard benchmark available for this task, we show qualitative evaluations. For querying, we used the following four pairs of semantically related words: (i) “jaguar” and “tiger”; (ii) “artiﬁcial” and “intelligence”; (iii) “milky” and “way” ; (iv) “ﬁnger” and “lakes”. Given the query words w1 and w2 , we compare the results obtained by the following four methods. • Google Sets: We use Google’s algorithm and report 5 words from Google spreadsheets [1]. This is Google’s algorithm which uses its own data. • 3-way Resemblance (3-way): We use 3-way resemblance |w1 ∩w2 ∩w| to rank every word |w1 ∪w2 ∪w| w and report top 5 words based on this ranking. • Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem|w2 ∩w| blance |w1 ∩w| + |w2 ∪w| and report top 5 words based on this ranking. |w1 ∪w| • Pairwise Intersection (PI): We ﬁrst retrieve top 100 words based on pairwise resemblance for each w1 and w2 independently. We then report the words common in both. If there is no word in common we do not report anything. The results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates for these four queries. An interesting query is “ﬁnger” and “lakes”. Finger Lakes is a region in upstate New York. Google could only relate it to New York, while 3-way resemblance could even retrieve the names of cities and lakes in the region. Also, for query “milky” and “way”, we can see some (perhaps) unrelated words like “dance” returned by Google. We do not see such random behavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used by Google, we can see that 3-way resemblance appears to be a right measure for this application. The above results also illustrate the problem with using the sum of pairwise similarity method. The similarity value with one of the words dominates the sum and hence we see for queries “artiﬁcial” and “intelligence” that all the retrieved words are mostly related to the word “intelligence”. Same is the case with query “ﬁnger” and “lakes” as well as “jaguar” and “tiger”. Note that “jaguar” is also a car brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to the each query word individually and so PI method never returns anything. 6 Table 1: Top ﬁve words retrieved using various methods for different queries. “JAGUAR” AND “ TIGER” G OOGLE 3- WAY SR LION LEOPARD CHEETAH CAT DOG LEOPARD CHEETAH LION PANTHER CAT CAT LEOPARD LITRE BMW CHASIS “MILKY” AND “ WAY” G OOGLE 3- WAY SR DANCE STARS SPACE THE UNIVERSE GALAXY STARS EARTH LIGHT SPACE EVEN ANOTHER STILL BACK TIME PI — — — — — “ARTIFICIAL” AND “INTELLIGENCE” G OOGLE 3- WAY SR PI COMPUTER COMPUTER SECURITY — PROGRAMMING SCIENCE WEAPONS — INTELLIGENT SECRET — SCIENCE ROBOT HUMAN ATTACKS — ROBOTICS TECHNOLOGY HUMAN — PI — — — — — G OOGLE NEW YORK NY PARK CITY “FINGER” AND “LAKES” 3- WAY SR SENECA CAYUGA ERIE ROCHESTER IROQUOIS RIVERS FRESHWATER FISH STREAMS FORESTED PI — — — — — We should note the importance of the denominator term in 3-way resemblance, without which frequent words will be blindly favored. The exciting contribution of this paper is that 3-way resemblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the other hand, no such provable guarantees are known for SR and other heuristic based search methods. 7.2 Improving Retrieval Quality in Similarity Search We also demonstrate how the retrieval quality of traditional similarity search can be boosted by utilizing more query candidates instead of just one. For the evaluations we choose two public datasets: MNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search with binary data using b-bit minwise hashing [20, 23]. The two datasets reﬂect diversity both in terms of task and scale that is encountered in practice. The MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 × 28 pixel yielding a 784 dimension vector with the associated class label (digit 0 − 9). We binarize the data by settings all non zeros to be 1. We used the standard partition of MNIST, which consists of 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143 features, consists of sparse vector representation of emails labeled as spam or not. We randomly sample 70,000 data points and partitioned them into two independent sets of size 35,000 each. Table 2: Percentage of top candidates with the same labels as that of query retrieved using various similarity criteria. More indicates better retrieval quality (Best marked in bold). T OP Pairwise 3-way NNbor 4-way NNbor 1 94.20 96.90 97.70 MNIST 10 20 92.33 91.10 96.13 95.36 96.89 96.28 50 89.06 93.78 95.10 1 98.45 99.75 99.90 WEBSPAM 10 20 96.94 96.46 98.68 97.80 98.87 98.15 50 95.12 96.11 96.45 For evaluation, we need to generate potential similar search query candidates for k-way search. It makes no sense in trying to search for object simultaneously similar to two very different objects. To generate such query candidates, we took one independent set of the data and partition it according to the class labels. We then run a cheap k-mean clustering on each class, and randomly sample triplets < x1 , x2 , x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search. For MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets, each partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each cluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For WEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed the same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from each class making the total number of 2000 query candidates. The above procedures ensure that the elements in each triplets < x1 , x2 , x3 > are not very far from each other and are of the same class label. For each triplet < x1 , x2 , x3 >, we sort all the points x in the other independent set based on the following: • Pairwise: We only use the information in x1 and rank x based on resemblance 7 |x1 ∩x| |x1 ∪x| . • 3-way NN: We rank x based on 3-way resemblance • 4-way NN: We rank x based on 4-way resemblance |x1 ∩x2 ∩x| |x1 ∪x2 ∪x| . |x1 ∩x2 ∩x3 ∩x| |x1 ∪x2 ∪x3 ∪x| . We look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the query triplets are of the same label, The percentage of top retrieved candidates having same label as that of the query items is a natural metric to evaluate the retrieval quality. This percentage values accumulated over all the triplets are summarized in Table 2. We can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points, are of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query points, further improves the results compared to 3-way resemblance similarity search. This clearly demonstrates that multi-way resemblance similarity search is more desirable whenever we have more than one representative query in mind. Note that, for MNIST, which contains 10 classes, the boost compared to pairwise retrieval is substantial. The results follow a consistent trend. 8 Future Work While the work presented in this paper is promising for efﬁcient 3-way and k-way similarity search in binary high-dimensional data, there are numerous interesting and practical research problems we can study as future work. In this section, we mention a few such examples. One-permutation hashing. Traditionally, building hash tables for near neighbor search required many (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for building tables but also for processing un-seen queries which have not been processed. One permutation hashing [22] provides the hope of reducing many permutations to merely one. The version in [22], however, was not applicable to near neighbor search due to the existence of many empty bins (which offer no indexing capability). The most recent work [27] is able to ﬁll the empty bins and works well for pairwise near neighbor search. It will be interesting to extend [27] to k-way search. Non-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions to real-valued data are possible. For example, our results naturally apply to consistent weighted sampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not solved if we are interested in similarities such as (normalized) k-way inner products, although the line of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on non-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation to (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not work in our context because the bottom (nonzero) subsets are not properly aligned. Building hash tables by directly using bits from minwise hashing. This will be a different approach from the way how the hash tables are constructed in this paper. For example, [26] directly used the bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the signiﬁcant advantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see the performance of this approach in k-way similarity search. k-Way sign random projections. It would be very useful to develop theory for k-way sign random projections. For usual (real-valued) random projections, it is known that the volume (which is related to the determinant) is approximately preserved [24, 17]. We speculate that the collision probability of k-way sign random projections might be also a (monotonic) function of the determinant. 9 Conclusions We formulate a new framework for k-way similarity search and obtain fast algorithms in the case of k-way resemblance with provable worst-case approximation guarantees. We show some applications of k-way resemblance search in practice and demonstrate the advantages over traditional search. Our analysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond the pairwise case. We believe the idea of probabilistic hashing still has a long way to go. Acknowledgement The work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in the summer of 2004 at Microsoft Research. 8 References [1] http://www.howtogeek.com/howto/15799/how-to-use-autoﬁll-on-a-google-docs-spreadsheet-quick-tips/. [2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005. [3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommendation: semantics and efﬁciency. Proc. VLDB Endow., 2(1):754–765, 2009. [4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM, pages 247–256, 2011. [5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997. [6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998. [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classiﬁcation. IEEE Trans. Neural Networks, 10(5):1055–1064, 1999. [8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. [9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012. [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003. [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for ﬁnding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of ACM, 42(6):1115–1145, 1995. [13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):321–350, 2012. [14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010. [16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report, arXiv:1207.6083, 2013. [18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305–354, 2007. [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, Canada, 2006. [20] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In Proceedings of the 19th International o Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010. [21] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, Canada, 2010. [22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012. [23] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice. In Intero netware, Changsha, China, 2013. [24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In APPROX / RANDOM, pages 523–534, 2008. [25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010. [26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, Bristol, UK, 2012. [27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014. [28] Roger Weber, Hans-J¨ rg Schek, and Stephen Blott. A quantitative analysis and performance study for o similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998. [29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008. [30] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classiﬁcation and clustering using hypergraphs. o In NIPS, Vancouver, Canada, 2006. 9

5 0.14021252 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces

Author: Leonid Boytsov, Bilegsaikhan Naidan

Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efﬁcient for the same rank approximation quality. 1

6 0.10868599 63 nips-2013-Cluster Trees on Manifolds

7 0.10361787 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

8 0.1030203 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

9 0.097899914 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

10 0.097461239 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

11 0.08585991 47 nips-2013-Bayesian Hierarchical Community Discovery

12 0.085221693 148 nips-2013-Latent Maximum Margin Clustering

13 0.084502347 158 nips-2013-Learning Multiple Models via Regularized Weighting

14 0.081402756 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

15 0.080237933 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

16 0.078790091 326 nips-2013-The Power of Asymmetry in Binary Hashing

17 0.076312289 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

18 0.074739188 340 nips-2013-Understanding variable importances in forests of randomized trees

19 0.068626344 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

20 0.068517119 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.177), (1, 0.04), (2, -0.033), (3, -0.002), (4, 0.12), (5, 0.055), (6, 0.037), (7, -0.067), (8, -0.07), (9, 0.118), (10, -0.053), (11, -0.03), (12, 0.152), (13, 0.055), (14, 0.049), (15, 0.007), (16, -0.043), (17, 0.073), (18, 0.07), (19, 0.094), (20, 0.122), (21, 0.135), (22, -0.074), (23, 0.126), (24, 0.066), (25, -0.075), (26, -0.072), (27, -0.058), (28, 0.11), (29, -0.036), (30, 0.05), (31, 0.03), (32, 0.089), (33, -0.071), (34, 0.124), (35, 0.05), (36, 0.082), (37, 0.072), (38, -0.009), (39, -0.177), (40, -0.047), (41, 0.01), (42, -0.015), (43, -0.068), (44, -0.011), (45, 0.011), (46, -0.028), (47, -0.012), (48, 0.013), (49, 0.188)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96963894 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

2 0.62281859 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Author: Anshumali Shrivastava, Ping Li

3 0.6166358 340 nips-2013-Understanding variable importances in forests of randomized trees

Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts

Abstract: Despite growing interest and practical use in various scientiﬁc areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientiﬁc ﬁelds is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientiﬁc areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simpliﬁed to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at ﬁlling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artiﬁcial problem; ﬁnally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we ﬁrst describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classiﬁcation and regression trees and random forests A binary classiﬁcation (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identiﬁes at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efﬁcient and simple way to address this ﬂaw is to use them in the context of randomization-based ensemble methods. Speciﬁcally, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be deﬁned for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are deﬁned by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not conﬁrmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simpliﬁed version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as deﬁned by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an inﬁnitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is deﬁned as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint conﬁgurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of fully developed totally randomized trees and an inﬁnitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The ﬁrst level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of ﬁxed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us deﬁne as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we deﬁne as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by deﬁnition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its inﬁnite sample size importance as computed with an inﬁnite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The inﬁnite sample size / importance of Xm ∈ V as computed with an inﬁnite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The inﬁnite sample size importance of any variable Xm ∈ VR as computed with an inﬁnite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We ﬁrst show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of ﬁxed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the deﬁnition of irrelevant variables. In particular, our results thus hold for the Gini index in classiﬁcation and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q ﬁrst terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of pruned totally randomized trees built up to depth q ≤ p and an inﬁnitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of pruned totally randomized trees built up to depth q ≤ p and an inﬁnitely large training sample is identical to the importance as computed for Y with an inﬁnite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as deﬁned before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an inﬁnite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K ﬁxed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a ﬁnite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly deﬁnes the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of ﬁnite size that perfectly follows the data distribution amounts to building them on a sample of inﬁnite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which conﬁrms Theorem 1. Small differences stem from the fact that a ﬁnite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards inﬁnity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which conﬁrms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus conﬁrms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the ﬁgure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the ﬁgure ﬁrst illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are signiﬁcantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this ﬁgure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the ﬁrst terms are signiﬁcantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The ﬁrst thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this ﬁgure thus indeed conﬁrms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a ﬁrst step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and ﬁnally illustrated our developments on an artiﬁcial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a ﬁnite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its ﬁnancial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Ofﬁce. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classiﬁers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artiﬁcial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classiﬁcation and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classiﬁcation. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9

4 0.60476106 71 nips-2013-Convergence of Monte Carlo Tree Search in Simultaneous Move Games

Author: Viliam Lisy, Vojta Kovarik, Marc Lanctot, Branislav Bosansky

Abstract: unkown-abstract

5 0.580365 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

Author: Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, Antonio Criminisi

Abstract: Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classiﬁcation. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efﬁciently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization. 1

6 0.57846528 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces

7 0.56777322 47 nips-2013-Bayesian Hierarchical Community Discovery

8 0.56738043 63 nips-2013-Cluster Trees on Manifolds

9 0.5586254 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

10 0.54554808 326 nips-2013-The Power of Asymmetry in Binary Hashing

11 0.51113051 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

12 0.49486986 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

13 0.49069941 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

14 0.4801302 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

15 0.44877729 119 nips-2013-Fast Template Evaluation with Vector Quantization

16 0.42908716 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

17 0.42574063 344 nips-2013-Using multiple samples to learn mixture models

18 0.41400939 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

19 0.4036949 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

20 0.39310557 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.022), (16, 0.09), (21, 0.017), (31, 0.165), (33, 0.136), (34, 0.082), (41, 0.024), (49, 0.036), (56, 0.137), (70, 0.041), (85, 0.061), (89, 0.041), (93, 0.033), (95, 0.032), (99, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85943049 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

2 0.81242299 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent

Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justiﬁcation which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-inﬁnitesimal corruption noise (or non-inﬁnitesimal contractive penalty). 1

3 0.79065412 252 nips-2013-Predictive PAC Learning and Process Decompositions

Author: Cosma Shalizi, Aryeh Kontorovitch

Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with ﬁnite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This deﬁnition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1

4 0.78335553 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change ﬂexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efﬁciently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1

5 0.78298706 232 nips-2013-Online PCA for Contaminated Data

Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan

Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the ﬁnal result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efﬁcient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1

6 0.78206491 233 nips-2013-Online Robust PCA via Stochastic Optimization

7 0.78191304 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

8 0.78183407 249 nips-2013-Polar Operators for Structured Sparse Estimation

9 0.77887917 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

10 0.77860117 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

11 0.77822286 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

12 0.77779347 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

13 0.77752084 186 nips-2013-Matrix factorization with binary components

14 0.77626455 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks

15 0.77552283 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

16 0.77552271 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

17 0.77544343 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

18 0.77495921 350 nips-2013-Wavelets on Graphs via Deep Learning

19 0.77420235 344 nips-2013-Using multiple samples to learn mixture models

20 0.77408701 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding