nips nips2013 nips2013-332 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Erich Kummerfeld, David Danks
Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. [sent-5, score-0.285]
2 Real-world data often come from generating models that are only locally stationary. [sent-7, score-0.11]
3 In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. [sent-8, score-0.294]
4 We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. [sent-9, score-0.281]
5 Almost all standard algorithms for learning graphical model structure [9, 10, 12, 3] assume that the underlying generating structure does not change over the course of data collection, and so the data are i. [sent-11, score-0.377]
6 In the real world, however, generating structures often change and it can be critical to quickly detect the structure change and then learn the new one. [sent-18, score-0.358]
7 That is, we cannot learn in “batch mode,” but must instead learn the novel structure in an online manner, processing the data as it arrives. [sent-20, score-0.159]
8 Current online learning algorithms can detect and handle changes in the learning environment, but none are capable of general, graphical model structure learning. [sent-21, score-0.267]
9 , and learns graphical model structure in an online fashion. [sent-25, score-0.197]
10 We then present the details of our algorithm and provide simulation evidence that it can successfully learn graphical model structure in an online manner. [sent-27, score-0.223]
11 Importantly, when there is a stable generating structure, the algorithm’s performance is indistinguishable from a standard batch-mode structure learning algorithm. [sent-28, score-0.192]
12 2 Related work We focus here on graphical models based on directed acyclic graphs (DAGs) over random variables with corresponding quantitative components, whether Bayesian networks or recursive Structural Equation Models (SEMs) [3, 12, 10]. [sent-30, score-0.144]
13 All of our observations in this paper, as well as the core 1 algorithm, are readily adaptable to learn structure for models based on undirected graphs, such as Markov random fields or Gaussian graphical models [6, 9]. [sent-31, score-0.194]
14 Standard graphical model structure learning algorithms divide into two rough types. [sent-32, score-0.151]
15 Constraint-based structure learning algorithms leverage the fact that every graphical model predicts a pattern of (conditional) independencies over the variables, though multiple models can predict the same pattern. [sent-34, score-0.173]
16 , [10, 12]) find the set of graphical models that best predict the (conditional) independencies in the data. [sent-37, score-0.112]
17 Both types of structure learning algorithms assume that the data come from a single generating structure, and so neither is directly usable for learning when structure change is possible. [sent-38, score-0.279]
18 Since we are focused on situations in which the underlying structure can change, we do not want the same output. [sent-44, score-0.087]
19 One could instead look to online learning methods that track some environmental feature. [sent-45, score-0.093]
20 In general, TDL methods are good at tracking slow-moving environmental changes, but perform suboptimally during times of either high stability or dramatic change, such as when the generating model structure abruptly changes. [sent-48, score-0.198]
21 Both Bayesian [1] and frequentist [4] online changepoint detection (CPD) algorithms are effective at detecting abrupt changes, but do so by storing substantial portions of the input data. [sent-49, score-0.197]
22 For example, a Bayesian CPD [1] outputs the probability of a changepoint having occurred r timesteps ago, and so the algorithm must store more than r datapoints. [sent-50, score-0.096]
23 Furthermore, CPD algorithms assume a model of the environment that has only abrupt changes separated by periods of stability. [sent-51, score-0.073]
24 Two previous papers have aimed to learn time-indexed graph structures from time-series data, though both require full datasets as input, so cannot function in real-time [14, 11]. [sent-53, score-0.094]
25 Talih and Hengartner (2005) take an ordered data set and divide it into a fixed number of (possibly empty) data intervals, each with an associated undirected graph that differs by one edge from its neighbors. [sent-54, score-0.101]
26 In contrast with our work, they focus on a particular type of graph structure change (single edge addition or deletion), operate solely in “batch mode,” and use undirected graphs instead of directed acyclic graph models. [sent-55, score-0.315]
27 Our approach differs by using frequentist methods instead of Bayesian ones (since we would otherwise need to maintain a probability distribution over the superexponential number of graphical models), and by being able to operate in real-time on an incoming data stream. [sent-57, score-0.116]
28 In contrast to previous work on structure learning, we assume only that the generating process is locally stationary: for each time r, data are generated i. [sent-60, score-0.171]
29 At a high level, the Locally Stationary Structure Tracker (LoSST) algorithm takes, at each timestep r, a new datapoint as input and outputs a graphical model Mr . [sent-69, score-0.218]
30 Obviously, a single datapoint is 2 insufficient to learn graphical model structure. [sent-70, score-0.244]
31 The LoSST algorithm instead tracks the locally stationary sufficient statistics—for recursive SEMs, the means, covariances, and sample size—in an online fashion, and then dynamically (re)learns the graphical model structure as appropriate. [sent-71, score-0.325]
32 The LoSST algorithm processes each datapoint only once, and so LoSST can also function as a singlepass, graphical model structure learner for very large datasets. [sent-72, score-0.279]
33 r Let Xr be the r-th multivariate datapoint and let Xi be the value of Vi for that datapoint. [sent-73, score-0.128]
34 To track the potentially changing generating structure, the datapoints must potentially be differentially weighted. [sent-74, score-0.276]
35 In particular, datapoints should be weighted more heavily after a change occurs. [sent-75, score-0.278]
36 Let ar ∈ (0, ∞) be r the weight on Xr , and let br = k=1 ak be the sum of those weights over time. [sent-76, score-0.398]
37 The update equation for i i k=1 br (Xi − µi )(Xj − µj ). [sent-78, score-0.12]
38 If ar = αbr , then the learning is the same as if one uses TD(0) learning for each covariance with a learning rate of α. [sent-80, score-0.291]
39 The sample size S r is more complicated, since datapoints are weighted differently and so the “effective” sample size can differ from the actual sample size (though it should always be less-than-orequal). [sent-81, score-0.271]
40 Because Xr+1 comes from the current generating structure, it should always contribute 1 to r+1 the effective sample size. [sent-82, score-0.122]
41 If we adjust the natural sample size update equation to satisfy these two constraints, then the update equation becomes: ar r S r+1 = S +1 (3) ar+1 If ar+1 ≥ ar for all r (as in the method we use below), then S r+1 ≤ S r + 1. [sent-84, score-0.593]
42 If ar+1 = ar for all r, then S r = r; that is, if the datapoint weights are constant, then S r is the true sample size. [sent-85, score-0.417]
43 Sufficient statistics tracking—µr+1 , Cr+1 , and S r+1 —thus requires remembering only their previous values and br , assuming that ar+1 can be efficiently computed. [sent-86, score-0.102]
44 The ar+1 weights are based on the “fit” between the current estimated covariance matrix and the input data: poor fit implies that a change in the underlying generating structure is more likely. [sent-87, score-0.249]
45 , poor fit) for some datapoint could indicate simply an outlier; inferring that the underlying generating structure has changed requires large Mahalanobis distances over multiple datapoints. [sent-91, score-0.35]
46 We instead base the ar+1 weights on the (weighted) pooled p-value of the individual p-values for the Mahalanobis distance of each datapoint. [sent-93, score-0.087]
47 The Mahalanobis distance of a V -dimensional datapoint from a covariance matrix estimated from a sample of size N is distributed as Hotelling’s T 2 with parameters p = V and m = N − 1. [sent-94, score-0.192]
48 The p-value for the Mahalanobis distance Dr+1 is thus: pr+1 = T 2 (x > Dr+1 |p = N, m = S r − 1) where S r is the effective sample size. [sent-95, score-0.073]
49 Then Liptak’s method for weighted pooling of the individual p-values [7] gives √ r a2 ) = Φ(ηr+1 , τr+1 ), where the the following definition:1 ρr+1 = Φ( i=1 ai Φ−1 (pi , 1), i update equations for η and τ are ηr+1 = ηr + ar Φ−1 (pr , 1) and τr+1 = τr + a2 . [sent-97, score-0.298]
50 Mathematically, this transformation is: ar+1 = ar · max 1, γT − γρr+1 + ρr+1 T (4) Efficient computation of ar+1 thus only requires additionally tracking ρr , ηr , and τr . [sent-103, score-0.318]
51 We can efficiently track the relevant sufficient statistics in an online fashion, and so the only remaining step is to learn the corresponding graphical model. [sent-104, score-0.191]
52 Learning graphical model structure is computationally expensive [2] and so one should balance the accuracy of the current model against the computational cost of relearning. [sent-107, score-0.151]
53 More precisely, graph2 relearning should be most frequent after an inferred underlying change, though there should be a non-zero chance of relearning even when the structure appears to be relatively stable (since the structure could be slowly drifting). [sent-108, score-0.629]
54 In practice, the LoSST algorithm probabilistically relearns based on the inverse3 of ρr : the probability of relearning at time r + 1 is a noisy-OR gate with the probability of relearning at time r, and a weighted (1 − ρr+1 ). [sent-109, score-0.446]
55 Mathematically, Pr+1 (relearn) = Pr (relearn) + ν(1 − ρr+1 ) − Pr (relearn)ν(1 − ρr+1 ), where ν ∈ [0, 1] modifies the frequency of graph relearning: large values result in more frequent relearning and small values result in fewer. [sent-110, score-0.276]
56 If a relearning event is triggered at datapoint r, then a new graphical model structure and parameters are learned, and Pr (relearn) is set to 0. [sent-111, score-0.487]
57 In general, ρr is lower when changepoints are detected, so Pr (relearn) will increase more quickly around changepoints, and graph relearning will become more frequent. [sent-112, score-0.299]
58 Convergence is a standard desideratum: if there is a stable structure in the limit, then the algorithm’s output should stabilize on that structure. [sent-117, score-0.105]
59 Both diligence and convergence are desirable methodological virtues, but they are provably incompatible: no learning algorithm can be both diligent and convergent [5]. [sent-119, score-0.219]
60 Intuitively, they are incompatible because they must respond differently to improbable datapoints: convergent algorithms must tolerate them (since such data occur with probability 1 in the infinite limit), while diligent algorithms must regard them as signals that the structure has changed. [sent-120, score-0.241]
61 If γ = 1, then LoSST is a convergent algorithm, since it follows that ar+1 = ar for all r (which is a sufficient condition for convergence). [sent-121, score-0.311]
62 If T < 0, then we again have ar+1 = ar for all r, and so LoSST is convergent. [sent-123, score-0.268]
63 LoSST is also provably convergent if T is time-indexed such that Tr = f (Sr ) for some f with (0, 1] range, where ∞ 4 i=1 (1 − f (i)) converges. [sent-124, score-0.072]
64 4 Proof sketch: ∞ (1 − qi ) can be shown to be an upper bound on the probability that (1 − ρi ) > qi i=r will occur for some i in [r, ∞), where qi is the i-th element of the sequence Q of lower threshold values. [sent-127, score-0.087]
65 All simulations used scenarios in which either the ground truth parameters or ground truth graph (and parameters) changed during the course of data collection. [sent-142, score-0.123]
66 Before the first changepoint, there should be no significant difference between LoSST and a standard batch-mode learner, since those datapoints are globally i. [sent-143, score-0.198]
67 Performance on these datapoints thus provides information about the performance cost (if any) of online learning using LoSST, relative to traditional algorithms. [sent-146, score-0.224]
68 We used the PC algorithm [12] as our baseline batch-mode learning algorithm; we conjecture that any other standard graphical model structure learning algorithm would perform similarly, given the graphs and sample sizes in our simulations. [sent-154, score-0.191]
69 The first set of simulations used datasets with 2000 datapoints, where the SEM graph and parameters both changed after the first 1000 datapoints. [sent-156, score-0.123]
70 Figures 1(a-c) show the mean edge addition, removal, and orientation errors (respectively) by LoSST as a function of time, and Figures 1(d-f) show the means of #errorsP C − #errorsLoSST for each error type (i. [sent-158, score-0.09]
71 After the underlying generating model changes, however, there are significant differences. [sent-163, score-0.095]
72 In contrast, the LoSST algorithm finds large Mahalanobis distances for those datapoints, which lead to higher weights, which lead it to learn (approximately) the new underlying graphical model. [sent-165, score-0.164]
73 In practice, LoSST typically stabilized on a new model by roughly 250 datapoints after the changepoint. [sent-166, score-0.178]
74 The second set of simulations was identical to the first (500 runs each for various pairs of variable number and edge degree), except that the graph was held constant throughout and only the SEM parameters changed after 1000 datapoints. [sent-167, score-0.16]
75 Performance after the parameter change did not follow quite the same pattern as before, however. [sent-170, score-0.07]
76 LoSST again does much better on edge addition and orientation errors, but performed significantly worse on edge removal errors for the first 200 points following the Proof sketch: By equation (4), T > 1 & γ > 1 ⇒ γ − γ−1 > 1 ⇒ ar+1 ≥ ar (γ − γ−1 ) > ar for all T T γT −γ+1 r. [sent-171, score-0.715]
77 This last strict inequality implies that the effective sample size has a finite upper bound (= (γ−1)(T −1) if ρr = 1 for all r), and the majority of the effective sample comes from recent data points. [sent-172, score-0.106]
78 6 LoSST relearned graphs and PC was rerun after datapoints {25, 50, 100, 200, 300, 500, 750, 1000, 1025, 1050, 1100, 1200, 1300, 1500, 1750, 2000}. [sent-174, score-0.197]
79 When a change occcurs, PC intially responds by adding edges to the output, while LoSST responds by being more cautious in its inferences (since the effective sample size shrinks after a change). [sent-176, score-0.195]
80 As a result, PC can outperform LoSST for a brief time on the edge removal metric in these types of cases in which the change involves only parameters, not graph structure. [sent-178, score-0.188]
81 We randomly generated a single dataset with 10,000 datapoints, where the underlying SEM graph and parameters changed after every 1000 datapoints. [sent-180, score-0.117]
82 We then ran LoSST with probabilistic relearning (ν = . [sent-182, score-0.208]
83 As expected, there are substantial relearning peaks after each structure shift, and the expected number of relearnings persisted at roughly 0. [sent-185, score-0.333]
84 Figures 3(b-d) provide error information: the smooth green lines indicate the mean edge addition, removal, and orientation errors (respectively) during learning, and the blocky blue lines indicate the LoSST errors if graph relearning occurred after every datapoint. [sent-187, score-0.404]
85 Although there are many fewer graph relearnings with the probabilistic schedule, overall errors did not significantly increase. [sent-188, score-0.139]
86 Figure 4(a) shows the change in effective sample size, where the key observation is that change detection prompts significant drops in the effective sample size. [sent-195, score-0.246]
87 Figures 4(b) and 4(c) show the pooled p-value and Mahalanobis distance for each month, which are the drivers of sample size 7 changes. [sent-196, score-0.108]
88 LoSST appears to detect exactly such a shift in the volatility of the relationships between these price indexes, though it detects another shift shortly after 2000. [sent-200, score-0.124]
89 8 This real-world case study also demonstrates the importance of using pooled p-values, as that is why LoSST does not respond to the single-month spike in Mahalanobis distance in 1995, but does respond to the extended sequence of slightly above average Mahalanobis distances around 1980. [sent-201, score-0.207]
90 6 Discussion and future research The LoSST algorithm is suitable for locally stationary structures, but there are obviously limits. [sent-202, score-0.07]
91 In particular, it will perform poorly if the generating structure changes very rapidly, or if the datapoints are a random-order mixture from multiple structures. [sent-203, score-0.353]
92 This algorithm can also be extended to have the current learned model influence the ar weights. [sent-208, score-0.268]
93 Suppose particular graphical edges or adjacencies have not changed over a long period of time, or have been stable over multiple relearnings. [sent-209, score-0.198]
94 In that case, one might plausibly conclude that those connections are less likely to change, and so much greater error should be required to relearn those connections. [sent-210, score-0.112]
95 In practice, this extension would require the ar weights to vary across Vi , Vj pairs, which significantly complicates the mathematics and memory requirements of the sufficient statistic tracking. [sent-211, score-0.268]
96 We have focused on SEMs, but there are many other types of graphical models; for example, Bayesian networks have the same graph-type but are defined over discrete variables with conditional probability tables. [sent-213, score-0.09]
97 Tracking the sufficient statistics for Bayes net structure learning is substantially more costly, and we are currently investigating ways to learn the necessary information in a tractable, online fashion. [sent-214, score-0.133]
98 Similarly, our graph learning relies on constraint-based structure learning since the relevant scores in score-based methods (such as [3]) do not decompose in a manner that is suitable for online learning. [sent-215, score-0.154]
99 There are many real-world contexts in which batch-mode structure learning is either infeasible or inappropriate. [sent-217, score-0.091]
100 The online structure learning algorithm presented here has great potential to perform well in a range of challenging contexts, and at little cost in “traditional” settings. [sent-219, score-0.107]
wordName wordTfidf (topN-words)
[('losst', 0.785), ('ar', 0.268), ('relearning', 0.208), ('datapoints', 0.178), ('datapoint', 0.128), ('mahalanobis', 0.118), ('relearn', 0.112), ('pc', 0.102), ('br', 0.102), ('graphical', 0.09), ('xr', 0.083), ('cr', 0.072), ('change', 0.07), ('generating', 0.069), ('pooled', 0.067), ('changepoint', 0.065), ('diligence', 0.064), ('diligent', 0.064), ('relearnings', 0.064), ('structure', 0.061), ('sem', 0.061), ('sems', 0.057), ('dr', 0.053), ('gr', 0.052), ('pr', 0.05), ('tracking', 0.05), ('respond', 0.049), ('figures', 0.048), ('cpd', 0.048), ('graph', 0.047), ('online', 0.046), ('changes', 0.045), ('stable', 0.044), ('changed', 0.044), ('convergent', 0.043), ('vi', 0.042), ('locally', 0.041), ('vj', 0.04), ('edge', 0.037), ('environments', 0.036), ('removal', 0.034), ('kummerfeld', 0.032), ('siracusa', 0.032), ('talih', 0.032), ('tdl', 0.032), ('simulations', 0.032), ('effective', 0.032), ('occurred', 0.031), ('contexts', 0.03), ('bayesian', 0.03), ('weighted', 0.03), ('stationary', 0.029), ('provably', 0.029), ('track', 0.029), ('qi', 0.029), ('errors', 0.028), ('abrupt', 0.028), ('changepoints', 0.028), ('virtues', 0.028), ('shift', 0.028), ('ak', 0.028), ('pittsburgh', 0.027), ('frequentist', 0.026), ('responds', 0.026), ('learn', 0.026), ('underlying', 0.026), ('detect', 0.025), ('orientation', 0.025), ('tracker', 0.024), ('aji', 0.024), ('incompatible', 0.024), ('pa', 0.024), ('volatility', 0.023), ('covariance', 0.023), ('independencies', 0.022), ('fashion', 0.022), ('distances', 0.022), ('structures', 0.021), ('carnegie', 0.021), ('mellon', 0.021), ('frequent', 0.021), ('sample', 0.021), ('globally', 0.02), ('edges', 0.02), ('distance', 0.02), ('td', 0.02), ('price', 0.02), ('tracks', 0.019), ('methodological', 0.019), ('graphs', 0.019), ('equation', 0.018), ('recursive', 0.018), ('indistinguishable', 0.018), ('basically', 0.018), ('environmental', 0.018), ('heuristic', 0.018), ('neither', 0.018), ('directed', 0.017), ('undirected', 0.017), ('quickly', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 332 nips-2013-Tracking Time-varying Graphical Structure
Author: Erich Kummerfeld, David Danks
Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1
2 0.057430353 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by: γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
3 0.053134184 202 nips-2013-Multiclass Total Variation Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht
Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1
4 0.044483766 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
Author: Arash Amini, Xuanlong Nguyen
Abstract: We propose a general formalism of iterated random functions with semigroup property, under which exact and approximate Bayesian posterior updates can be viewed as specific instances. A convergence theory for iterated random functions is presented. As an application of the general theory we analyze convergence behaviors of exact and approximate message-passing algorithms that arise in a sequential change point detection problem formulated via a latent variable directed graphical model. The sequential inference algorithm and its supporting theory are illustrated by simulated examples.
5 0.043194052 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
6 0.040193237 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
7 0.038287897 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
8 0.037479252 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
9 0.036561411 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
10 0.036260057 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
11 0.034993045 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
12 0.032873064 188 nips-2013-Memory Limited, Streaming PCA
13 0.03249168 232 nips-2013-Online PCA for Contaminated Data
14 0.031912558 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
15 0.031887505 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
16 0.031549536 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
17 0.031108834 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking
18 0.030897627 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
19 0.030682128 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
20 0.030354308 217 nips-2013-On Poisson Graphical Models
topicId topicWeight
[(0, 0.102), (1, 0.018), (2, -0.001), (3, 0.001), (4, 0.004), (5, 0.032), (6, 0.023), (7, -0.008), (8, -0.015), (9, 0.004), (10, -0.011), (11, -0.042), (12, 0.037), (13, 0.014), (14, 0.018), (15, 0.032), (16, 0.026), (17, -0.011), (18, -0.032), (19, 0.022), (20, 0.004), (21, 0.007), (22, -0.025), (23, -0.023), (24, -0.005), (25, 0.012), (26, -0.009), (27, 0.025), (28, 0.024), (29, 0.013), (30, 0.012), (31, 0.037), (32, -0.007), (33, 0.029), (34, -0.022), (35, 0.024), (36, 0.03), (37, 0.006), (38, -0.021), (39, -0.007), (40, -0.061), (41, -0.021), (42, -0.004), (43, -0.04), (44, -0.004), (45, -0.065), (46, -0.005), (47, -0.017), (48, -0.049), (49, 0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.84366459 332 nips-2013-Tracking Time-varying Graphical Structure
Author: Erich Kummerfeld, David Danks
Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1
2 0.57198232 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling
Author: Mahito Sugiyama, Karsten Borgwardt
Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. 1
3 0.54541904 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
4 0.54212093 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
Author: Ulrike Von Luxburg, Morteza Alamgir
Abstract: Consider an unweighted k-nearest neighbor graph on n points that have been sampled i.i.d. from some unknown density p on Rd . We prove how one can estimate the density p just from the unweighted adjacency matrix of the graph, without knowing the points themselves or any distance or similarity scores. The key insights are that local differences in link numbers can be used to estimate a local function of the gradient of p, and that integrating this function along shortest paths leads to an estimate of the underlying density. 1
5 0.51492137 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
Author: Paul Valiant, Gregory Valiant
Abstract: Recently, Valiant and Valiant [1, 2] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n). We propose a novel modification of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to first use the sample to characterize the “unseen” portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. 1
6 0.51420051 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
7 0.50948417 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
8 0.50908035 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
9 0.50890607 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
10 0.50743008 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
11 0.48609847 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
12 0.48532277 355 nips-2013-Which Space Partitioning Tree to Use for Search?
13 0.483163 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?
14 0.48199707 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
15 0.48116517 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
16 0.48096424 134 nips-2013-Graphical Models for Inference with Missing Data
17 0.4801971 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
18 0.47058803 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
19 0.46112332 25 nips-2013-Adaptive Anonymity via $b$-Matching
20 0.45889613 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
topicId topicWeight
[(16, 0.029), (33, 0.108), (34, 0.073), (41, 0.021), (49, 0.023), (56, 0.112), (70, 0.031), (85, 0.419), (89, 0.041), (93, 0.02), (95, 0.016)]
simIndex simValue paperId paperTitle
1 0.91409934 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
Author: Tai Qin, Karl Rohe
Abstract: Spectral clustering is a fast and popular algorithm for finding clusters in networks. Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed inspired variations on the algorithm that artificially inflate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the “star shape” in the eigenvectors–a common feature of empirical networks–can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of these models. 1
2 0.88459784 7 nips-2013-A Gang of Bandits
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella
Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1
3 0.8206768 168 nips-2013-Learning to Pass Expectation Propagation Messages
Author: Nicolas Heess, Daniel Tarlow, John Winn
Abstract: Expectation Propagation (EP) is a popular approximate posterior inference algorithm that often provides a fast and accurate alternative to sampling-based methods. However, while the EP framework in theory allows for complex nonGaussian factors, there is still a significant practical barrier to using them within EP, because doing so requires the implementation of message update operators, which can be difficult and require hand-crafted approximations. In this work, we study the question of whether it is possible to automatically derive fast and accurate EP updates by learning a discriminative model (e.g., a neural network or random forest) to map EP message inputs to EP message outputs. We address the practical concerns that arise in the process, and we provide empirical analysis on several challenging and diverse factors, indicating that there is a space of factors where this approach appears promising. 1
same-paper 4 0.8170855 332 nips-2013-Tracking Time-varying Graphical Structure
Author: Erich Kummerfeld, David Danks
Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1
5 0.8033275 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Author: Bruno Scherrer
Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !
6 0.69786489 89 nips-2013-Dimension-Free Exponentiated Gradient
7 0.67621732 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
8 0.66722322 196 nips-2013-Modeling Overlapping Communities with Node Popularities
9 0.64494383 232 nips-2013-Online PCA for Contaminated Data
10 0.63150477 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
11 0.62059194 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
12 0.6180107 289 nips-2013-Scalable kernels for graphs with continuous attributes
13 0.61352956 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
14 0.57863867 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
15 0.57266814 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
16 0.57231873 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
17 0.56818467 25 nips-2013-Adaptive Anonymity via $b$-Matching
18 0.56144953 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
19 0.55901217 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
20 0.55891162 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks