nips nips2013 nips2013-151 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
Reference: text
sentIndex sentText sentNum sentScore
1 It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. [sent-2, score-0.244]
2 To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. [sent-3, score-0.897]
3 The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. [sent-4, score-0.313]
4 This enables the development of reductions from the structure learning problem to propositional satisfiability (SAT) [14] and its generalizations such as maximum satisfiability (MAXSAT) [15], and satisfiability modulo theories (SMT) [16], as well as answer-set programming (ASP) [17]. [sent-28, score-0.232]
5 A main novelty is the recognition of maximum weight spanning trees of the clique graph by a condition on the cardinalities of occurrences of variables in cliques and separators, which we call the balancing condition. [sent-29, score-1.639]
6 To enable efficient encodings of Markov network learning as a constraint satisfaction problem, in Section 3 we establish a new characterization of the separators of a Markov network based on a balancing condition. [sent-32, score-0.696]
7 In Section 4, we provide a high-level description how the learning problem can be expressed using constraints and sketch the actual translations into propositional satisfiability (SAT) and its generalizations. [sent-33, score-0.255]
8 2 Structure Learning for Markov Networks An undirected graph G = V, E consists of a set of nodes V which represents a set of random variables and a set of undirected edges E ⊆ {{n, n } | n, n ∈ V and n = n }. [sent-36, score-0.45]
9 A path in a graph is a sequence of nodes such that every two consecutive nodes are connected by an edge. [sent-37, score-0.389]
10 Two sets of nodes A and B are said to be separated by a third set of nodes D if every path between a node in A and a node in B contains at least one node in D. [sent-38, score-0.565]
11 An undirected graph is chordal if for all paths n0 , . [sent-39, score-0.387]
12 nk with k ≥ 4 and n0 = nk there exist two nodes ni , nj in the path connected by an edge such that j = i ± 1. [sent-42, score-0.278]
13 A clique in a graph is a set of nodes c such that every two nodes in it are connected by an edge. [sent-43, score-0.736]
14 In addition, there may not exist a set of nodes c such that c ⊂ c and every two nodes in c are connected by an edge. [sent-44, score-0.28]
15 Given the set of cliques C in a chordal graph, the set of separators S can be obtained through intersections of the cliques ordered in terms of a junction tree [19], this operation is considered thoroughly in Section 3. [sent-45, score-1.35]
16 ⊥ For a Markov network with a chordal graph G, the probability of a joint outcome x factorizes as PV (x) = Pci (xci ) . [sent-49, score-0.388]
17 si ∈S Psi (xsi ) ci ∈C Following this factorization the marginal likelihood of a dataset X given a Markov network with a chordal graph G can be written P (X|G) = Pci (Xci ) . [sent-50, score-0.514]
18 Let a denote an arbitrary clique or separator containing the variables Xa whose outcome (j) (j) space has the cardinality k. [sent-52, score-0.46]
19 Further, let na denote the number of occurrences where Xa = xa in 2 the dataset Xa . [sent-53, score-0.28]
20 3 Fundamental Properties and Characterization Results In this section, we point out some properties of chordal graphs and clique graphs that can be utilized in the encodings of the learning problem. [sent-65, score-0.637]
21 In particular, we develop a characterization of maximum weight spanning trees in terms of a balancing condition on separators. [sent-66, score-0.736]
22 The separators needed for determining the score (1) of a candidate Markov network are defined as follows. [sent-67, score-0.322]
23 Given the cliques, we can form the clique graph, in which the nodes are the cliques and there is an edge between two nodes if the corresponding cliques have a non-empty intersection. [sent-68, score-1.383]
24 The separators are the edge labels of a maximum weight spanning tree of the clique graph. [sent-70, score-1.175]
25 Maximum weight spanning trees of arbitrary graphs can be found in polynomial time by reducing the problem to finding minimum weight spanning trees. [sent-71, score-0.79]
26 There may be several maximum weight spanning trees, but they induce exactly the same separators, and they only differ in terms of which pairs of cliques induce the separators. [sent-73, score-0.737]
27 To restrict the search space we can observe that a chordal graph with n nodes has at most n maximal cliques [19]. [sent-74, score-0.868]
28 This gives an immediate upper bound on the number of cliques chosen to build a Markov network, which can be encoded as a simple cardinality constraint. [sent-75, score-0.414]
29 1 Characterization of Maximum Weight Spanning Trees To simplify the encoding of maximum weight spanning trees (and forests) of chordal clique graphs, we introduce the notion of balanced spanning trees (respectively, forests), and show that these two concepts coincide for chordal graphs. [sent-77, score-1.756]
30 3 Definition 1 (Balancing) A spanning tree (or forest) of a clique graph is balanced if for every node n, the number of cliques containing n is one higher than the number of labeled edges containing n. [sent-79, score-1.511]
31 While in the following we state many results for spanning trees only, they can be straightforwardly generalized to spanning forests as well (in case the Markov networks are disconnected. [sent-80, score-0.756]
32 ) Lemma 2 For any clique graph, all its balanced spanning trees have the same weight. [sent-81, score-0.793]
33 Proof: This holds in general because the balancing condition requires exactly the same number of occurrences of any node in the separator edges for any balanced spanning tree, and the weight is defined as the sum of the occurrences of nodes in the edge labels. [sent-82, score-1.401]
34 Lemma 3 ([21, 22]) Any maximum weight spanning tree of the clique graph is a junction tree, and hence satisfies the running intersection property: for every pair of nodes c and c , (c ∩ c ) ⊆ c for all nodes c on the unique path between c and c . [sent-83, score-1.332]
35 Lemma 4 Let T = V, ET be a maximum weight spanning tree of the clique graph V, E of a connected chordal graph. [sent-84, score-1.215]
36 Proof: We order the tree by choosing an arbitrary clique as the root and by assigning a depth to all nodes according to their distance from the root node. [sent-86, score-0.624]
37 The rest of the proof proceeds by induction on the height of subtrees starting from the leaf nodes as the base case. [sent-87, score-0.375]
38 The induction hypothesis says that all subtrees satisfy the balancing condition. [sent-88, score-0.383]
39 The base cases are trivial: each leaf node (clique) trivially satisfies the balancing condition, as there are no separators to consider. [sent-89, score-0.565]
40 In the inductive cases, we have a clique c at depth d, connected to one or more subtrees rooted at neighboring cliques c1 , . [sent-90, score-0.908]
41 , ck at depth d + 1, with the subtrees satisfying the balancing condition. [sent-93, score-0.4]
42 We show that the tree consisting of the clique c, the labeled edges connecting c respectively to cliques c1 , . [sent-94, score-0.931]
43 First note that by Lemma 3, any maximum weight spanning tree of the clique graph is a junction tree and hence satisfies the running intersection property, meaning that for any two cliques c1 and c2 in the tree, every clique on the unique path connecting them includes c1 ∩ c2 . [sent-101, score-1.88]
44 We have to show that the subtree rooted at c is balanced, given that its subtrees are balanced. [sent-102, score-0.307]
45 We show that the balancing condition is satisfied for each node separately. [sent-103, score-0.335]
46 Now each of the subtrees rooted at some ci has either 0 occurrences of n, or ki ≤ 1 occurrences in the cliques and ki − 1 occurrences in the edge labels, because by the induction hypothesis the balancing condition is satisfied. [sent-105, score-1.393]
47 Now the balancing condition is trivially satisfied for the subtree rooted at c, because n either does not occur in c, or it occurs in c but does not occur in the label of any of the edges to the subtrees. [sent-108, score-0.547]
48 Since any maximum weight spanning tree is a junction tree by Lemma 3, n must occur also in c and in the labels of the edges between c and the cliques in which the subtrees with n are rooted. [sent-111, score-1.326]
49 , sj be the numbers of occurrences of n in the edge labels in the subtrees with at least one occurrence of n, and t1 , . [sent-115, score-0.35]
50 , tj the numbers of occurrences of n in the cliques in the same subtrees. [sent-118, score-0.48]
51 By the induction hypothesis, these subtrees are balanced, and hence ti − si = 1 for all k i ∈ {1, . [sent-119, score-0.275]
52 The subtree rooted at c now has 1 + i=1 ti occurrences of n in the nodes j (once in c itself and then the subtrees) and j + i=1 si occurrences in the edge labels, where the j occurrences are in the edges between c and the j subtrees. [sent-123, score-0.966]
53 4 We establish the balancing condition through a sequence of equalities. [sent-124, score-0.24]
54 j k (1 + i=1 ti ) − (j + i=1 si ) j = 1 − j + i=1 (ti − si ) reordering the terms =1−j+j since ti − si = 1 for every subtree =1 Hence also the subtree rooted at c is balanced. [sent-126, score-0.476]
55 Since any maximum weight spanning tree is a junction tree by Lemma 3, n must occur also in the clique ci . [sent-130, score-1.153]
56 Since the subtree is balanced, the new graph obtained by adding the clique c and the edge with a label containing n is also balanced. [sent-132, score-0.614]
57 Further, adding all the other subtrees that do not contain n will not affect the balancing of n. [sent-133, score-0.34]
58 Since there are n occurrences of n in any of the other subtrees, in c, or in the edge labels between c and any of the subtrees, the balancing condition holds. [sent-136, score-0.444]
59 This completes the induction step and consequently, the whole spanning tree is balanced. [sent-137, score-0.476]
60 Lemma 5 Assume T = V, EB is a spanning tree of the clique graph GC = V, E of a chordal graph that satisfies the balancing condition. [sent-138, score-1.421]
61 Proof: Let TM be one of the spanning trees of GC with the maximum weight w. [sent-140, score-0.463]
62 By Lemma 4, this maximum weight spanning tree is balanced. [sent-141, score-0.53]
63 Hence also T is a maximum weight spanning tree of GC . [sent-143, score-0.53]
64 Theorem 6 For any clique graph of a chordal graph, any of its subgraphs is a maximum weight spanning tree if and only if it is a balanced acyclic subgraph. [sent-144, score-1.295]
65 Definition 1) for a set C of cliques forming a Markov network and the set S of separators induced by the tree structure. [sent-149, score-0.761]
66 A solution to the Markov network learning problem is a set of cliques C = {c1 , . [sent-155, score-0.394]
67 Every node is included in at least one of the chosen cliques in C, i. [sent-159, score-0.439]
68 The graph V, E with the set of edges E = 5 c∈C edges(c) is chordal. [sent-167, score-0.212]
69 The set C has a balanced spanning tree labeled by a set of separators S = {s1 , . [sent-169, score-0.743]
70 1 Graph Properties We assume that clique candidates – which are the non-empty subsets of V – are indexed from 1 to 2|V | . [sent-180, score-0.347]
71 Each clique candidate c ⊆ V has an associated score v(c). [sent-182, score-0.389]
72 To encode the search space for Markov networks, we introduce, for every clique candidate c, a propositional variable xc denoting that c is part of the learned network. [sent-183, score-0.631]
73 We also introduce propositional variables en,m that represent edges {n, m} that are in at least one chosen clique. [sent-184, score-0.238]
74 To satisfy the maximality condition 2(a), we require that if a clique is chosen, then at least one edge in each of its super-cliques is not chosen. [sent-189, score-0.461]
75 We first make the edges of the chosen cliques explicit by the next constraint for all {n, m} ⊆ V and cliques c1 , . [sent-190, score-0.829]
76 en,m ↔ (xc1 ∨ · · · ∨ xck ) (3) Then for every clique candidate c = {n1 , . [sent-194, score-0.347]
77 , nk } and every node n ∈ V \c we have the constraint xc → (¬en1 ,n ∨ · · · ∨ ¬enk ,n ) (4) where en1 ,n , . [sent-197, score-0.271]
78 For each pair of clique candidates c and c such that c ⊂ c , ¬xc ∨ ¬xc is a logical consequence of the constraints (4). [sent-201, score-0.41]
79 For condition 2(b) we use propositional variables zc which mean that either c or one of its supercliques is chosen, and propositional variables wc which mean that all edges of c are chosen. [sent-203, score-0.534]
80 For 2-element cliques c = {n1 , n2 } we have wc ↔ en1 ,n2 . [sent-204, score-0.416]
81 (5) For larger cliques c we have wc ↔ wc1 ∧ · · · ∧ wck (6) where c1 , . [sent-205, score-0.416]
82 If all edges of a clique are chosen, then the clique itself or one of its super-cliques must be chosen. [sent-210, score-0.797]
83 , ck are all cliques that extend c by one node, this is encoded as follows. [sent-214, score-0.404]
84 Let us consider all cycles these nodes could form in the graph V, E of condition 3 in Definition 7. [sent-221, score-0.295]
85 6 redundant cycle constraints, we arbitrarily fix the starting node and require that the index of the second node in the cycle is lower than the index of the second last node. [sent-232, score-0.294]
86 Now, the chordality constraint says that if there is an edge between every pair of consecutive nodes in n1 , . [sent-234, score-0.317]
87 3 Separators Separators for pairs c and c of clique candidates can be formalized as propositional variables sc,c , meaning that c ∩ c is a separator and there is an edge in the spanning tree between c and c labeled by c ∩ c . [sent-242, score-1.026]
88 (10) The lack of the converse implication formalizes the choice of the spanning tree, i. [sent-244, score-0.296]
89 The remaining constraints on separators fall into two cases. [sent-247, score-0.293]
90 First, we have cardinality constraints encoding the balancing condition (cf. [sent-248, score-0.415]
91 1): each variable occurs in the chosen cliques one more time than it occurs in the separators which label the spanning tree. [sent-250, score-0.956]
92 Second, the graph formed by the cliques with the separators as edges must be acyclic. [sent-252, score-0.786]
93 , nodes with at most one neighbor, until all nodes have been removed. [sent-255, score-0.28]
94 A graph with m nodes is acyclic iff all its nodes are level m leaves. [sent-261, score-0.389]
95 For both datasets, we computed the respective score file that specifies the score of each clique candidate, i. [sent-286, score-0.431]
96 The heart data involves 6 variables giving rise to 26 = 64 clique candidates in total and a search space of 215 undirected networks of which a subset are decomposable. [sent-293, score-0.556]
97 6 econ 310 × 103 heart 3930 kB 3120 kB 3120 kB 3120 kB 8120 kB 197 kB 203 kB econ 139 MB 130 MB 130 MB 130 MB 1060 MB 4. [sent-296, score-0.233]
98 We were able to solve this instance optimally with one solver only, HC LASP, which allows for a more refined control of the search heuristic: we forced HC LASP to try cliques in an ascending order by size, with greatest cliques first. [sent-301, score-0.801]
99 The related problem of structure learning of Bayesian networks has been addressed by general-purpose combinatorial search methods, including MAXSAT [30] and a constraint-programming solver with a linear-programming solver as a subprocedure [31, 32]. [sent-305, score-0.237]
100 On the history of the minimum spanning tree problem. [sent-390, score-0.433]
wordName wordTfidf (topN-words)
[('clique', 0.347), ('cliques', 0.344), ('spanning', 0.296), ('separators', 0.23), ('chordal', 0.229), ('balancing', 0.194), ('subtrees', 0.146), ('lasp', 0.141), ('maxsat', 0.141), ('nodes', 0.14), ('tree', 0.137), ('occurrences', 0.136), ('propositional', 0.135), ('smt', 0.124), ('graph', 0.109), ('asp', 0.109), ('xa', 0.107), ('edges', 0.103), ('xc', 0.103), ('markov', 0.098), ('node', 0.095), ('subtree', 0.09), ('econ', 0.088), ('xci', 0.088), ('finland', 0.086), ('mb', 0.083), ('balanced', 0.08), ('kb', 0.077), ('ci', 0.073), ('wc', 0.072), ('pci', 0.072), ('rooted', 0.071), ('sat', 0.071), ('chordality', 0.071), ('xsi', 0.071), ('trees', 0.07), ('cardinality', 0.07), ('edge', 0.068), ('solver', 0.067), ('junction', 0.066), ('weight', 0.064), ('hc', 0.064), ('constraints', 0.063), ('encodings', 0.061), ('ck', 0.06), ('heart', 0.057), ('translations', 0.057), ('networks', 0.057), ('psi', 0.054), ('abo', 0.053), ('akademi', 0.053), ('biere', 0.053), ('jukka', 0.053), ('pwbo', 0.053), ('si', 0.053), ('cycle', 0.052), ('network', 0.05), ('boolean', 0.05), ('undirected', 0.049), ('gms', 0.047), ('search', 0.046), ('condition', 0.046), ('leaf', 0.046), ('gc', 0.046), ('satis', 0.044), ('della', 0.043), ('pietra', 0.043), ('separator', 0.043), ('zc', 0.043), ('lauritzen', 0.043), ('occurs', 0.043), ('induction', 0.043), ('score', 0.042), ('encoding', 0.042), ('satisfaction', 0.04), ('pa', 0.039), ('solvers', 0.039), ('ability', 0.038), ('constraint', 0.038), ('forests', 0.037), ('na', 0.037), ('graphical', 0.037), ('nk', 0.035), ('corander', 0.035), ('enk', 0.035), ('expressible', 0.035), ('gebser', 0.035), ('janhunen', 0.035), ('pti', 0.035), ('sebastiani', 0.035), ('tomi', 0.035), ('pv', 0.033), ('theories', 0.033), ('characterization', 0.033), ('ti', 0.033), ('maximum', 0.033), ('modulo', 0.031), ('helsinki', 0.031), ('giudici', 0.031), ('aalto', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
2 0.11444416 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
3 0.10030184 66 nips-2013-Computing the Stationary Distribution Locally
Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah
Abstract: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difficult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1
4 0.097461239 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by: γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
5 0.097395048 289 nips-2013-Scalable kernels for graphs with continuous attributes
Author: Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, Karsten Borgwardt
Abstract: While graphs with continuous node attributes arise in many applications, stateof-the-art graph kernels for comparing continuous-attributed graphs suffer from a high runtime complexity. For instance, the popular shortest path kernel scales as O(n4 ), where n is the number of nodes. In this paper, we present a class of graph kernels with computational complexity O(n2 (m + log n + δ 2 + d)), where δ is the graph diameter, m is the number of edges, and d is the dimension of the node attributes. Due to the sparsity and small diameter of real-world graphs, these kernels typically scale comfortably to large graphs. In our experiments, the presented kernels outperform state-of-the-art kernels in terms of speed and accuracy on classification benchmark datasets. 1
6 0.096890427 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
7 0.096208036 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
8 0.096100286 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
9 0.092047244 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
10 0.083679378 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
11 0.080455221 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
12 0.070849285 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
13 0.069702543 7 nips-2013-A Gang of Bandits
14 0.067516245 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
15 0.06750872 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
16 0.067495011 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
17 0.06698399 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
18 0.064583085 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
19 0.06225821 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
20 0.060878988 347 nips-2013-Variational Planning for Graph-based MDPs
topicId topicWeight
[(0, 0.154), (1, 0.026), (2, -0.042), (3, -0.008), (4, 0.067), (5, 0.077), (6, 0.062), (7, -0.121), (8, 0.032), (9, 0.023), (10, 0.052), (11, -0.107), (12, 0.232), (13, -0.019), (14, 0.005), (15, -0.006), (16, -0.002), (17, 0.059), (18, -0.016), (19, -0.019), (20, 0.112), (21, 0.088), (22, -0.01), (23, -0.09), (24, 0.059), (25, 0.006), (26, 0.063), (27, 0.076), (28, -0.011), (29, 0.071), (30, -0.073), (31, -0.054), (32, -0.021), (33, -0.001), (34, 0.01), (35, 0.046), (36, 0.062), (37, 0.029), (38, -0.021), (39, -0.099), (40, -0.049), (41, 0.016), (42, -0.076), (43, -0.032), (44, 0.023), (45, -0.003), (46, -0.074), (47, -0.005), (48, -0.044), (49, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.97148067 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
2 0.73020011 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
3 0.67497844 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
Author: Ying Liu, Alan Willsky
Abstract: Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity O(k 2 n) using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of highly influential nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate with complexity O(kn2 + n2 log n) if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing an inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank corrections with complexity O(kn2 + n2 log n) per iteration. We perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. 1
4 0.63719809 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
Author: Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying
Abstract: Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman’s coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? This is a critical task for understanding and describing observations in many domains [1, 2], including natural language processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. Our goal is to provide efficient algorithms to discover bushy hierarchies. We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. The beta coalescent (Section 3) does not have this restriction. However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. Our first contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). After describing sequential Monte Carlo posterior inference for the beta coalescent, we develop efficient inference strategies in Section 4, where we use proposal distributions that draw on the connection between Dirichlet processes—a ubiquitous Bayesian nonparametric tool for non-hierarchical clustering—and hierarchical coalescents to make inference tractable. We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. Heller et al. [10] build 1 binary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary branching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees with unbounded width and depth, which supports data observations at leaves and internal nodes.1 However, these models do not distinguish edge lengths, an important property in distinguishing how “tight” the clustering is at particular nodes. Hierarchical models can be divided into complementary “fragmentation” and “coagulation” frameworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single partition and divide it into ever more specific partitions until only singleton partitions remain. Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. Consider a set of n individuals observed at the present (time t = 0). All individuals start in one of n singleton sets. After time ti , a set of these nodes coalesce into a new node. Once a set merges, their parent replaces the original nodes. This is called a coalescent event. This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. Different coalescents are defined by different probabilities of merging a set of nodes. This is called the coalescent rate, defined by a general family of coalescents: the lambda coalescent [7, 15]. We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). The rate of a fraction γ of the nodes coalescing is given by γ −2 Λ(dγ), where Λ(dγ) is a finite measure on [0, 1]. So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). (1) 0 Choosing different measures yields different coalescents. A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. When α is closer to 1, the tree is bushier; as α approaches 2, it becomes Kingman’s coalescent. If we have ni−1 nodes at time ti−1 in a beta coalescent, the rate λkii−1 for a children set of ki nodes at time n ti and the total rate λni−1 of any children set merging—summing over all possible mergers—is λkii−1 = n Γ(ki − α)Γ(ni−1 − ki + α) and λni−1 = Γ(2 − α)Γ(α)Γ(ni−1 ) ni−1 ni−1 ki λkii−1 . n (2) ki =2 Each coalescent event also has an edge length—duration—δi . The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is specified by a transition kernel, Section 3). Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . . . m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. We focus on the case where only the descendants are known. For a concrete example, see Section 5.2. 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. 0 s 3: Initialize the node set V = {ρ0 , ρ1 , · · · , ρn }. 4: while ∃s ∈ {1 · · · S} where ns > 1 do 5: Update i = i + 1. 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. s 9: Propose a duration δi by Equation 10. s 10: Set coalescent time ts = ts − δi . i i−1 11: Sample partitions ps from DPMM. i s 12: Propose a set ρci according to Equation 11. s 13: Update weight wi by Equation 13. s s s 14: Update n = n − |ρci | + 1. s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . i 16: Compute effective sample size ESS [16]. Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. merges two each time. 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. In real clustering applications, we also care about a node’s children and features. In this section, we define the nodes and their features, and then review how we use message passing to compute the probabilities of trees. An internal node ρi is defined as the merger of other nodes. The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . This encodes the identity of the nodes that participate in specific coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. In addition, each node is associated with a multidimensional feature vector yi . Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). The initial distribution can be viewed as a prior or regularizer for feature representations. The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition kernel follows a Gaussian distribution centered at a feature. The covariance matrix Σ is decided by the mutation rate µ [18, 9], the probability of a mutation in an individual. Different kernels (e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature representations. To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. We define a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the final coalescent event m is a complete tree π. The message for node ρi marginalizes over the features of the nodes in its children set.2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . b∈ci 2 When ρb is a leaf, the message Mρb (yb ) is a delta function centered on the observation. 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. This gives the marginal probability of the whole tree, m p(x|π) = Z−∞ (x|θm ) Zρi (x|θi ), (7) i=1 The joint probability of a tree π combines the prior (Equation 3) and likelihood (Equation 7), m λkii−1 exp(−λni−1 δi ) · Zρi (x|θi ). n p(x, π) = Z−∞ (x|θm ) (8) i=1 3.1 Sequential Monte Carlo Inference Sequential Monte Carlo (SMC)—often called particle filters—estimates a structured sequence of hidden variables based on observations [20]. For coalescent models, this estimates the posterior distribution over tree structures given observations x. Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . The weights from SIR approximate the posterior. Computing the weights requires a conditional distris s s bution of data given a latent state p(x|θi ), a transition distribution between latent states p(θi |θi−1 ), s s and a proposal distribution f (θi |θi−1 , x). Together, these distributions define weights s s wi = wi−1 s s s p(x | θi )p(θi | θi−1 ) . s s f (θi | θi−1 , x) (9) Then we can approximate the posterior distribution of the hidden structure using the normalized weights, which become more accurate with more particles. To apply SIR inference to belief propagation with the beta coalescent prior, we first define the particle s space structure. The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. 4 This is the “priorpost” method in Teh et al. [9]. However, this approach is intractable. Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . The computational complexity grows from O(n2 ) i−1 i−1 2 3 ni−1 (Kingman’s coalescent) to O(2 ) (beta coalescent). 4 Efficiently Finding Children Sets with DPMM We need a more efficient way to consider possible children sets. Even for Kingman’s coalescent, which only considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they use data structures from computational geometry to select the R closest pairs as their restriction set, reducing inference to O(n log n). While finding closest pairs is a traditional problem in computational geometry, discovering arbitrary-sized sets is less studied. 3 The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles. When we use time-specific adjectives for particles, this is with respect to inference. 4 This is a special case of Section 4.2’s algorithm, where the restriction set Ωi is all possible subsets. 4 In this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a restriction set Ω, integrating DPMMs into the SMC proposal. We first briefly review what DPMMs are, describe why they are attractive, and then describe how we incorporate DPMMs in SMC inference. The DPMM is defined by a concentration β and a base distribution G0 . A distribution over mixtures is drawn from a Dirichlet process (DP): G ∼ DP(β, G0 ). Each observation xi is assigned to a mixture component µi drawn from G. Because the Dirichlet process is a discrete distribution, observations i and j can have the same mixture component (µi = µj ). When this happens, points are said to be in the same partition. Posterior inference can discover a distribution over partitions. A full derivation of these sampling equations appears in the supplemental material. 4.1 Attractive Properties of DPMMs DPMM s and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a Dirichlet process is equivalent to the distribution over coalescents’ allelic partitions—the set of members that have the same feature representation—when the mutation rate µ of the associated kernel is half of the Dirichlet concentration β (Section 3). For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. The base distribution G0 is also related with nodes’ feature. The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. As a result, we can select a base distribution which fits the distribution of the samples in coalescent process. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian is also appropriate as the DPMM base distribution. Effectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have support on a superset of the true posterior. In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. Thus the new proposal with DPMM satisfies this requirement, and it is a valid proposal. In addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for outliers, considers the likelihood, and lies close to the true posterior. The DPMM fulfills these criteria. First, the DPMM provides a distribution over all partitions. Varying the concentration parameter β can control the length of the tail of the distribution over partitions. Second, choosing the base distribution of the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar nodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). 4.2 Incorporating DPMM in SMC Proposals To address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over partitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions to create a restriction set Ωi = {ωi1 , ωi2 , · · · }, where each ωij is a subset of the ni−1 nodes. A standard Gibbs sampler provides these partitions (see supplemental). s With this restriction set Ωi , we propose the duration time δi from the exponential distribution and s propose a children set ρci based on the local normalizers s s Zρi (x|θi−1 , δi , ρsi ) c s s s s s s · I ρci ∈ Ωs , (11) fi (ρci |δi , θi−1 ) = fi (δi ) = λs i−1 exp(−λs i−1 δi ) (10) i n n Z0 s where Ωs restricts the candidate children sets, I is the indicator, and we replace Zρi (x|θi ) with i s s s Zρi (x|θi−1 , δi , ρci ) since they are equivalent here. The normalizer is Z0 = ρc s s Zρi (x|θi−1 , δi , ρc ) · I [ρc ∈ Ωs ] = i ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ). (12) Applying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution (Equation 10 and Equation 11) to the SIR weight update (Equation 9), |ρs | c s s wi = wi−1 i λni−1 · ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ) λs i−1 n 5 , (13) |ρs | c s i where |ρsi | is the size of children set ρci ; parameter λni−1 is the rate of the children set ρsi (Equac c s tion 2); and λni−1 is the rate of all possible sets given a total number of nodes ni−1 (Equation 2). We can view this new proposal as a coarse-to-fine process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and filters “bad” children sets, the slower finer step considers fewer children sets, saving computation time (Algorithm 1). If Ωi has all children sets, it recovers exhaustive SMC. We estimate the effective sample size [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and only provides singleton clusters). In such cases it is simpler to enumerate all children sets. 4.3 Example Transition Kernel: Brownian Diffusion This section uses Brownian diffusion as an example for message passing framework. The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. Then the local normalizer Zρi (x|θi ) is Zρi (x|θi ) = N (yi ; 0, ∞) b∈ci N (yi ; yb , Σ(vρb + tb − ti ))dyi , ˆ (14) and the node message Mρi (yi ) is normally distributed Mρi (yi ) ∼ N (yi ; yρi , Σvρi ), where ˆ vρi = 5 b∈ci (vρb + tb − ti )−1 −1 , yρi = ˆ b∈ci yρb ˆ vρ b + t b − t i vρ i . Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. We show beta performs best and can capture data in more interpretable, bushier trees. Setup The parameter α for the beta coalescent is between 1 and 2. The closer α is to 1, bushier the tree is, and we set α = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized as β = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. For any two leaf nodes, we find the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better. Scores are in expectation over particles and averaged across chains. 5.1 Synthetic Hierarchies To test our inference method, we generated synthetic data with edge length (full details available in the supplemental material); we also assume each child of the root has a unique label and the descendants also have the same label as their parent node (except the root node). We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no edge length). While increasing the feature dimension improves both scores, more observations do not: for synthetic data, a small number of observations suffice to construct a good tree. 5 With DPMM proposals, α has a negligible effect, so we elide further analysis for different α values. 6 0.8 0.6 beta hac kingman 0.4 0.8 0.6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0.6 beta kingman 0.4 0.2 0.0 4 6 length Dimension 0.6 0.4 0.2 0.0 20 40 60 80 100 0.6 beta 2 4 6 length Dimension 0.6 8 enum beta 10 enum 0.4 0.2 0.0 2 Number of Observations (a) Increasing observations 0.8 0.4 2 Scores 40 1.0 10 0.4 20 Scores purity 1.0 Scores purity Scores Scores purity 1.0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v.s. enum Figure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number of dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM and achieved similar accuracy (Figure 2(c)) while greatly improving efficiency. 5.2 Human Tissue Development Our first real dataset is based on the developmental biology of human tissues. As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. These eventually form all human tissues. For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. Because each germ layer specializes into many different types of cells at specific times, it is inappropriate to model this development as a binary tree, or with clustering models lacking path lengths. Historically, uncovering these specialization pathways is a painstaking process, requiring inspection of embryos at many stages of development; however, massively parallel sequencing data make it possible to efficiently form developmental hypotheses based on similar patterns of gene expression. To investigate this question we use the transcriptome of 27 tissues with known, unambiguous, time-specific lineages [29]. We reduce the original 182727 dimensions via principle component analysis [30, PCA]. We use five chains with five particles per chain. Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3) reflects human development. The first major differentiation is the division of embryonic cells into three layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult organs and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast; this is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue lineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as well. Tissues known to secrete endocrine hormones (dashed borders) cluster together. 5.3 Clustering 20-newsgroups Data Following Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel hierarchy first dividing into general areas (rec, space, and religion) before specializing into areas such as baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume each edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus, and use the topic distribution for each document as the document feature. We use five chains with eighty particles per chain. 6 http://qwone.com/˜jason/20Newsgroups/ 7 We use “rec.autos”, “rec.sport.baseball”, “rec.sport.hockey”, “sci.space” newsgroups but also—in contrast to Heller et al. [10]—added “soc.religion.christian”. 7 ectoderm Stomach Pancreas mesoderm placenta Placenta endoderm Bone Marrow Thyroid Colon Kidney Heart PeripheralBlood Lymphocytes Brain Hypothalamus Brain Amygdala Prostate Uterus Lung Brain Thalamus BrainCorpus Callosum Spleen Thymus Spinal Cord Brain Cerebellum BrainCaudate Nucleus Doc Label rec.sport.baseball rec.autos rec.sport.hocky sci.space soc.religion.christian Trachea Small Intestine Retina Monocytes Mammary Gland ... purity ↑ subtree ↑ length ↓ ... ... ... ... Bladder Figure 3: One sample hierarchy of human tissue from beta. Color indicates germ layer origin of tissue. Dashed border indicates secretory function. While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. The cluster also preferred placing secretory tissues together and higher in the tree. hac 0.453 0.240 − True Tree Figure 4: One sample hierarchy of the 20newsgroups from beta. Each small square is a document colored by its class label. Large rectangles represent a subtree with all the enclosed documents as leaf nodes. Most of the documents from the same group are clustered together; the three “rec” groups are merged together first, and then merged with the religion and space groups. Biological Data kingman beta 0.474 ± 0.029 0.492 ± 0.028 0.302 ± 0.033 0.331 ± 0.050 0.654 ± 0.041 0.586 ± 0.051 hac 0.465 0.571 − 20-newsgroups Data kingman beta 0.510 ± 0.047 0.565 ± 0.081 0.651 ± 0.013 0.720 ± 0.013 0.477 ± 0.027 0.333 ± 0.047 Table 1: Comparing the three models: beta performs best on all three scores. As with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup merge first, then the three “rec” groups, followed by “space” and “religion” groups. We only use topic distribution as features, so better results could be possible with more comprehensive features. 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization practical and efficient. This new model provides a bushier tree, often a more realistic view of data. While we only consider real-valued vectors, which we model through the ubiquitous Gaussian, other likelihoods might be better suited to other applications. For example, for discrete data such as in natural language processing, a multinomial likelihood may be more appropriate. This is a straightforward extension of our model via other transition kernels and DPMM base distributions. Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classification [32]. Hierarchies are often taken a priori in natural language processing. Particularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the hierarchy and a downstream task could improve performance in dependency parsing [33] (clustering parts of speech), multilingual sentiment [34] (finding sentiment-correlated words across languages), or topic modeling [35] (finding coherent words that should co-occur in a topic). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments, and thank H´ ctor e Corrada Bravo for pointing us to human tissue data. This research was supported by NSF grant #1018625. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language. Computational Linguistics, 18:18–4, 1990. [4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992. [5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821– 7826, 2002. [6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. [7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870–1902, 1999. [8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009. e [9] Teh, Y. W., H. Daum´ III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008. [10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005. [11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010. [12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010. [13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011. [14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619–629, 2003. [15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36:1116–1125, 1999. [16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998. [17] Fearhhead, P. Sequential Monte Carlo method in filter theory. PhD thesis, University of Oxford, 1998. [18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet, 25(5):471–492, 1973. [19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the infinitely many sites model. Theoretical population biology, 79(4):155–73, 2011. [20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001. [21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing, 140(2):107–113, 1993. ou [22] G¨ r¨ r, D., L. Boyles, M. Welling. Scalable inference on Kingman’s coalescent using pair similarity. JMLR, 22:440–448, 2012. [23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. [24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007. [25] Chen, Z. Bayesian filtering: From kalman filters to particle filters, and beyond. McMaster, [Online], 2003. [26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007. [27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International Journal of Corpus Linguistics, 2:91–131, 1997. [29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel signature sequencing (mpss). Genome Res, 15:1007–1014, 2005. [30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute for Biological Studies. 2005. [31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003. [32] Rai, P., H. Daum´ III. The infinite hierarchical factor regression model. In NIPS. 2008. e [33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008. [34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. 2009. 9
5 0.63259685 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
Author: Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, Antonio Criminisi
Abstract: Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classification. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efficiently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization. 1
6 0.6300965 47 nips-2013-Bayesian Hierarchical Community Discovery
7 0.62901998 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
8 0.57966638 340 nips-2013-Understanding variable importances in forests of randomized trees
9 0.55121922 289 nips-2013-Scalable kernels for graphs with continuous attributes
10 0.55083799 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
11 0.54267907 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
12 0.53256434 66 nips-2013-Computing the Stationary Distribution Locally
13 0.52389634 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles
14 0.52385867 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
15 0.5198155 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
16 0.51557356 184 nips-2013-Marginals-to-Models Reducibility
17 0.51266092 355 nips-2013-Which Space Partitioning Tree to Use for Search?
18 0.50524908 332 nips-2013-Tracking Time-varying Graphical Structure
19 0.50503486 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
20 0.49726835 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
topicId topicWeight
[(2, 0.022), (16, 0.048), (33, 0.087), (34, 0.105), (41, 0.026), (49, 0.025), (56, 0.084), (70, 0.058), (85, 0.05), (89, 0.02), (93, 0.034), (95, 0.015), (99, 0.345)]
simIndex simValue paperId paperTitle
same-paper 1 0.77408791 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
2 0.74900734 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
Author: Hossein Azari Soufiani, William Chen, David C. Parkes, Lirong Xia
Abstract: In this paper we propose a class of efficient Generalized Method-of-Moments (GMM) algorithms for computing parameters of the Plackett-Luce model, where the data consists of full rankings over alternatives. Our technique is based on breaking the full rankings into pairwise comparisons, and then computing parameters that satisfy a set of generalized moment conditions. We identify conditions for the output of GMM to be unique, and identify a general class of consistent and inconsistent breakings. We then show by theory and experiments that our algorithms run significantly faster than the classical Minorize-Maximization (MM) algorithm, while achieving competitive statistical efficiency. 1
3 0.67926729 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
4 0.64883256 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin
Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1
5 0.63939512 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
6 0.60423338 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
7 0.46703103 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
8 0.46239123 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
9 0.4616372 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
10 0.46013626 15 nips-2013-A memory frontier for complex synapses
11 0.45814738 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
12 0.45713103 47 nips-2013-Bayesian Hierarchical Community Discovery
13 0.45694023 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
14 0.4557932 86 nips-2013-Demixing odors - fast inference in olfaction
15 0.45576873 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
16 0.45527688 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
17 0.45515534 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
18 0.45360032 121 nips-2013-Firing rate predictions in optimal balanced networks
19 0.45334759 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
20 0.45220307 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit