nips nips2004 nips2004-19 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto
Abstract: This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss the relation between our algorithm and SVMs with convolution kernels. Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efficiency. 1
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. [sent-9, score-0.261]
2 The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. [sent-10, score-1.017]
3 We also discuss the relation between our algorithm and SVMs with convolution kernels. [sent-11, score-0.128]
4 Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efficiency. [sent-12, score-0.424]
5 , DNA and RNA), chemical compounds, natural language texts, and semi-structured data (e. [sent-17, score-0.133]
6 Recently, a number of kernels have been proposed for such structured data, such as sequences [7], trees [2, 5], and graphs [6]. [sent-22, score-0.212]
7 Most are based on the idea that a feature vector is implicitly composed of the counts of substructures (e. [sent-23, score-0.174]
8 Although kernel methods show remarkable performance, their implicit definitions of feature space make it difficult to know what kind of features (substructures) are relevant or which features are used in classifications. [sent-26, score-0.281]
9 In this paper, we present a new machine learning algorithm for classifying labeled graphs that has the following characteristics: 1) It performs learning and classification using the Figure 1: Labeled connected graphs and subgraph relation structural information of a given graph. [sent-29, score-0.561]
10 2) It uses a set of all subgraphs (bag-of-subgraphs) as a feature set without any constraints, which is essentially the same idea as a convolution kernel [4]. [sent-30, score-0.505]
11 3) Even though the size of the candidate feature set becomes quite large, it automatically selects a compact and relevant feature set based on Boosting. [sent-31, score-0.128]
12 2 Classifier for Graphs We first assume that an instance is represented in a labeled graph. [sent-32, score-0.13]
13 The focused problem can be formalized as a general problem called the graph classification problem. [sent-33, score-0.11]
14 The graph classification problem is to induce a mapping f (x) : X → {±1}, from given training examples T = { xi , yi }L , where xi ∈ X is a labeled graph and yi ∈ {±1} is a class i=1 label associated with the training data. [sent-34, score-0.739]
15 The important characteristic is that input example xi is represented not as a numerical feature vector but as a labeled graph. [sent-36, score-0.273]
16 1 Preliminaries In this paper we focus on undirected, labeled, and connected graphs, since we can easily extend our algorithm to directed or unlabeled graphs with minor modifications. [sent-38, score-0.163]
17 Let us introduce a labeled connected graph (or simply a labeled graph), its definitions and notations. [sent-39, score-0.344]
18 Definition 1 Labeled Connected Graph A labeled graph is represented in a 4-tuple G = (V, E, L, l), where V is a set of vertices, E ⊆ V × V is a set of edges, L is a set of labels, and l : V ∪ E → L is a mapping that assigns labels to the vertices and the edges. [sent-40, score-0.267]
19 A labeled connected graph is a labeled graph such that there is a path between any pair of verticies. [sent-41, score-0.454]
20 Definition 2 Subgraph Let G = (V , E , L , l ) and G = (V, E, L, l) be labeled connected graphs. [sent-42, score-0.135]
21 G matches G, or G is a subgraph of G (G ⊆ G) if the following conditions are satisfied: (1) V ⊆ V , (2) E ⊆ E, (3) L ⊆ L, and (4) l = l. [sent-43, score-0.123]
22 If G is a subgraph of G, then G is a supergraph of G . [sent-44, score-0.201]
23 Figure 1 shows an example of a labeled graph and its subgraph and non-subgraph. [sent-45, score-0.332]
24 2 Decision Stumps Decision stumps are simple classifiers in which the final decision is made by a single hypothesis or feature. [sent-47, score-0.41]
25 Boostexter [10] uses word-based decision stumps for text classification. [sent-48, score-0.41]
26 To classify graphs, we define the subgraph-based decision stumps as follows. [sent-49, score-0.442]
27 Definition 3 Decision Stumps for Graphs Let t and x be labeled graphs and y be a class label (y ∈ {±1}). [sent-50, score-0.226]
28 A decision stump classifier for graphs is given by y t⊆x def h t,y (x) = −y otherwise. [sent-51, score-0.343]
29 The parameter for classification is a tuple t, y , hereafter referred to as a rule of decision ˆˆ stumps. [sent-52, score-0.155]
30 The gain function for a rule t, y is defined as L def gain( t, y ) = yi h t,y (xi ). [sent-56, score-0.429]
31 In this paper, we use gain instead of error rate for clarity. [sent-58, score-0.196]
32 3 Applying Boosting The decision stump classifiers are too inaccurate to be applied to real applications, since the final decision relies on the existence of a single graph. [sent-60, score-0.269]
33 Boosting repeatedly calls a given weak learner and finally produces a hypothesis f , which is a linear combination of K hypotheses produced K by the weak learners, i,e. [sent-62, score-0.166]
34 , dL ) on the (k) (k) L training data, where i=1 di = 1, di ≥ 0. [sent-67, score-0.318]
35 To use decision stumps as the weak learner of Boosting, we redefine the gain function (2) as: L def gain( t, y ) = y i di h t,y (xi ). [sent-69, score-0.895]
36 However, it is trivial to fit our decision stumps to other boosting algorithms, such as Arc-GV [1] and Boosting with soft margins [8]. [sent-71, score-0.855]
37 , xL , yL , dL } be training data where xi is a labeled graph, L yi ∈ {±1} is a class label associated with xi and di ( i=1 di = 1, di ≥ 0) is a normalˆˆ ized weight assigned to xi . [sent-77, score-0.944]
38 , t, y = argmaxt∈F ,y∈{±1} di yi h t,y , where F = i=1 {t|t ⊆ xi }. [sent-80, score-0.369]
39 The most naive and exhaustive method in which we first enumerate all subgraphs F and then calculate the gains for all subgraphs is usually impractical, since the number of subgraphs is exponential to its size. [sent-81, score-0.674]
40 The method to find the optimal rule is modeled as a variant of branch-and-bound algorithm and will be summarized as the following strategies: 1) Define Figure 2: Example of DFS Code Tree for a graph a canonical search space in which a whole set of subgraphs can be enumerated. [sent-83, score-0.478]
41 2) Find the optimal rule by traversing this search space. [sent-84, score-0.12]
42 3) Prune the search space by proposing a criteria for the upper bound of the gain. [sent-85, score-0.144]
43 proposed an efficient depth-first search algorithm to enumerate all subgraphs from a given graph [12]. [sent-89, score-0.441]
44 The search tree given by the DFS code is called a DFS Code Tree. [sent-91, score-0.302]
45 Leaving the details to [12], the order of the DFS code is defined by the lexicographic order of labels as well as the topology of graphs. [sent-92, score-0.199]
46 Each node in this tree is represented in a 5-tuple [i, j, vi , eij , vj ], where eij , vi and vj are the labels of i−j edge, i-th vertex, and j-th vertex respectively. [sent-94, score-0.216]
47 By performing a pre-order search of the DFS Code Tree, we can obtain all the subgraphs of a graph in order of their DFS code. [sent-95, score-0.391]
48 However, one cannot avoid isomorphic enumerations even giving pre-order traverse, since one graph can have several DFS codes in a DFS Code Tree. [sent-96, score-0.218]
49 So, canonical DFS code (minimum DFS code) is defined as its first code in the pre-order search of the DFS Code Tree. [sent-97, score-0.379]
50 show that two graphs G and G are isomorphic if and only if minimum DFS codes for the two graphs min(G) and min(G ) are the same. [sent-99, score-0.362]
51 We can thus ignore non-minimum DFS codes in subgraph enumerations. [sent-100, score-0.153]
52 In other words, in depth-first traverse, we can prune a node with DFS code c, if c is not minimum. [sent-101, score-0.204]
53 The isomorphic graph represented in minimum code has already been enumerated in the depth-first traverse. [sent-102, score-0.352]
54 2 Upper bound of gain DFS Code Tree defines a canonical search space in which one can enumerate all subgraphs from a given set of graphs. [sent-106, score-0.601]
55 We consider an upper bound of the gain that allows pruning of subspace in this canonical search space. [sent-107, score-0.38]
56 The following lemma gives a convenient method of computing a tight upper bound on gain( t , y ) for any supergraph t of t. [sent-108, score-0.149]
57 Lemma 1 Upper bound of the gain: µ(t) For any t ⊇ t and y ∈ {±1}, the gain of where µ(t) is given by t , y is bounded by µ(t) (i. [sent-109, score-0.23]
58 , gain( t y ) ≤ µ(t)), L µ(t) def = max 2 di − {i|yi =+1,t⊆xi } L yi · di , 2 di + {i|yi =−1,t⊆xi } i=1 yi · di . [sent-111, score-0.953]
59 i=1 Proof 1 L gain( t , y ) = L di yi h i=1 t ,y (xi ) = di yi · y · (2I(t ⊆ xi ) − 1), i=1 where I(·) is the indicator function. [sent-112, score-0.659]
60 If we focus on the case y = +1, then L gain( t , +1 ) = 2 yi di − {i|t ⊆xi } L yi · di ≤ 2 di − {i|yi =+1,t ⊆xi } i=1 yi · di i=1 L ≤ 2 di − yi · di , {i|yi =+1,t⊆xi } i=1 since |{i|yi = +1, t ⊆ xi }| ≤ |{i|yi = +1, t ⊆ xi }| for any t ⊇ t. [sent-113, score-1.636]
61 Similarly, L gain( t , −1 ) ≤ 2 di + {i|yi =−1,t⊆xi } yi · di . [sent-114, score-0.449]
62 2 We can efficiently prune the DFS Code Tree using the upper bound of gain u(t). [sent-116, score-0.338]
63 During pre-order traverse in a DFS Code Tree, we always maintain the temporally suboptimal gain τ among all the gains calculated previously. [sent-117, score-0.319]
64 If µ(t) < τ , the gain of any supergraph t ⊇ t is no greater than τ , and therefore we can safely prune the search space spanned from the subgraph t. [sent-118, score-0.541]
65 If µ(t) ≥ τ , then we cannot prune this space since a supergraph t ⊇ t might exist such that gain(t ) ≥ τ . [sent-119, score-0.149]
66 However, if we can calculate a tighter upper bound in advance, the search space can be pruned more effectively. [sent-122, score-0.144]
67 Suboptimal value τ is calculated by selecting one rule from the cache that maximizes the gain of the current distribution. [sent-124, score-0.309]
68 This idea is based on our observation that a rule in the cache tends to be reused as the number of Boosting iterations increases. [sent-125, score-0.113]
69 yi (4) j=1 where J is the number of hypotheses. [sent-133, score-0.131]
70 Note that in the case of decision stumps for graphs, J = |{±1} × F| = 2|F|. [sent-134, score-0.41]
71 The l2 -norm margin gives the separating hyperplane expressed by dotR products in feature space. [sent-142, score-0.164]
72 The feature space in SVMs is thus expressed implicitly by using a Marcer kernel function, which is a generalized dot-product between two objects, (i. [sent-143, score-0.201]
73 The best known kernel for modeling structured data is a convolution kernel [4] (e. [sent-146, score-0.376]
74 , string kernel [7] and tree kernel [2, 5]), which argues that a feature vector is implicitly composed of the counts of substructures. [sent-148, score-0.402]
75 2 The implicit mapping defined by the convolution kernel is given as: Φ(x) = (#(t1 ⊆ x), . [sent-149, score-0.233]
76 Noticing that a decision stump can be expressed as h t,y (x) = y · (2I(t ⊆ x) − 1), we see that the constraints or feature space of Boosting with substructure-based decision stumps are essentially the same as those of SVMs with the convolution kernel 3 . [sent-153, score-0.868]
77 The complexity of SVMs with tree kernel is O(l|n1 ||n2 |), where n1 and n2 are trees, and l is the number of support vectors, which is too heavy to be applied to real applications. [sent-170, score-0.201]
78 Boosting, in contrast, performs faster since the complexity depends only on a small number of decision stumps. [sent-171, score-0.108]
79 Second, sparse hypotheses are useful in practice as they provide “transparent” models with which we can analyze how the model performs or what kind of features are useful. [sent-172, score-0.126]
80 It is difficult to give such analysis with kernel methods since they define feature space implicitly. [sent-173, score-0.169]
81 (1) Cellphone review classification (REV) The goal of this task is to classify reviews for cellphones as positive or negative. [sent-175, score-0.115]
82 Each sentence is represented in a word-based dependency tree using a Japanese dependency parser CaboCha 4 . [sent-177, score-0.127]
83 (2) Toxicology prediction of chemical compounds (PTC) The task is to classify chemical compounds by carcinogenicity. [sent-178, score-0.462]
84 We used the PTC data set5 consisting of 417 compounds with 4 types of test animals: male mouse (MM), female 2 Strictly speaking, graph kernel [6] is not a convolution kernel because it is not based on the count of subgraphs, but on random walks in a graph. [sent-179, score-0.699]
85 3 The difference between decision stumps and the convolution kernels is that the former uses a binary feature denoting the existence (or absence) of each substructure, whereas the latter uses the cardinality of each substructure. [sent-180, score-0.687]
86 However, it makes little difference since a given graph is often sparse and the cardinality of substructures will be approximated by their existence. [sent-181, score-0.253]
87 We compared the performance of our Boosting algorithm and support vector machines with tree kernel [2, 5] (for REV) and graph kernel [6] (for PTC) according to their F-score in 5-fold cross validation. [sent-210, score-0.416]
88 , maximum iteration of Boosting, soft margin parameter of SVMs, and termination probability of random walks in graph kernel [6]). [sent-213, score-0.357]
89 In most tasks and categories, ML algorithms with structural features outperform the baseline systems (BOL). [sent-215, score-0.105]
90 These results support our first intuition that structural features are important for the classification of structured data, such as natural language texts and chemical compounds. [sent-216, score-0.321]
91 However, in the PTC task, our method outperforms SVMs using graph kernel on the categories MM, FM, and FR at a statistically significant level. [sent-218, score-0.215]
92 With our methods, about 1800 and 50 features (subgraphs) are used in the REV and PTC tasks respectively, while the potential number of features is quite large. [sent-220, score-0.112]
93 Even giving all subgraphs as feature candidates, Boosting selects a small and highly relevant subset of features. [sent-221, score-0.272]
94 These features are interesting because we cannot determine the correct label (positive/negative) only using such bag-of-label features as “charging,” “short,” or “long. [sent-224, score-0.112]
95 We cannot examine such analysis in kernel methods, since they define their feature space implicitly. [sent-233, score-0.169]
96 The reason why graph kernel shows poor performance on the PTC data set is that it cannot identify subtle difference between two graphs because it is based on a random walks in a graph. [sent-234, score-0.408]
97 For example, kernel dot-product between the similar but different structures 3(c) and 3(d) becomes quite large, although they show different behavior. [sent-235, score-0.134]
98 To classify chemical compounds by their functions, the system must be capable of capturing subtle differences among given graphs. [sent-236, score-0.274]
99 The proposal consists of i) decision stumps that use subtrees as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are applied as the weak learners. [sent-245, score-0.927]
100 Two experiments are employed to confirm the importance of subgraph features. [sent-246, score-0.123]
wordName wordTfidf (topN-words)
[('dfs', 0.408), ('boosting', 0.384), ('stumps', 0.302), ('ptc', 0.223), ('subgraphs', 0.208), ('gain', 0.196), ('svms', 0.175), ('di', 0.159), ('rev', 0.156), ('code', 0.133), ('yi', 0.131), ('convolution', 0.128), ('graphs', 0.127), ('subgraph', 0.123), ('compounds', 0.116), ('graph', 0.11), ('decision', 0.108), ('kernel', 0.105), ('chemical', 0.099), ('labeled', 0.099), ('tree', 0.096), ('xi', 0.079), ('isomorphic', 0.078), ('substructures', 0.078), ('supergraph', 0.078), ('search', 0.073), ('prune', 0.071), ('margin', 0.068), ('bol', 0.067), ('substructure', 0.067), ('cache', 0.066), ('feature', 0.064), ('classi', 0.06), ('traverse', 0.058), ('features', 0.056), ('def', 0.055), ('stump', 0.053), ('yan', 0.053), ('enumerate', 0.05), ('structural', 0.049), ('weak', 0.048), ('kernels', 0.047), ('fm', 0.047), ('rule', 0.047), ('argmaxt', 0.045), ('cellphone', 0.045), ('cellphones', 0.045), ('hisashi', 0.045), ('recharging', 0.045), ('tthms', 0.045), ('mm', 0.045), ('fr', 0.045), ('texts', 0.045), ('hypotheses', 0.043), ('canonical', 0.04), ('walks', 0.039), ('boostexter', 0.039), ('kashima', 0.039), ('lexicographic', 0.039), ('cardinality', 0.038), ('reviews', 0.038), ('structured', 0.038), ('upper', 0.037), ('connected', 0.036), ('mouse', 0.035), ('nara', 0.035), ('soft', 0.035), ('robert', 0.035), ('bound', 0.034), ('language', 0.034), ('accuracies', 0.033), ('female', 0.033), ('subtrees', 0.033), ('suboptimal', 0.033), ('maintain', 0.032), ('hyperplane', 0.032), ('ml', 0.032), ('classify', 0.032), ('implicitly', 0.032), ('represented', 0.031), ('rat', 0.031), ('eij', 0.031), ('codes', 0.03), ('instances', 0.03), ('adaboost', 0.03), ('dl', 0.03), ('yoav', 0.03), ('structures', 0.029), ('male', 0.028), ('sparse', 0.027), ('labels', 0.027), ('argmin', 0.027), ('subtle', 0.027), ('icml', 0.027), ('learner', 0.027), ('mr', 0.026), ('margins', 0.026), ('proposal', 0.026), ('find', 0.025), ('cation', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 19 nips-2004-An Application of Boosting to Graph Classification
Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto
Abstract: This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss the relation between our algorithm and SVMs with convolution kernels. Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efficiency. 1
2 0.16834593 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse
Abstract: We study a method of optimal data-driven aggregation of classifiers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classifiers of activation patterns in fMRI based on locally trained SVM classifiers. The aggregation coefficients are then used to design a ”boosting map” of the brain needed to identify the regions with most significant impact on classification. 1
3 0.15837646 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields
Author: Antonio Torralba, Kevin P. Murphy, William T. Freeman
Abstract: We seek to both detect and segment objects in images. To exploit both local image data as well as contextual information, we introduce Boosted Random Fields (BRFs), which uses Boosting to learn the graph structure and local evidence of a conditional random field (CRF). The graph structure is learned by assembling graph fragments in an additive model. The connections between individual pixels are not very informative, but by using dense graphs, we can pool information from large regions of the image; dense models also support efficient inference. We show how contextual information from other objects can improve detection performance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes. 1
4 0.1347885 141 nips-2004-Optimal sub-graphical models
Author: Mukund Narasimhan, Jeff A. Bilmes
Abstract: We investigate the problem of reducing the complexity of a graphical model (G, PG ) by finding a subgraph H of G, chosen from a class of subgraphs H, such that H is optimal with respect to KL-divergence. We do this by first defining a decomposition tree representation for G, which is closely related to the junction-tree representation for G. We then give an algorithm which uses this representation to compute the optimal H ∈ H. Gavril [2] and Tarjan [3] have used graph separation properties to solve several combinatorial optimization problems when the size of the minimal separators in the graph is bounded. We present an extension of this technique which applies to some important choices of H even when the size of the minimal separators of G are arbitrarily large. In particular, this applies to problems such as finding an optimal subgraphical model over a (k − 1)-tree of a graphical model over a k-tree (for arbitrary k) and selecting an optimal subgraphical model with (a constant) d fewer edges with respect to KL-divergence can be solved in time polynomial in |V (G)| using this formulation. 1 Introduction and Preliminaries The complexity of inference in graphical models is typically exponential in some parameter of the graph, such as the size of the largest clique. Therefore, it is often required to find a subgraphical model that has lower complexity (smaller clique size) without introducing a large error in inference results. The KL-divergence between the original probability distribution and the probability distribution on the simplified graphical model is often used to measure the impact on inference. Existing techniques for reducing the complexity of graphical models including annihilation and edge-removal [4] are greedy in nature and cannot make any guarantees regarding the optimality of the solution. This problem is NP-complete [9] and so, in general, one cannot expect a polynomial time algorithm to find the optimal solution. However, we show that when we restrict the problem to some sets of subgraphs, the optimal solution can be found quite quickly using a dynamic programming algorithm in time polynomial in the tree-width of the graph. 1.1 Notation and Terminology A graph G = (V, E) is said to be triangulated if every cycle of length greater than 3 has a chord. A clique of G is a non-empty set S ⊆ V such that {a, b} ∈ E for all ∗ This work was supported by NSF grant IIS-0093430 and an Intel Corporation Grant. {b, c, d} d {c, f, g} {b, c} {b, e, c} b c {f, c} {c, e} {e, c, f } g {b, e} a e f {a, b, e} Figure 1: A triangulated graph G and a junction-tree for G a, b ∈ S. A clique S is maximal if S is not properly contained in another clique. If α and β are non-adjacent vertices of G then a set of vertices S ⊆ V \ {α, β} is called an (α, β)-separator if α and β are in distinct components of G[V \ S]. S is a minimal (α, β)-separator if no proper subset of S is an (α, β)-separator. S is said to be a minimal separator if S is a minimal (α, β)-separator for some non adjacent a, b ∈ V . If T = (K, S) is a junction-tree for G (see [7]), then the nodes K of T correspond to the maximalcliques of G, while the links S correspond to minimal separators of G (We reserve the terms vertices/edges for elements of G, and nodes/links for the elements of T ). If G is triangulated, then the number of maximal cliques is at most |V |. For example, in the graph G shown in Figure 1, K = {{b, c, d} , {a, b, e} , {b, e, c} , {e, c, f } , {c, f, g}}. The links S of T correspond to minimal-separators of G in the following way. If Vi Vj ∈ S (where Vi , Vj ∈ K and hence are cliques of G), then Vi ∩ Vj = φ. We label each edge Vi Vj ∈ S with the set Vij = Vi ∩ Vj , which is a non-empty complete separator in G. The removal of any link Vi Vj ∈ S disconnects T into two subtrees which we denote T (i) and T (j) (chosen so that T (i) contains Vi ). We will let K(i) be the nodes of T (i) , and V (i) = ∪V ∈K (i) V be the set of vertices corresponding to the subtree T (i) . The junction tree property ensures that V (i) ∩ V (j) = Vi ∩ Vj = Vij . We will let G(i) be the subgraph induced by V (i) . A graphical model is a pair (G, P ) where P is the joint probability distribution for random variables X1 , X2 , . . . , Xn , and G is a graph with vertex set V (G) = {X1 , X2 , . . . , Xn } such that the separators in G imply conditional independencies in P (so P factors according to G). If G is triangulated, then the junction-tree algorithm can be used for exact inference in the probability distribution P . The complexity of this algorithm grows with the treewidth of G (which is one less than the size of the largest clique in G when G is triangulated). The growth is exponential when P is a discrete probability distribution, thus rendering exact inference for graphs with large treewidth impractical. Therefore, we seek another graphical model (H, PH ) which allows tractable inference (so H should have lower treewidth than G has). The general problem of finding a graphical model of tree-width at most k so as to minimize the KL-divergence from a specified probability distribution is NP complete for general k ([9]) However, it is known that this problem is solvable in polynomial time (in |V (G)|) for some special cases cases (such as when G has bounded treewidth or when k = 1 [1]). If (G, PG ) and (H, PH ) are graphical models, then we say that (H, PH ) is a subgraphical model of (G, PG ) if H is a spanning subgraph of G. Note in particular that separators in G are separators in H, and hence (G, PH ) is also a graphical model. 2 Graph Decompositions and Divide-and-Conquer Algorithms For the remainder of the paper, we will be assuming that G = (V, E) is some triangulated graph, with junction tree T = (K, S). As observed above, if Vi Vj ∈ S, then the removal {b, c, d} d {b, c} {b, e, c} b c c {c, f, g} {f, c} {e, c, f } g {b, e} a e e f {a, b, e} Figure 2: The graphs G(i) , G(j) and junction-trees T (i) and T (j) resulting from the removal of the link Vij = {c, e} of Vij = Vi ∩ Vj disconnects G into two (vertex-induced) subgraphs G(i) and G(j) which are both triangulated, with junction-trees T (i) and T (j) respectively. We can recursively decompose each of G(i) and G(j) into smaller and smaller subgraphs till the resulting subgraphs are cliques. When the size of all the minimal separators are bounded, we may use these decompositions to easily solve problems that are hard in general. For example, in [5] it is shown that NP-complete problems like vertex coloring, and finding maximum independent sets can be solved in polynomial time on graphs with bounded tree-width (which are equivalent to spanning graphs with bounded size separators). We will be interested in finding (triangulated) subgraphs of G that satisfy some conditions, such as a bound on the number of edges, or a bound on the tree-width and which optimize separable objective functions (described in Section 2) One reason why problems such as this can often be solved easily when the tree-width of G is bounded by some constant is this : If Vij is a separator decomposing G into G(i) and G(j) , then a divide-and-conquer approach would suggest that we try and find optimal subgraphs of G(i) and G(j) and then splice the two together to get an optimal subgraph of G. There are two issues with this approach. First, the optimal subgraphs of G (i) and G(j) need not necessarily match up on Vij , the set of common vertices. Second, even if the two subgraphs agree on the set of common vertices, the graph resulting from splicing the two subgraphs together need not be triangulated (which could happen even if the two subgraphs individually are triangulated). To rectify the situation, we can do the following. We partition the set of subgraphs of G(i) and G(j) into classes, so that any subgraph of G(i) and any subgraph G(j) corresponding to the same class are compatible in the sense that they match up on their intersection namely Vij , and so that by splicing the two subgraphs together, we get a subgraph of G which is acceptable (and in particular is triangulated). Then given optimal subgraphs of both G(i) and G(j) corresponding to each class, we can enumerate over all the classes and pick the best one. Of course, to ensure that we do not repeatedly solve the same problem, we need to work bottom-up (a.k.a dynamic programming) or memoize our solutions. This procedure can be carried out in polynomial (in |V |) time as long as we have only a polynomial number of classes. Now, if we have a polynomial number of classes, these classes need not actually be a partition of all the acceptable subgraphs, though the union of the classes must cover all acceptable subgraphs (so the same subgraph can be contained in more than one class). For our application, every class can be thought of to be the set of subgraphs that satisfy some constraint, and we need to pick a polynomial number of constraints that cover all possibilities. The bound on the tree-width helps us here. If k |Vij | = k, then in any subgraph H of G, H[Vij ] must be one of the 2(2) possible subgraphs k of G[Vij ]. So, if k is sufficiently small (so 2(2) is bounded by some polynomial in |V |), then this procedure results in a polynomial time algorithm. In this paper, we show that in some cases we can characterize the space H so that we still have a polynomial number of constraints even when the tree-width of G is not bounded by a small constant. 2.1 Separable objective functions For cases where exact inference in the graphical model (G, PG ) is intractable, it is natural to try to find a subgraphical model (H, PH ) such that D(PG PH ) is minimized, and inference using H is tractable. We will denote by H the set of subgraphs of G that are tractable for inference. For example, this set could be the set of subgraphs of G with treewidth one less than the treewidth of G, or perhaps the set of subgraphs of G with at d fewer edges. For a specified subgraph H of G, there is a unique probability distribution PH factoring over H that minimizes D(PG PH ). Hence, finding a optimal subgraphical model is equivalent to finding a subgraph H for which D(PG PH ) is minimized. If Vij is a separator of G, we will attempt to find optimal subgraphs of G by finding optimal subgraphs of G (i) and G(j) and splicing them together. However, to do this, we need to ensure that the objective criteria also decomposes along the separator Vij . Suppose that H is any triangulated subgraph of G. Let PG(i) and PG(j) be the (marginalized) distributions of PG on V (i) and V (j) respectively, and PH (i) and PH (j) be the (marginalized) distributions of the distribution PH on V (i) and V (j) where H (i) = H[V (i) ] and H (j) = H[V (j) ], The following result assures us that the KL-divergence also factors according to the separator Vij . Lemma 1. Suppose that (G, PG ) is a graphical model, H is a triangulated subgraph of G, and PH factors over H. Then D(PG PH ) = D(PG(i) PH (i) ) + D(PG(j) PH (j) ) − D(PG[Vij ] PH[Vij ] ). Proof. Since H is a subgraph of G, and Vij is a separator of G, Vij must also be a separator of H. Therefore, PH {Xv }v∈V = PH (i) ({Xv }v∈V (i) )·PH (j) ({Xv }v∈V (j) ) . PH[Vij ] ({Xv }v∈V ) The result ij follows immediately. Therefore, there is hope that we can reduce our our original problem of finding an optimal subgraph H ∈ H as one of finding subgraphs of H (i) ⊆ G(i) and H (j) ⊆ G(j) that are compatible, in the sense that they match up on the overlap Vij , and for which D(PG PH ) is minimized. Throughout this paper, for the sake of concreteness, we will assume that the objective criterion is to minimize the KL-divergence. However, all the results can be extended to other objective functions, as long as they “separate” in the sense that for any separator, the objective function is the sum of the objective functions of the two parts, possibly modulo some correction factor which is purely a function of the separator. Another example might be the complexity r(H) of representing the graphical model H. A very natural representation satisfies r(G) = r(G(i) ) + r(G(j) ) if G has a separator G(i) ∩ G(j) . Therefore, the representation cost reduction would satisfy r(G) − r(H) = (r(G (i) ) − r(H (i) )) + (r(G(j) ) − r(H (j) )), and so also factors according to the separators. Finally note that any linear combinations of such separable functions is also separable, and so this technique could also be used to determine tradeoffs (representation cost vs. KL-divergence loss for example). In Section 4 we discuss some issues regarding computing this function. 2.2 Decompositions and decomposition trees For the algorithms considered in this paper, we will be mostly interested in the decompositions that are specified by the junction tree, and we will represent these decompositions by a rooted tree called a decomposition tree. This representation was introduced in [2, 3], and is similar in spirit to Darwiche’s dtrees [6] which specify decompositions of directed acyclic graphs. In this section and the next, we show how a decomposition tree for a graph may be constructed, and show how it is used to solve a number of optimization problems. abd; ce; gf a; be; cd d; bc; e abe dbc ebc e; cf ; g cef cf g Figure 3: The separator tree corresponding to Figure 1 A decomposition tree for G is a rooted tree whose vertices correspond to separators and cliques of G. We describe the construction of the decomposition tree in terms of a junctiontree T = (K, S) for G. The interior nodes of the decomposition tree R(T ) correspond to S (the links of T and hence the minimal separators of G). The leaf or terminal nodes represent the elements of K (the nodes of T and hence the maximal cliques of G). R(T ) can be recursively constructed from T as follows : If T consists of just one node K, (and hence no edges), then R consists of just one node, which is given the label K as well. If however, T has more than one node, then T must contain at least one link. To begin, let Vi Vj ∈ S be any link in T . Then removal of the link Vi Vj results in two disjoint junctiontrees T (i) and T (j) . We label the root of R by the decomposition (V (i) ; Vij ; V (j) ). The rest of R is recursively built by successively picking links of T (i) and T (j) (decompositions of G(i) and G(j) ) to form the interior nodes of R. The effect of this procedure on the junction tree of Figure 1 is shown in Figure 3, where the decomposition associated with the interior nodes is shown inside the nodes. Let M be the set of all nodes of R(T ). For any interior node M induced by the the link Vi Vj ∈ S of T , then we will let M (i) and M (j) represent the left and right children of M , and R(i) and R(j) be the left and right trees below M . 3 3.1 Finding optimal subgraphical models Optimal sub (k − 1)-trees of k-trees Suppose that G is a k-tree. A sub (k − 1)-tree of G is a subgraph H of G that is (k − 1)tree. Now, if Vij is any minimal separator of G, then both G(i) and G(j) are k-trees on vertex sets V (i) and V (j) respectively. It is clear that the induced subgraphs H[V (i) ] and H[V (j) ] are subgraphs of G(i) and G(j) and are partial (k − 1)-trees. We will be interested in finding sub (k − 1)-trees of k trees and this problem is trivial by the result of [1] when k = 2. Therefore, we assume that k ≥ 3. The following result characterizes the various possibilities for H[Vij ] in this case. Lemma 2. Suppose that G is a k-tree, and S = Vij is a minimal separator of G corresponding to the link ij of the junction-tree T . In any (k − 1)-tree H ⊆ G either 1. There is a u ∈ S such that u is not connected to vertices in both V (i) \ S and V (j) \ S. Then S \ {u} is a minimal separator in H and hence is complete. 2. Every vertex in S is connected to vertices in both V (i) \S and V (j) \S. Then there are vertices {x, y} ⊆ S such that the edge H[S] is missing only the edge {x, y}. Further either H[V (i) ] or H[V (j) ] does not contain a unchorded x-y path. Proof. We consider two possibilities. In the first, there is some vertex u ∈ S such that u is not connected to vertices in both V (i) \S and V (j) \. Since the removal of S disconnects G, the removal of S must also disconnect H. Therefore, S must contain a minimal separator of H. Since H is a (k − 1)-tree, all minimal separators of H must contain k − 1 vertices which must therefore be S \{u}. This corresponds to case (1) above. Clearly this possiblity can occur. If there is no such u ∈ S, then every vertex in S is connected to vertices in both V (i) \ S and V (j) \ S. If x ∈ S is connected to some yi ∈ V (i) \ S and yj ∈ V (j) \ S, then x is contained in every minimal yi /yj separator (see [5]). Therefore, every vertex in S is part of a minimal separator. Since each minimal separator contains k − 1 vertices, there must be at least two distinct minimum separators contained in S. Let Sx = S \ {x} and Sy = S \ {y} be two distinct minimal separators. We claim that H[S] contains all edges except the edge {x, y}. To see this, note that if z, w ∈ S, with z = w and {z, w} = {x, y} (as sets), then either {z, w} ⊂ Sy or {z, w} ⊂ Sx . Since both Sx and Sy are complete in H, this edge must be present in H. The edge {x, y} is not present in H[S] because all minimal separators in H must be of size k − 1. Further, if both V (i) and V (j) contain an unchorded path between x and y, then by joining the two paths at x and y, we get a unchorded cycle in H which contradicts the fact that H is triangulated. Therefore, we may associate k · 2 + 2 · k constraints with each separator Vij of G as 2 follows. There are k possible constraints corresponding to case (1) above (one for each choice of x), and k · 2 choices corresponding to case (2) above. This is because for each 2 pair {x, y} corresponding to the missing edge, we have either V (i) contains no unchorded xy paths or V (j) contains no unchorded xy paths. More explicitly, we can encode the set of constraints CM associated with each separator S corresponding to an interior node M of the decomposition tree as follows: CM = { (x, y, s) : x ∈ S, y ∈ S, s ∈ {i, j}}. If y = x, then this corresponds to case (1) of the above lemma. If s = i, then x is connected only to H (i) and if s = j, then x is connected only to H (j) . If y = x, then this corresponds to case (2) in the above lemma. If s = i, then H (i) does not contain any unchorded path between x and y, and there is no constraint on H (j) . Similarly if s = j, then H (j) does not contain any unchorded path between x and y, and there is no constraint on H (i) . Now suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively, then it is clear that if H (i) and H (j) both satisfy the same constraint they must match up on the common vertices Vij . Therefore to splice together two solutions corresponding to the same constraint, we only need to check that the graph obtained by splicing the graphs is triangulated. Lemma 3. Suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively such that both of them satisfy the same constraint as described above. Then the graph H obtained by splicing H (i) and H (j) together is triangulated. Proof. Suppose that both H (i) and H (j) are both triangulated and both satisfy the same constraint. If both H (i) and H (j) satisfy the same constraint corresponding to case (1) in Lemma 2 and H has an unchorded cycle, then this cycle must involve elements of both H (i) and H (j) . Therefore, there must be two vertices of S \{u} on the cycle, and hence this cycle has a chord as S \ {u} is complete. This contradiction shows that H is triangulated. So assume that both of them satisfy the constraint corresponding to case (2) of Lemma 2. Then if H is not triangulated, there must be a t-cycle (for t ≥ 4) with no chord. Now, since {x, y} is the only missing edge of S in H, and because H (i) and H (j) are individually triangulated, the cycle must contain x, y and vertices of both V (i) \ S and V (j) \ S. We may split this unchorded cycle into two unchorded paths, one contained in V (i) and one in V (j) thus violating our assumption that both H (i) and H (j) satisfy the same constraint. If |S| = k, then there are 2k + 2 · k ∈ O(k 2 ) ∈ O(n2 ). We can use a divide and conquer 2 strategy to find the optimal sub (k − 1) tree once we have taken care of the base case, where G is just a single clique (of k + 1) elements. However, for this case, it is easily checked that any subgraph of G obtained by deleting exactly one edge results in a (k − 1) tree, and every sub (k−1)-tree results from this operation. Therefore, the optimal (k−1)-tree can be found using Algorithm 1, and in this case, the complexity of Algorithm 1 is O(n(k + 1) 2 ). This procedure can be generalized to find the optimal sub (k − d)- tree for any fixed d. However, the number of constraints grows exponentially with d (though it is still polynomial in n). Therefore for small, fixed values of d, we can compute the optimal sub (k − d)-tree of G. While we can compute (k − d)-trees of G by first going from a k tree to a (k − 1) tree, then from a (k − 1)-tree to a (k − 2)-tree, and so on in a greedy fashion, this will not be optimal in general. However, this might be a good algorithm to try when d is large. 3.2 Optimal triangulated subgraphs with |E(G)| − d edges Suppose that we are interested in a (triangulated) subgraph of G that contains d fewer edges that G does. That is, we want to find an optimal subgraph H ⊂ G such that |E(H)| = |E(G)| − d. Note that by the result of [4] there is always a triangulated subgraph with d fewer edges (if d < |E(G)|). Two possibilities for finding such an optimal subgraph are 1. Use the procedure described in [4]. This is a greedy procedure which works in d steps by deleting an edge at each step. At each state, the edge is picked from the set of edges whose deletion leaves a triangulated graph. Then the edge which causes the least increase in KL-divergence is picked at each stage. 2. For each possible subset A of E(G) of size d, whose deletion leaves a triangulated graph, compute the KL divergence using the formula above, and then pick the optimal one. Since there are |E(G)| such sets, this can be done in polynomial d time (in |V (G)|) when d is a constant. The first greedy algorithm is not guaranteed to yield the optimal solution. The second takes time that is O(n2d ). Now, let us solve this problem using the framework we’ve described. Let H be the set of subgraphs of G which may be obtained by deletion of d edges. For each M = ij ∈ M corresponding to the separator Vij , let CM = (l, r, c, s, A) : l + r − c = d, s a d bit string, A ∈ E(G[Vij ]) . The constraint reprec sented by (l, r, c, A) is this : A is a set of d edges of G[Vij ] that are missing in H, l edges are missing from the left subgraph, and r edges are missing from the right subgraph. c represents the double count, and so is subtracted from the total. If k is the size of the largest k clique, then the total number of such constraints is bounded by 2d · 2d · (2) ∈ O(k 2d ) d which could be better than O(n2d ) and is polynomial in |V | when d is constant. See [10] for additional details. 4 Conclusions Algorithm 1 will compute the optimal H ∈ H for the two examples discussed above and is polynomial (for fixed constant d) even if k is O(n). In [10] a generalization is presented which will allow finding the optimal solution for other classes of subgraphical models. Now, we assume an oracle model for computing KL-divergences of probability distributions on vertex sets of cliques. It is clear that these KL-divergences can be computed R ← separator-tree for G; for each vertex M of R in order of increasing height (bottom up) do for each constraint cM of M do if M is an interior vertex of R corresponding to edge ij of the junction tree then Let Ml and Mr be the left and right children of M ; Pick constraint cl ∈ CMl compatible with cM to minimize table[Ml , cl ]; Pick constraint cr ∈ CMr compatible with cM to minimize table[Mr , cr ]; loss ← D(PG [M ] PH [M ]); table[M, cM ] ← table[Ml , cl ] + table[Mr , cr ] − loss; else table[M, cM ] ← D(PG [M ] PH [M ]); end end end Algorithm 1: Finding optimal set of constraints efficiently for distributions like Gaussians, but for discrete distributions this may not be possible when k is large. However even in this case this algorithm will result in only polynomial calls to the oracle. The standard algorithm [3] which is exponential in the treewidth will make O(2k ) calls to this oracle. Therefore, when the cost of computing the KL-divergence is large, this algorithm becomes even more attractive as it results in exponential speedup over the standard algorithm. Alternatively, if we can compute approximate KL-divergences, or approximately optimal solutions, then we can compute an approximate solution by using the same algorithm. References [1] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, v. 14, 1968, Pages 462–467. [2] F. Gavril, “Algorithms on clique separable graphs”, Discrete Mathematics v. 9 (1977), pp. 159–165. [3] R. E. Tarjan. “Decomposition by Clique Separators”, Discrete Mathematics, v. 55 (1985), pp. 221–232. [4] U. Kjaerulff. “Reduction of computational complexity in Bayesian networks through removal of weak dependencies”, Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence, pp. 374–382, 1994. [5] T. Kloks, “Treewidth: Computations and Approximations”, Springer-Verlag, 1994. [6] A. Darwiche and M. Hopkins. “Using recursive decomposition to construct elimination orders, jointrees and dtrees”, Technical Report D-122, Computer Science Dept., UCLA. [7] S. Lauritzen. “Graphical Models”, Oxford University Press, Oxford, 1996. [8] T. A. McKee and F. R. McMorris. “Topics in Intersection Graph Theory”, SIAM Monographs on Discrete Mathematics and Applications, 1999. [9] D. Karger and N. Srebro. “Learning Markov networks: Maximum bounded tree-width graphs.” In Symposium on Discrete Algorithms, 2001, Pages 391-401. [10] M. Narasimhan and J. Bilmes. “Optimization on separator-clique trees.”, Technical report UWEETR 2004-10, June 2004.
5 0.12314686 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers
Author: Ligen Wang, Balázs Kégl
Abstract: In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classifier design and selection. On the other hand, we use A DA B OOST’s efficient learning mechanism to significantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the specific manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. 1
6 0.11702821 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
7 0.115391 34 nips-2004-Breaking SVM Complexity with Cross-Training
8 0.11516642 178 nips-2004-Support Vector Classification with Input Data Uncertainty
9 0.11279693 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
10 0.098256677 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
11 0.095849283 165 nips-2004-Semi-supervised Learning on Directed Graphs
12 0.091431491 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
13 0.084152162 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
14 0.083542578 92 nips-2004-Kernel Methods for Implicit Surface Modeling
15 0.08252579 115 nips-2004-Maximum Margin Clustering
16 0.080820523 168 nips-2004-Semigroup Kernels on Finite Sets
17 0.079722829 23 nips-2004-Analysis of a greedy active learning strategy
18 0.079664692 44 nips-2004-Conditional Random Fields for Object Recognition
19 0.077544957 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
20 0.077112958 70 nips-2004-Following Curved Regularized Optimization Solution Paths
topicId topicWeight
[(0, -0.231), (1, 0.111), (2, -0.029), (3, 0.161), (4, 0.046), (5, 0.074), (6, 0.126), (7, 0.006), (8, -0.154), (9, 0.025), (10, -0.087), (11, -0.073), (12, -0.075), (13, 0.053), (14, 0.03), (15, 0.027), (16, 0.12), (17, -0.006), (18, 0.112), (19, 0.064), (20, 0.042), (21, -0.014), (22, -0.024), (23, 0.073), (24, -0.04), (25, -0.008), (26, 0.049), (27, -0.138), (28, 0.05), (29, -0.16), (30, -0.025), (31, -0.002), (32, 0.002), (33, -0.131), (34, -0.124), (35, 0.091), (36, 0.072), (37, -0.045), (38, 0.14), (39, -0.11), (40, -0.065), (41, 0.095), (42, -0.126), (43, -0.019), (44, -0.063), (45, -0.008), (46, 0.158), (47, -0.04), (48, -0.026), (49, -0.093)]
simIndex simValue paperId paperTitle
same-paper 1 0.93296266 19 nips-2004-An Application of Boosting to Graph Classification
Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto
Abstract: This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss the relation between our algorithm and SVMs with convolution kernels. Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efficiency. 1
2 0.72408843 141 nips-2004-Optimal sub-graphical models
Author: Mukund Narasimhan, Jeff A. Bilmes
Abstract: We investigate the problem of reducing the complexity of a graphical model (G, PG ) by finding a subgraph H of G, chosen from a class of subgraphs H, such that H is optimal with respect to KL-divergence. We do this by first defining a decomposition tree representation for G, which is closely related to the junction-tree representation for G. We then give an algorithm which uses this representation to compute the optimal H ∈ H. Gavril [2] and Tarjan [3] have used graph separation properties to solve several combinatorial optimization problems when the size of the minimal separators in the graph is bounded. We present an extension of this technique which applies to some important choices of H even when the size of the minimal separators of G are arbitrarily large. In particular, this applies to problems such as finding an optimal subgraphical model over a (k − 1)-tree of a graphical model over a k-tree (for arbitrary k) and selecting an optimal subgraphical model with (a constant) d fewer edges with respect to KL-divergence can be solved in time polynomial in |V (G)| using this formulation. 1 Introduction and Preliminaries The complexity of inference in graphical models is typically exponential in some parameter of the graph, such as the size of the largest clique. Therefore, it is often required to find a subgraphical model that has lower complexity (smaller clique size) without introducing a large error in inference results. The KL-divergence between the original probability distribution and the probability distribution on the simplified graphical model is often used to measure the impact on inference. Existing techniques for reducing the complexity of graphical models including annihilation and edge-removal [4] are greedy in nature and cannot make any guarantees regarding the optimality of the solution. This problem is NP-complete [9] and so, in general, one cannot expect a polynomial time algorithm to find the optimal solution. However, we show that when we restrict the problem to some sets of subgraphs, the optimal solution can be found quite quickly using a dynamic programming algorithm in time polynomial in the tree-width of the graph. 1.1 Notation and Terminology A graph G = (V, E) is said to be triangulated if every cycle of length greater than 3 has a chord. A clique of G is a non-empty set S ⊆ V such that {a, b} ∈ E for all ∗ This work was supported by NSF grant IIS-0093430 and an Intel Corporation Grant. {b, c, d} d {c, f, g} {b, c} {b, e, c} b c {f, c} {c, e} {e, c, f } g {b, e} a e f {a, b, e} Figure 1: A triangulated graph G and a junction-tree for G a, b ∈ S. A clique S is maximal if S is not properly contained in another clique. If α and β are non-adjacent vertices of G then a set of vertices S ⊆ V \ {α, β} is called an (α, β)-separator if α and β are in distinct components of G[V \ S]. S is a minimal (α, β)-separator if no proper subset of S is an (α, β)-separator. S is said to be a minimal separator if S is a minimal (α, β)-separator for some non adjacent a, b ∈ V . If T = (K, S) is a junction-tree for G (see [7]), then the nodes K of T correspond to the maximalcliques of G, while the links S correspond to minimal separators of G (We reserve the terms vertices/edges for elements of G, and nodes/links for the elements of T ). If G is triangulated, then the number of maximal cliques is at most |V |. For example, in the graph G shown in Figure 1, K = {{b, c, d} , {a, b, e} , {b, e, c} , {e, c, f } , {c, f, g}}. The links S of T correspond to minimal-separators of G in the following way. If Vi Vj ∈ S (where Vi , Vj ∈ K and hence are cliques of G), then Vi ∩ Vj = φ. We label each edge Vi Vj ∈ S with the set Vij = Vi ∩ Vj , which is a non-empty complete separator in G. The removal of any link Vi Vj ∈ S disconnects T into two subtrees which we denote T (i) and T (j) (chosen so that T (i) contains Vi ). We will let K(i) be the nodes of T (i) , and V (i) = ∪V ∈K (i) V be the set of vertices corresponding to the subtree T (i) . The junction tree property ensures that V (i) ∩ V (j) = Vi ∩ Vj = Vij . We will let G(i) be the subgraph induced by V (i) . A graphical model is a pair (G, P ) where P is the joint probability distribution for random variables X1 , X2 , . . . , Xn , and G is a graph with vertex set V (G) = {X1 , X2 , . . . , Xn } such that the separators in G imply conditional independencies in P (so P factors according to G). If G is triangulated, then the junction-tree algorithm can be used for exact inference in the probability distribution P . The complexity of this algorithm grows with the treewidth of G (which is one less than the size of the largest clique in G when G is triangulated). The growth is exponential when P is a discrete probability distribution, thus rendering exact inference for graphs with large treewidth impractical. Therefore, we seek another graphical model (H, PH ) which allows tractable inference (so H should have lower treewidth than G has). The general problem of finding a graphical model of tree-width at most k so as to minimize the KL-divergence from a specified probability distribution is NP complete for general k ([9]) However, it is known that this problem is solvable in polynomial time (in |V (G)|) for some special cases cases (such as when G has bounded treewidth or when k = 1 [1]). If (G, PG ) and (H, PH ) are graphical models, then we say that (H, PH ) is a subgraphical model of (G, PG ) if H is a spanning subgraph of G. Note in particular that separators in G are separators in H, and hence (G, PH ) is also a graphical model. 2 Graph Decompositions and Divide-and-Conquer Algorithms For the remainder of the paper, we will be assuming that G = (V, E) is some triangulated graph, with junction tree T = (K, S). As observed above, if Vi Vj ∈ S, then the removal {b, c, d} d {b, c} {b, e, c} b c c {c, f, g} {f, c} {e, c, f } g {b, e} a e e f {a, b, e} Figure 2: The graphs G(i) , G(j) and junction-trees T (i) and T (j) resulting from the removal of the link Vij = {c, e} of Vij = Vi ∩ Vj disconnects G into two (vertex-induced) subgraphs G(i) and G(j) which are both triangulated, with junction-trees T (i) and T (j) respectively. We can recursively decompose each of G(i) and G(j) into smaller and smaller subgraphs till the resulting subgraphs are cliques. When the size of all the minimal separators are bounded, we may use these decompositions to easily solve problems that are hard in general. For example, in [5] it is shown that NP-complete problems like vertex coloring, and finding maximum independent sets can be solved in polynomial time on graphs with bounded tree-width (which are equivalent to spanning graphs with bounded size separators). We will be interested in finding (triangulated) subgraphs of G that satisfy some conditions, such as a bound on the number of edges, or a bound on the tree-width and which optimize separable objective functions (described in Section 2) One reason why problems such as this can often be solved easily when the tree-width of G is bounded by some constant is this : If Vij is a separator decomposing G into G(i) and G(j) , then a divide-and-conquer approach would suggest that we try and find optimal subgraphs of G(i) and G(j) and then splice the two together to get an optimal subgraph of G. There are two issues with this approach. First, the optimal subgraphs of G (i) and G(j) need not necessarily match up on Vij , the set of common vertices. Second, even if the two subgraphs agree on the set of common vertices, the graph resulting from splicing the two subgraphs together need not be triangulated (which could happen even if the two subgraphs individually are triangulated). To rectify the situation, we can do the following. We partition the set of subgraphs of G(i) and G(j) into classes, so that any subgraph of G(i) and any subgraph G(j) corresponding to the same class are compatible in the sense that they match up on their intersection namely Vij , and so that by splicing the two subgraphs together, we get a subgraph of G which is acceptable (and in particular is triangulated). Then given optimal subgraphs of both G(i) and G(j) corresponding to each class, we can enumerate over all the classes and pick the best one. Of course, to ensure that we do not repeatedly solve the same problem, we need to work bottom-up (a.k.a dynamic programming) or memoize our solutions. This procedure can be carried out in polynomial (in |V |) time as long as we have only a polynomial number of classes. Now, if we have a polynomial number of classes, these classes need not actually be a partition of all the acceptable subgraphs, though the union of the classes must cover all acceptable subgraphs (so the same subgraph can be contained in more than one class). For our application, every class can be thought of to be the set of subgraphs that satisfy some constraint, and we need to pick a polynomial number of constraints that cover all possibilities. The bound on the tree-width helps us here. If k |Vij | = k, then in any subgraph H of G, H[Vij ] must be one of the 2(2) possible subgraphs k of G[Vij ]. So, if k is sufficiently small (so 2(2) is bounded by some polynomial in |V |), then this procedure results in a polynomial time algorithm. In this paper, we show that in some cases we can characterize the space H so that we still have a polynomial number of constraints even when the tree-width of G is not bounded by a small constant. 2.1 Separable objective functions For cases where exact inference in the graphical model (G, PG ) is intractable, it is natural to try to find a subgraphical model (H, PH ) such that D(PG PH ) is minimized, and inference using H is tractable. We will denote by H the set of subgraphs of G that are tractable for inference. For example, this set could be the set of subgraphs of G with treewidth one less than the treewidth of G, or perhaps the set of subgraphs of G with at d fewer edges. For a specified subgraph H of G, there is a unique probability distribution PH factoring over H that minimizes D(PG PH ). Hence, finding a optimal subgraphical model is equivalent to finding a subgraph H for which D(PG PH ) is minimized. If Vij is a separator of G, we will attempt to find optimal subgraphs of G by finding optimal subgraphs of G (i) and G(j) and splicing them together. However, to do this, we need to ensure that the objective criteria also decomposes along the separator Vij . Suppose that H is any triangulated subgraph of G. Let PG(i) and PG(j) be the (marginalized) distributions of PG on V (i) and V (j) respectively, and PH (i) and PH (j) be the (marginalized) distributions of the distribution PH on V (i) and V (j) where H (i) = H[V (i) ] and H (j) = H[V (j) ], The following result assures us that the KL-divergence also factors according to the separator Vij . Lemma 1. Suppose that (G, PG ) is a graphical model, H is a triangulated subgraph of G, and PH factors over H. Then D(PG PH ) = D(PG(i) PH (i) ) + D(PG(j) PH (j) ) − D(PG[Vij ] PH[Vij ] ). Proof. Since H is a subgraph of G, and Vij is a separator of G, Vij must also be a separator of H. Therefore, PH {Xv }v∈V = PH (i) ({Xv }v∈V (i) )·PH (j) ({Xv }v∈V (j) ) . PH[Vij ] ({Xv }v∈V ) The result ij follows immediately. Therefore, there is hope that we can reduce our our original problem of finding an optimal subgraph H ∈ H as one of finding subgraphs of H (i) ⊆ G(i) and H (j) ⊆ G(j) that are compatible, in the sense that they match up on the overlap Vij , and for which D(PG PH ) is minimized. Throughout this paper, for the sake of concreteness, we will assume that the objective criterion is to minimize the KL-divergence. However, all the results can be extended to other objective functions, as long as they “separate” in the sense that for any separator, the objective function is the sum of the objective functions of the two parts, possibly modulo some correction factor which is purely a function of the separator. Another example might be the complexity r(H) of representing the graphical model H. A very natural representation satisfies r(G) = r(G(i) ) + r(G(j) ) if G has a separator G(i) ∩ G(j) . Therefore, the representation cost reduction would satisfy r(G) − r(H) = (r(G (i) ) − r(H (i) )) + (r(G(j) ) − r(H (j) )), and so also factors according to the separators. Finally note that any linear combinations of such separable functions is also separable, and so this technique could also be used to determine tradeoffs (representation cost vs. KL-divergence loss for example). In Section 4 we discuss some issues regarding computing this function. 2.2 Decompositions and decomposition trees For the algorithms considered in this paper, we will be mostly interested in the decompositions that are specified by the junction tree, and we will represent these decompositions by a rooted tree called a decomposition tree. This representation was introduced in [2, 3], and is similar in spirit to Darwiche’s dtrees [6] which specify decompositions of directed acyclic graphs. In this section and the next, we show how a decomposition tree for a graph may be constructed, and show how it is used to solve a number of optimization problems. abd; ce; gf a; be; cd d; bc; e abe dbc ebc e; cf ; g cef cf g Figure 3: The separator tree corresponding to Figure 1 A decomposition tree for G is a rooted tree whose vertices correspond to separators and cliques of G. We describe the construction of the decomposition tree in terms of a junctiontree T = (K, S) for G. The interior nodes of the decomposition tree R(T ) correspond to S (the links of T and hence the minimal separators of G). The leaf or terminal nodes represent the elements of K (the nodes of T and hence the maximal cliques of G). R(T ) can be recursively constructed from T as follows : If T consists of just one node K, (and hence no edges), then R consists of just one node, which is given the label K as well. If however, T has more than one node, then T must contain at least one link. To begin, let Vi Vj ∈ S be any link in T . Then removal of the link Vi Vj results in two disjoint junctiontrees T (i) and T (j) . We label the root of R by the decomposition (V (i) ; Vij ; V (j) ). The rest of R is recursively built by successively picking links of T (i) and T (j) (decompositions of G(i) and G(j) ) to form the interior nodes of R. The effect of this procedure on the junction tree of Figure 1 is shown in Figure 3, where the decomposition associated with the interior nodes is shown inside the nodes. Let M be the set of all nodes of R(T ). For any interior node M induced by the the link Vi Vj ∈ S of T , then we will let M (i) and M (j) represent the left and right children of M , and R(i) and R(j) be the left and right trees below M . 3 3.1 Finding optimal subgraphical models Optimal sub (k − 1)-trees of k-trees Suppose that G is a k-tree. A sub (k − 1)-tree of G is a subgraph H of G that is (k − 1)tree. Now, if Vij is any minimal separator of G, then both G(i) and G(j) are k-trees on vertex sets V (i) and V (j) respectively. It is clear that the induced subgraphs H[V (i) ] and H[V (j) ] are subgraphs of G(i) and G(j) and are partial (k − 1)-trees. We will be interested in finding sub (k − 1)-trees of k trees and this problem is trivial by the result of [1] when k = 2. Therefore, we assume that k ≥ 3. The following result characterizes the various possibilities for H[Vij ] in this case. Lemma 2. Suppose that G is a k-tree, and S = Vij is a minimal separator of G corresponding to the link ij of the junction-tree T . In any (k − 1)-tree H ⊆ G either 1. There is a u ∈ S such that u is not connected to vertices in both V (i) \ S and V (j) \ S. Then S \ {u} is a minimal separator in H and hence is complete. 2. Every vertex in S is connected to vertices in both V (i) \S and V (j) \S. Then there are vertices {x, y} ⊆ S such that the edge H[S] is missing only the edge {x, y}. Further either H[V (i) ] or H[V (j) ] does not contain a unchorded x-y path. Proof. We consider two possibilities. In the first, there is some vertex u ∈ S such that u is not connected to vertices in both V (i) \S and V (j) \. Since the removal of S disconnects G, the removal of S must also disconnect H. Therefore, S must contain a minimal separator of H. Since H is a (k − 1)-tree, all minimal separators of H must contain k − 1 vertices which must therefore be S \{u}. This corresponds to case (1) above. Clearly this possiblity can occur. If there is no such u ∈ S, then every vertex in S is connected to vertices in both V (i) \ S and V (j) \ S. If x ∈ S is connected to some yi ∈ V (i) \ S and yj ∈ V (j) \ S, then x is contained in every minimal yi /yj separator (see [5]). Therefore, every vertex in S is part of a minimal separator. Since each minimal separator contains k − 1 vertices, there must be at least two distinct minimum separators contained in S. Let Sx = S \ {x} and Sy = S \ {y} be two distinct minimal separators. We claim that H[S] contains all edges except the edge {x, y}. To see this, note that if z, w ∈ S, with z = w and {z, w} = {x, y} (as sets), then either {z, w} ⊂ Sy or {z, w} ⊂ Sx . Since both Sx and Sy are complete in H, this edge must be present in H. The edge {x, y} is not present in H[S] because all minimal separators in H must be of size k − 1. Further, if both V (i) and V (j) contain an unchorded path between x and y, then by joining the two paths at x and y, we get a unchorded cycle in H which contradicts the fact that H is triangulated. Therefore, we may associate k · 2 + 2 · k constraints with each separator Vij of G as 2 follows. There are k possible constraints corresponding to case (1) above (one for each choice of x), and k · 2 choices corresponding to case (2) above. This is because for each 2 pair {x, y} corresponding to the missing edge, we have either V (i) contains no unchorded xy paths or V (j) contains no unchorded xy paths. More explicitly, we can encode the set of constraints CM associated with each separator S corresponding to an interior node M of the decomposition tree as follows: CM = { (x, y, s) : x ∈ S, y ∈ S, s ∈ {i, j}}. If y = x, then this corresponds to case (1) of the above lemma. If s = i, then x is connected only to H (i) and if s = j, then x is connected only to H (j) . If y = x, then this corresponds to case (2) in the above lemma. If s = i, then H (i) does not contain any unchorded path between x and y, and there is no constraint on H (j) . Similarly if s = j, then H (j) does not contain any unchorded path between x and y, and there is no constraint on H (i) . Now suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively, then it is clear that if H (i) and H (j) both satisfy the same constraint they must match up on the common vertices Vij . Therefore to splice together two solutions corresponding to the same constraint, we only need to check that the graph obtained by splicing the graphs is triangulated. Lemma 3. Suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively such that both of them satisfy the same constraint as described above. Then the graph H obtained by splicing H (i) and H (j) together is triangulated. Proof. Suppose that both H (i) and H (j) are both triangulated and both satisfy the same constraint. If both H (i) and H (j) satisfy the same constraint corresponding to case (1) in Lemma 2 and H has an unchorded cycle, then this cycle must involve elements of both H (i) and H (j) . Therefore, there must be two vertices of S \{u} on the cycle, and hence this cycle has a chord as S \ {u} is complete. This contradiction shows that H is triangulated. So assume that both of them satisfy the constraint corresponding to case (2) of Lemma 2. Then if H is not triangulated, there must be a t-cycle (for t ≥ 4) with no chord. Now, since {x, y} is the only missing edge of S in H, and because H (i) and H (j) are individually triangulated, the cycle must contain x, y and vertices of both V (i) \ S and V (j) \ S. We may split this unchorded cycle into two unchorded paths, one contained in V (i) and one in V (j) thus violating our assumption that both H (i) and H (j) satisfy the same constraint. If |S| = k, then there are 2k + 2 · k ∈ O(k 2 ) ∈ O(n2 ). We can use a divide and conquer 2 strategy to find the optimal sub (k − 1) tree once we have taken care of the base case, where G is just a single clique (of k + 1) elements. However, for this case, it is easily checked that any subgraph of G obtained by deleting exactly one edge results in a (k − 1) tree, and every sub (k−1)-tree results from this operation. Therefore, the optimal (k−1)-tree can be found using Algorithm 1, and in this case, the complexity of Algorithm 1 is O(n(k + 1) 2 ). This procedure can be generalized to find the optimal sub (k − d)- tree for any fixed d. However, the number of constraints grows exponentially with d (though it is still polynomial in n). Therefore for small, fixed values of d, we can compute the optimal sub (k − d)-tree of G. While we can compute (k − d)-trees of G by first going from a k tree to a (k − 1) tree, then from a (k − 1)-tree to a (k − 2)-tree, and so on in a greedy fashion, this will not be optimal in general. However, this might be a good algorithm to try when d is large. 3.2 Optimal triangulated subgraphs with |E(G)| − d edges Suppose that we are interested in a (triangulated) subgraph of G that contains d fewer edges that G does. That is, we want to find an optimal subgraph H ⊂ G such that |E(H)| = |E(G)| − d. Note that by the result of [4] there is always a triangulated subgraph with d fewer edges (if d < |E(G)|). Two possibilities for finding such an optimal subgraph are 1. Use the procedure described in [4]. This is a greedy procedure which works in d steps by deleting an edge at each step. At each state, the edge is picked from the set of edges whose deletion leaves a triangulated graph. Then the edge which causes the least increase in KL-divergence is picked at each stage. 2. For each possible subset A of E(G) of size d, whose deletion leaves a triangulated graph, compute the KL divergence using the formula above, and then pick the optimal one. Since there are |E(G)| such sets, this can be done in polynomial d time (in |V (G)|) when d is a constant. The first greedy algorithm is not guaranteed to yield the optimal solution. The second takes time that is O(n2d ). Now, let us solve this problem using the framework we’ve described. Let H be the set of subgraphs of G which may be obtained by deletion of d edges. For each M = ij ∈ M corresponding to the separator Vij , let CM = (l, r, c, s, A) : l + r − c = d, s a d bit string, A ∈ E(G[Vij ]) . The constraint reprec sented by (l, r, c, A) is this : A is a set of d edges of G[Vij ] that are missing in H, l edges are missing from the left subgraph, and r edges are missing from the right subgraph. c represents the double count, and so is subtracted from the total. If k is the size of the largest k clique, then the total number of such constraints is bounded by 2d · 2d · (2) ∈ O(k 2d ) d which could be better than O(n2d ) and is polynomial in |V | when d is constant. See [10] for additional details. 4 Conclusions Algorithm 1 will compute the optimal H ∈ H for the two examples discussed above and is polynomial (for fixed constant d) even if k is O(n). In [10] a generalization is presented which will allow finding the optimal solution for other classes of subgraphical models. Now, we assume an oracle model for computing KL-divergences of probability distributions on vertex sets of cliques. It is clear that these KL-divergences can be computed R ← separator-tree for G; for each vertex M of R in order of increasing height (bottom up) do for each constraint cM of M do if M is an interior vertex of R corresponding to edge ij of the junction tree then Let Ml and Mr be the left and right children of M ; Pick constraint cl ∈ CMl compatible with cM to minimize table[Ml , cl ]; Pick constraint cr ∈ CMr compatible with cM to minimize table[Mr , cr ]; loss ← D(PG [M ] PH [M ]); table[M, cM ] ← table[Ml , cl ] + table[Mr , cr ] − loss; else table[M, cM ] ← D(PG [M ] PH [M ]); end end end Algorithm 1: Finding optimal set of constraints efficiently for distributions like Gaussians, but for discrete distributions this may not be possible when k is large. However even in this case this algorithm will result in only polynomial calls to the oracle. The standard algorithm [3] which is exponential in the treewidth will make O(2k ) calls to this oracle. Therefore, when the cost of computing the KL-divergence is large, this algorithm becomes even more attractive as it results in exponential speedup over the standard algorithm. Alternatively, if we can compute approximate KL-divergences, or approximately optimal solutions, then we can compute an approximate solution by using the same algorithm. References [1] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, v. 14, 1968, Pages 462–467. [2] F. Gavril, “Algorithms on clique separable graphs”, Discrete Mathematics v. 9 (1977), pp. 159–165. [3] R. E. Tarjan. “Decomposition by Clique Separators”, Discrete Mathematics, v. 55 (1985), pp. 221–232. [4] U. Kjaerulff. “Reduction of computational complexity in Bayesian networks through removal of weak dependencies”, Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence, pp. 374–382, 1994. [5] T. Kloks, “Treewidth: Computations and Approximations”, Springer-Verlag, 1994. [6] A. Darwiche and M. Hopkins. “Using recursive decomposition to construct elimination orders, jointrees and dtrees”, Technical Report D-122, Computer Science Dept., UCLA. [7] S. Lauritzen. “Graphical Models”, Oxford University Press, Oxford, 1996. [8] T. A. McKee and F. R. McMorris. “Topics in Intersection Graph Theory”, SIAM Monographs on Discrete Mathematics and Applications, 1999. [9] D. Karger and N. Srebro. “Learning Markov networks: Maximum bounded tree-width graphs.” In Symposium on Discrete Algorithms, 2001, Pages 391-401. [10] M. Narasimhan and J. Bilmes. “Optimization on separator-clique trees.”, Technical report UWEETR 2004-10, June 2004.
3 0.61139673 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields
Author: Antonio Torralba, Kevin P. Murphy, William T. Freeman
Abstract: We seek to both detect and segment objects in images. To exploit both local image data as well as contextual information, we introduce Boosted Random Fields (BRFs), which uses Boosting to learn the graph structure and local evidence of a conditional random field (CRF). The graph structure is learned by assembling graph fragments in an additive model. The connections between individual pixels are not very informative, but by using dense graphs, we can pool information from large regions of the image; dense models also support efficient inference. We show how contextual information from other objects can improve detection performance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes. 1
4 0.56431001 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse
Abstract: We study a method of optimal data-driven aggregation of classifiers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classifiers of activation patterns in fMRI based on locally trained SVM classifiers. The aggregation coefficients are then used to design a ”boosting map” of the brain needed to identify the regions with most significant impact on classification. 1
5 0.49689454 165 nips-2004-Semi-supervised Learning on Directed Graphs
Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf
Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classification problems demonstrates encouraging results that validate our approach. 1
6 0.46071112 100 nips-2004-Learning Preferences for Multiclass Problems
7 0.43361568 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers
8 0.42560118 34 nips-2004-Breaking SVM Complexity with Cross-Training
9 0.42114121 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
10 0.41463241 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
11 0.41257894 177 nips-2004-Supervised Graph Inference
12 0.40600133 44 nips-2004-Conditional Random Fields for Object Recognition
13 0.38953581 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
14 0.3857837 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
15 0.374116 82 nips-2004-Incremental Algorithms for Hierarchical Classification
16 0.36986029 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
17 0.3677156 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
18 0.3669934 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
19 0.3641434 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making
20 0.35941881 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
topicId topicWeight
[(13, 0.074), (15, 0.154), (17, 0.016), (26, 0.077), (31, 0.018), (32, 0.015), (33, 0.181), (35, 0.011), (39, 0.027), (50, 0.027), (76, 0.01), (87, 0.012), (94, 0.298)]
simIndex simValue paperId paperTitle
1 0.88556755 152 nips-2004-Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization
Author: Fei Sha, Lawrence K. Saul
Abstract: An auditory “scene”, composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an important cue for auditory scene analysis. In this paper, with the goal of building agents that operate in human environments, we describe a real-time system to identify the presence of one or more voices and compute their pitch. The signal processing in the front end is based on instantaneous frequency estimation, a method for tracking the partials of voiced speech, while the pattern-matching in the back end is based on nonnegative matrix factorization, an unsupervised algorithm for learning the parts of complex objects. While supporting a framework to analyze complicated auditory scenes, our system maintains real-time operability and state-of-the-art performance in clean speech.
2 0.80621493 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini
Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1
same-paper 3 0.80553782 19 nips-2004-An Application of Boosting to Graph Classification
Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto
Abstract: This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss the relation between our algorithm and SVMs with convolution kernels. Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efficiency. 1
4 0.76983047 145 nips-2004-Parametric Embedding for Class Visualization
Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1
5 0.67095822 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artificially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures. 1
6 0.65774804 68 nips-2004-Face Detection --- Efficient and Rank Deficient
7 0.64973658 161 nips-2004-Self-Tuning Spectral Clustering
8 0.64962786 178 nips-2004-Support Vector Classification with Input Data Uncertainty
9 0.64941978 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
10 0.64913446 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
11 0.64885122 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
12 0.64751315 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
13 0.64748412 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
14 0.64747614 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
15 0.64605182 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
16 0.64577812 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
17 0.64534605 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
18 0.64532721 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
19 0.64411587 69 nips-2004-Fast Rates to Bayes for Kernel Machines
20 0.64344448 131 nips-2004-Non-Local Manifold Tangent Learning