nips nips2001 nips2001-190 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. [sent-7, score-1.048]
2 By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. [sent-8, score-0.782]
3 This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. [sent-9, score-0.394]
4 We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection. [sent-10, score-0.279]
5 In such problems, the goal is generally that of characterizing the principal dependencies in the data, a problem which is often cast within the framework of multivariate density estimation. [sent-12, score-0.086]
6 Simple models are often preferred in this setting, both for their computational tractability and their relative immunity to overfitting. [sent-13, score-0.1]
7 A number of methods have been developed for selecting models of higher-order dependencies in data, either within the maximum entropy setting—in which features are selected [9, 16]—and the graphical model setting—in which edges are selected [8]. [sent-18, score-0.423]
8 Simplicity also plays an important role in the design of these algorithms; in particular, greedy methods that add or subtract a single feature or edge at a time are generally employed. [sent-19, score-0.208]
9 In the current paper we describe a methodology that can be viewed as a generalization of the Chow-Liu algorithm for constructing tree models [2]. [sent-21, score-0.338]
10 Note that tree models have the property that their junction trees have no more than two nodes in any clique—the treewidth of tree models is one. [sent-22, score-1.513]
11 In our generalization, we allow the treewidth to be a larger, but still controlled, value. [sent-23, score-0.471]
12 We fit data within the space of models having “thin” junction trees. [sent-24, score-0.449]
13 Models with thin junction trees are tractable for exact inference, indeed the complexity of any type of inference (joint, marginal, conditional) is controlled by the upper bound that is imposed on the treewidth. [sent-25, score-0.876]
14 , occluded digits) in a simple and direct way—we simply marginalize away the unobserved variables, an operation which is tractable in our models. [sent-29, score-0.092]
15 We illustrate this capability in a study of handwritten digit recognition in Section 4. [sent-30, score-0.159]
16 2, where we compare thin junction trees and support vector machines (SVMs), a discriminative technique which does not come equipped with a simple and principled method for handling partially observed data. [sent-31, score-0.84]
17 As we will see, thin junction trees are quite robust to missing data in this domain. [sent-32, score-0.84]
18 In particular, tree models come equipped with particularly efficient algorithms for parameter estimation and model selection—algorithms which do not generalize readily to non-tree models, including thin junction tree models. [sent-34, score-1.25]
19 1 Feature induction ¢£ We assume an input space with variables and a target probability distribution . [sent-38, score-0.152]
20 Consider a vector-valued “feature” or “sufficient statistic” , where is the dimensionality of the feature space. [sent-40, score-0.109]
21 The feature can also be thought in terms of its components as a set of real-valued features . [sent-41, score-0.248]
22 We focus on exponential family distributions (also known as “Gibbs” or “maximum entropy” distributions) based on these features: where is a parameter vector, is a base-measure (typically uniform), and is the normalizing constant. [sent-42, score-0.144]
23 (Section 3 considers the closely-related problem of inducing edges rather than features). [sent-43, score-0.104]
24 0 5¤ Each feature is a function of a certain subset of variables, and we let index the subset of variables referred to by feature . [sent-46, score-0.256]
25 Let us consider the undirected graphical model , where the set of edges is the set of all pairs are the maximal cliques of the graph included in at least one . [sent-47, score-0.436]
26 With this definition the and, if is decomposable in this graph, the exponential family distribution with features and reference distribution is also decomposable in this graph. [sent-48, score-0.328]
27 For each possible triangulation of the graph, we can define a junction tree [4], where for all there exists a maximal clique containing . [sent-50, score-0.96]
28 The complexity of exact inference depends on the size of the maximal clique of the triangulated graph. [sent-51, score-0.448]
29 We define the treewidth of our original graph to be one less than the minimum possible value of this maximal clique size for all possible triangulations. [sent-52, score-0.798]
30 We say that a graphical model has a thin junction tree if its treewidth is small. [sent-53, score-1.353]
31 ' S Q I UTRPH I X i @Y S ¦ hg i ' IH p r 0 C¤ e YcccY `YW f¡ dC11baXXV IH 0 5¤ I qH r Our basic feature induction algorithm is a constrained variant of that proposed by [9]. [sent-54, score-0.265]
32 Given a set of available features, we perform a greedy search to find the set of features that enables the best possible fit to , under the constraint of having a thin junction tree. [sent-55, score-0.75]
33 At each step, candidates are ranked according to the gain in KL divergence, with respect to the empirical distribution, that would be achieved by their addition to the current set of features. [sent-56, score-0.098]
34 Features that would generate a graphical cover with treewidth greater than a given upper bound are removed from the ranking. [sent-57, score-0.541]
35 ¢£ r 7 The parameter values are held fixed during each step of the feature ranking process. [sent-58, score-0.217]
36 Once a set of candidate features are chosen, however, we reestimate all of the parameters (using the algorithm to be described in Section 2) and iterate. [sent-59, score-0.181]
37 F EATURE I NDUCTION ¡ ' 7Y ¡ , , a set of available features 1. [sent-60, score-0.139]
38 In particular, as shown by [9], under these conditions we can rank a new feature by solving a polynomial equation whose degree is the number of values can take minus one, and whose coefficients are expectations under of functions of . [sent-65, score-0.146]
39 When the feature is binary the process is even more efficient—the equation is linear and can be solved directly. [sent-67, score-0.109]
40 Consequently, with a single set of samples from , we can rank many features very cheaply. [sent-68, score-0.213]
41 ¤ ¤ For the feature elimination operation, algorithms exist that determine in time linear in the number of nodes whether a graph has a treewidth smaller than , and if so output a triangulation in which all cliques are of size less than [1]. [sent-69, score-0.954]
42 In practice we have had success using fast heuristic triangulation methods [11] that allow us to guarantee the existence of a junction tree with a maximal clique no larger than for a given model. [sent-71, score-0.96]
43 r r r r A critical bottleneck in the algorithm is the parameter estimation step, and it is important to develop a parameter estimation algorithm that exploits the bounded treewidth property. [sent-73, score-0.814]
44 2 Iterative Scaling using the junction tree Fitting an exponential family distribution under expectation constraints is a well studied problem; the basic technique is known as Iterative Scaling. [sent-75, score-0.695]
45 Algorithms that update the parameters in parallel have also been proposed; in particular the Generalized Iterative Scaling algorithm [6], which imposes the constraint that the features sum to one, and the Improved Iterative Scaling algorithm [9], which removes this constraint. [sent-77, score-0.223]
46 These algorithms have an important advantage in our setting in that, for each set of parameter updates, they only require computations of expectation that can all be estimated with a single set of samples from the current distribution. [sent-78, score-0.149]
47 To do so we exploit the bounded treewidth of our models. [sent-80, score-0.501]
48 We present a novel algorithm that uses the junction tree and the structure of the problem to speed up parameter estimation. [sent-81, score-0.679]
49 The algorithm generalizes to Gibbs distributions the “effective IPF” algorithm of [10]. [sent-82, score-0.084]
50 When working with a junction tree, a efficient way of performing Iterative Scaling is to update parameters block by block so that each update is performed for a relatively small number of features on a small number of variables. [sent-83, score-0.606]
51 Each block can be fit with any parameter estimation algorithm, in particular Improved Iterative Scaling (IIS). [sent-84, score-0.136]
52 The following algorithm exploits this idea by grouping the features whose supports are in the same clique of the triangulated graph. [sent-85, score-0.481]
53 Thus, parameter estimation is done in spaces of dimensions at most , and all the needed expectations can be evaluated cheaply. [sent-86, score-0.099]
54 Let denote the maximal cliques of the triangulated graph, with potentials . [sent-92, score-0.339]
55 We assign each feature to one of the cliques that contains . [sent-93, score-0.255]
56 For each clique we denote as the set of features assigned to . [sent-94, score-0.312]
57 Replace by this distribution and add the resulting parameters (one for each feature in ) to the corresponding ’s: . [sent-101, score-0.109]
58 Moreover, each the features pass through all the cliques is equivalent to one pass of Iterative Scaling and therefore this algorithm converges to the maximum likelihood distribution. [sent-105, score-0.402]
59 3 Edge induction Thus far we have emphasized the exponential family representation. [sent-107, score-0.215]
60 Our algorithm can, however, be adapted readily to the problem of learning the structure of a graphical model. [sent-108, score-0.112]
61 This is achieved by using features that are indicators of subsets of variables, ensuring that there is one such indicator for every combination of values of the variables in a clique. [sent-109, score-0.209]
62 In particular, we evaluate an edge only in terms of the two variables associated directly with the edge. [sent-112, score-0.137]
63 The clique formed by the addition of the edge, however, may involve additional higher-order dependencies, which can be parameterized and incorporated in the model. [sent-113, score-0.173]
64 Evaluating edges in this way thus underestimates the potential gain in KL divergence. [sent-114, score-0.103]
65 20 15 10 5 0 0 10 20 30 Figure 1: (Left) Circular Boltzmann machine of treewidth 4. [sent-115, score-0.471]
66 We should not expect to be able to find an exact edge-selection method—recent work by Srebro [15] has shown that the related problem of finding the maximum likelihood graphical model with bounded treewidth is NP-hard. [sent-117, score-0.676]
67 1 Small graphs with known generative model In this experiment we generate samples from a known graphical model and fit our model to the data. [sent-119, score-0.162]
68 We consider circular Boltzmann machines of known treewidth equal to 4 as shown in Figure 1. [sent-120, score-0.508]
69 Our networks all have 32 nodes and the weights were selected from a —so that each edge is significant. [sent-121, score-0.099]
70 For an increasing uniform distribution in number of training samples, ten replications were performed for each case using our feature induction algorithm with maximum treewidth equal to 4. [sent-122, score-0.885]
71 2 MNIST digit dataset In this section we study the performance of the thin junction tree method on the MNIST dataset of handwritten digits. [sent-125, score-1.087]
72 It is of interest to see how much performance loss we incur and how much robustness we gain by using a sophisticated generative model for this problem. [sent-127, score-0.092]
73 ` ` The MNIST training set is composed of 4-bit grayscale pixels that have been resized and cropped to binary images (an example is provided in the leftmost plots in Figure 2). [sent-128, score-0.267]
74 We used thin junction trees as density estimators in the 256-dimensional pixel space by training ten different models, one for each of the ten classes. [sent-129, score-1.0]
75 We utilized ten percent fractions of the training data for crossvalidation and test. [sent-132, score-0.166]
76 W W ' " " ¦% & " # Density estimation: The leftmost plot in Figure 3 shows how increasing the maximal allowed treewidth, ranging from 1 (trees) to 15, enables a better fit to data. [sent-134, score-0.134]
77 Classification: We built classifiers from the bank of ten thin junction tree (“TJT”) models using one of the following strategies: (1) take the maximum likelihood among the ten Figure 2: Digit from the MNIST database. [sent-135, score-1.165]
78 From left to right, original digit, cropped and resized digits used in our experiments, 50% of missing values, 75% of missing values, occluded digit. [sent-136, score-0.382]
79 100 70 80 65 60 60 40 55 50 20 0 5 10 15 0 0 50 100 Figure 3: (Left) Negative log likelihood for the digit 2 vs maximal allowed treewidth. [sent-137, score-0.248]
80 (Right) Error rate as a function of the percentage of erased pixels for the TJT classifier (plain) and a support vector machine (dotted). [sent-138, score-0.22]
81 models (TJT-ML), or (2) train a discriminative model using the outputs of the ten models. [sent-140, score-0.213]
82 It is important to emphasize that our models are tractable for full joint inference; indeed, the junction trees have a maximal clique size of 10 in the largest models we used on the ten classes. [sent-152, score-1.069]
83 Missing pixels: We ran an experiment in which pixels were chosen uniformly at random and erased, as shown in Figure 2. [sent-155, score-0.094]
84 In our generative model, we treat them as hidden variables that were marginalized out. [sent-156, score-0.137]
85 The rightmost plot in Figure 3 shows the error rate on the testing set as a function of the percentage of unknown pixels, for our models and for a SVM. [sent-157, score-0.099]
86 Best classification performance was achieved with replacing the missing value by the value of a blank pixel. [sent-159, score-0.138]
87 Note that very little performance decrement is seen for our classifier even with up to 50 percent of the pixels missing, while for the SVM, although performance is better for small percentages, performance degrades more rapidly as the percentage of erased digits increases. [sent-160, score-0.271]
88 Reconstruction: We conducted an additional experiment in which the upper halves of images were erased. [sent-161, score-0.085]
89 We ran the junction tree inference algorithm to fill in these missing values, choosing the maximizing value of the conditional probability (max-propagation). [sent-162, score-0.782]
90 3 SPLICE Dataset The task in this dataset is to classify splice junctions in DNA sequences. [sent-168, score-0.263]
91 The three different classes are: EI exactly at the middle (between the 30th and the 31st bases), IE exactly at the middle (between the 30th and the 31st bases), no splice junction. [sent-172, score-0.12]
92 We treat classification as a density estimation problem in this case by treating the class variable as another variable. [sent-175, score-0.088]
93 We tested both feature induction and edge induction; in the were former case only binary features that are products of features of the form tested and induced. [sent-177, score-0.6]
94 MDL was used to pick the number of features or edges. [sent-178, score-0.139]
95 " ¡¢' " &%¦ " % ¦ £ ) Our feature induction algorithm, with a maximum treewidth equal to 5, gave an error rate of , while the edge induction algorithm gave an error rate of . [sent-179, score-0.987]
96 cW ¤ ¥c ¥ ¤c £ ¤c ¤ 5 Conclusions We have described a methodology for feature selection, edge selection and parameter estimation that can be viewed as a generalization of the Chow-Liu algorithm. [sent-181, score-0.4]
97 Drawing on the feature selection methods of [9, 16], our method is quite general, building an exponential family model from the general vocabulary of features on overlapping subsets of variables. [sent-182, score-0.403]
98 Our methodology applies equally well to feature or edge selection. [sent-184, score-0.247]
99 In large-scale, sparse domains in which overfitting is of particular concern, however, feature selection may be the preferred approach, in that it provides a finer-grained search in the space of simple models than is allowed by the edge selection approach. [sent-185, score-0.372]
100 Preucil, On the effective implementation of the iterative proportional fitting procedure, Computational Statistics and Data Analysis, 19, 177-189, 1995. [sent-233, score-0.117]
wordName wordTfidf (topN-words)
[('treewidth', 0.471), ('junction', 0.393), ('thin', 0.218), ('tree', 0.201), ('clique', 0.173), ('cliques', 0.146), ('features', 0.139), ('trees', 0.135), ('splice', 0.12), ('iterative', 0.117), ('induction', 0.114), ('digit', 0.114), ('ten', 0.111), ('feature', 0.109), ('edge', 0.099), ('maximal', 0.097), ('triangulation', 0.096), ('triangulated', 0.096), ('missing', 0.094), ('pixels', 0.094), ('scaling', 0.084), ('erased', 0.083), ('graphical', 0.07), ('mnist', 0.07), ('classi', 0.07), ('ef', 0.069), ('edges', 0.066), ('ranking', 0.065), ('candidates', 0.061), ('dataset', 0.058), ('graph', 0.057), ('estimation', 0.056), ('models', 0.056), ('fractions', 0.055), ('ipf', 0.055), ('resized', 0.055), ('tjt', 0.055), ('yccc', 0.055), ('generative', 0.055), ('family', 0.055), ('selection', 0.054), ('dependencies', 0.054), ('dna', 0.052), ('inference', 0.052), ('digits', 0.051), ('bases', 0.05), ('junctions', 0.048), ('halves', 0.048), ('equipped', 0.048), ('chow', 0.048), ('srebro', 0.048), ('tractable', 0.048), ('discriminative', 0.046), ('exponential', 0.046), ('handwritten', 0.045), ('iis', 0.044), ('blank', 0.044), ('cropped', 0.044), ('tractability', 0.044), ('decomposable', 0.044), ('marginalized', 0.044), ('occluded', 0.044), ('reconstructions', 0.044), ('percentage', 0.043), ('parameter', 0.043), ('algorithm', 0.042), ('della', 0.041), ('pietra', 0.041), ('mdl', 0.041), ('elimination', 0.041), ('ih', 0.04), ('root', 0.039), ('repeat', 0.039), ('methodology', 0.039), ('kl', 0.039), ('svmtorch', 0.038), ('inducing', 0.038), ('variables', 0.038), ('cient', 0.038), ('maximum', 0.038), ('samples', 0.037), ('likelihood', 0.037), ('images', 0.037), ('gain', 0.037), ('block', 0.037), ('rank', 0.037), ('classify', 0.037), ('leftmost', 0.037), ('circular', 0.037), ('setting', 0.035), ('algorithms', 0.034), ('ensuring', 0.032), ('boltzmann', 0.032), ('liu', 0.032), ('density', 0.032), ('exploits', 0.031), ('bounded', 0.03), ('gibbs', 0.03), ('jordan', 0.03), ('exact', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 190 nips-2001-Thin Junction Trees
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
2 0.19355595 118 nips-2001-Matching Free Trees with Replicator Equations
Author: Marcello Pelillo
Abstract: Motivated by our recent work on rooted tree matching, in this paper we provide a solution to the problem of matching two free (i.e., unrooted) trees by constructing an association graph whose maximal cliques are in one-to-one correspondence with maximal common subtrees. We then solve the problem using simple replicator dynamics from evolutionary game theory. Experiments on hundreds of uniformly random trees are presented. The results are impressive: despite the inherent inability of these simple dynamics to escape from local optima, they always returned a globally optimal solution.
3 0.15976356 193 nips-2001-Unsupervised Learning of Human Motion Models
Author: Yang Song, Luis Goncalves, Pietro Perona
Abstract: This paper presents an unsupervised learning algorithm that can derive the probabilistic dependence structure of parts of an object (a moving human body in our examples) automatically from unlabeled data. The distinguished part of this work is that it is based on unlabeled data, i.e., the training features include both useful foreground parts and background clutter and the correspondence between the parts and detected features are unknown. We use decomposable triangulated graphs to depict the probabilistic independence of parts, but the unsupervised technique is not limited to this type of graph. In the new approach, labeling of the data (part assignments) is taken as hidden variables and the EM algorithm is applied. A greedy algorithm is developed to select parts and to search for the optimal structure based on the differential entropy of these variables. The success of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled real image sequences.
4 0.14906634 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
Author: James M. Coughlan, Alan L. Yuille
Abstract: We describe the g-factor, which relates probability distributions on image features to distributions on the images themselves. The g-factor depends only on our choice of features and lattice quantization and is independent of the training image data. We illustrate the importance of the g-factor by analyzing how the parameters of Markov Random Field (i.e. Gibbs or log-linear) probability models of images are learned from data by maximum likelihood estimation. In particular, we study homogeneous MRF models which learn image distributions in terms of clique potentials corresponding to feature histogram statistics (d. Minimax Entropy Learning (MEL) by Zhu, Wu and Mumford 1997 [11]) . We first use our analysis of the g-factor to determine when the clique potentials decouple for different features . Second, we show that clique potentials can be computed analytically by approximating the g-factor. Third, we demonstrate a connection between this approximation and the Generalized Iterative Scaling algorithm (GIS), due to Darroch and Ratcliff 1972 [2], for calculating potentials. This connection enables us to use GIS to improve our multinomial approximation, using Bethe-Kikuchi[8] approximations to simplify the GIS procedure. We support our analysis by computer simulations. 1
5 0.14812267 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
6 0.13481462 56 nips-2001-Convolution Kernels for Natural Language
7 0.11659446 105 nips-2001-Kernel Machines and Boolean Functions
8 0.10708399 22 nips-2001-A kernel method for multi-labelled classification
9 0.10207732 149 nips-2001-Probabilistic Abstraction Hierarchies
10 0.10095941 46 nips-2001-Categorization by Learning and Combining Object Parts
11 0.099319018 43 nips-2001-Bayesian time series classification
12 0.096927106 115 nips-2001-Linear-time inference in Hierarchical HMMs
13 0.092347205 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
14 0.090686172 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
15 0.089634508 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs
16 0.081339717 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
17 0.078731656 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
18 0.077733628 8 nips-2001-A General Greedy Approximation Algorithm with Applications
19 0.076660827 188 nips-2001-The Unified Propagation and Scaling Algorithm
20 0.075074084 89 nips-2001-Grouping with Bias
topicId topicWeight
[(0, -0.266), (1, 0.08), (2, -0.028), (3, 0.046), (4, -0.1), (5, -0.123), (6, -0.126), (7, -0.075), (8, -0.061), (9, -0.017), (10, -0.033), (11, -0.146), (12, 0.125), (13, -0.018), (14, -0.008), (15, -0.144), (16, -0.205), (17, 0.219), (18, -0.032), (19, -0.053), (20, -0.015), (21, -0.002), (22, -0.028), (23, 0.024), (24, -0.074), (25, 0.061), (26, 0.1), (27, 0.025), (28, 0.131), (29, 0.069), (30, 0.131), (31, 0.034), (32, 0.105), (33, -0.147), (34, -0.033), (35, 0.104), (36, 0.043), (37, -0.043), (38, -0.026), (39, 0.196), (40, 0.028), (41, 0.014), (42, 0.062), (43, -0.03), (44, -0.006), (45, 0.081), (46, 0.017), (47, -0.045), (48, -0.009), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.94012815 190 nips-2001-Thin Junction Trees
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
2 0.76654053 118 nips-2001-Matching Free Trees with Replicator Equations
Author: Marcello Pelillo
Abstract: Motivated by our recent work on rooted tree matching, in this paper we provide a solution to the problem of matching two free (i.e., unrooted) trees by constructing an association graph whose maximal cliques are in one-to-one correspondence with maximal common subtrees. We then solve the problem using simple replicator dynamics from evolutionary game theory. Experiments on hundreds of uniformly random trees are presented. The results are impressive: despite the inherent inability of these simple dynamics to escape from local optima, they always returned a globally optimal solution.
3 0.67733264 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
Author: James M. Coughlan, Alan L. Yuille
Abstract: We describe the g-factor, which relates probability distributions on image features to distributions on the images themselves. The g-factor depends only on our choice of features and lattice quantization and is independent of the training image data. We illustrate the importance of the g-factor by analyzing how the parameters of Markov Random Field (i.e. Gibbs or log-linear) probability models of images are learned from data by maximum likelihood estimation. In particular, we study homogeneous MRF models which learn image distributions in terms of clique potentials corresponding to feature histogram statistics (d. Minimax Entropy Learning (MEL) by Zhu, Wu and Mumford 1997 [11]) . We first use our analysis of the g-factor to determine when the clique potentials decouple for different features . Second, we show that clique potentials can be computed analytically by approximating the g-factor. Third, we demonstrate a connection between this approximation and the Generalized Iterative Scaling algorithm (GIS), due to Darroch and Ratcliff 1972 [2], for calculating potentials. This connection enables us to use GIS to improve our multinomial approximation, using Bethe-Kikuchi[8] approximations to simplify the GIS procedure. We support our analysis by computer simulations. 1
4 0.59945685 193 nips-2001-Unsupervised Learning of Human Motion Models
Author: Yang Song, Luis Goncalves, Pietro Perona
Abstract: This paper presents an unsupervised learning algorithm that can derive the probabilistic dependence structure of parts of an object (a moving human body in our examples) automatically from unlabeled data. The distinguished part of this work is that it is based on unlabeled data, i.e., the training features include both useful foreground parts and background clutter and the correspondence between the parts and detected features are unknown. We use decomposable triangulated graphs to depict the probabilistic independence of parts, but the unsupervised technique is not limited to this type of graph. In the new approach, labeling of the data (part assignments) is taken as hidden variables and the EM algorithm is applied. A greedy algorithm is developed to select parts and to search for the optimal structure based on the differential entropy of these variables. The success of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled real image sequences.
5 0.57225293 149 nips-2001-Probabilistic Abstraction Hierarchies
Author: Eran Segal, Daphne Koller, Dirk Ormoneit
Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.
6 0.48285857 56 nips-2001-Convolution Kernels for Natural Language
7 0.46367753 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
8 0.45922601 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
9 0.4345499 108 nips-2001-Learning Body Pose via Specialized Maps
10 0.42774519 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs
11 0.38892975 105 nips-2001-Kernel Machines and Boolean Functions
12 0.3617599 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
13 0.3615157 120 nips-2001-Minimax Probability Machine
14 0.35805136 43 nips-2001-Bayesian time series classification
15 0.35648668 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
16 0.35612333 22 nips-2001-A kernel method for multi-labelled classification
17 0.34981573 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
18 0.3444154 115 nips-2001-Linear-time inference in Hierarchical HMMs
19 0.34001896 84 nips-2001-Global Coordination of Local Linear Models
20 0.31771749 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
topicId topicWeight
[(14, 0.037), (17, 0.018), (19, 0.036), (27, 0.147), (30, 0.104), (36, 0.016), (38, 0.02), (59, 0.049), (72, 0.065), (79, 0.047), (83, 0.027), (88, 0.22), (91, 0.145)]
simIndex simValue paperId paperTitle
1 0.95316589 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
2 0.9205519 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
Author: Brendan J. Frey, Nebojsa Jojic
Abstract: In previous work on “transformed mixtures of Gaussians” and “transformed hidden Markov models”, we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N ×N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N )N 2 scalar operations per iteration. In contrast, the original algorithm takes CN 6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320 ×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm. 1
same-paper 3 0.87616551 190 nips-2001-Thin Junction Trees
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
4 0.78806531 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder
Author: Shiro Ikeda, Toshiyuki Tanaka, Shun-ichi Amari
Abstract: The mystery of belief propagation (BP) decoder, especially of the turbo decoding, is studied from information geometrical viewpoint. The loopy belief network (BN) of turbo codes makes it difficult to obtain the true “belief” by BP, and the characteristics of the algorithm and its equilibrium are not clearly understood. Our study gives an intuitive understanding of the mechanism, and a new framework for the analysis. Based on the framework, we reveal basic properties of the turbo decoding.
5 0.76159114 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
6 0.73500121 13 nips-2001-A Natural Policy Gradient
7 0.73335177 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
8 0.73022223 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
9 0.73018968 46 nips-2001-Categorization by Learning and Combining Object Parts
10 0.73010212 56 nips-2001-Convolution Kernels for Natural Language
11 0.72972107 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
12 0.72943169 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
13 0.72893369 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
14 0.72775203 8 nips-2001-A General Greedy Approximation Algorithm with Applications
15 0.7270568 60 nips-2001-Discriminative Direction for Kernel Classifiers
17 0.72555631 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
18 0.72544181 193 nips-2001-Unsupervised Learning of Human Motion Models
19 0.72431248 89 nips-2001-Grouping with Bias
20 0.72392219 185 nips-2001-The Method of Quantum Clustering