nips nips2004 nips2004-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jacob Goldberger, Sam T. Roweis
Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. [sent-3, score-0.785]
2 The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. [sent-4, score-0.074]
3 We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. [sent-5, score-0.373]
4 1 Introduction The Gaussian mixture model (MoG) is a flexible and powerful parametric framework for unsupervised data grouping. [sent-6, score-0.258]
5 Mixture models, however, are often involved in other learning processes whose goals extend beyond simple density estimation to hierarchical clustering, grouping of discrete categories or model simplification. [sent-7, score-0.186]
6 This grouping results in a compact representation of the original mixture of many Gaussians that respects the original component structure in the sense that no original component is split in the reduced representation. [sent-9, score-0.643]
7 We can view the problem of Gaussian component clustering as general data-point clustering with side information that points belonging to the same original Gaussian component should end up in the same final cluster. [sent-10, score-0.679]
8 Several algorithms that perform clustering of data points given such constraints were recently proposed [11, 5, 12]. [sent-11, score-0.237]
9 Of course, one could always generate data by sampling from the model, enforcing the constraint that any two samples generated by the same mixture component must end up in the same final cluster. [sent-13, score-0.287]
10 In other situations we want to collapse a MoG into a mixture of fewer components in order to reduce computation complexity. [sent-15, score-0.277]
11 One common solution to this problem is grouping the Gaussians according to their common history in recent timesteps and collapsing Gaussians grouped together into a single Gaussian [1]. [sent-17, score-0.251]
12 Other instances in which collapsing MoGs is relevant are variants of particle filtering [10], non-parametric belief propagation [7] and fault detection in dynamical systems [3]. [sent-19, score-0.204]
13 A straight-forward solution for these situations is first to produce samples from the original MoG and then to apply the EM algorithm to learn a reduced model; however this is computationally inefficient and does not preserve the component structure of the original mixture. [sent-20, score-0.36]
14 2 The Clustering Algorithm We assume that we are given a mixture density f composed of k d-dimensional Gaussian components: k f (y) = k αi N (y; µi , Σi ) = i=1 αi fi (y) (1) i=1 We want to cluster the components of f into a reduced mixture of m < k components. [sent-21, score-0.847]
15 If we denote the set of all (d-dimensional) Gaussian mixture models with at most m components by MoG(m), one way to formalize the goal of clustering is to say that we wish to find the element g of MoG(m) “closest” to f under some distance measure. [sent-22, score-0.52]
16 g = arg ming KL(f ||g) = arg maxg f log g, where KL() is the Kullback-Leibler ˆ divergence and the minimization is performed over all g in MoG(m). [sent-25, score-0.326]
17 Furthermore, minimizing a KL-based criterion does not preserving the original component structure of f . [sent-27, score-0.157]
18 2 2 2 |Σ1 | Under this distance, the optimal reduced MoG representation g is the solution to ˆ the minimization of (2) over MoG(m): g = arg ming d(f, g). [sent-31, score-0.254]
19 Although the minˆ imization ranges over all the MoG(m), we prove that the optimal density g is a ˆ MoG obtained from grouping the components of f into clusters and collapsing all Gaussians within a cluster into a single Gaussian. [sent-32, score-0.397]
20 i=1 (3) For a given g ∈ M oG(m), we associate a matching function π g ∈ S: m π g (i) = arg min KL(fi ||gj ) i = 1, . [sent-42, score-0.163]
21 Using (5) to define our main optimization we obtain the optimal reduced model as a solution of the following double minimization problem: g = arg min min d(f, g, π) ˆ g (6) π∈S For m > 1 the double minimization (6) can not be solved analytically. [sent-48, score-0.411]
22 Instead, we can use alternating minimization to obtain a local minimum. [sent-49, score-0.082]
23 Let gj = N (µj , Σj ) be the Gaussian distribution obtained π by collapsing the set fj into a single Gaussian. [sent-52, score-0.757]
24 It satisfies: π π π gj = N (µj , Σj ) = arg min KL(fj ||g) = arg min d(fj , g) g g such that the minimization is performed over all the d-dimensional Gaussian densities. [sent-53, score-0.814]
25 Denote the collapsed version of f according to π by g π , i. [sent-54, score-0.193]
26 : m π βj gj gπ = (8) j=1 Lemma 1: Given a MoG f and a matching function π ∈ S, g π is the unique minimum point of d(f, g, π). [sent-56, score-0.557]
27 More precisely, d(f, g π , π) ≤ d(f, g, π) for all g ∈ π M oG(m), and if d(f, g π , π) = d(f, g, π) then gj = gj for all j = 1, . [sent-57, score-1.032]
28 , m such that π π gj and gj are the Gaussian components of g and g respectively. [sent-59, score-1.08]
29 Proof: Denote c = k i=1 fi log fi (a constant independent of g). [sent-60, score-0.599]
30 , m (such that π (j) is not empty) the π Gaussian densities gj and gj are equal. [sent-64, score-1.032]
31 The iterative algorithm monotonically decreases the distance measure d(f, g). [sent-66, score-0.112]
32 The next theorem ensures that once the iterative algorithm converges we obtain a clustering of the MoG components. [sent-68, score-0.321]
33 Definition 1: A MoG g ∈ M oG(m) is an m-mixture collapsed version of f if there exists a matching function π ∈ S such that g is obtained by collapsing f according to π, . [sent-69, score-0.375]
34 Theorem 1: If applying a single iteration (expressions (regroup) and (refit)) to a function g ∈ M oG(m) does not decrease the distance function (2), then necessarily g is a collapsed version of f . [sent-73, score-0.265]
35 Let g π be a collapsed version of f according to π. [sent-75, score-0.193]
36 According to Lemma 1, this implies π that gj = gj for all j = 1, . [sent-89, score-1.032]
37 2 Theorem 1 implies that each local minimum of the propose iterative algorithm is a collapsed version of f . [sent-94, score-0.231]
38 Given the optimal matching function π, the last step of the algorithm is to set π the weights of the reduced representation. [sent-95, score-0.151]
39 These weights are automatically obtained via the collapsing process. [sent-97, score-0.141]
40 3 Experimental Results In this section we evaluate the performance of our semi-supervised clustering algorithm and compare it to the standard “flat” clustering approach that does not respect the original component structure. [sent-98, score-0.635]
41 We have applied both methods to clustering handwritten digits and natural scene images. [sent-99, score-0.341]
42 For each category c we learn from a labeled training set a Gaussian distribution f (x|c). [sent-101, score-0.096]
43 The goal is to cluster the objects into a small number of clusters (fewer than the number of class labels). [sent-103, score-0.141]
44 The standard (flat) approach is to apply an unsupervised clustering to entire collection of original objects, ignoring their class labels. [sent-104, score-0.392]
45 Alternatively we can utilize the given categorization as side-information in order to obtain an improved reduced clustering which also respects the original labels, thus inducing a hierarchical structure. [sent-105, score-0.483]
46 f x f Class A method this paper unsupervised EM ¢ ¢ Class B cls Class A Class B Class 1 Class 2 0 100 0 93 7 1 4 96 16 85 Figure 1: (top) Means of 10 models of digit classes. [sent-106, score-0.112]
47 (bottom) Means of two clusters after our algorithm has grouped 0,2,3,5,6,8 and 1,4,7,9. [sent-107, score-0.128]
48 2 99 1 93 7 3 99 1 87 14 4 3 98 22 78 5 99 2 66 34 6 99 1 96 4 7 0 100 16 84 8 94 6 23 77 9 1 99 25 76 Table 1: Clustering results showing the purity of a 2-cluster reduced model learned from a training set of handwritten digits in 10 original classes. [sent-108, score-0.312]
49 For each true label, the percentage of cases (from an unseen test set) falling into each of the two reduced classes is shown. [sent-109, score-0.087]
50 The top two lines show the purity of assignments provided by our clustering algorithm; the bottom two lines show assignments from a flat unsupervised fitting of a two component mixture. [sent-110, score-0.476]
51 In the next step we want to divide the digits into two natural clusters, while taking into account their original 10-way structure. [sent-113, score-0.127]
52 We applied our semi-supervised algorithm to reduce the mixture of 10 Gaussians into a mixture of two Gaussians. [sent-114, score-0.391]
53 The minimal distance (2) is obtained when the ten digits are divided into the two groups {0, 2, 3, 5, 6, 8} and {1, 4, 7, 9}. [sent-115, score-0.108]
54 The means of the two resulting clusters are shown in Figure 1. [sent-116, score-0.081]
55 To evaluate the purity of this clustering, the reduced MoG was used to label a test set consists of 4000 previously unseen examples. [sent-117, score-0.182]
56 The binary labels on the test set are obtained by comparing the likelihood of the two components in the reduced mixture. [sent-118, score-0.177]
57 Alternatively we can apply a standard EM algorithm to learn by maximum likelihood a flat mixture of 2 Gaussians directly from the 7000 training examples, without utilizing their class labels. [sent-120, score-0.31]
58 Although the likelihood of the unsupervised mixture model was significantly better than the semi-supervised model, both on train and test data-sets it is obvious that the purity of the clusters it learns is much worse since it is not preserving the hierarchical class structure. [sent-122, score-0.584]
59 Comparing the top and bottom of Table 1, we can see that using the side information we obtain a clustering of the digit data-base which is much more correlated with categorization of the set into ten digits than the unsupervised procedure. [sent-123, score-0.456]
60 In a second experiment, we evaluate the performance of our proposed algorithm on image category models. [sent-124, score-0.146]
61 The database used consists of 1460 images selectively handpicked from the COREL database to create 16 categories. [sent-125, score-0.085]
62 The images within each category have similar color spatial layout, and are labeled with a high-level semantic clustering results 2 semi−supervised unsupervised mutual information 1. [sent-126, score-0.449]
63 5 B D Figure 2: Hierarchical clustering of natural image categories. [sent-139, score-0.27]
64 (left) Mutual information between reduced cluster index and original class. [sent-140, score-0.162]
65 (right) Sample images from the sets A,B,C,D learned by hierarchical clustering. [sent-141, score-0.089]
66 From all the pixels that are belonging to the same category we learn a single Gaussian. [sent-146, score-0.117]
67 , 6 sets using our algorithm and compared the results to unsupervised clustering obtained from an EM procedure that learned a mixture of k Gaussians. [sent-150, score-0.518]
68 In order to evaluate the quality of the clustering in terms of correlation with the category information we computed the mutual information (MI) between the clustering result (into k clusters) and the category affiliation of the images in a test set. [sent-151, score-0.702]
69 A high value of mutual information indicates a strong resemblance between the content of the learned clusters and the hand-picked image categories. [sent-152, score-0.151]
70 It can be verified from the results summarized in Figure 2 that, as we can expect, the MI in the case of semi-supervised clustering is consistently larger than the MI in the case of completely unsupervised clustering. [sent-153, score-0.311]
71 A semi-supervised clustering of the image database yields clusters that are based on both low-level features and a high level available categorization. [sent-154, score-0.398]
72 Sampled images from clustering into 4 sets presented in Figure 2. [sent-155, score-0.27]
73 4 A Stochastic Model for the Proposed Distance In this section we describe a stochastic process that induces a likelihood function which coincides with the distance measure d(f, g) presented in section 2. [sent-156, score-0.093]
74 Suppose we are given two MoGs: αi N (y; µi , Σi ) , αi fi (y) = i=1 m m k k f (y) = j=1 i=1 βj N (y; µj , Σj ) βj gj (y) = g(y) = j=1 Consider an iid sample set of size n, drawn from f (y). [sent-157, score-0.796]
75 The samples can be arranged in k blocks according to the Gaussian component that was selected to produce the sample. [sent-158, score-0.126]
76 Assume that ni samples were drawn from the i-th component fi and denote these samples by yi = {yi1 , . [sent-159, score-0.515]
77 Next, we compute the likelihood of the sample set according to the model g; but under the constraint that samples within the same block must be assigned to the same mixture component of g. [sent-163, score-0.374]
78 The likelihood of the sample set yn according to the MoG g under this constraint is: k ni m Ln (g) = g(y1 , . [sent-165, score-0.206]
79 (9) Surprisingly, as noted earlier the mixture weights βj do not appear in the asymptotic likelihood function of the generative model presented in this section. [sent-169, score-0.252]
80 , m be a set of m sequences of real positive numbers such that xjn → xj and let {βj } be a set of positive numbers. [sent-172, score-0.137]
81 Then 1 log j βj (xjn )n → maxj log xj [This can be shown as follows: Let a = arg maxj xj . [sent-173, score-0.348]
82 Hence log xa ≤ j j n 1 limn→∞ n log j βj (xjn ) ≤ log xa . [sent-175, score-0.311]
83 , yini } are independently sampled from the Gaussian distribution fi . [sent-179, score-0.304]
84 ni 1 Therefore, the law of large numbers implies: ni log t=1 N (yit ; µj , Σj ) → fi log gj . [sent-180, score-1.203]
85 1 n i Hence, substituting: xjni = ( t=1 N (yit ; µj , Σj )) ni → exp( fi log gj ) = xj in Lemma 2, m ni 1 we obtain: ni log j=1 βj t=1 N (yit ; µj , Σj ) → maxm fi log gj In a similar manner, j=1 the law of large numbers, applied to the discrete distribution (α1 , . [sent-181, score-2.247]
86 n k m ni ni 1 1 1 Hence n log Ln (g) = n log g(y1 , . [sent-185, score-0.404]
87 [5] utilized the generative model described in the previous section and the EM algorithm derived from it, to learn a MoG from data set endowed with equivalence constraints that enforce equivalent points to be assigned to the same cluster. [sent-190, score-0.1]
88 Vasconcelos and Lippman [9] proposed a similar EM based clustering algorithm for constructing mixture hierarchies using a finite set of virtual samples. [sent-191, score-0.472]
89 Given the generative model presented above, we can apply the EM algorithm to learn the (locally) maximum likelihood parameters of the reduced MoG model g(y). [sent-192, score-0.206]
90 This EM-based approach, however, is not precisely suitable for our component clustering problem. [sent-193, score-0.305]
91 The EM update rule for the weights of the reduced mixture density is based only on the number of the original components that are clustered into a single component without taking into account the relative weights [9]. [sent-194, score-0.472]
92 Slonim and Weiss [6] showed that the clustering algorithm in this case can be either motivated from the EM algorithm applied to a suitable generative model [4] or from the (hard decision version) of the IB principle [8]. [sent-197, score-0.345]
93 However, when we want to represent the clustering result as a mixture density there is a difference in the resulting mixture coefficient between the EM and the IB based algorithms. [sent-198, score-0.664]
94 Unlike the IB updating equation (10) of the coefficients wij , the EM update equation is based only on the number of components that are collapsed into a single Gaussian. [sent-199, score-0.24]
95 In the case of mixture of Gaussians, applying the IB principle results only in a partitioning of the original components but does not deliver a reduced representation in the form of a smaller mixture [2]. [sent-200, score-0.608]
96 If we modify gj in equation (10) by collapsing the mixture gj into a single Gaussian we obtain a soft version of our algorithm. [sent-201, score-1.39]
97 To conclude, we have presented an efficient Gaussian component clustering algorithm that can be used for object category clustering and for MoG collapsing. [sent-203, score-0.633]
98 In this study we have assumed that the desired number of clusters is given as part of the problem setup, but if this is not the case, standard methods for model selection can be applied. [sent-205, score-0.081]
99 Applying the information bottleneck principle to unsupervised clustering of discrete and continuous image representations. [sent-215, score-0.411]
100 Computing gaussian mixture models with em using equivalence constraints. [sent-236, score-0.359]
wordName wordTfidf (topN-words)
[('gj', 0.516), ('mog', 0.462), ('fi', 0.258), ('clustering', 0.237), ('og', 0.185), ('mixture', 0.184), ('collapsing', 0.141), ('collapsed', 0.137), ('ni', 0.119), ('xjn', 0.115), ('yit', 0.115), ('kl', 0.106), ('ib', 0.103), ('fj', 0.1), ('mogs', 0.092), ('gaussians', 0.088), ('reduced', 0.087), ('log', 0.083), ('em', 0.083), ('clusters', 0.081), ('arg', 0.076), ('unsupervised', 0.074), ('purity', 0.073), ('gaussian', 0.069), ('category', 0.068), ('component', 0.068), ('grouping', 0.063), ('digits', 0.057), ('hierarchical', 0.056), ('wij', 0.055), ('minimization', 0.054), ('lemma', 0.054), ('distance', 0.051), ('components', 0.048), ('original', 0.048), ('handwritten', 0.047), ('min', 0.046), ('maxm', 0.046), ('refit', 0.046), ('regroup', 0.046), ('xan', 0.046), ('yini', 0.046), ('likelihood', 0.042), ('matching', 0.041), ('preserving', 0.041), ('gm', 0.04), ('fault', 0.04), ('iterative', 0.038), ('digit', 0.038), ('mutual', 0.037), ('ming', 0.037), ('vasconcelos', 0.037), ('slonim', 0.037), ('density', 0.037), ('principle', 0.036), ('samples', 0.035), ('shental', 0.034), ('image', 0.033), ('images', 0.033), ('class', 0.033), ('version', 0.033), ('coe', 0.032), ('bottleneck', 0.031), ('maxj', 0.031), ('xa', 0.031), ('categories', 0.03), ('jensen', 0.029), ('respects', 0.029), ('learn', 0.028), ('virtual', 0.028), ('alternating', 0.028), ('cluster', 0.027), ('database', 0.026), ('histograms', 0.026), ('categorization', 0.026), ('generative', 0.026), ('mi', 0.026), ('yk', 0.025), ('law', 0.025), ('grouped', 0.024), ('double', 0.024), ('ln', 0.024), ('bottom', 0.024), ('belief', 0.023), ('algorithm', 0.023), ('iteration', 0.023), ('according', 0.023), ('equivalence', 0.023), ('situations', 0.023), ('theorem', 0.023), ('xj', 0.022), ('evaluate', 0.022), ('want', 0.022), ('empty', 0.022), ('sample', 0.022), ('yields', 0.021), ('applying', 0.021), ('alternatively', 0.021), ('belonging', 0.021), ('shall', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 77 nips-2004-Hierarchical Clustering of a Mixture Model
Author: Jacob Goldberger, Sam T. Roweis
Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1
2 0.30460766 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
3 0.15651901 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
4 0.15507647 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
Author: Aharon Bar-hillel, Adam Spiro, Eran Stark
Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1
5 0.13224824 115 nips-2004-Maximum Margin Clustering
Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans
Abstract: We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1
6 0.12569498 40 nips-2004-Common-Frame Model for Object Recognition
7 0.12001093 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
8 0.11336454 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
9 0.10880237 161 nips-2004-Self-Tuning Spectral Clustering
10 0.10160291 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
11 0.091594197 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
12 0.089945316 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
13 0.089208208 103 nips-2004-Limits of Spectral Clustering
14 0.085119016 163 nips-2004-Semi-parametric Exponential Family PCA
15 0.081644617 52 nips-2004-Discrete profile alignment via constrained information bottleneck
16 0.070017733 164 nips-2004-Semi-supervised Learning by Entropy Minimization
17 0.069580123 158 nips-2004-Sampling Methods for Unsupervised Learning
18 0.06738364 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
19 0.061377738 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
20 0.059992962 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
topicId topicWeight
[(0, -0.214), (1, 0.05), (2, -0.079), (3, -0.112), (4, -0.021), (5, -0.02), (6, -0.199), (7, 0.18), (8, -0.074), (9, 0.085), (10, -0.037), (11, 0.201), (12, 0.112), (13, -0.132), (14, -0.038), (15, -0.221), (16, -0.028), (17, 0.079), (18, -0.102), (19, -0.06), (20, 0.122), (21, -0.147), (22, -0.091), (23, 0.087), (24, 0.022), (25, -0.136), (26, -0.025), (27, -0.087), (28, 0.045), (29, -0.056), (30, -0.09), (31, -0.223), (32, 0.098), (33, -0.041), (34, -0.135), (35, -0.035), (36, 0.037), (37, -0.065), (38, -0.063), (39, -0.04), (40, 0.016), (41, -0.074), (42, -0.018), (43, 0.001), (44, 0.025), (45, 0.05), (46, 0.043), (47, -0.025), (48, -0.045), (49, -0.088)]
simIndex simValue paperId paperTitle
same-paper 1 0.94804418 77 nips-2004-Hierarchical Clustering of a Mixture Model
Author: Jacob Goldberger, Sam T. Roweis
Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1
2 0.73642373 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
3 0.72072774 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
4 0.5371294 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
Author: Aharon Bar-hillel, Adam Spiro, Eran Stark
Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a first search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1
5 0.42706987 158 nips-2004-Sampling Methods for Unsupervised Learning
Author: Rob Fergus, Andrew Zisserman, Pietro Perona
Abstract: We present an algorithm to overcome the local maxima problem in estimating the parameters of mixture models. It combines existing approaches from both EM and a robust fitting algorithm, RANSAC, to give a data-driven stochastic learning scheme. Minimal subsets of data points, sufficient to constrain the parameters of the model, are drawn from proposal densities to discover new regions of high likelihood. The proposal densities are learnt using EM and bias the sampling toward promising solutions. The algorithm is computationally efficient, as well as effective at escaping from local maxima. We compare it with alternative methods, including EM and RANSAC, on both challenging synthetic data and the computer vision problem of alpha-matting. 1
6 0.41925934 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
7 0.40829247 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
8 0.38860616 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
9 0.38332728 115 nips-2004-Maximum Margin Clustering
10 0.37891471 40 nips-2004-Common-Frame Model for Object Recognition
11 0.35645917 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
12 0.35500604 161 nips-2004-Self-Tuning Spectral Clustering
13 0.3414619 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
14 0.3246462 103 nips-2004-Limits of Spectral Clustering
15 0.308835 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
16 0.2933366 163 nips-2004-Semi-parametric Exponential Family PCA
17 0.28380311 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
18 0.28284523 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
19 0.26077953 138 nips-2004-Online Bounds for Bayesian Algorithms
20 0.26019377 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
topicId topicWeight
[(13, 0.087), (15, 0.159), (25, 0.02), (26, 0.051), (31, 0.051), (33, 0.197), (35, 0.029), (39, 0.023), (50, 0.035), (57, 0.225), (87, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.8549403 77 nips-2004-Hierarchical Clustering of a Mixture Model
Author: Jacob Goldberger, Sam T. Roweis
Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1
2 0.77311361 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty
Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1
3 0.76693237 178 nips-2004-Support Vector Classification with Input Data Uncertainty
Author: Jinbo Bi, Tong Zhang
Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1
4 0.76545876 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
5 0.76472378 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
6 0.76266241 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
7 0.76189768 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
8 0.76131642 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
9 0.76126385 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
10 0.76122445 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
11 0.76113051 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
12 0.7609545 127 nips-2004-Neighbourhood Components Analysis
13 0.76086736 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
14 0.75882638 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
15 0.75876123 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
16 0.75732195 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
17 0.75643146 102 nips-2004-Learning first-order Markov models for control
18 0.75594002 69 nips-2004-Fast Rates to Bayes for Kernel Machines
19 0.75526077 177 nips-2004-Supervised Graph Inference
20 0.75522703 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms