nips nips2003 nips2003-40 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Harald Steck, Tommi S. Jaakkola
Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The bootstrap has become a popular method for exploring model (structure) uncertainty. [sent-6, score-0.797]
2 Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. [sent-7, score-1.034]
3 We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. [sent-11, score-0.617]
4 The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. [sent-12, score-0.193]
5 We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). [sent-14, score-0.903]
6 1 Introduction Efron’s bootstrap is a powerful tool for estimating various properties of a given statistic, most commonly its bias and variance (cf. [sent-15, score-0.999]
7 However, the bootstrap procedure also involves various problems (e. [sent-19, score-0.757]
8 For instance, in the non-parametric bootstrap, where bootstrap samples D(b) (b = 1, . [sent-23, score-0.757]
9 , B) are generated by drawing the data points from the given data D with replacement, each bootstrap sample D(b) often contains multiple identical data points, which is a typical property of discrete data. [sent-26, score-0.826]
10 , as in gene-expression data, the bootstrap procedure introduces a spurious discreteness in the samples D(b) . [sent-29, score-0.915]
11 A statistic computed from these discrete bootstrap samples may differ from the ones based on the continuous data D. [sent-30, score-0.969]
12 As noted in [4], however, the effects due to this induced spurious discreteness are typically negligible. [sent-31, score-0.218]
13 In this paper, we focus on the spurious dependences induced by the bootstrap procedure, even when given discrete data. [sent-32, score-0.978]
14 spurious dependences cannot be neglected when exploring model (structure) uncertainty by means of bootstrap, whether parametric or non-parametric. [sent-34, score-0.241]
15 Graphical models learned from the bootstrap samples are biased towards too complex models and this bias can be considerably larger than the variability of the graph structure, especially in the interesting case of limited data. [sent-35, score-1.287]
16 As a result, too many edges are present in the learned model structures, and the confidence in the presence of edges is overestimated. [sent-36, score-0.245]
17 This suggests that a bias-corrected bootstrap procedure is essential for exploring model structure uncertainty. [sent-37, score-0.836]
18 Similarly to the statistics literature, we give a derivation for the bias-correction term to amend several popular scoring functions when applied to bootstrap samples (cf. [sent-38, score-0.917]
19 This bias-correction term asymptotically equals one half of the penalty term for model complexity in the Akaike Information Criterion (AIC), cf. [sent-41, score-0.287]
20 The (huge) effects of this bias and the proposed bias-correction are illustrated in our experiments in Section 5. [sent-44, score-0.28]
21 As the maximum likelihood score and the entropy are intimately tied to each other in the exponential family of probability distributions, we also relate this bias towards too complex models with the bias of the plug-in estimator for entropy (Section 3. [sent-45, score-0.971]
22 Moreover, we show in Section 4, similarly to [13, 1], how the (bootstrap) bias-correction can be used to obtain a scoring function whose penalty for model complexity asymptotically equals Akaike’s penalty (rather than one half of that). [sent-47, score-0.428]
23 2 Bootstrap Bias-Estimation and Bias-Correction In this section, we introduce relevant notation and briefly review the bootstrap bias estimation of an arbitrary statistic as well as the bootstrap bias-correction (cf. [sent-48, score-1.903]
24 A statistic T is any number that can be computed from the given data D. [sent-56, score-0.212]
25 Its bias is defined as BiasT = T (D) D∼p − T (p), where T (D) D∼p denotes the expectation over the data sets D of size N sampled from the (unknown) true distribution p. [sent-57, score-0.349]
26 However, it can be approximated by the bootstrap bias-estimate, where p is replaced by the empirical distribution p, and the average over the data sets D is replaced by the ˆ one over the bootstrap samples D(b) generated from p, where b = 1, . [sent-60, score-1.503]
27 For example, Tσ2 (ˆ) = IE(X 2 ) − unbiased IE(X)2 is the familiar plug-in statistic for the variance, while Tσ2 (D) = N/(N − p 1)Tσ2 (ˆ) is the unbiased estimator. [sent-67, score-0.351]
28 Obviously, a plug-in statistic yields an unbiased estimate concerning the distribution that is plugged in. [sent-68, score-0.358]
29 Consequently, when the empirical distribution is plugged in, a plug-in statistic typically does not give an unbiased estimate concerning the (unknown) true distribution. [sent-69, score-0.4]
30 In this case, the bias does not vanish in general. [sent-74, score-0.28]
31 In the special case where a plug-in statistic is a convex (concave) function of p, it follows immediately ˆ from the Jensen inequality that its bias is positive (negative). [sent-75, score-0.465]
32 For example, the statistic Tσ2 (ˆ) is a negative quadratic, and thus concave, function of p, and hence p ˆ underestimates the variance of the (unknown) true distribution. [sent-76, score-0.257]
33 The general procedure of bias-correction can be used to reduce the bias of a biased statistic considerably. [sent-77, score-0.561]
34 The bootstrap bias-corrected estimator T BC is given by T BC (D) = T (D) − BiasT = 2 T (D) − T (D(b) ) b , (2) where BiasT is the bootstrap bias estimate according to Eq. [sent-78, score-1.838]
35 1 Typically, T BC (D) agrees with the corresponding unbiased estimator in leading order in N (cf. [sent-80, score-0.199]
36 However, this is not an issue in this paper, since the ”estimate” of the bias turns out to be independent of the empirical distribution (in leading order in N ). [sent-88, score-0.302]
37 3 Bias-Corrected Scoring-Functions In this section, we show that the above popular scoring-functions are (considerably) biased towards too complex models when applied to bootstrap samples (in place of the given data). [sent-89, score-0.904]
38 These scoring functions can be amended by an additional penalty term that accounts for this bias. [sent-90, score-0.251]
39 Using the bootstrap bias-correction in a slightly non-standard way, a simple expression for this penalty term follows easily (Section 3. [sent-91, score-0.833]
40 2) from the well-know bias of the plug-in estimator of the entropy, which is reviewed in Section 3. [sent-92, score-0.374]
41 Since this is a concave function of the p’s, the plug-in estimator H(ˆ(X)) tends to underestimate the true entropy H(p(X)) (cf. [sent-99, score-0.239]
42 Note that this approximation can be applied analogously to BiasH (instead of the bootstrap estimate BiasH ), and the same leading-order term is obtained. [sent-107, score-0.77]
43 The bootstrap bias-corrected estimator for the entropy of the (unknown true) distribution is thus given by H BC (ˆ(X)) = p 1 1 H(ˆ(X)) + 2N (|X| − 1) + O( N 2 ). [sent-111, score-0.889]
44 2 Bias-Correction for Bootstrapped Scoring-Functions This section is concerned with the bias of popular scoring functions that is induced by the bootstrap procedure. [sent-113, score-1.232]
45 However, when the BIC is applied to a bootstrap sample D(b) (instead of the given data D), the BIC cannot be expected to yield an ”unbiased” estimate of the true graph. [sent-122, score-0.865]
46 This is because the maximum likelihood term in the BIC is biased when computed from the bootstrap sample D(b) instead of the given data D. [sent-123, score-0.886]
47 This bias reads BiasTBIC = TBIC (D(b) ) b − TBIC (D). [sent-124, score-0.302]
48 First, it is the (exact) bias induced by the bootstrap procedure, while Eq. [sent-127, score-1.097]
49 1 is a bootstrap approximation of the (unknown) true bias. [sent-128, score-0.761]
50 1 applies to a statistic in general, the last term in Eq. [sent-130, score-0.21]
51 Since the maximum likelihood term is intimately tied to the entropy in the exponential family of probability distributions, the leading-order approximation of the bias of the entropy carries over (cf. [sent-133, score-0.566]
52 Note that this bias is identical to one half of the penalty for model complexity in the Akaike Information Criterion (AIC). [sent-137, score-0.412]
53 Hence, this bias due to the bootstrap cannot be neglected compared to the penalty terms inherent in all popular scoring functions. [sent-138, score-1.258]
54 Also our experiments in Section 5 confirm the dominating effect of this bias when exploring model uncertainty. [sent-139, score-0.325]
55 BiasTBIC = This bias in the maximum likelihood gives rise to spurious dependences induced by the bootstrap (a well-known property). [sent-140, score-1.289]
56 In this context, the bootstrap procedure obviously gives rise to a (considerable) bias towards too complex models. [sent-142, score-1.122]
57 As a consequence, too many edges are present in the learned graph structure, and the confidence in the presence of edges is overestimated. [sent-143, score-0.284]
58 This is because the bias is proportional to the number of joint states of the parents of a variable (cf. [sent-145, score-0.311]
59 Hence, the amount of the induced bias generally varies among the different edges in the graph. [sent-148, score-0.453]
60 Consequently, the BIC has to be amended when applied to a bootstrap sample BC D(b) (instead of the given data D). [sent-149, score-0.807]
61 Since the bias originates from the maximum likelihood term involved in the BIC, the same bias-correction applies to the AIC and MDL scores. [sent-151, score-0.366]
62 7 can also be expected to account for most of the bias of log p(D(b) |m) when applied to bootstrap samples D(b) . [sent-153, score-1.101]
63 4 Bias-Corrected Maximum-Likelihood It may be surprising that the bias derived in Eq. [sent-154, score-0.28]
64 Section 2), we obtain a scoring function that asymptotically equals the AIC. [sent-158, score-0.207]
65 The goal is to learn a Bayesian network model with p(X|θ, m), or ˆ are the maximum likelihood p(X|m) in short, where m is the graph structure and θ ˆ parameter estimates, given data D. [sent-161, score-0.2]
66 ˆ Since the entropy of the true distribution p(X) is an irrelevant constant when comparing different graphs, minimizing the KL-divergence is equivalent to minimizing the statistic T (p, p, m) = − ˆ p(x) log p(x|m), ˆ (8) x which is the test error of the learned model when using the log loss. [sent-164, score-0.449]
67 However, the ”bias-corrected training error”, T BC (ˆ, m) = T (ˆ, m) − BiasT (p,m) , p p (10) ˆ can serve as a surrogate, (nearly) unbiased estimator for the unknown test error, T (p, p, m), and hence as a scoring function for model selection. [sent-169, score-0.345]
68 The bias is given ˆ by the difference between the expected training error and the expected test error, 1 p(x|m) log p(x|m) D∼p ≈ − |θ|. [sent-170, score-0.369]
69 The overall bias amounts to |θ|/N , which exactly equals the AIC penalty for model complexity. [sent-177, score-0.43]
70 When the bootstrap estimate of the (exact) bias in Eq. [sent-179, score-1.025]
71 5 Experiments In our experiments with artificial and real-world data, we demonstrate the crucial effect of the bias induced by the bootstrap procedure, when exploring model uncertainty. [sent-182, score-1.142]
72 7 can compensate for most of this (possibly large) bias in structure learning of Bayesian networks. [sent-184, score-0.314]
73 We examined two different scoring functions, namely BIC and posterior probability (uniform prior over network structures, equivalent sample size α = 1, cf. [sent-187, score-0.227]
74 Consequently, the reported variability in the learned network structures tends to be smaller than the uncertainty determined by local search (without this additional information). [sent-191, score-0.221]
75 However, we are mainly interested in the bias induced by the bootstrap here, which can be expected to be largely unaffected by the search strategy. [sent-192, score-1.145]
76 Although the true alarm network is known, we use the network structures learned from the given data D as a reference in our experiments: as expected, the optimal graphs learned from our small data sets tend to be sparser than the original graph in order to avoid over-fitting (cf. [sent-193, score-0.567]
77 3 We generated 200 bootstrap samples from the given data D (as suggested in [5]), and then learned the network structure from each. [sent-195, score-0.955]
78 Table 1 shows that the bias induced by the bootstrap procedure is considerable for both the BIC and the posterior probability: it cannot be neglected compared to the standard deviation of the distribution over the number of edges. [sent-196, score-1.222]
79 Also note that, despite the small data sets, the bootstrap yields graphs that have even more edges than the true alarm network. [sent-197, score-0.959]
80 In contrast, Table 1 illustrates that this bias towards too complex models can be reduced dramatically by the bias-correction outlined in Section 3. [sent-198, score-0.336]
81 The jackknife is an alternative resampling method, and can be viewed as an approximation to the bootstrap (e. [sent-203, score-0.89]
82 As a consequence, unlike bootstrap samples, jackknife samples do not contain multiple identical data points when generated from a given continuous data set (cf. [sent-213, score-0.95]
83 1 ± ± ± ± alarm network data N = 300 N = 1, 000 posterior BIC posterior 40 43 44 4. [sent-219, score-0.214]
84 3 Table 1: Number of edges (mean ± standard deviation) in the network structures learned from the given data set D, and when using various resampling methods: bias-corrected bootstrap (boot BC), naive bootstrap (boot), delete-1 jackknife (jack 1), and delete-d jackknife (jack d; here d = N/10). [sent-257, score-2.019]
85 ¡ bootstrap corrected bootstrap bootstrap 1 0 0 0. [sent-258, score-2.157]
86 5 corrected bootstrap 1 ¡ 1 ¢ 1 ¢ Figure 1: The axis of these scatter plots show the confidence in the presence of the edges in the graphs learned from the pheromone data. [sent-264, score-0.997]
87 In the context of model selection, however, one may take advantage of the extremely small bias of the ”raw” jackknife estimate when determining, e. [sent-270, score-0.445]
88 Table 1 shows that the ”raw” jackknife is typically less biased than the bias-corrected bootstrap in our experiments. [sent-273, score-0.916]
89 However, it is not clear in the context of model selection as to how meaningful the ”raw” jackknife estimate of model variability is. [sent-274, score-0.194]
90 5 Since the correct network structure is unknown in this experiment, we used local search combined with simulated annealing in order to optimize the BIC score and the posterior probability (α = 25, cf. [sent-279, score-0.29]
91 As a reference in this experiment, we used 320 network structures learned from the given (discretized) data D, each of which is the highest-scoring graph found in a run of local search combined with simulated annealing. [sent-281, score-0.312]
92 7 also applies to the joint optimization of the discretization and graph structure when given a bootstrap sample. [sent-284, score-0.821]
93 6 Using the annealing parameters as suggested in [10], each run of simulated annealing resulted in a different network structure (local optimum) in practice. [sent-285, score-0.215]
94 While the pheromone data experiments in Table 1 qualitatively confirm the previous results, the bias induced by the bootstrap is even larger here. [sent-286, score-1.194]
95 We suspect that this difference in the bias is caused by the rather extreme parameter values in the original alarm network model, which leads to a relatively large signal-to-noise ratio even in small data sets. [sent-287, score-0.434]
96 Another effect of the spurious dependences induced by the bootstrap procedure is shown in Figure 1: the overestimation of the confidence in the presence of individual edges in the network structures. [sent-289, score-1.215]
97 Obviously, the naive application of the bootstrap leads to a considerable overestimation of the confidence in the presence of many edges in Figure 1, particularly of those whose absence is favored by both our reference and the bias-corrected bootstrap. [sent-292, score-0.901]
98 In contrast, the confidence estimated by the bias-corrected bootstrap aligns quite well with the confidence determined by our reference in Figure 1, leading to more trustworthy results in our experiments. [sent-293, score-0.771]
99 On the application of the bootstrap for computing confidence measures on features of induced Bayesian networks. [sent-336, score-0.817]
100 Asymptotic bias in information estimates and the exponential (Bell) polynomials. [sent-409, score-0.28]
wordName wordTfidf (topN-words)
[('bootstrap', 0.719), ('bias', 0.28), ('bic', 0.207), ('statistic', 0.185), ('aic', 0.158), ('jackknife', 0.139), ('akaike', 0.123), ('biast', 0.123), ('scoring', 0.102), ('induced', 0.098), ('bc', 0.096), ('estimator', 0.094), ('penalty', 0.089), ('tbic', 0.088), ('spurious', 0.085), ('unbiased', 0.083), ('bayesian', 0.077), ('entropy', 0.076), ('dependences', 0.076), ('edges', 0.075), ('dence', 0.071), ('boot', 0.07), ('pheromone', 0.07), ('network', 0.069), ('learned', 0.068), ('unknown', 0.066), ('equals', 0.061), ('di', 0.06), ('biased', 0.058), ('alarm', 0.058), ('jack', 0.055), ('biash', 0.053), ('biastbic', 0.053), ('intimately', 0.046), ('exploring', 0.045), ('con', 0.045), ('annealing', 0.044), ('asymptotically', 0.044), ('half', 0.043), ('discretized', 0.042), ('true', 0.042), ('erent', 0.042), ('plugged', 0.042), ('log', 0.039), ('graph', 0.039), ('graphs', 0.038), ('samples', 0.038), ('procedure', 0.038), ('amended', 0.035), ('discreteness', 0.035), ('harald', 0.035), ('neglected', 0.035), ('towards', 0.034), ('structure', 0.034), ('popular', 0.033), ('raw', 0.032), ('structures', 0.032), ('tied', 0.032), ('resampling', 0.032), ('likelihood', 0.031), ('parents', 0.031), ('csail', 0.03), ('frequentist', 0.03), ('goldszmidt', 0.03), ('originates', 0.03), ('steck', 0.03), ('underestimates', 0.03), ('posterior', 0.03), ('graphical', 0.03), ('criterion', 0.03), ('reference', 0.03), ('ect', 0.029), ('variability', 0.029), ('obviously', 0.029), ('erence', 0.029), ('discretization', 0.029), ('efron', 0.028), ('ie', 0.028), ('mdl', 0.028), ('overestimation', 0.028), ('tommi', 0.028), ('data', 0.027), ('concave', 0.027), ('presence', 0.027), ('sample', 0.026), ('estimate', 0.026), ('binomial', 0.026), ('zurich', 0.026), ('term', 0.025), ('expected', 0.025), ('simulated', 0.024), ('friedman', 0.024), ('pe', 0.023), ('bin', 0.023), ('search', 0.023), ('considerable', 0.022), ('reads', 0.022), ('concerning', 0.022), ('leading', 0.022), ('complex', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty
Author: Harald Steck, Tommi S. Jaakkola
Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1
2 0.44061336 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
Author: Dörthe Malzahn, Manfred Opper
Abstract: We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1
3 0.12769219 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
4 0.12516627 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation
Author: Yoshua Bengio, Yves Grandvalet
Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1
5 0.09548606 115 nips-2003-Linear Dependent Dimensionality Reduction
Author: Nathan Srebro, Tommi S. Jaakkola
Abstract: We formulate linear dimensionality reduction as a semi-parametric estimation problem, enabling us to study its asymptotic behavior. We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. 1
6 0.061227947 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
7 0.059772741 30 nips-2003-Approximability of Probability Distributions
8 0.056514934 51 nips-2003-Design of Experiments via Information Theory
9 0.050665118 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
10 0.049204428 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
11 0.048233669 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models
12 0.043071382 73 nips-2003-Feature Selection in Clustering Problems
13 0.042628702 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
14 0.041298341 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
15 0.041006677 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
16 0.040247321 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
17 0.040209789 181 nips-2003-Statistical Debugging of Sampled Programs
18 0.040058933 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
19 0.03949267 176 nips-2003-Sequential Bayesian Kernel Regression
20 0.038468704 32 nips-2003-Approximate Expectation Maximization
topicId topicWeight
[(0, -0.16), (1, -0.036), (2, -0.015), (3, 0.044), (4, 0.089), (5, 0.011), (6, 0.058), (7, -0.097), (8, -0.054), (9, 0.024), (10, -0.11), (11, 0.074), (12, -0.154), (13, -0.098), (14, 0.097), (15, 0.146), (16, 0.02), (17, -0.507), (18, 0.222), (19, 0.246), (20, 0.154), (21, 0.066), (22, -0.013), (23, -0.179), (24, -0.013), (25, -0.048), (26, -0.046), (27, 0.09), (28, 0.143), (29, -0.025), (30, 0.023), (31, 0.061), (32, 0.111), (33, -0.082), (34, 0.091), (35, 0.128), (36, 0.107), (37, 0.012), (38, -0.048), (39, -0.111), (40, 0.016), (41, -0.069), (42, 0.01), (43, -0.089), (44, -0.002), (45, -0.074), (46, -0.034), (47, 0.084), (48, -0.091), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.96694374 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty
Author: Harald Steck, Tommi S. Jaakkola
Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1
2 0.79959059 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
Author: Dörthe Malzahn, Manfred Opper
Abstract: We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1
3 0.3728382 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation
Author: Yoshua Bengio, Yves Grandvalet
Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1
4 0.33647087 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
5 0.28318965 30 nips-2003-Approximability of Probability Distributions
Author: Alina Beygelzimer, Irina Rish
Abstract: We consider the question of how well a given distribution can be approximated with probabilistic graphical models. We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. 1
6 0.26988825 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models
7 0.21501049 115 nips-2003-Linear Dependent Dimensionality Reduction
8 0.2125362 85 nips-2003-Human and Ideal Observers for Detecting Image Curves
9 0.19091794 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science
10 0.1760028 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
11 0.17203894 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
12 0.16863163 1 nips-2003-1-norm Support Vector Machines
13 0.16827828 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
14 0.1635965 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
15 0.16059279 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
16 0.16031781 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
17 0.15971811 100 nips-2003-Laplace Propagation
18 0.15855399 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
19 0.15698606 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
20 0.15689407 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
topicId topicWeight
[(0, 0.024), (11, 0.016), (29, 0.024), (30, 0.02), (35, 0.09), (49, 0.296), (53, 0.123), (71, 0.073), (76, 0.062), (85, 0.057), (91, 0.078), (99, 0.025)]
simIndex simValue paperId paperTitle
1 0.84421206 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
Author: Anton Schwaighofer, Marian Grigoras, Volker Tresp, Clemens Hoffmann
Abstract: In this article, we present a novel approach to solving the localization problem in cellular networks. The goal is to estimate a mobile user’s position, based on measurements of the signal strengths received from network base stations. Our solution works by building Gaussian process models for the distribution of signal strengths, as obtained in a series of calibration measurements. In the localization stage, the user’s position can be estimated by maximizing the likelihood of received signal strengths with respect to the position. We investigate the accuracy of the proposed approach on data obtained within a large indoor cellular network. 1
same-paper 2 0.8166579 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty
Author: Harald Steck, Tommi S. Jaakkola
Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1
3 0.7906636 42 nips-2003-Bounded Finite State Controllers
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
4 0.71579373 112 nips-2003-Learning to Find Pre-Images
Author: Jason Weston, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: We consider the problem of reconstructing patterns from a feature map. Learning algorithms using kernels to operate in a reproducing kernel Hilbert space (RKHS) express their solutions in terms of input points mapped into the RKHS. We introduce a technique based on kernel principal component analysis and regression to reconstruct corresponding patterns in the input space (aka pre-images) and review its performance in several applications requiring the construction of pre-images. The introduced technique avoids difficult and/or unstable numerical optimization, is easy to implement and, unlike previous methods, permits the computation of pre-images in discrete input spaces. 1
5 0.55792648 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
6 0.55116028 189 nips-2003-Tree-structured Approximations by Expectation Propagation
7 0.55081767 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
8 0.54605901 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
9 0.54597664 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
10 0.54581207 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
11 0.5452978 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
12 0.54518294 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models
13 0.54508775 107 nips-2003-Learning Spectral Clustering
14 0.54496175 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
15 0.54474634 126 nips-2003-Measure Based Regularization
16 0.54433918 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
17 0.54340982 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
18 0.54182696 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing
19 0.54182351 78 nips-2003-Gaussian Processes in Reinforcement Learning
20 0.54004115 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images