nips nips2011 nips2011-241 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dan Feldman, Matthew Faulkner, Andreas Krause
Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Scalable Training of Mixture Models via Coresets Dan Feldman MIT Matthew Faulkner Caltech Andreas Krause ETH Zurich Abstract How can we train a statistical mixture model on a massive data set? [sent-1, score-0.116]
2 In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. [sent-2, score-0.704]
3 A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. [sent-3, score-1.388]
4 We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. [sent-4, score-0.747]
5 More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. [sent-5, score-0.148]
6 Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. [sent-6, score-0.781]
7 We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. [sent-8, score-0.165]
8 1 Introduction We consider the problem of training statistical mixture models, in particular mixtures of Gaussians and some natural generalizations, on massive data sets. [sent-9, score-0.233]
9 In contrast to parameter estimation for models with compact sufficient statistics, mixture models generally require inference over latent variables, which in turn depends on the full data set. [sent-11, score-0.135]
10 In this paper, we show that Gaussian mixture models (GMMs), and some generalizations, admit small coresets: A coreset is a weighted subset of the data which guarantees that models fitting the coreset will also provide a good fit for the original data set. [sent-12, score-1.527]
11 Perhaps surprisingly, we show that Gaussian mixtures admit coresets of size independent of the size of the data set. [sent-13, score-0.729]
12 In particular, we show that given a data set D of n points in Rd , ε > 0 and k ∈ N, how one can efficiently construct a weighted set C of O(dk 3 /ε2 ) points, such that for any mixture of k ε-semispherical Gaussians θ = [(w1 , µ1 , Σ1 ), . [sent-15, score-0.24]
13 , (wk , µk , Σk )] it holds that the log-likelihood ln P (D | θ) of D under θ is approximated by the (properly weighted) log-likelihood ln P (C | θ) of C under θ to arbitrary accuracy as ε → 0. [sent-18, score-0.15]
14 Thus solving the estimation problem on the coreset C (e. [sent-19, score-0.656]
15 Moreover, coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting (using space and update time per point of poly(dkε−1 log n log(1/δ))). [sent-24, score-0.805]
16 Existence and construction of coresets have been investigated for a number of problems in computational geometry (such as k-means and k-median) in many recent papers (cf. [sent-25, score-0.693]
17 In particular, we use our approach for density estimation for acceleration data, motivated by an application in earthquake detection using mobile phones. [sent-30, score-0.133]
18 , wk ≥ 0 are the mixture weights, i wi = 1, and µi and Σi are mean and covariance of the i-th mixture component, which is modeled as a multivariate normal distribution N (x, µi , Σi ) = √ 1 exp − 1 (x − µi )T Σ−1 (x − µi ) . [sent-44, score-0.312]
19 , the negative log likelihood of the data is L(D | θ) = − j ln P (xj | θ), and we wish to obtain the maximum likelihood estimate (MLE) of the parameters θ∗ = argminθ∈C L(D | θ), where C is a set of constraints ensuring that degenerate solutions are avoided1 . [sent-49, score-0.187]
20 The above observations hold even for the case k = 1 and Σ = I, where the mixture θ consists of a single Gaussian, and the log-likelihood is the sum of squared distances to a point µ and an additive term. [sent-68, score-0.139]
21 We call a weighted data set C a (k, ε)-coreset for another (possibly weighted) set D ⊆ Rd , if for all mixtures θ ∈ C of k Gaussians it holds that (1 − ε)φ(D | θ) ≤ φ(C | θ) ≤ φ(D | θ)(1 + ε). [sent-76, score-0.169]
22 5 (d) Final approximation B 2 Figure 1: Illustration of the coreset construction for example data set (a). [sent-94, score-0.76]
23 Solid squares are points sampled uniformly from remaining points, hollow squares are points selected in previous iterations. [sent-96, score-0.137]
24 Further, under the additional condition that all variances are sufficiently large (formally 1 λ∈spec(Σi ) λ ≥ (2π)d for all components i), the log-normalizer ln Z(θ) is negative, and consequently the coreset in fact provides a multiplicative (1 + ε) approximation to the log-likelihood, i. [sent-105, score-0.785]
25 Note that if we had access to a (k, ε)-coreset C, then we could reduce the problem of fitting a mixture model on D to one of fitting a model on C, since the optimal solution θC is a good approximation (in terms of log-likelihood) of θ∗ . [sent-109, score-0.121]
26 We show that, perhaps surprisingly, one can efficiently find coresets C of size independent of the size n of D, and with polynomial dependence on 1 , d and k. [sent-116, score-0.63]
27 In particular, suppose the data set is generated from a 1 1 mixture of two spherical Gaussians (Σi = I) with weights w1 = √n and w2 = 1 − √n . [sent-119, score-0.135]
28 This example already suggests that, intuitively, in order to achieve small multiplicative error, we must devise a sampling scheme that adaptively selects representative points from all “clusters” present in the data set. [sent-123, score-0.134]
29 However, this suggests that obtaining a coreset requires solving a chicken-and-egg problem, where we need to understand the density of the data to obtain the coreset, but simultaneously would like to use the coreset for density estimation. [sent-124, score-1.376]
30 The key idea behind the coreset construction is that we can break the chicken-and-egg problem by first obtaining a rough approximation B of the clustering solution (using more than k components, but far fewer than n), and then to use this solution to bias the random sampling. [sent-126, score-0.787]
31 This initial clustering is then used to sample the data points comprising coreset C according to probabilities which are roughly proportional to the squared distance to the set B. [sent-128, score-0.777]
32 The same general idea has been found successful in constructing coresets for geometric clustering problems such as k-means and k-median [4]. [sent-130, score-0.687]
33 The pseudocode for obtaining the approximation B, and for using it to obtain coreset C is given in Algorithm 1. [sent-131, score-0.683]
34 , (γ(x|C| ), x|C| ) D ← D; B ← ∅; while |D | > 10dk ln(1/δ) do Sample set S of β = 10dk ln(1/δ) points uniformly at random from D ; Remove |D |/2 points x ∈ D closest to S (i. [sent-135, score-0.118]
35 In our experiments, we compare the performance of clustering on coresets constructed via adaptive sampling, vs. [sent-142, score-0.663]
36 By replacing B in the algorithm with a constant factor approximation B , |B | = l for the k-means problem, we can get a coreset C of size independent of n. [sent-145, score-0.683]
37 There exist functions φ, ψ, and f such that, for any point x ∈ Rd and mixture model θ, ln P (x | θ) = −fφ(x) (ψ(θ)) + Z(θ), where wi exp −Wi dist(˜ − µi , si )2 . [sent-152, score-0.235]
38 ˜ x ˜ fx (y) = − ln ˜ i Hereby, φ is a function that maps a point x ∈ Rd into x = φ(x) ∈ R2d , and ψ is a function that ˜ maps a mixture model θ into a tuple y = (s, w, µ, W ) where w is a k-tuple of nonnegative weights ˜ w1 , . [sent-153, score-0.201]
39 2 is that level sets of distances between points and subspaces are quadratic forms, and can thus represent level sets of the Gaussian probability density function (see Figure 2(a) for an illustration). [sent-161, score-0.155]
40 (b) Tree construction for generating coresets in parallel or from data streams. [sent-163, score-0.732]
41 , C7 are enumerated in the order in which they would be generated in the streaming case. [sent-168, score-0.144]
42 In the parallel case, C1 , C2 , C4 and C5 would be constructed in parallel, followed by parallel construction of C3 and C6 , finally resulting in C7 . [sent-169, score-0.206]
43 − ln i wi exp (−ηi ) as an approximation upper-bounding the minimum min(η) = mini ηi for ηi = Wi dist(˜ − µi , si )2 and η = [η1 , . [sent-170, score-0.168]
44 The motivation behind this transformation is that it x ˜ allows expressing the likelihood P (x | θ) of a data point x given a model θ in a purely geometric manner as soft-min over distances between points and subspaces in a transformed space. [sent-174, score-0.24]
45 For semi-spherical Gaussians, it can be shown that the subspaces can be chosen as points while incurring a multiplicative error of at most 1/ε, and thus we recover the well-known k-means problem in the transformed space. [sent-176, score-0.143]
46 This insight suggests using a known coreset construction for k-means, adapted to the transformation employed. [sent-177, score-0.728]
47 We tackle this challenge by proving a general˜ ized triangle inequality adapted to the exponential transformation, and employing the framework described in [4], which provides a general method for constructing coresets for clustering problems of the form mins i fx (s). [sent-179, score-0.734]
48 ˜ As proved in [4], the key quantity that controls the size of a coreset is the pseudo-dimension of the functions Fd = {fx for x ∈ R2d }. [sent-180, score-0.675]
49 1 is a new bound on the complexity of mixtures of k Gaussians in d dimensions proved in the supplemental material. [sent-183, score-0.148]
50 2 Streaming and Parallel Computation One major advantage of coresets is that they can be constructed in parallel, as well as in a streaming setting where data points arrive one by one, and it is impossible to remember the entire data set due to memory constraints. [sent-185, score-0.915]
51 The key insight is that coresets satisfy certain composition properties, which have previously been used by [6] for streaming and parallel construction of coresets for geometric clustering problems such as k-median and k-means. [sent-186, score-1.535]
52 In the following, we review how to exploit these properties for parallel and streaming computation. [sent-193, score-0.208]
53 In the streaming setting, we assume that points arrive one-by-one, but we do not have enough memory to remember the entire data set. [sent-195, score-0.279]
54 Thus, we wish to maintain a coreset over time, while keeping only a small subset of O(log n) coresets in memory. [sent-196, score-1.247]
55 There is a general reduction that shows that a small coreset scheme to a given problem suffices to solve the corresponding problem on a streaming input [7, 6]. [sent-197, score-0.8]
56 The idea is to construct and save in memory a coreset for every block of poly(dk/ε) consecutive points arriving in a stream. [sent-198, score-0.751]
57 When we have two coresets in memory, we can merge them (resulting in a (k, ε)-coreset via property (1)), and compress by computing a single coreset from the merged coresets (via property (2)) to avoid increase in the coreset size. [sent-199, score-2.494]
58 An important subtlety arises: While merging two coresets (via property (1)) does not increase the approximation error, compressing a coreset (via property (2)) does increase the error. [sent-200, score-1.274]
59 A naive approach that merges and compresses immediately as soon as two coresets have been constructed, can incur an exponential increase in approximation error. [sent-201, score-0.639]
60 Fortunately, it is possible to organize the merge-and-compress operations in a binary tree of height O(log n), where we need to store in memory a single coreset 5 for each level on the tree (thus requiring only poly(dkε−1 log n) memory). [sent-202, score-0.705]
61 In order to construct a coreset for the union of two (weighted) coresets, we use a weighted version of Algorithm 1, where we consider a weighted point as duplicate copies of a non-weighted point (possibly with fractional weight). [sent-204, score-0.784]
62 We summarize our streaming result in the following theorem. [sent-206, score-0.144]
63 Using the same ideas from the streaming model, a (nonparallel) coreset construction can be transformed into a parallel one. [sent-211, score-0.946]
64 We partition the data into sets, and compute coresets for each set, independently, on different computers in a cluster. [sent-212, score-0.613]
65 We then (in parallel) merge (via property (1)) two coresets, and compute a single coreset for every pair of such coresets (via property (2)). [sent-213, score-1.247]
66 This computation is also naturally suited for map-reduce [9] style computations, where the map tasks compute coresets for disjoint parts of D, and the reduce tasks perform the merge-and-compress operations. [sent-215, score-0.614]
67 3 Fitting a GMM on the Coreset using Weighted EM One approach, which we employ in our experiments, is to use a natural generalization of the EM algorithm, which takes the coreset weights into account. [sent-221, score-0.656]
68 4 Extensions and Generalizations We now show how the connection between estimating the parameters for mixture models and problems in computational geometry can be leveraged further. [sent-230, score-0.141]
69 Our observations are based on the link between mixture of Gaussians and projective clustering (multiple subspace approximation) as shown in Lemma 3. [sent-231, score-0.163]
70 For simplicity, we generalized the coreset construction for the k-means problem, which required assumptions that the Gaussians are ε-semispherical. [sent-234, score-0.711]
71 However, several more complex coresets for projective clustering were suggested recently (cf. [sent-235, score-0.66]
72 Using the tools developed in this article, each such coreset implies a corresponding coreset for GMMs and generalizations. [sent-237, score-1.312]
73 As an example, the coresets for approximating points by lines [10] implies that we can construct small coresets for GMMs even if the smallest singular value of one of the corresponding covariance matrices is zero. [sent-238, score-1.293]
74 We compare likelihood of the best model obtained on subsets C constructed by uniform sampling, and by the adaptive coreset sampling procedure. [sent-248, score-0.785]
75 For example, replacing the squared distances by non-squared distances yields a coreset for mixture 2 2 of Laplace distributions. [sent-250, score-0.84]
76 at least 1 − δ it holds that for all θ ∈ Cε , φ(D | θ)(1 − ε) ≤ φ(C | θ) ≤ φ(D | θ)(1 + ε), where Z(θ) = wi i g(θi ) and φ(D | θ) = − using the normalizer g(θi ) = 5 x∈D −1/2 1 exp − 2 Σi ln k wi i=1 Z(θ)g(θi ) exp q (x − µi ) −1/2 1 − 2 Σi q (x − µi ) dx. [sent-259, score-0.246]
77 Experiments We experimentally evaluate the effectiveness of using coresets of different sizes for training mixture models. [sent-260, score-0.709]
78 From the training set, we produce coresets and uniformly sampled subsets of sizes between 30 and 5000, using the parameters k = 10 (a cluster for each digit), β = 20 and δ = 0. [sent-266, score-0.652]
79 Notice that coresets significantly outperform uniform samples of the same size, and even a coreset of 30 points performs very well. [sent-269, score-1.335]
80 Constructing the coreset and running EM on this size takes 7. [sent-271, score-0.656]
81 We also compare coresets and uniform sampling on a large dataset containing 319,209 records of rat hippocampal action potentials, measured by four co-located electrodes. [sent-275, score-0.664]
82 From the training set, we produce coresets and uniformly sampled subsets of sizes between 70 and 1000, using the parameters k = 33 (as in [11]), β = 66, and δ = 0. [sent-279, score-0.652]
83 Coreset GMMs obtain consistently higher LLH than uniform sample GMMs for sets of the same size, and even a coreset of 100 points performs very well. [sent-282, score-0.744]
84 Overall, training on coresets achieves approximately the same likelihood as training on the full set about 95 times faster (1. [sent-283, score-0.691]
85 Motivated by the limited storage on smart phones, we evaluate coresets on a data set of 40,000 accelerometer feature vectors, using the parameters k = 6, β = 12, and δ = 0. [sent-294, score-0.672]
86 Notice that on this data set, coresets show an even larger improvement over uniform sampling. [sent-297, score-0.651]
87 We hypothesize that this is due to the fact that the recorded accelerometer data is imbalanced, and contains clusters of vastly varying size, so uniform sampling does not represent smaller clusters well. [sent-298, score-0.134]
88 Overall, the coresets obtain a speedup of approximately 35 compared to training on the full set. [sent-299, score-0.634]
89 We also evaluate how GMMs trained on the coreset compare with the baseline GMMs in terms of anomaly detection performance. [sent-300, score-0.698]
90 Note that even very small coresets lead to performance comparable to training on the full set, drastically outperforming uniform sampling (Fig. [sent-303, score-0.707]
91 However, in contrast to our results, all the results described above crucially rely on the fact that the data set D is actually generated by a mixture of Gaussians. [sent-317, score-0.116]
92 The problem of fitting a mixture model with near-optimal log-likelihood for arbitrary data is studied by [3], who provides a PTAS for this problem. [sent-318, score-0.116]
93 Furthermore, none of the algorithms described above applies to the streaming or parallel setting. [sent-321, score-0.208]
94 Coresets also imply streaming algorithms for many of these problems [6, 1, 23, 8]. [sent-329, score-0.144]
95 7 Conclusion We have shown how to construct coresets for estimating parameters of GMMs and natural generalizations. [sent-331, score-0.611]
96 To our knowledge, our results provide the first rigorous guarantees for obtaining compressed ε-approximations of the log-likelihood of mixture models for large data sets. [sent-333, score-0.116]
97 The coreset construction relies on an intuitive adaptive sampling scheme, and can be easily implemented. [sent-334, score-0.746]
98 Unlike most of the related work, our coresets provide guarantees for any given (possibly unstructured) data, without assumptions on the distribution or model that generated it. [sent-336, score-0.591]
99 Maximum likelihood from incomplete data via the em algorithm. [sent-418, score-0.12]
100 Pac learning axis-aligned mixtures of gaussians with no separation assumption. [sent-445, score-0.222]
wordName wordTfidf (topN-words)
[('coreset', 0.656), ('coresets', 0.591), ('streaming', 0.144), ('gmms', 0.137), ('gaussians', 0.101), ('mixture', 0.094), ('mixtures', 0.093), ('ln', 0.075), ('dist', 0.07), ('wi', 0.066), ('em', 0.065), ('parallel', 0.064), ('csn', 0.059), ('construction', 0.055), ('feldman', 0.055), ('poly', 0.054), ('weighted', 0.054), ('generalizations', 0.052), ('earthquake', 0.051), ('dk', 0.051), ('gmm', 0.05), ('points', 0.05), ('clustering', 0.049), ('geometry', 0.047), ('hereby', 0.047), ('distances', 0.045), ('lold', 0.044), ('subspaces', 0.039), ('wk', 0.039), ('seismic', 0.039), ('normalizer', 0.039), ('accelerometer', 0.039), ('spec', 0.039), ('uniform', 0.038), ('xj', 0.037), ('supplemental', 0.036), ('sampling', 0.035), ('likelihood', 0.033), ('fx', 0.032), ('focs', 0.032), ('tting', 0.029), ('accelerometers', 0.029), ('chandy', 0.029), ('faulkner', 0.029), ('llh', 0.029), ('mle', 0.029), ('rd', 0.029), ('separation', 0.028), ('transformed', 0.027), ('multiplicative', 0.027), ('approximation', 0.027), ('stream', 0.026), ('tetrode', 0.026), ('phones', 0.026), ('db', 0.026), ('foundations', 0.025), ('memory', 0.025), ('geometric', 0.024), ('log', 0.024), ('mnist', 0.024), ('training', 0.024), ('ptas', 0.024), ('spheres', 0.024), ('constructing', 0.023), ('style', 0.023), ('andreas', 0.023), ('krause', 0.023), ('admit', 0.023), ('constructed', 0.023), ('mobile', 0.022), ('anomaly', 0.022), ('phone', 0.022), ('arbitrarily', 0.022), ('surprisingly', 0.022), ('data', 0.022), ('approximating', 0.022), ('stoc', 0.021), ('naive', 0.021), ('triangle', 0.021), ('polynomial', 0.021), ('density', 0.021), ('detection', 0.02), ('fd', 0.02), ('projective', 0.02), ('smart', 0.02), ('fitting', 0.02), ('construct', 0.02), ('arrive', 0.02), ('covariance', 0.019), ('full', 0.019), ('acceleration', 0.019), ('sampled', 0.019), ('suppose', 0.019), ('carrying', 0.019), ('proved', 0.019), ('uniformly', 0.018), ('perhaps', 0.018), ('remember', 0.018), ('inequality', 0.018), ('insight', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 241 nips-2011-Scalable Training of Mixture Models via Coresets
Author: Dan Feldman, Matthew Faulkner, Andreas Krause
Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1
2 0.2073745 95 nips-2011-Fast and Accurate k-means For Large Datasets
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
3 0.067065917 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
4 0.06358619 260 nips-2011-Sparse Features for PCA-Like Linear Regression
Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail
Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.
5 0.057481255 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
6 0.05512967 258 nips-2011-Sparse Bayesian Multi-Task Learning
7 0.048567139 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
8 0.0484199 186 nips-2011-Noise Thresholds for Spectral Clustering
9 0.047819179 198 nips-2011-On U-processes and clustering performance
10 0.045436576 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
11 0.042115655 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
12 0.041178402 4 nips-2011-A Convergence Analysis of Log-Linear Training
13 0.040815573 109 nips-2011-Greedy Model Averaging
14 0.040709604 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
15 0.039081469 54 nips-2011-Co-regularized Multi-view Spectral Clustering
16 0.038417425 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
17 0.038182676 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
18 0.037860241 104 nips-2011-Generalized Beta Mixtures of Gaussians
19 0.037657976 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
20 0.037127316 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
topicId topicWeight
[(0, 0.125), (1, -0.004), (2, -0.019), (3, -0.038), (4, -0.009), (5, -0.021), (6, -0.012), (7, -0.014), (8, -0.025), (9, -0.012), (10, -0.025), (11, 0.037), (12, 0.009), (13, -0.122), (14, 0.082), (15, 0.021), (16, -0.052), (17, -0.028), (18, -0.038), (19, -0.018), (20, 0.021), (21, 0.032), (22, 0.042), (23, 0.037), (24, 0.008), (25, 0.022), (26, 0.058), (27, 0.113), (28, 0.045), (29, 0.007), (30, -0.078), (31, 0.013), (32, 0.049), (33, 0.067), (34, 0.005), (35, 0.056), (36, 0.101), (37, -0.005), (38, 0.02), (39, -0.031), (40, -0.013), (41, -0.081), (42, -0.03), (43, 0.036), (44, 0.047), (45, -0.071), (46, -0.094), (47, 0.152), (48, 0.1), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.88906878 241 nips-2011-Scalable Training of Mixture Models via Coresets
Author: Dan Feldman, Matthew Faulkner, Andreas Krause
Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1
2 0.76607221 95 nips-2011-Fast and Accurate k-means For Large Datasets
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
3 0.5170005 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
Author: Kumar Sricharan, Alfred O. Hero
Abstract: Learning minimum volume sets of an underlying nominal distribution is a very effective approach to anomaly detection. Several approaches to learning minimum volume sets have been proposed in the literature, including the K-point nearest neighbor graph (K-kNNG) algorithm based on the geometric entropy minimization (GEM) principle [4]. The K-kNNG detector, while possessing several desirable characteristics, suffers from high computation complexity, and in [4] a simpler heuristic approximation, the leave-one-out kNNG (L1O-kNNG) was proposed. In this paper, we propose a novel bipartite k-nearest neighbor graph (BPkNNG) anomaly detection scheme for estimating minimum volume sets. Our bipartite estimator retains all the desirable theoretical properties of the K-kNNG, while being computationally simpler than the K-kNNG and the surrogate L1OkNNG detectors. We show that BP-kNNG is asymptotically consistent in recovering the p-value of each test point. Experimental results are given that illustrate the superior performance of BP-kNNG as compared to the L1O-kNNG and other state of the art anomaly detection schemes.
4 0.51661235 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
Author: Bogdan Alexe, Viviana Petrescu, Vittorio Ferrari
Abstract: We present a computationally efficient technique to compute the distance of highdimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18] and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average. 1
5 0.50296164 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
6 0.49655175 64 nips-2011-Convergent Bounds on the Euclidean Distance
7 0.48138914 299 nips-2011-Variance Penalizing AdaBoost
8 0.46212909 217 nips-2011-Practical Variational Inference for Neural Networks
9 0.45740303 198 nips-2011-On U-processes and clustering performance
10 0.44146484 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies
11 0.43302104 186 nips-2011-Noise Thresholds for Spectral Clustering
12 0.43127036 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
13 0.43113953 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
14 0.42926347 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
15 0.38871887 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
16 0.38093242 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
17 0.3789846 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
18 0.37808099 263 nips-2011-Sparse Manifold Clustering and Embedding
19 0.37172785 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
20 0.36275405 282 nips-2011-The Fast Convergence of Boosting
topicId topicWeight
[(0, 0.024), (4, 0.045), (20, 0.047), (26, 0.033), (28, 0.023), (31, 0.369), (33, 0.022), (43, 0.063), (45, 0.088), (57, 0.036), (74, 0.048), (83, 0.034), (84, 0.012), (87, 0.011), (99, 0.049)]
simIndex simValue paperId paperTitle
1 0.99001503 225 nips-2011-Probabilistic amplitude and frequency demodulation
Author: Richard Turner, Maneesh Sahani
Abstract: A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for so-called amplitude- and frequencydemodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an auto-regressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings. 1
2 0.98858255 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
Author: Dylan A. Simon, Nathaniel D. Daw
Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1
3 0.98616958 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
4 0.97701603 23 nips-2011-Active dendrites: adaptation to spike-based communication
Author: Balazs B. Ujfalussy, Máté Lengyel
Abstract: Computational analyses of dendritic computations often assume stationary inputs to neurons, ignoring the pulsatile nature of spike-based communication between neurons and the moment-to-moment fluctuations caused by such spiking inputs. Conversely, circuit computations with spiking neurons are usually formalized without regard to the rich nonlinear nature of dendritic processing. Here we address the computational challenge faced by neurons that compute and represent analogue quantities but communicate with digital spikes, and show that reliable computation of even purely linear functions of inputs can require the interplay of strongly nonlinear subunits within the postsynaptic dendritic tree. Our theory predicts a matching of dendritic nonlinearities and synaptic weight distributions to the joint statistics of presynaptic inputs. This approach suggests normative roles for some puzzling forms of nonlinear dendritic dynamics and plasticity. 1
5 0.97011 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
6 0.95980525 240 nips-2011-Robust Multi-Class Gaussian Process Classification
same-paper 7 0.91440517 241 nips-2011-Scalable Training of Mixture Models via Coresets
8 0.89943343 249 nips-2011-Sequence learning with hidden units in spiking neural networks
9 0.88222682 75 nips-2011-Dynamical segmentation of single trials from population neural data
10 0.88200229 229 nips-2011-Query-Aware MCMC
11 0.88013285 131 nips-2011-Inference in continuous-time change-point models
12 0.86936504 221 nips-2011-Priors over Recurrent Continuous Time Processes
13 0.86817378 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
14 0.85183537 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
15 0.85101104 87 nips-2011-Energetically Optimal Action Potentials
16 0.85028577 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
17 0.83873075 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
18 0.83537561 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
19 0.83361977 301 nips-2011-Variational Gaussian Process Dynamical Systems
20 0.83355349 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search