nips nips2013 nips2013-204 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. [sent-17, score-0.101]
2 It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. [sent-18, score-0.101]
3 We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. [sent-19, score-0.849]
4 A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. [sent-20, score-0.171]
5 State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. [sent-22, score-0.138]
6 Parsimonious models for such big data assume that the density in the ambient space concentrates around a lower-dimensional (possibly nonlinear) subspace. [sent-26, score-0.252]
7 A plethora of methods are emerging to estimate such lower-dimensional subspaces [1, 2]. [sent-27, score-0.08]
8 We are interested in using such lower-dimensional embeddings to obtain estimates of the conditional distribution of some target variable(s). [sent-28, score-0.101]
9 This conditional density estimation setting arises in a number of important application areas, including neuroscience, genetics, and video processing. [sent-29, score-0.202]
10 For example, one might desire automated estimation of a predictive density for a neurologic phenotype of interest, such as intelligence, on the basis of available data for a patient including neuroimaging. [sent-30, score-0.276]
11 The challenge is to estimate the probability density function of the phenotype nonparametrically based on a 106 dimensional image of the subject’s brain. [sent-31, score-0.141]
12 It is crucial to avoid parametric assumptions on the density, such as Gaussianity, while allowing the density to change flexibly with predictors. [sent-32, score-0.101]
13 Otherwise, one can obtain misleading predictions and poorly characterize predictive uncertainty. [sent-33, score-0.127]
14 1 There is a rich machine learning and statistical literature on conditional density estimation of a response y ∈ Y given a set of features (predictors) x = (x1 , x2 , . [sent-34, score-0.202]
15 However, as the number of features increases, the problem of finding the best splitting attribute becomes intractable, so that CART, MARS and multiple tree models cannot be efficiently applied. [sent-47, score-0.085]
16 Similarly, mixture of experts becomes computationally demanding, since both mixture weights and dictionary densities are predictor dependent. [sent-48, score-0.293]
17 However, performing variable selection in high dimensions is effectively intractable: algorithms need to efficiently search for the best subsets of predictors to include in weight and mean functions within a mixture model, an NP-hard problem [16]. [sent-50, score-0.096]
18 In order to efficiently deal with massive datasets, we propose a novel multiscale approach which starts by learning a multiscale dictionary of densities. [sent-51, score-0.593]
19 This tree is efficiently learned in a first stage using a fast and scalable graph partitioning algorithm applied to the high-dimensional observations [17]. [sent-52, score-0.171]
20 Expressing the conditional densities f (y|x) for each x ∈ X as a convex combination of coarse-to-fine scale dictionary densities, the learning problem in the second stage estimates the corresponding multiscale probability tree. [sent-53, score-0.589]
21 This is accomplished in a Bayesian manner using a novel multiscale stick-breaking process, which allows the data to inform about the optimal bias-variance tradeoff; weighting coarse scale dictionary densities more highly decreases variance while adding to bias. [sent-54, score-0.488]
22 Note that M need not be a linear subspace of X , rather, M could be, for example, a union or affine subspaces, or a smooth compact Riemannian manifold. [sent-62, score-0.11]
23 This is a much less restrictive assumption than the commonplace assumption in manifold learning that the marginal distribution fX concentrates around a low-dimensional latent space. [sent-65, score-0.166]
24 To provide some intuition for our model, we provide the following concrete example where the distribution of y ∈ R is a Gaussian function of the coordinate η ∈ M along the swissroll, which is embedded in a high-dimensional ambient space. [sent-66, score-0.171]
25 Specifically, we sample the manifold coordinate, η ∼ U (0, 1). [sent-67, score-0.092]
26 In particular, x lives on a swissroll embedded in a 2 p-dimensional ambient space, but y is only a function of the coordinate η along the swissroll M. [sent-76, score-0.7]
27 The left panels of Figure 1 depict this example when µ(η) = η and σ(η) = η + 1. [sent-77, score-0.179]
28 The top left panel shows the manifold M (a swissroll) embedded in a p-dimensional ambient space, where the color indicates the coordinate along the manifold, η (only the first 3 dimensions are shown for visualization purposes). [sent-79, score-0.263]
29 The middle and right panels show our estimates of fY |η at scales 3 and 4, respectively, which follow from partitioning our data. [sent-81, score-0.256]
30 To our knowledge, no extant approach for estimating conditional densities or posteriors thereof satisfies even our first criterion. [sent-90, score-0.282]
31 Deeds Framework We propose here a general modular approach which we refer to as multiscale dictionary learning for estimating conditional distributions (“Ms. [sent-93, score-0.439]
32 Deeds consists of two components: (i) a tree decomposition of the space, and (ii) an assumed form of the conditional probability model. [sent-96, score-0.222]
33 Tree Decomposition A tree decomposition τ yields a multiscale partition of the data or the ambient space in which the data live. [sent-97, score-0.534]
34 A partition tree τ of W consists of a collection of cells, τ = {Cj,k }j∈Z,k∈Kj . [sent-102, score-0.131]
35 At each scale j, the set of cells Cj = {Cj,k }k∈Kj provides a disjoint partition of W almost everywhere. [sent-103, score-0.091]
36 [18] developed a multiscale measure estimation strategy, and proved that there exists a scale j such that the approximate measure is within some bound of the true measure, under certain relatively general assumptions. [sent-111, score-0.344]
37 We decided to simply partition the x’s, ignoring the y’s in the partitioning strategy. [sent-112, score-0.132]
38 In such cases, we do not want to bias the partitioning to any specific y’s, all the more so when new unknown y’s may later emerge. [sent-115, score-0.086]
39 Second, because the x’s are so much higher dimensional than the y’s in our applications of interest, the partitions would be dominated by the x’s, unless we chose a partitioning strategy that emphasized the y’s. [sent-116, score-0.157]
40 Specifically, we employ METIS [17], a well-known relatively efficient multiscale partitioning algorithm with demonstrably good empirical performance on a wide range of graphs. [sent-121, score-0.385]
41 , xip )T ∈ X for i ∈ [n], the graph construction follows via computing all pairwise distances using ρ(xu , xv ) = xu − xv 2 , ˜ ˜ where x is the whitened x (i. [sent-127, score-0.139]
42 Applying METIS recursively on the graph constructed in this way yields a single tree (see supplementary material for further details). [sent-131, score-0.085]
43 Conditional Probability Model Given the tree decomposition of the data, we place a nonparametric prior over the tree. [sent-132, score-0.152]
44 Specifically, we define fY |X as fY |X = πj,kj (x) fj,kj (x) (y|x) (1) j∈Z where kj (x) is the set at scale j where x has been allocated and πj,kj (x) are weights across scales such that j∈Z πj,kj (x) = 1. [sent-133, score-0.183]
45 We refer to this prior as a multiscale stickbreaking process. [sent-140, score-0.255]
46 Thus, by adopting this Bayesian formulation, we are able to obtain posterior estimates for any newly observed data, regardless of the amount and variability of training data. [sent-142, score-0.088]
47 Moreover, the family can adapt with j or k, being more complex at the coarser scales (for which nj,k ’s are larger), and simpler for the finer scales (or partitions with fewer samples). [sent-151, score-0.138]
48 We let the family of conditional densities for y be Gaussian for simplicity, that is, we assume that fj,k = N (µj,k , σj,k ) with µj,k ∈ R and σj,k ∈ R+ . [sent-152, score-0.206]
49 Because we are interested in posteriors over the conditional distribution fY |X , we place relatively uninformative but conjugate priors on µj,k and σj,k , specifically, assuming the y’s have been whitened and are unidimensional, µj,k ∼ N (0, 1) and σj,k = IG(a, b). [sent-153, score-0.251]
50 2 Inference We introduce the latent variable i ∈ Z, for i = [n], denoting the multiscale level used by the ith observation. [sent-156, score-0.319]
51 Parameters (a, b) and α involved in the prior density of parameters σj,k ’s and Vj,k ’s were set to (3, 1) and 1, respectively. [sent-164, score-0.101]
52 , p}, and a uniform distribution over the latent manifold η ∼ sin(U (0, c)). [sent-170, score-0.127]
53 (2) Swissroll We then return to the swissroll example of Figure 1; in Figure 3 we show results for (µ, σ) = (η, 1). [sent-174, score-0.249]
54 (4) Union of Linear Subspaces This model is a direct extension of the linear subspace model, as it is a union of subspaces. [sent-181, score-0.11]
55 4 Neuroscience Applications We assessed the predictive performance of the proposed method on two very different neuroimaging datasets. [sent-192, score-0.109]
56 In the first experiment we investigated the extent to which we could predict creativity (as measured via the Composite Creativity Index [22]) via a structural connectome dataset collected at the Mind Research Network (data were collected as described in Jung et al. [sent-197, score-0.264]
57 The second dataset comes from a resting-state functional magnetic resonance experiment as part of the Autism Brain Imaging Data Exchange [26]. [sent-202, score-0.087]
58 To ensure the existence of nonlinear signal relating these predictors, we let yi correspond to an estimate of overall head motion in the scanner, called mean framewise displacement (FD) computed as described in Power et al. [sent-206, score-0.082]
59 To obtain mean-squared error estimates from MSB, we select our posterior mean as a point-estimate (the comparison algorithms do not generate posterior predictions, only point estimates). [sent-211, score-0.106]
60 1 Results Illustrative Example The middle and right panels of Figure 1 depict the quality of partitioning and density estimation for the swissroll example described in §2, with the ambient dimension p = 1000 and the predictive manifold dimension d = 1. [sent-218, score-0.899]
61 At scale 3 we have 4 partitions, and at scale 4 we have 8 (note that the partition tree, in general, need not be binary). [sent-220, score-0.136]
62 The top panels are color coded to indicate which xi ’s fall into which partition. [sent-221, score-0.157]
63 The bottom panels show the resulting estimate of the posteriors at the two scales. [sent-223, score-0.228]
64 These posteriors are piecewise constant, as they are invariant to the manifold coordinate within a given partition. [sent-224, score-0.197]
65 Figure 2 shows the estimated density of two observations of model (1) with parameters (µ1 , σ1 ) = (−2, 1), (µ2 , σ2 ) = (2, 1), σx = 0. [sent-226, score-0.101]
66 Posteriors of the conditional density fY |X were computed for various sample sizes. [sent-228, score-0.202]
67 Figure 2 suggests that our estimate of fY |X approaches the true density as the number of observations in the training set increases. [sent-229, score-0.101]
68 We are unable to compare our strategy for posterior estimation to previous literature because we are unaware of previous Bayesian approaches for this problem that scale up to problems of this size. [sent-230, score-0.127]
69 Therefore, we numerically compare the performance ˆ of our point-estimates (which we define as the posterior mean of fY |X ) with the predictions of the competitor algorithms. [sent-231, score-0.141]
70 Note that these three simulations span a wide range of models, including nonlinear smooth manifolds such as the swissroll 6 n=100 n=150 f(y|η=−0. [sent-235, score-0.343]
71 5) Figure 2: Illustrative example of model (1) suggesting that our posterior estimates of the conditional density are converging as n increases even when fY |η is highly nonlinear and fX|η is very high-dimensional. [sent-246, score-0.304]
72 True (red) and estimated (black) density (50th percentile: solid line, 2. [sent-247, score-0.101]
73 5th percentiles: dashed lines) for two data positions along the manifold (top panels: η ≈ −0. [sent-249, score-0.092]
74 3 0 −4 −2 0 y 2 4 0 −4 −2 0 2 4 0 −4 (model 2), relatively simple linear subspace manifolds (model 3), and a union of linear subspaces model (model 4 ; which is neither linear nor a manifold). [sent-258, score-0.234]
75 In terms of predictive accuracy, the top panels show that for all three simulations, in every dimensionality that we considered—including p = 0. [sent-259, score-0.202]
76 Note that this is the case even though MSB provides much more information about the posterior fY |X , yielding an entire posterior over fY |X , rather than merely a point estimate. [sent-261, score-0.106]
77 Thus, it is clear from these simulations that MSB has better scaling properties—in terms of both predictive accuracy and computational time—than the competitor methods. [sent-266, score-0.166]
78 Top plots depict the relative meansquared error of MSB (our approach), versus CART (red), Lasso (black), and PC regression (green) for as a function of ambient dimension of x. [sent-268, score-0.208]
79 The three simulation scenarios are: swissroll (left), linear subspaces (middle), union of linear subspaces (right). [sent-270, score-0.509]
80 MSB outperforms both CART, Lasso, and PC regression in all three A scenarios regardless of ambient dimension (rmse < 1 for all p). [sent-271, score-0.186]
81 Squared error predictive accuracy per subject (using leave-one-out) was computed. [sent-278, score-0.11]
82 We compare multiscale stick-breaking (MSB), CART, Lasso, random forest (RF), and PC regression. [sent-282, score-0.255]
83 MSB outperforms all the competitors in terms of predictive accuracy and scalability. [sent-283, score-0.125]
84 For the creativity application, p was relatively small, “merely” 2, 415, so we could run Lasso, CART, and random forests (RF) [30]. [sent-320, score-0.248]
85 For both applications, MSB yielded improved predictive accuracy over all competitors. [sent-322, score-0.08]
86 Although CART and Lasso were faster than MSB on the relatively low-dimensional predictor example (creativity), their computational scaling was poor, such that CART yielded a memory fault on the higherdimensional case, and Lasso required substantially more time than MSB. [sent-323, score-0.077]
87 6 Discussion In this work we have introduced a general formalism to estimate conditional distributions via multiscale dictionary learning. [sent-324, score-0.439]
88 We considered simulations and real-data examples where the dimensionality of the predictor space approached one million. [sent-326, score-0.078]
89 To our knowledge, no other approach to learn conditional distributions can run at this scale. [sent-327, score-0.101]
90 Because a fully Bayesian strategy remains computationally intractable at this scale, we developed an empirical Bayes approach, estimating the partition tree based on the data, but integrating over scales and posteriors. [sent-330, score-0.236]
91 This improvement was demonstrated when the M was a swissroll, a latent subspace, a union of latent subspaces, and real data (for which the latent space may not even exist). [sent-332, score-0.156]
92 Indeed, while multiscale methods benefit from a rich theoretical foundation [2], the relative advantages and disadvantages of a fully Bayesian approach, in which one can estimate posteriors over all functionals of fY |X at all scales, remains relatively unexplored. [sent-334, score-0.375]
93 Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. [sent-377, score-0.255]
94 Fast kernel conditional density estimation: a dual-tree Monte Carlo approach. [sent-386, score-0.202]
95 Regression density estimation with variational methods and stochastic approximation. [sent-402, score-0.101]
96 Simultaneous variable selection and component selection for regression density estimation with mixtures of heteroscedastic experts. [sent-410, score-0.169]
97 Bayesian density regression with logistic Gaussian process and subspace projection. [sent-446, score-0.199]
98 A fast and high quality multilevel scheme for partitioning irregular graphs. [sent-459, score-0.086]
99 A fast multiscale framework for data in high-dimensions: Measure estimation, anomaly detection, and compressive measurements. [sent-466, score-0.255]
100 Principles of diffusion tensor imaging and its applications to basic neuroscience research. [sent-513, score-0.134]
wordName wordTfidf (topN-words)
[('msb', 0.521), ('fy', 0.362), ('multiscale', 0.255), ('swissroll', 0.249), ('cart', 0.221), ('creativity', 0.204), ('panels', 0.122), ('ambient', 0.112), ('densities', 0.105), ('density', 0.101), ('conditional', 0.101), ('lasso', 0.099), ('manifold', 0.092), ('pc', 0.089), ('partitioning', 0.086), ('tree', 0.085), ('dictionary', 0.083), ('fx', 0.08), ('predictive', 0.08), ('ig', 0.08), ('subspaces', 0.08), ('posteriors', 0.076), ('connectomes', 0.068), ('deeds', 0.068), ('cpu', 0.064), ('bayesian', 0.063), ('kj', 0.062), ('connectome', 0.06), ('subspace', 0.059), ('neuroscience', 0.058), ('depict', 0.057), ('automated', 0.055), ('pipeline', 0.054), ('posterior', 0.053), ('union', 0.051), ('nonlinear', 0.049), ('simulation', 0.049), ('scales', 0.048), ('diffusion', 0.048), ('vogelstein', 0.047), ('predictions', 0.047), ('partition', 0.046), ('competitors', 0.045), ('dunson', 0.045), ('chavez', 0.045), ('grazioplene', 0.045), ('simulations', 0.045), ('scale', 0.045), ('mri', 0.044), ('magnetic', 0.044), ('relatively', 0.044), ('resonance', 0.043), ('partitions', 0.042), ('competitor', 0.041), ('xv', 0.04), ('metis', 0.04), ('phenotype', 0.04), ('gurable', 0.04), ('nott', 0.04), ('sampler', 0.039), ('concentrates', 0.039), ('rf', 0.039), ('regression', 0.039), ('carolina', 0.037), ('wavelets', 0.037), ('movement', 0.036), ('mixture', 0.036), ('decomposition', 0.036), ('xi', 0.035), ('cj', 0.035), ('latent', 0.035), ('asso', 0.035), ('regardless', 0.035), ('predictor', 0.033), ('durham', 0.033), ('jung', 0.033), ('yi', 0.033), ('mse', 0.033), ('fw', 0.031), ('lives', 0.031), ('nonparametric', 0.031), ('rp', 0.031), ('predictors', 0.031), ('subject', 0.03), ('whitened', 0.03), ('bottom', 0.03), ('embedded', 0.03), ('xu', 0.029), ('xr', 0.029), ('north', 0.029), ('neuroimaging', 0.029), ('coordinate', 0.029), ('strategy', 0.029), ('variable', 0.029), ('exibly', 0.028), ('genetics', 0.028), ('tensor', 0.028), ('intractable', 0.028), ('ii', 0.028), ('allocated', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
2 0.089834429 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
3 0.087278731 109 nips-2013-Estimating LASSO Risk and Noise Level
Author: Mohsen Bayati, Murat A. Erdogdu, Andrea Montanari
Abstract: We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefficient vector θ0 ∈ Rp from noisy linear observations y = Xθ0 + w ∈ Rn (p > n) and the popular estimation procedure of solving the 1 -penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the 2 estimation risk θ − θ0 2 and the variance of the noise when distributions of θ0 and w are unknown. These can be used to select the regularization parameter optimally. Our approach combines Stein’s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b] on the analysis of approximate message passing and the risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices X of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on a certain conjecture from statistical physics. To the best of our knowledge, this result is the first that provides an asymptotically consistent risk estimator for the LASSO solely based on data. In addition, we demonstrate through simulations that our variance estimation outperforms several existing methods in the literature. 1
4 0.084797733 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
5 0.072845198 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen
Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1
6 0.072001174 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
7 0.071354754 350 nips-2013-Wavelets on Graphs via Deep Learning
8 0.069700502 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
9 0.068609729 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
10 0.067959294 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
11 0.066800892 256 nips-2013-Probabilistic Principal Geodesic Analysis
12 0.066780061 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
13 0.065541424 137 nips-2013-High-Dimensional Gaussian Process Bandits
14 0.064952418 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
15 0.062630378 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
16 0.062315699 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
17 0.061324932 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields
18 0.060432289 91 nips-2013-Dirty Statistical Models
19 0.060286183 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
20 0.059334323 355 nips-2013-Which Space Partitioning Tree to Use for Search?
topicId topicWeight
[(0, 0.19), (1, 0.08), (2, -0.008), (3, 0.032), (4, -0.004), (5, 0.081), (6, 0.069), (7, 0.012), (8, -0.012), (9, -0.013), (10, 0.016), (11, -0.007), (12, -0.021), (13, -0.034), (14, -0.013), (15, -0.005), (16, 0.043), (17, -0.093), (18, -0.0), (19, -0.022), (20, 0.018), (21, 0.096), (22, 0.003), (23, 0.005), (24, 0.044), (25, 0.033), (26, -0.009), (27, 0.03), (28, 0.009), (29, -0.0), (30, 0.018), (31, -0.007), (32, 0.076), (33, -0.003), (34, -0.007), (35, 0.114), (36, 0.072), (37, -0.048), (38, -0.024), (39, -0.01), (40, 0.014), (41, 0.002), (42, -0.057), (43, 0.011), (44, -0.084), (45, -0.036), (46, -0.066), (47, -0.048), (48, 0.038), (49, -0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.90827364 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
2 0.62798971 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
3 0.61100066 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
Author: Jeffrey W. Miller, Matthew T. Harrison
Abstract: For data assumed to come from a finite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1
4 0.60733992 4 nips-2013-A Comparative Framework for Preconditioned Lasso Algorithms
Author: Fabian L. Wauthier, Nebojsa Jojic, Michael Jordan
Abstract: The Lasso is a cornerstone of modern multivariate data analysis, yet its performance suffers in the common situation in which covariates are correlated. This limitation has led to a growing number of Preconditioned Lasso algorithms that pre-multiply X and y by matrices PX , Py prior to running the standard Lasso. A direct comparison of these and similar Lasso-style algorithms to the original Lasso is difficult because the performance of all of these methods depends critically on an auxiliary penalty parameter λ. In this paper we propose an agnostic framework for comparing Preconditioned Lasso algorithms to the Lasso without having to choose λ. We apply our framework to three Preconditioned Lasso instances and highlight cases when they will outperform the Lasso. Additionally, our theory reveals fragilities of these algorithms to which we provide partial solutions. 1
5 0.60611248 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
Author: Benigno Uria, Iain Murray, Hugo Larochelle
Abstract: We introduce RNADE, a new model for joint density estimation of real-valued vectors. Our model calculates the density of a datapoint as the product of onedimensional conditionals modeled using mixture density networks with shared parameters. RNADE learns a distributed representation of the data, while having a tractable expression for the calculation of densities. A tractable likelihood allows direct comparison with other methods and training by standard gradientbased optimizers. We compare the performance of RNADE on several datasets of heterogeneous and perceptual data, finding it outperforms mixture models in all but one case. 1
6 0.60524023 256 nips-2013-Probabilistic Principal Geodesic Analysis
7 0.60186851 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
8 0.60072356 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
9 0.59542036 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
10 0.59044003 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
11 0.57996941 344 nips-2013-Using multiple samples to learn mixture models
12 0.57655102 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
13 0.57358479 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
14 0.56696248 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements
15 0.56532621 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
17 0.53907979 332 nips-2013-Tracking Time-varying Graphical Structure
18 0.53745908 167 nips-2013-Learning the Local Statistics of Optical Flow
19 0.53359997 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
20 0.51622319 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
topicId topicWeight
[(2, 0.011), (16, 0.394), (33, 0.125), (34, 0.121), (41, 0.018), (49, 0.021), (56, 0.089), (70, 0.031), (85, 0.033), (89, 0.036), (93, 0.035)]
simIndex simValue paperId paperTitle
1 0.97508597 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes
Author: Abbas Edalat
Abstract: We solve the mean field equations for a stochastic Hopfield network with temperature (noise) in the presence of strong, i.e., multiply stored, patterns, and use this solution to obtain the storage capacity of such a network. Our result provides for the first time a rigorous solution of the mean filed equations for the standard Hopfield model and is in contrast to the mathematically unjustifiable replica technique that has been used hitherto for this derivation. We show that the critical temperature for stability of a strong pattern is equal to its degree or multiplicity, when the sum of the squares of degrees of the patterns is negligible compared to the network size. In the case of a single strong pattern, when the ratio of the number of all stored pattens and the network size is a positive constant, we obtain the distribution of the overlaps of the patterns with the mean field and deduce that the storage capacity for retrieving a strong pattern exceeds that for retrieving a simple pattern by a multiplicative factor equal to the square of the degree of the strong pattern. This square law property provides justification for using strong patterns to model attachment types and behavioural prototypes in psychology and psychotherapy. 1 Introduction: Multiply learned patterns in Hopfield networks The Hopfield network as a model of associative memory and unsupervised learning was introduced in [23] and has been intensively studied from a wide range of viewpoints in the past thirty years. However, properties of a strong pattern, as a pattern that has been multiply stored or learned in these networks, have only been examined very recently, a surprising delay given that repetition of an activity is the basis of learning by the Hebbian rule and long term potentiation. In particular, while the storage capacity of a Hopfield network with certain correlated patterns has been tackled [13, 25], the storage capacity of a Hopfield network in the presence of strong as well as random patterns has not been hitherto addressed. The notion of a strong pattern of a Hopfield network has been proposed in [15] to model attachment types and behavioural prototypes in developmental psychology and psychotherapy. This suggestion has been motivated by reviewing the pioneering work of Bowlby [9] in attachment theory and highlighting how a number of academic biologists, psychiatrists, psychologists, sociologists and neuroscientists have consistently regarded Hopfield-like artificial neural networks as suitable tools to model cognitive and behavioural constructs as patterns that are deeply and repeatedly learned by individuals [11, 22, 24, 30, 29, 10]. A number of mathematical properties of strong patterns in Hopfield networks, which give rise to strong attractors, have been derived in [15]. These show in particular that strong attractors are strongly stable; a series of experiments have also been carried out which confirm the mathematical 1 results and also indicate that a strong pattern stored in the network can be retrieved even in the presence of a large number of simple patterns, far exceeding the well-known maximum load parameter or storage capacity of the Hopfield network with random patterns (αc ≈ 0.138). In this paper, we consider strong patterns in stochastic Hopfield model with temperature, which accounts for various types of noise in the network. In these networks, the updating rule is probabilistic and depend on the temperature. Since analytical solution of such a system is not possible in general, one strives to obtain the average behaviour of the network when the input to each node, the so-called field at the node, is replaced with its mean. This is the basis of mean field theory for these networks. Due to the close connection between the Hopfield network and the Ising model in ferromagnetism [1, 8], the mean field approach for the Hopfield network and its variations has been tackled using the replica method, starting with the pioneering work of Amit, Gutfreund and Sompolinsky [3, 2, 4, 19, 31, 1, 13]. Although this method has been widely used in the theory of spin glasses in statistical physics [26, 16] its mathematical justification has proved to be elusive as we will discuss in the next section; see for example [20, page 264], [14, page 27], and [7, page 9]. In [17] and independently in [27], an alternative technique to the replica method for solving the mean field equations has been proposed which is reproduced and characterised as heuristic in [20, section 2.5] since it relies on a number of assumptions that are not later justified and uses a number of mathematical steps that are not validated. Here, we use the basic idea of the above heuristic to develop a verifiable mathematical framework with provable results grounded on elements of probability theory, with which we assume the reader is familiar. This technique allows us to solve the mean field equations for the Hopfield network in the presence of strong patterns and use the results to study, first, the stability of these patterns in the presence of temperature (noise) and, second, the storage capacity of the network with a single strong pattern at temperature zero. We show that the critical temperature for the stability of a strong pattern is equal to its degree (i.e., its multiplicity) when the ratio of the sum of the squares of degrees of the patterns to the network size tends to zero when the latter tends to infinity. In the case that there is only one strong pattern present with its degree small compared to the number of patterns and the latter is a fixed multiple of the number of nodes, we find the distribution of the overlap of the mean field and the patterns when the strong pattern is being retrieved. We use these distributions to prove that the storage capacity for retrieving a strong pattern exceeds that for a simple pattern by a multiplicative factor equal to the square of the degree of the strong attractor. This result matches the finding in [15] regarding the capacity of a network to recall strong patterns as mentioned above. Our results therefore show that strong patterns are robust and persistent in the network memory as attachment types and behavioural prototypes are in the human memory system. In this paper, we will several times use Lyapunov’s theorem in probability which provides a simple sufficient condition to generalise the Central Limit theorem when we deal with independent but not necessarily identically distributed random variables. We require a general form of this theorem kn as follows. Let Yn = N i=1 Yni , for n ∈ I , be a triangular array of random variables such that for each n, the random variables Yni , for 1 ≤ i ≤ kn are independent with E(Yni ) = 0 2 2 and E(Yni ) = σni , where E(X) stands for the expected value of the random variable X. Let kn 2 2 sn = i=1 σni . We use the notation X ∼ Y when the two random variables X and Y have the same distribution (for large n if either or both of them depend on n). Theorem 1.1 (Lyapunov’s theorem [6, page 368]) If for some δ > 0, we have the condition: 1 E(|Yn |2+δ |) → 0 s2+δ n d d as n → ∞ then s1 Yn −→ N (0, 1) as n → ∞ where −→ denotes convergence in distribution, and we denote n by N (a, σ 2 ) the normal distribution with mean a and variance σ 2 . Thus, for large n we have Yn ∼ N (0, s2 ). n 2 2 Mean field theory We consider a Hopfield network with N neurons i = 1, . . . , N with values Si = ±1 and follow the notations in [20]. As in [15], we assume patterns can be multiply stored and the degree of a pattern is defined as its multiplicity. The total number of patterns, counting their multiplicity, is denoted by p and we assume there are n patterns ξ 1 , . . . , ξ n with degrees d1 , . . . , dn ≥ 1 respectively and that n the remaining p − k=1 dk ≥ 0 patterns are simple, i.e., each has degree one. Note that by our assumptions there are precisely n p0 = p + n − dk k=1 distinct patterns, which we assume are independent and identically distributed with equal probability of taking value ±1 for each node. More generally, for any non-negative integer k ∈ I , we let N p0 dk . µ pk = µ=1 p µ µ 0 1 We use the generalized Hebbian rule for the synaptic couplings: wij = N µ=1 dµ ξi ξj for i = j with wii = 0 for 1 ≤ i, j ≤ N . As in the standard stochastic Hopfield model [20], we use Glauber dynamics [18] for the stochastic updating rule with pseudo-temperature T > 0, which accounts for various types of noise in the network, and assume zero bias in the local field. Putting β = 1/T (i.e., with the Boltzmann constant kB = 1) and letting fβ (h) = 1/(1 + exp(−2βh)), the stochastic updating rule at time t is given by: N Pr(Si (t + 1) = ±1) = fβ (±hi (t)), where hi (t) = wij Sj (t), (1) j=1 is the local field at i at time t. The updating is implemented asynchronously in a random way. The energy of the network in the configuration S = (Si )N is given by i=1 N 1 Si Sj wij . H(S) = − 2 i,j=1 For large N , this specifies a complex system, with an underlying state space of dimension 2N , which in general cannot be solved exactly. However, mean field theory has proved very useful in studying Hopfield networks. The average updated value of Si (t + 1) in Equation (1) is Si (t + 1) = 1/(1 + e−2βhi (t) ) − 1/(1 + e2βhi (t) ) = tanh(βhi (t)), (2) where . . . denotes taking average with respect to the probability distribution in the updating rule in Equation (1). The stationary solution for the mean field thus satisfies: Si = tanh(βhi ) , (3) The average overlap of pattern ξ µ with the mean field at the nodes of the network is given by: mν = 1 N N ν ξi Si (4) i=1 The replica technique for solving the mean field problem, used in the case p/N = α > 0 as N → ∞, seeks to obtain the average of the overlaps in Equation (4) by evaluating the partition function of the system, namely, Z = TrS exp(−βH(S)), where the trace TrS stands for taking sum over all possible configurations S = (Si )N . As it i=1 is generally the case in statistical physics, once the partition function of the system is obtained, 3 all required physical quantities can in principle be computed. However, in this case, the partition function is very difficult to compute since it entails computing the average log Z of log Z, where . . . indicates averaging over the random distribution of the stored patterns ξ µ . To overcome this problem, the identity Zk − 1 log Z = lim k→0 k is used to reduce the problem to finding the average Z k of Z k , which is then computed for positive integer values of k. For such k, we have: Z k = TrS 1 TrS 2 . . . TrS k exp(−β(H(S 1 ) + H(S 1 ) + . . . + H(S k ))), where for each i = 1, . . . , k the super-scripted configuration S i is a replica of the configuration state. In computing the trace over each replica, various parameters are obtained and the replica symmetry condition assumes that these parameters are independent of the particular replica under consideration. Apart from this assumption, there are two basic mathematical problems with the technique which makes it unjustifiable [20, page 264]. Firstly, the positive integer k above is eventually treated as a real number near zero without any mathematical justification. Secondly, the order of taking limits, in particular the order of taking the two limits k → 0 and N → ∞, are several times interchanged again without any mathematical justification. Here, we develop a mathematically rigorous method for solving the mean field problem, i.e., computing the average of the overlaps in Equation (4) in the case of p/N = α > 0 as N → ∞. Our method turns the basic idea of the heuristic presented in [17] and reproduced in [20] for solving the mean field equation into a mathematically verifiable formalism, which for the standard Hopfield network with random stored patterns gives the same result as the replica method, assuming replica symmetry. In the presence of strong patterns we obtain a set of new results as explained in the next two sections. The mean field equation is obtained from Equation (3) by approximating the right hand side of N this equation by the value of tanh at the mean field hi = j=1 wij Sj , ignoring the sum N j=1 wij (Sj − Sj ) for large N [17, page 32]: Si = tanh(β hi ) = tanh β N N j=1 p0 µ=1 µ µ dµ ξi ξj Sj . (5) Equation (5) gives the mean field equation for the Hopfield network with n possible strong patterns n ξ µ (1 ≤ µ ≤ n) and p − µ=1 dµ simple patterns ξ µ with n + 1 ≤ µ ≤ p0 . As in the standard Hopfield model, where all patterns are simple, we have two cases to deal with. However, we now have to account for the presence of strong attractors and our two cases will be as follows: (i) In the p0 first case we assume p2 := µ=1 d2 = o(N ), which includes the simpler case p2 N when p2 µ is fixed and independent of N . (ii) In the second case we assume we have a single strong attractor with the load parameter p/N = α > 0. 3 Stability of strong patterns with noise: p2 = o(N ) The case of constant p and N → ∞ is usually referred to as α = 0 in the standard Hopfield model. Here, we need to consider the sum of degrees of all stored patterns (and not just the number of patterns) compared to N . We solve the mean field equation with T > 0 by using a method similar in spirit to [20, page 33] for the standard Hopfield model, but in our case strong patterns induce a sequence of independent but non-identically distributed random variables in the crosstalk term, where the Central Limit Theorem cannot be used; we show however that Lyapunov’s theorem (Theorem (1.1) can be invoked. In retrieving pattern ξ 1 , we look for a solution of the mean filed 1 equation of the form: Si = mξi , where m > 0 is a constant. Using Equation (5) and separating 1 the contribution of ξ in the argument of tanh, we obtain: 1 mξi = tanh mβ 1 d1 ξi + N 4 µ µ 1 dµ ξi ξj ξj . j=i,µ>1 (6) For each N , µ > 1 and j = i, let dµ µ µ 1 (7) ξ ξ ξ . N i j j 2 This gives (p0 − 1)(N − 1) independent random variables with E(YN µj ) = 0, E(YN µj ) = d2 /N 2 , µ 3 3 3 and E(|YN µj |) = dµ /N . We have: YN µj = s2 := N 2 E(YN µj ) = µ>1,j=i 1 N −1 d2 ∼ N 2 µ>1 µ N d2 . µ (8) µ>1 Thus, as N → ∞, we have: 1 s3 N 3 E(|YN µj |) ∼ √ µ>1,j=i µ>1 N( d3 µ µ>1 d2 )3/2 µ → 0. (9) as N → ∞ since for positive numbers dµ we always have µ>1 d3 < ( µ>1 d2 )3/2 . Thus the µ µ Lyapunov condition is satisfied for δ = 1. By Lyapunov’s theorem we deduce: 1 N µ µ 1 dµ ξi ξj ξj ∼ N d2 /N µ 0, (10) µ>1 µ>1,j=i Since we also have p2 = o(N ), it follows that we can ignore the second term, i.e., the crosstalk term, in the argument of tanh in Equation (6) as N → ∞; we thus obtain: m = tanh βd1 m. (11) To examine the fixed points of the Equation (11), we let d = d1 for convenience and put x = βdm = dm/T , so that tanh x = T x/d; see Figure 1. It follows that Tc = d is the critical temperature. If T < d then there is a non-zero (non-trivial) solution for m, whereas for T > d we only have the trivial solution. For d = 1 our solution is that of the standard Hopfield network as in [20, page 34]. (d < T) y>x y = x ( d = T) y = tanh x y
2 0.93861282 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
3 0.91834003 245 nips-2013-Pass-efficient unsupervised feature selection
Author: Crystal Maung, Haim Schweitzer
Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modification of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the final selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efficiency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1
4 0.87756085 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
same-paper 5 0.83393759 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
6 0.81999743 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
7 0.6560393 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
8 0.64758635 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
9 0.63908273 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
10 0.62527579 252 nips-2013-Predictive PAC Learning and Process Decompositions
11 0.62354088 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
12 0.62325627 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
13 0.61139292 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
14 0.60923851 350 nips-2013-Wavelets on Graphs via Deep Learning
15 0.60808754 47 nips-2013-Bayesian Hierarchical Community Discovery
16 0.60750353 355 nips-2013-Which Space Partitioning Tree to Use for Search?
17 0.59984452 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
18 0.59864086 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations
19 0.59782648 86 nips-2013-Demixing odors - fast inference in olfaction
20 0.59728795 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models