nips nips2013 nips2013-204 knowledge-graph by maker-knowledge-mining

204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Source: pdf

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change ﬂexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efﬁciently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. [sent-17, score-0.101]

2 It is important to allow not only the mean but also the variance and shape of the response density to change ﬂexibly with features, which are massive-dimensional. [sent-18, score-0.101]

3 We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. [sent-19, score-0.849]

4 A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. [sent-20, score-0.171]

5 State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. [sent-22, score-0.138]

6 Parsimonious models for such big data assume that the density in the ambient space concentrates around a lower-dimensional (possibly nonlinear) subspace. [sent-26, score-0.252]

7 A plethora of methods are emerging to estimate such lower-dimensional subspaces [1, 2]. [sent-27, score-0.08]

8 We are interested in using such lower-dimensional embeddings to obtain estimates of the conditional distribution of some target variable(s). [sent-28, score-0.101]

9 This conditional density estimation setting arises in a number of important application areas, including neuroscience, genetics, and video processing. [sent-29, score-0.202]

10 For example, one might desire automated estimation of a predictive density for a neurologic phenotype of interest, such as intelligence, on the basis of available data for a patient including neuroimaging. [sent-30, score-0.276]

11 The challenge is to estimate the probability density function of the phenotype nonparametrically based on a 106 dimensional image of the subject’s brain. [sent-31, score-0.141]

12 It is crucial to avoid parametric assumptions on the density, such as Gaussianity, while allowing the density to change ﬂexibly with predictors. [sent-32, score-0.101]

13 Otherwise, one can obtain misleading predictions and poorly characterize predictive uncertainty. [sent-33, score-0.127]

14 1 There is a rich machine learning and statistical literature on conditional density estimation of a response y ∈ Y given a set of features (predictors) x = (x1 , x2 , . [sent-34, score-0.202]

15 However, as the number of features increases, the problem of ﬁnding the best splitting attribute becomes intractable, so that CART, MARS and multiple tree models cannot be efﬁciently applied. [sent-47, score-0.085]

16 Similarly, mixture of experts becomes computationally demanding, since both mixture weights and dictionary densities are predictor dependent. [sent-48, score-0.293]

17 However, performing variable selection in high dimensions is effectively intractable: algorithms need to efﬁciently search for the best subsets of predictors to include in weight and mean functions within a mixture model, an NP-hard problem [16]. [sent-50, score-0.096]

18 In order to efﬁciently deal with massive datasets, we propose a novel multiscale approach which starts by learning a multiscale dictionary of densities. [sent-51, score-0.593]

19 This tree is efﬁciently learned in a ﬁrst stage using a fast and scalable graph partitioning algorithm applied to the high-dimensional observations [17]. [sent-52, score-0.171]

20 Expressing the conditional densities f (y|x) for each x ∈ X as a convex combination of coarse-to-ﬁne scale dictionary densities, the learning problem in the second stage estimates the corresponding multiscale probability tree. [sent-53, score-0.589]

21 This is accomplished in a Bayesian manner using a novel multiscale stick-breaking process, which allows the data to inform about the optimal bias-variance tradeoff; weighting coarse scale dictionary densities more highly decreases variance while adding to bias. [sent-54, score-0.488]

22 Note that M need not be a linear subspace of X , rather, M could be, for example, a union or afﬁne subspaces, or a smooth compact Riemannian manifold. [sent-62, score-0.11]

23 This is a much less restrictive assumption than the commonplace assumption in manifold learning that the marginal distribution fX concentrates around a low-dimensional latent space. [sent-65, score-0.166]

24 To provide some intuition for our model, we provide the following concrete example where the distribution of y ∈ R is a Gaussian function of the coordinate η ∈ M along the swissroll, which is embedded in a high-dimensional ambient space. [sent-66, score-0.171]

25 Speciﬁcally, we sample the manifold coordinate, η ∼ U (0, 1). [sent-67, score-0.092]

26 In particular, x lives on a swissroll embedded in a 2 p-dimensional ambient space, but y is only a function of the coordinate η along the swissroll M. [sent-76, score-0.7]

27 The left panels of Figure 1 depict this example when µ(η) = η and σ(η) = η + 1. [sent-77, score-0.179]

28 The top left panel shows the manifold M (a swissroll) embedded in a p-dimensional ambient space, where the color indicates the coordinate along the manifold, η (only the ﬁrst 3 dimensions are shown for visualization purposes). [sent-79, score-0.263]

29 The middle and right panels show our estimates of fY |η at scales 3 and 4, respectively, which follow from partitioning our data. [sent-81, score-0.256]

30 To our knowledge, no extant approach for estimating conditional densities or posteriors thereof satisﬁes even our ﬁrst criterion. [sent-90, score-0.282]

31 Deeds Framework We propose here a general modular approach which we refer to as multiscale dictionary learning for estimating conditional distributions (“Ms. [sent-93, score-0.439]

32 Deeds consists of two components: (i) a tree decomposition of the space, and (ii) an assumed form of the conditional probability model. [sent-96, score-0.222]

33 Tree Decomposition A tree decomposition τ yields a multiscale partition of the data or the ambient space in which the data live. [sent-97, score-0.534]

34 A partition tree τ of W consists of a collection of cells, τ = {Cj,k }j∈Z,k∈Kj . [sent-102, score-0.131]

35 At each scale j, the set of cells Cj = {Cj,k }k∈Kj provides a disjoint partition of W almost everywhere. [sent-103, score-0.091]

36 [18] developed a multiscale measure estimation strategy, and proved that there exists a scale j such that the approximate measure is within some bound of the true measure, under certain relatively general assumptions. [sent-111, score-0.344]

37 We decided to simply partition the x’s, ignoring the y’s in the partitioning strategy. [sent-112, score-0.132]

38 In such cases, we do not want to bias the partitioning to any speciﬁc y’s, all the more so when new unknown y’s may later emerge. [sent-115, score-0.086]

39 Second, because the x’s are so much higher dimensional than the y’s in our applications of interest, the partitions would be dominated by the x’s, unless we chose a partitioning strategy that emphasized the y’s. [sent-116, score-0.157]

40 Speciﬁcally, we employ METIS [17], a well-known relatively efﬁcient multiscale partitioning algorithm with demonstrably good empirical performance on a wide range of graphs. [sent-121, score-0.385]

41 , xip )T ∈ X for i ∈ [n], the graph construction follows via computing all pairwise distances using ρ(xu , xv ) = xu − xv 2 , ˜ ˜ where x is the whitened x (i. [sent-127, score-0.139]

42 Applying METIS recursively on the graph constructed in this way yields a single tree (see supplementary material for further details). [sent-131, score-0.085]

43 Conditional Probability Model Given the tree decomposition of the data, we place a nonparametric prior over the tree. [sent-132, score-0.152]

44 Speciﬁcally, we deﬁne fY |X as fY |X = πj,kj (x) fj,kj (x) (y|x) (1) j∈Z where kj (x) is the set at scale j where x has been allocated and πj,kj (x) are weights across scales such that j∈Z πj,kj (x) = 1. [sent-133, score-0.183]

45 We refer to this prior as a multiscale stickbreaking process. [sent-140, score-0.255]

46 Thus, by adopting this Bayesian formulation, we are able to obtain posterior estimates for any newly observed data, regardless of the amount and variability of training data. [sent-142, score-0.088]

47 Moreover, the family can adapt with j or k, being more complex at the coarser scales (for which nj,k ’s are larger), and simpler for the ﬁner scales (or partitions with fewer samples). [sent-151, score-0.138]

48 We let the family of conditional densities for y be Gaussian for simplicity, that is, we assume that fj,k = N (µj,k , σj,k ) with µj,k ∈ R and σj,k ∈ R+ . [sent-152, score-0.206]

49 Because we are interested in posteriors over the conditional distribution fY |X , we place relatively uninformative but conjugate priors on µj,k and σj,k , speciﬁcally, assuming the y’s have been whitened and are unidimensional, µj,k ∼ N (0, 1) and σj,k = IG(a, b). [sent-153, score-0.251]

50 2 Inference We introduce the latent variable i ∈ Z, for i = [n], denoting the multiscale level used by the ith observation. [sent-156, score-0.319]

51 Parameters (a, b) and α involved in the prior density of parameters σj,k ’s and Vj,k ’s were set to (3, 1) and 1, respectively. [sent-164, score-0.101]

52 , p}, and a uniform distribution over the latent manifold η ∼ sin(U (0, c)). [sent-170, score-0.127]

53 (2) Swissroll We then return to the swissroll example of Figure 1; in Figure 3 we show results for (µ, σ) = (η, 1). [sent-174, score-0.249]

54 (4) Union of Linear Subspaces This model is a direct extension of the linear subspace model, as it is a union of subspaces. [sent-181, score-0.11]

55 4 Neuroscience Applications We assessed the predictive performance of the proposed method on two very different neuroimaging datasets. [sent-192, score-0.109]

56 In the ﬁrst experiment we investigated the extent to which we could predict creativity (as measured via the Composite Creativity Index [22]) via a structural connectome dataset collected at the Mind Research Network (data were collected as described in Jung et al. [sent-197, score-0.264]

57 The second dataset comes from a resting-state functional magnetic resonance experiment as part of the Autism Brain Imaging Data Exchange [26]. [sent-202, score-0.087]

58 To ensure the existence of nonlinear signal relating these predictors, we let yi correspond to an estimate of overall head motion in the scanner, called mean framewise displacement (FD) computed as described in Power et al. [sent-206, score-0.082]

59 To obtain mean-squared error estimates from MSB, we select our posterior mean as a point-estimate (the comparison algorithms do not generate posterior predictions, only point estimates). [sent-211, score-0.106]

60 1 Results Illustrative Example The middle and right panels of Figure 1 depict the quality of partitioning and density estimation for the swissroll example described in §2, with the ambient dimension p = 1000 and the predictive manifold dimension d = 1. [sent-218, score-0.899]

61 At scale 3 we have 4 partitions, and at scale 4 we have 8 (note that the partition tree, in general, need not be binary). [sent-220, score-0.136]

62 The top panels are color coded to indicate which xi ’s fall into which partition. [sent-221, score-0.157]

63 The bottom panels show the resulting estimate of the posteriors at the two scales. [sent-223, score-0.228]

64 These posteriors are piecewise constant, as they are invariant to the manifold coordinate within a given partition. [sent-224, score-0.197]

65 Figure 2 shows the estimated density of two observations of model (1) with parameters (µ1 , σ1 ) = (−2, 1), (µ2 , σ2 ) = (2, 1), σx = 0. [sent-226, score-0.101]

66 Posteriors of the conditional density fY |X were computed for various sample sizes. [sent-228, score-0.202]

67 Figure 2 suggests that our estimate of fY |X approaches the true density as the number of observations in the training set increases. [sent-229, score-0.101]

68 We are unable to compare our strategy for posterior estimation to previous literature because we are unaware of previous Bayesian approaches for this problem that scale up to problems of this size. [sent-230, score-0.127]

69 Therefore, we numerically compare the performance ˆ of our point-estimates (which we deﬁne as the posterior mean of fY |X ) with the predictions of the competitor algorithms. [sent-231, score-0.141]

70 Note that these three simulations span a wide range of models, including nonlinear smooth manifolds such as the swissroll 6 n=100 n=150 f(y|η=−0. [sent-235, score-0.343]

71 5) Figure 2: Illustrative example of model (1) suggesting that our posterior estimates of the conditional density are converging as n increases even when fY |η is highly nonlinear and fX|η is very high-dimensional. [sent-246, score-0.304]

72 True (red) and estimated (black) density (50th percentile: solid line, 2. [sent-247, score-0.101]

73 5th percentiles: dashed lines) for two data positions along the manifold (top panels: η ≈ −0. [sent-249, score-0.092]

74 3 0 −4 −2 0 y 2 4 0 −4 −2 0 2 4 0 −4 (model 2), relatively simple linear subspace manifolds (model 3), and a union of linear subspaces model (model 4 ; which is neither linear nor a manifold). [sent-258, score-0.234]

75 In terms of predictive accuracy, the top panels show that for all three simulations, in every dimensionality that we considered—including p = 0. [sent-259, score-0.202]

76 Note that this is the case even though MSB provides much more information about the posterior fY |X , yielding an entire posterior over fY |X , rather than merely a point estimate. [sent-261, score-0.106]

77 Thus, it is clear from these simulations that MSB has better scaling properties—in terms of both predictive accuracy and computational time—than the competitor methods. [sent-266, score-0.166]

78 Top plots depict the relative meansquared error of MSB (our approach), versus CART (red), Lasso (black), and PC regression (green) for as a function of ambient dimension of x. [sent-268, score-0.208]

79 The three simulation scenarios are: swissroll (left), linear subspaces (middle), union of linear subspaces (right). [sent-270, score-0.509]

80 MSB outperforms both CART, Lasso, and PC regression in all three A scenarios regardless of ambient dimension (rmse < 1 for all p). [sent-271, score-0.186]

81 Squared error predictive accuracy per subject (using leave-one-out) was computed. [sent-278, score-0.11]

82 We compare multiscale stick-breaking (MSB), CART, Lasso, random forest (RF), and PC regression. [sent-282, score-0.255]

83 MSB outperforms all the competitors in terms of predictive accuracy and scalability. [sent-283, score-0.125]

84 For the creativity application, p was relatively small, “merely” 2, 415, so we could run Lasso, CART, and random forests (RF) [30]. [sent-320, score-0.248]

85 For both applications, MSB yielded improved predictive accuracy over all competitors. [sent-322, score-0.08]

86 Although CART and Lasso were faster than MSB on the relatively low-dimensional predictor example (creativity), their computational scaling was poor, such that CART yielded a memory fault on the higherdimensional case, and Lasso required substantially more time than MSB. [sent-323, score-0.077]

87 6 Discussion In this work we have introduced a general formalism to estimate conditional distributions via multiscale dictionary learning. [sent-324, score-0.439]

88 We considered simulations and real-data examples where the dimensionality of the predictor space approached one million. [sent-326, score-0.078]

89 To our knowledge, no other approach to learn conditional distributions can run at this scale. [sent-327, score-0.101]

90 Because a fully Bayesian strategy remains computationally intractable at this scale, we developed an empirical Bayes approach, estimating the partition tree based on the data, but integrating over scales and posteriors. [sent-330, score-0.236]

91 This improvement was demonstrated when the M was a swissroll, a latent subspace, a union of latent subspaces, and real data (for which the latent space may not even exist). [sent-332, score-0.156]

92 Indeed, while multiscale methods beneﬁt from a rich theoretical foundation [2], the relative advantages and disadvantages of a fully Bayesian approach, in which one can estimate posteriors over all functionals of fY |X at all scales, remains relatively unexplored. [sent-334, score-0.375]

93 Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. [sent-377, score-0.255]

94 Fast kernel conditional density estimation: a dual-tree Monte Carlo approach. [sent-386, score-0.202]

95 Regression density estimation with variational methods and stochastic approximation. [sent-402, score-0.101]

96 Simultaneous variable selection and component selection for regression density estimation with mixtures of heteroscedastic experts. [sent-410, score-0.169]

97 Bayesian density regression with logistic Gaussian process and subspace projection. [sent-446, score-0.199]

98 A fast and high quality multilevel scheme for partitioning irregular graphs. [sent-459, score-0.086]

99 A fast multiscale framework for data in high-dimensions: Measure estimation, anomaly detection, and compressive measurements. [sent-466, score-0.255]

100 Principles of diffusion tensor imaging and its applications to basic neuroscience research. [sent-513, score-0.134]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('msb', 0.521), ('fy', 0.362), ('multiscale', 0.255), ('swissroll', 0.249), ('cart', 0.221), ('creativity', 0.204), ('panels', 0.122), ('ambient', 0.112), ('densities', 0.105), ('density', 0.101), ('conditional', 0.101), ('lasso', 0.099), ('manifold', 0.092), ('pc', 0.089), ('partitioning', 0.086), ('tree', 0.085), ('dictionary', 0.083), ('fx', 0.08), ('predictive', 0.08), ('ig', 0.08), ('subspaces', 0.08), ('posteriors', 0.076), ('connectomes', 0.068), ('deeds', 0.068), ('cpu', 0.064), ('bayesian', 0.063), ('kj', 0.062), ('connectome', 0.06), ('subspace', 0.059), ('neuroscience', 0.058), ('depict', 0.057), ('automated', 0.055), ('pipeline', 0.054), ('posterior', 0.053), ('union', 0.051), ('nonlinear', 0.049), ('simulation', 0.049), ('scales', 0.048), ('diffusion', 0.048), ('vogelstein', 0.047), ('predictions', 0.047), ('partition', 0.046), ('competitors', 0.045), ('dunson', 0.045), ('chavez', 0.045), ('grazioplene', 0.045), ('simulations', 0.045), ('scale', 0.045), ('mri', 0.044), ('magnetic', 0.044), ('relatively', 0.044), ('resonance', 0.043), ('partitions', 0.042), ('competitor', 0.041), ('xv', 0.04), ('metis', 0.04), ('phenotype', 0.04), ('gurable', 0.04), ('nott', 0.04), ('sampler', 0.039), ('concentrates', 0.039), ('rf', 0.039), ('regression', 0.039), ('carolina', 0.037), ('wavelets', 0.037), ('movement', 0.036), ('mixture', 0.036), ('decomposition', 0.036), ('xi', 0.035), ('cj', 0.035), ('latent', 0.035), ('asso', 0.035), ('regardless', 0.035), ('predictor', 0.033), ('durham', 0.033), ('jung', 0.033), ('yi', 0.033), ('mse', 0.033), ('fw', 0.031), ('lives', 0.031), ('nonparametric', 0.031), ('rp', 0.031), ('predictors', 0.031), ('subject', 0.03), ('whitened', 0.03), ('bottom', 0.03), ('embedded', 0.03), ('xu', 0.029), ('xr', 0.029), ('north', 0.029), ('neuroimaging', 0.029), ('coordinate', 0.029), ('strategy', 0.029), ('variable', 0.029), ('exibly', 0.028), ('genetics', 0.028), ('tensor', 0.028), ('intractable', 0.028), ('ii', 0.028), ('allocated', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

2 0.089834429 63 nips-2013-Cluster Trees on Manifolds

Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman

Abstract: unkown-abstract

3 0.087278731 109 nips-2013-Estimating LASSO Risk and Noise Level

Author: Mohsen Bayati, Murat A. Erdogdu, Andrea Montanari

Abstract: We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefﬁcient vector θ0 ∈ Rp from noisy linear observations y = Xθ0 + w ∈ Rn (p > n) and the popular estimation procedure of solving the 1 -penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the 2 estimation risk θ − θ0 2 and the variance of the noise when distributions of θ0 and w are unknown. These can be used to select the regularization parameter optimally. Our approach combines Stein’s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b] on the analysis of approximate message passing and the risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices X of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on a certain conjecture from statistical physics. To the best of our knowledge, this result is the ﬁrst that provides an asymptotically consistent risk estimator for the LASSO solely based on data. In addition, we demonstrate through simulations that our variance estimation outperforms several existing methods in the literature. 1

4 0.084797733 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve signiﬁcant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for ﬁnite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

5 0.072845198 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identiﬁcation) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a ﬂexible model able to capture complex dynamical phenomena. To enable efﬁcient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1

6 0.072001174 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

7 0.071354754 350 nips-2013-Wavelets on Graphs via Deep Learning

8 0.069700502 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

9 0.068609729 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

10 0.067959294 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

11 0.066800892 256 nips-2013-Probabilistic Principal Geodesic Analysis

12 0.066780061 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

13 0.065541424 137 nips-2013-High-Dimensional Gaussian Process Bandits

14 0.064952418 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

15 0.062630378 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

16 0.062315699 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

17 0.061324932 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

18 0.060432289 91 nips-2013-Dirty Statistical Models

19 0.060286183 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory

20 0.059334323 355 nips-2013-Which Space Partitioning Tree to Use for Search?

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.19), (1, 0.08), (2, -0.008), (3, 0.032), (4, -0.004), (5, 0.081), (6, 0.069), (7, 0.012), (8, -0.012), (9, -0.013), (10, 0.016), (11, -0.007), (12, -0.021), (13, -0.034), (14, -0.013), (15, -0.005), (16, 0.043), (17, -0.093), (18, -0.0), (19, -0.022), (20, 0.018), (21, 0.096), (22, 0.003), (23, 0.005), (24, 0.044), (25, 0.033), (26, -0.009), (27, 0.03), (28, 0.009), (29, -0.0), (30, 0.018), (31, -0.007), (32, 0.076), (33, -0.003), (34, -0.007), (35, 0.114), (36, 0.072), (37, -0.048), (38, -0.024), (39, -0.01), (40, 0.014), (41, 0.002), (42, -0.057), (43, 0.011), (44, -0.084), (45, -0.036), (46, -0.066), (47, -0.048), (48, 0.038), (49, -0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90827364 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

2 0.62798971 63 nips-2013-Cluster Trees on Manifolds

Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman

Abstract: unkown-abstract

3 0.61100066 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

Author: Jeffrey W. Miller, Matthew T. Harrison

Abstract: For data assumed to come from a ﬁnite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1

4 0.60733992 4 nips-2013-A Comparative Framework for Preconditioned Lasso Algorithms

Author: Fabian L. Wauthier, Nebojsa Jojic, Michael Jordan

Abstract: The Lasso is a cornerstone of modern multivariate data analysis, yet its performance suffers in the common situation in which covariates are correlated. This limitation has led to a growing number of Preconditioned Lasso algorithms that pre-multiply X and y by matrices PX , Py prior to running the standard Lasso. A direct comparison of these and similar Lasso-style algorithms to the original Lasso is difﬁcult because the performance of all of these methods depends critically on an auxiliary penalty parameter λ. In this paper we propose an agnostic framework for comparing Preconditioned Lasso algorithms to the Lasso without having to choose λ. We apply our framework to three Preconditioned Lasso instances and highlight cases when they will outperform the Lasso. Additionally, our theory reveals fragilities of these algorithms to which we provide partial solutions. 1

5 0.60611248 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

Author: Benigno Uria, Iain Murray, Hugo Larochelle

Abstract: We introduce RNADE, a new model for joint density estimation of real-valued vectors. Our model calculates the density of a datapoint as the product of onedimensional conditionals modeled using mixture density networks with shared parameters. RNADE learns a distributed representation of the data, while having a tractable expression for the calculation of densities. A tractable likelihood allows direct comparison with other methods and training by standard gradientbased optimizers. We compare the performance of RNADE on several datasets of heterogeneous and perceptual data, ﬁnding it outperforms mixture models in all but one case. 1

6 0.60524023 256 nips-2013-Probabilistic Principal Geodesic Analysis

7 0.60186851 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

8 0.60072356 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

9 0.59542036 265 nips-2013-Reconciling "priors" & "priors" without prejudice?

10 0.59044003 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory

11 0.57996941 344 nips-2013-Using multiple samples to learn mixture models

12 0.57655102 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables

13 0.57358479 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

14 0.56696248 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements

15 0.56532621 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

16 0.547786 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

17 0.53907979 332 nips-2013-Tracking Time-varying Graphical Structure

18 0.53745908 167 nips-2013-Learning the Local Statistics of Optical Flow

19 0.53359997 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

20 0.51622319 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.011), (16, 0.394), (33, 0.125), (34, 0.121), (41, 0.018), (49, 0.021), (56, 0.089), (70, 0.031), (85, 0.033), (89, 0.036), (93, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97508597 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes

Author: Abbas Edalat

Abstract: We solve the mean ﬁeld equations for a stochastic Hopﬁeld network with temperature (noise) in the presence of strong, i.e., multiply stored, patterns, and use this solution to obtain the storage capacity of such a network. Our result provides for the ﬁrst time a rigorous solution of the mean ﬁled equations for the standard Hopﬁeld model and is in contrast to the mathematically unjustiﬁable replica technique that has been used hitherto for this derivation. We show that the critical temperature for stability of a strong pattern is equal to its degree or multiplicity, when the sum of the squares of degrees of the patterns is negligible compared to the network size. In the case of a single strong pattern, when the ratio of the number of all stored pattens and the network size is a positive constant, we obtain the distribution of the overlaps of the patterns with the mean ﬁeld and deduce that the storage capacity for retrieving a strong pattern exceeds that for retrieving a simple pattern by a multiplicative factor equal to the square of the degree of the strong pattern. This square law property provides justiﬁcation for using strong patterns to model attachment types and behavioural prototypes in psychology and psychotherapy. 1 Introduction: Multiply learned patterns in Hopﬁeld networks The Hopﬁeld network as a model of associative memory and unsupervised learning was introduced in [23] and has been intensively studied from a wide range of viewpoints in the past thirty years. However, properties of a strong pattern, as a pattern that has been multiply stored or learned in these networks, have only been examined very recently, a surprising delay given that repetition of an activity is the basis of learning by the Hebbian rule and long term potentiation. In particular, while the storage capacity of a Hopﬁeld network with certain correlated patterns has been tackled [13, 25], the storage capacity of a Hopﬁeld network in the presence of strong as well as random patterns has not been hitherto addressed. The notion of a strong pattern of a Hopﬁeld network has been proposed in [15] to model attachment types and behavioural prototypes in developmental psychology and psychotherapy. This suggestion has been motivated by reviewing the pioneering work of Bowlby [9] in attachment theory and highlighting how a number of academic biologists, psychiatrists, psychologists, sociologists and neuroscientists have consistently regarded Hopﬁeld-like artiﬁcial neural networks as suitable tools to model cognitive and behavioural constructs as patterns that are deeply and repeatedly learned by individuals [11, 22, 24, 30, 29, 10]. A number of mathematical properties of strong patterns in Hopﬁeld networks, which give rise to strong attractors, have been derived in [15]. These show in particular that strong attractors are strongly stable; a series of experiments have also been carried out which conﬁrm the mathematical 1 results and also indicate that a strong pattern stored in the network can be retrieved even in the presence of a large number of simple patterns, far exceeding the well-known maximum load parameter or storage capacity of the Hopﬁeld network with random patterns (αc ≈ 0.138). In this paper, we consider strong patterns in stochastic Hopﬁeld model with temperature, which accounts for various types of noise in the network. In these networks, the updating rule is probabilistic and depend on the temperature. Since analytical solution of such a system is not possible in general, one strives to obtain the average behaviour of the network when the input to each node, the so-called ﬁeld at the node, is replaced with its mean. This is the basis of mean ﬁeld theory for these networks. Due to the close connection between the Hopﬁeld network and the Ising model in ferromagnetism [1, 8], the mean ﬁeld approach for the Hopﬁeld network and its variations has been tackled using the replica method, starting with the pioneering work of Amit, Gutfreund and Sompolinsky [3, 2, 4, 19, 31, 1, 13]. Although this method has been widely used in the theory of spin glasses in statistical physics [26, 16] its mathematical justiﬁcation has proved to be elusive as we will discuss in the next section; see for example [20, page 264], [14, page 27], and [7, page 9]. In [17] and independently in [27], an alternative technique to the replica method for solving the mean ﬁeld equations has been proposed which is reproduced and characterised as heuristic in [20, section 2.5] since it relies on a number of assumptions that are not later justiﬁed and uses a number of mathematical steps that are not validated. Here, we use the basic idea of the above heuristic to develop a veriﬁable mathematical framework with provable results grounded on elements of probability theory, with which we assume the reader is familiar. This technique allows us to solve the mean ﬁeld equations for the Hopﬁeld network in the presence of strong patterns and use the results to study, ﬁrst, the stability of these patterns in the presence of temperature (noise) and, second, the storage capacity of the network with a single strong pattern at temperature zero. We show that the critical temperature for the stability of a strong pattern is equal to its degree (i.e., its multiplicity) when the ratio of the sum of the squares of degrees of the patterns to the network size tends to zero when the latter tends to inﬁnity. In the case that there is only one strong pattern present with its degree small compared to the number of patterns and the latter is a ﬁxed multiple of the number of nodes, we ﬁnd the distribution of the overlap of the mean ﬁeld and the patterns when the strong pattern is being retrieved. We use these distributions to prove that the storage capacity for retrieving a strong pattern exceeds that for a simple pattern by a multiplicative factor equal to the square of the degree of the strong attractor. This result matches the ﬁnding in [15] regarding the capacity of a network to recall strong patterns as mentioned above. Our results therefore show that strong patterns are robust and persistent in the network memory as attachment types and behavioural prototypes are in the human memory system. In this paper, we will several times use Lyapunov’s theorem in probability which provides a simple sufﬁcient condition to generalise the Central Limit theorem when we deal with independent but not necessarily identically distributed random variables. We require a general form of this theorem kn as follows. Let Yn = N i=1 Yni , for n ∈ I , be a triangular array of random variables such that for each n, the random variables Yni , for 1 ≤ i ≤ kn are independent with E(Yni ) = 0 2 2 and E(Yni ) = σni , where E(X) stands for the expected value of the random variable X. Let kn 2 2 sn = i=1 σni . We use the notation X ∼ Y when the two random variables X and Y have the same distribution (for large n if either or both of them depend on n). Theorem 1.1 (Lyapunov’s theorem [6, page 368]) If for some δ > 0, we have the condition: 1 E(|Yn |2+δ |) → 0 s2+δ n d d as n → ∞ then s1 Yn −→ N (0, 1) as n → ∞ where −→ denotes convergence in distribution, and we denote n by N (a, σ 2 ) the normal distribution with mean a and variance σ 2 . Thus, for large n we have Yn ∼ N (0, s2 ). n 2 2 Mean ﬁeld theory We consider a Hopﬁeld network with N neurons i = 1, . . . , N with values Si = ±1 and follow the notations in [20]. As in [15], we assume patterns can be multiply stored and the degree of a pattern is deﬁned as its multiplicity. The total number of patterns, counting their multiplicity, is denoted by p and we assume there are n patterns ξ 1 , . . . , ξ n with degrees d1 , . . . , dn ≥ 1 respectively and that n the remaining p − k=1 dk ≥ 0 patterns are simple, i.e., each has degree one. Note that by our assumptions there are precisely n p0 = p + n − dk k=1 distinct patterns, which we assume are independent and identically distributed with equal probability of taking value ±1 for each node. More generally, for any non-negative integer k ∈ I , we let N p0 dk . µ pk = µ=1 p µ µ 0 1 We use the generalized Hebbian rule for the synaptic couplings: wij = N µ=1 dµ ξi ξj for i = j with wii = 0 for 1 ≤ i, j ≤ N . As in the standard stochastic Hopﬁeld model [20], we use Glauber dynamics [18] for the stochastic updating rule with pseudo-temperature T > 0, which accounts for various types of noise in the network, and assume zero bias in the local ﬁeld. Putting β = 1/T (i.e., with the Boltzmann constant kB = 1) and letting fβ (h) = 1/(1 + exp(−2βh)), the stochastic updating rule at time t is given by: N Pr(Si (t + 1) = ±1) = fβ (±hi (t)), where hi (t) = wij Sj (t), (1) j=1 is the local ﬁeld at i at time t. The updating is implemented asynchronously in a random way. The energy of the network in the conﬁguration S = (Si )N is given by i=1 N 1 Si Sj wij . H(S) = − 2 i,j=1 For large N , this speciﬁes a complex system, with an underlying state space of dimension 2N , which in general cannot be solved exactly. However, mean ﬁeld theory has proved very useful in studying Hopﬁeld networks. The average updated value of Si (t + 1) in Equation (1) is Si (t + 1) = 1/(1 + e−2βhi (t) ) − 1/(1 + e2βhi (t) ) = tanh(βhi (t)), (2) where . . . denotes taking average with respect to the probability distribution in the updating rule in Equation (1). The stationary solution for the mean ﬁeld thus satisﬁes: Si = tanh(βhi ) , (3) The average overlap of pattern ξ µ with the mean ﬁeld at the nodes of the network is given by: mν = 1 N N ν ξi Si (4) i=1 The replica technique for solving the mean ﬁeld problem, used in the case p/N = α > 0 as N → ∞, seeks to obtain the average of the overlaps in Equation (4) by evaluating the partition function of the system, namely, Z = TrS exp(−βH(S)), where the trace TrS stands for taking sum over all possible conﬁgurations S = (Si )N . As it i=1 is generally the case in statistical physics, once the partition function of the system is obtained, 3 all required physical quantities can in principle be computed. However, in this case, the partition function is very difﬁcult to compute since it entails computing the average log Z of log Z, where . . . indicates averaging over the random distribution of the stored patterns ξ µ . To overcome this problem, the identity Zk − 1 log Z = lim k→0 k is used to reduce the problem to ﬁnding the average Z k of Z k , which is then computed for positive integer values of k. For such k, we have: Z k = TrS 1 TrS 2 . . . TrS k exp(−β(H(S 1 ) + H(S 1 ) + . . . + H(S k ))), where for each i = 1, . . . , k the super-scripted conﬁguration S i is a replica of the conﬁguration state. In computing the trace over each replica, various parameters are obtained and the replica symmetry condition assumes that these parameters are independent of the particular replica under consideration. Apart from this assumption, there are two basic mathematical problems with the technique which makes it unjustiﬁable [20, page 264]. Firstly, the positive integer k above is eventually treated as a real number near zero without any mathematical justiﬁcation. Secondly, the order of taking limits, in particular the order of taking the two limits k → 0 and N → ∞, are several times interchanged again without any mathematical justiﬁcation. Here, we develop a mathematically rigorous method for solving the mean ﬁeld problem, i.e., computing the average of the overlaps in Equation (4) in the case of p/N = α > 0 as N → ∞. Our method turns the basic idea of the heuristic presented in [17] and reproduced in [20] for solving the mean ﬁeld equation into a mathematically veriﬁable formalism, which for the standard Hopﬁeld network with random stored patterns gives the same result as the replica method, assuming replica symmetry. In the presence of strong patterns we obtain a set of new results as explained in the next two sections. The mean ﬁeld equation is obtained from Equation (3) by approximating the right hand side of N this equation by the value of tanh at the mean ﬁeld hi = j=1 wij Sj , ignoring the sum N j=1 wij (Sj − Sj ) for large N [17, page 32]: Si = tanh(β hi ) = tanh β N N j=1 p0 µ=1 µ µ dµ ξi ξj Sj . (5) Equation (5) gives the mean ﬁeld equation for the Hopﬁeld network with n possible strong patterns n ξ µ (1 ≤ µ ≤ n) and p − µ=1 dµ simple patterns ξ µ with n + 1 ≤ µ ≤ p0 . As in the standard Hopﬁeld model, where all patterns are simple, we have two cases to deal with. However, we now have to account for the presence of strong attractors and our two cases will be as follows: (i) In the p0 ﬁrst case we assume p2 := µ=1 d2 = o(N ), which includes the simpler case p2 N when p2 µ is ﬁxed and independent of N . (ii) In the second case we assume we have a single strong attractor with the load parameter p/N = α > 0. 3 Stability of strong patterns with noise: p2 = o(N ) The case of constant p and N → ∞ is usually referred to as α = 0 in the standard Hopﬁeld model. Here, we need to consider the sum of degrees of all stored patterns (and not just the number of patterns) compared to N . We solve the mean ﬁeld equation with T > 0 by using a method similar in spirit to [20, page 33] for the standard Hopﬁeld model, but in our case strong patterns induce a sequence of independent but non-identically distributed random variables in the crosstalk term, where the Central Limit Theorem cannot be used; we show however that Lyapunov’s theorem (Theorem (1.1) can be invoked. In retrieving pattern ξ 1 , we look for a solution of the mean ﬁled 1 equation of the form: Si = mξi , where m > 0 is a constant. Using Equation (5) and separating 1 the contribution of ξ in the argument of tanh, we obtain:  1 mξi = tanh    mβ  1 d1 ξi + N 4 µ µ 1 dµ ξi ξj ξj  . j=i,µ>1 (6) For each N , µ > 1 and j = i, let dµ µ µ 1 (7) ξ ξ ξ . N i j j 2 This gives (p0 − 1)(N − 1) independent random variables with E(YN µj ) = 0, E(YN µj ) = d2 /N 2 , µ 3 3 3 and E(|YN µj |) = dµ /N . We have: YN µj = s2 := N 2 E(YN µj ) = µ>1,j=i 1 N −1 d2 ∼ N 2 µ>1 µ N d2 . µ (8) µ>1 Thus, as N → ∞, we have: 1 s3 N 3 E(|YN µj |) ∼ √ µ>1,j=i µ>1 N( d3 µ µ>1 d2 )3/2 µ → 0. (9) as N → ∞ since for positive numbers dµ we always have µ>1 d3 < ( µ>1 d2 )3/2 . Thus the µ µ Lyapunov condition is satisﬁed for δ = 1. By Lyapunov’s theorem we deduce: 1 N µ µ 1 dµ ξi ξj ξj ∼ N d2 /N µ 0, (10) µ>1 µ>1,j=i Since we also have p2 = o(N ), it follows that we can ignore the second term, i.e., the crosstalk term, in the argument of tanh in Equation (6) as N → ∞; we thus obtain: m = tanh βd1 m. (11) To examine the ﬁxed points of the Equation (11), we let d = d1 for convenience and put x = βdm = dm/T , so that tanh x = T x/d; see Figure 1. It follows that Tc = d is the critical temperature. If T < d then there is a non-zero (non-trivial) solution for m, whereas for T > d we only have the trivial solution. For d = 1 our solution is that of the standard Hopﬁeld network as in [20, page 34]. (d < T) y>x y = x ( d = T) y = tanh x y

2 0.93861282 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle

Abstract: Multi-task prediction methods are widely used to couple regressors or classiﬁcation models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efﬁcient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we ﬁnd substantial beneﬁts of modeling structured noise compared to established alternatives. 1

3 0.91834003 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modiﬁcation of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the ﬁnal selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efﬁciency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1

4 0.87756085 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

same-paper 5 0.83393759 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

6 0.81999743 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

7 0.6560393 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

8 0.64758635 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

9 0.63908273 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

10 0.62527579 252 nips-2013-Predictive PAC Learning and Process Decompositions

11 0.62354088 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

12 0.62325627 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

13 0.61139292 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture