nips nips2009 nips2009-255 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chong Wang, David M. Blei
Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. [sent-6, score-0.27]
2 Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. [sent-7, score-0.173]
3 In this paper, we develop an alternative inference technique based on variational methods. [sent-8, score-0.475]
4 To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. [sent-9, score-0.932]
5 In such settings, however, the combinatoric space of tree structures makes model selection unusually daunting. [sent-12, score-0.193]
6 The nested Chinese restaurant process (nCRP) [1] addresses this problem by specifying a generative probabilistic model for tree structures. [sent-14, score-0.434]
7 The nCRP is based on the Chinese restaurant process (CRP) [4], which is closely linked to the Dirichlet process in its application to mixture models [5]. [sent-17, score-0.229]
8 As a complicated Bayesian nonparametric model, posterior inference in an nCRP-based model is intractable, and previous approaches all rely Gibbs sampling [1, 2, 3]. [sent-18, score-0.231]
9 Here, we develop an alternative for posterior inference for nCRP-based models. [sent-20, score-0.151]
10 Our solution is to use the optimization-based variational methods [8]. [sent-21, score-0.384]
11 The idea behind variational methods is to posit a simple distribution over the latent variables, and then to fit this distribution to be close to the posterior of interest. [sent-22, score-0.518]
12 Variational methods have been successfully applied to several Bayesian nonparametric models, such as Dirichlet process (DP) mixtures [9, 10, 11], hierarchical Dirichlet processes (HDP) [12], Pitman-Yor processes [13] and Indian buffet processes (IBP) [14]. [sent-23, score-0.155]
13 The work presented here is unique in that our optimization of the variational distribution searches the combinatorial space of trees. [sent-24, score-0.384]
14 First, we describe the tree-based stick-breaking construction of nCRP, which is needed for variational inference. [sent-26, score-0.425]
15 Second, we develop our variational inference algorithm, which explores the infinite tree space associated with the nCRP. [sent-27, score-0.668]
16 1 2 Nested Chinese restaurant process mixtures The nested Chinese restaurant process (nCRP) is a distribution over hierarchical partitions [1]. [sent-29, score-0.477]
17 Imagine a restaurant with an infinite number of tables, and imagine customers entering the restaurant in sequence. [sent-32, score-0.282]
18 Each customer enters at the first level and comes out at the Lth level, generating a path with L tables as she sits in each restaurant. [sent-36, score-0.34]
19 Moving from a table at level to one of its subtables at level + 1, the customer draws following the CRP using Equation 1. [sent-37, score-0.156]
20 Each data point is drawn by first choosing a path in the tree according to the nCRP, and then choosing its value from a distribution that depends on the parameters in that path. [sent-42, score-0.391]
21 For data D = {tn }N , the nCRP mixture assumes that the nth data point tn is drawn as follows: n=1 1. [sent-45, score-0.239]
22 Draw a path cn |c1:(n−1) ∼ nCRP(γ, c1:(n−1) ), which contains L nodes from the tree. [sent-46, score-0.591]
23 Note that Wcn contains the wi s selected by the path cn . [sent-52, score-0.589]
24 The corresponding posterior of the latent variables decomposes the data into a collection of paths, and provides distributions of the parameters attached to each node in those paths. [sent-54, score-0.198]
25 Through this posterior, the nCRP mixture can be used as a flexible tree-based mixture model that does not assume a particular tree structure in advance of the data. [sent-56, score-0.319]
26 The nCRP mixture described above includes the hierarchical topic model of [1] as a special case. [sent-58, score-0.192]
27 The nodes of the tree are associated with distributions over words (“topics”), and each document is associated with both a path in the tree and with a vector of proportions over its levels. [sent-62, score-0.749]
28 In the notation above, p(w) is a Dirichlet distribution over the vocabulary simplex, p(x) is a joint distribution of level proportions (from a Dirichlet) and level assignments (N draws from the proportions), and p(t|Wc , x) are the N draws from the topics (for each word) associated with x. [sent-64, score-0.248]
29 For continuous data, if p(w), p(x) and p(t|Wc , x) are appropriate Gaussian distributions, we obtain hierarchical component analysis, a generalization of probabilistic principal component analysis (PPCA) [16, 17]. [sent-66, score-0.178]
30 Each path c can be thought as a PPCA model with factor loading Wc specified by that path. [sent-68, score-0.198]
31 Then each data point chooses a path (also a PPCA model specified by that path) and draw the factors x. [sent-69, score-0.23]
32 A draw from a DP(γ, G0 ) is described as vi ∼ Beta(1, γ), πi = vi i−1 j=1 (1 − vj ), ∞ and i=1 πi = wi ∼ G0 , i ∈ {1, 2, · · · }, G = ∞ i=1 πi δwi , where π are the stick lengths, 1 almost surely. [sent-80, score-0.404]
33 For all the nodes at the second level, their stick lengths are constructed i−1 ∞ as for the DP, i. [sent-84, score-0.173]
34 Although this stick represents an infinite tree, the nodes are countable and each node is uniquely identified by a sequence of L numbers. [sent-92, score-0.204]
35 The tree-based stick-breaking construction lets us calculate the conditional probability of a path given V . [sent-94, score-0.239]
36 Let the path c = [1, c2 , · · · , cL ], p(c|V ) = L =1 π1,c2 ,··· ,c = L =1 v1,c2 ,··· ,c c −1 j=1 (1 − v1,c2 ,··· ,j ). [sent-95, score-0.198]
37 Given Equation 2, the joint probability of a data set under the nCRP mixture is p(t1:N , x1:N , c1:N , V , W ) = p(V )p(W ) N n=1 p(cn |V )p(xn )p(tn |Wcn , xn ). [sent-97, score-0.151]
38 (3) This representation is the basis for variational inference. [sent-98, score-0.384]
39 3 Variational inference for the nCRP mixture The central computational problem in Bayesian modeling is posterior inference: Given data, what is the conditional distribution of the latent variables in the model? [sent-99, score-0.259]
40 In the nCRP mixture, these latent variables provide the tree structure and node parameters. [sent-100, score-0.298]
41 This model expresses a different philosophy: Their tree reflects the actual conditional dependencies among the components. [sent-102, score-0.193]
42 Data are not generated by choosing a path first, but by a linear transformation of all components in the tree. [sent-103, score-0.198]
43 3 Posterior inference in an nCRP mixture has previously relied on Gibbs sampling, in which we sample from a Markov chain whose stationary distribution is the posterior [1, 2, 3]. [sent-104, score-0.236]
44 Variational inference for Bayesian nonparametric models uses a truncated stick-breaking representation in the variational distribution [9] – free variational parameters are allowed only up to the truncation level. [sent-109, score-1.052]
45 If the truncation is too large, the variational algorithm will still isolate only a subset of components; if the truncation is too small, methods have been developed to expand the truncated stick as part of the variational algorithm [10]. [sent-110, score-1.122]
46 In the nCRP mixture, however, the challenge is that the tree structure is too large even to effectively truncate. [sent-111, score-0.193]
47 We will address this by defining search criteria for adaptively adjusting the structure of the variational distribution, searching over the set of trees to best accommodate the data. [sent-112, score-0.384]
48 1 Variational inference based on the tree-based stick-breaking construction We first address the problem of variational inference with a truncated tree of fixed structure. [sent-114, score-0.935]
49 Suppose that we have a truncated tree T and let MT be the set of all nodes in T . [sent-115, score-0.384]
50 We will address this issue below; (4) Distribution q(xn ) is the variational distribution for the latent variable xn and it is in the same family of distribution, as p(xn ). [sent-118, score-0.546]
51 In summary, this family of distributions retains the infinite tree structure. [sent-119, score-0.255]
52 Moreover, this family is nested [10, 11]: If a truncated tree T1 is a subtree of a truncated tree T2 then variational distributions defined over T1 are a special case of those defined over T2 . [sent-120, score-1.205]
53 This allows us to use greedy search to find a better tree structure. [sent-122, score-0.193]
54 With the variational distributions (Equation 4) and the joint distributions (Equation 3), we turn to the details of posterior inference. [sent-123, score-0.51]
55 (8) Two issues arise: 1) the variational distribution q(cn ) has infinite number of values, and we need to find an efficient way to manipulate this. [sent-130, score-0.384]
56 In the appendix, we show that all the operations ¯ can be done only via the truncated tree T . [sent-132, score-0.328]
57 Let c be a path in T , either an inner path (a path ending at an inner node) or a full path (a path ending at a leaf node). [sent-134, score-1.116]
58 ¯ Note that the inner path is only defined for the truncated tree T . [sent-135, score-0.565]
59 (9) ¯ Consequently iterating over the truncated tree T using c is the same as iterating all the full paths in the nCRP tree. [sent-140, score-0.463]
60 And these are all we need for doing variational inference. [sent-141, score-0.384]
61 Next, we move to optimize q(vi |ai , bi ) for i ∈ MT , where ai and bi are variational parameters for Beta distribution q(vi ). [sent-142, score-0.43]
62 Let the path containing vi be [1, c2 , · · · , c 0 ], where 0 ≤ L. [sent-143, score-0.313]
63 We isolate the term that only contains vi from the lower bound (Equation 5), L (q(vi )) = Eq [log p(vi ) − log q(vi )] + N n=1 c q(cn = c) log p(cn = c|V ). [sent-144, score-0.249]
64 The variational update functions for W and x depend on the actual distributions we use, and deriving them is straightforward. [sent-146, score-0.417]
65 2 Refining the tree structure during variational inference Since our variational distribution is nested, a larger truncated tree will always (theoretically) achieve a lower bound at least as tight as a smaller truncated tree. [sent-149, score-1.515]
66 This allows us to search the infinite tree space until a certain criterion is satisfied (e. [sent-150, score-0.193]
67 All these operations are performed on the truncated tree T . [sent-154, score-0.328]
68 This operation is similar to what Gibbs sampling does in searching the tree space. [sent-156, score-0.244]
69 We implement two heuristics: 1) Randomly choose several data points, and for each of them sample ¯ ¯ a path c according to q(cn = c). [sent-157, score-0.198]
70 If it is an inner path, expand it a full path; 2) For every inner N ¯ ¯ path in T , first compute the quantity g(¯) = n=1 q(cn = c). [sent-158, score-0.305]
71 Then sample an inner path (say c∗ ) c according to g(¯), and expand it to full path. [sent-159, score-0.266]
72 If a certain path gets very little probability assignments from all data points, we eliminate N this path – for path c, the criterion is n=1 q(cn = c) < δ, where δ is a small number. [sent-161, score-0.62]
73 This mimics Gibbs sampling in the sense that for nCRP (or CRP), if a certain path (table) gets no assignments in the sampling process, it will never get any assignment any more according to Equation 1. [sent-163, score-0.326]
74 If paths i and j give almost equal posterior distributions, merging these two paths is employed [24]. [sent-165, score-0.23]
75 4 Experiments In this section, we demonstrate variational inference for the nCRP. [sent-173, score-0.475]
76 inference (G): variational inference initialized from the initialization of Gibbs sampling. [sent-198, score-0.566]
77 1 Hierarchical topic modeling For discrete data, we compare variational inference compared with Gibbs sampling for hierarchical topic modeling. [sent-201, score-0.726]
78 Local maxima can be a problem for both Gibbs sampling and variational inference. [sent-205, score-0.435]
79 ) We run the variational inference until the relative change of log-likelihood is less than 0. [sent-214, score-0.475]
80 Specifically, we ˆ ˆ use posterior means θ and β to represent the estimated topic mixture proportions over L levels and topic multinomial parameters. [sent-221, score-0.317]
81 For the variational method, we use p({t1 , · · · , tN }test ) = N n=1 q(cn = c) c j n, ˆ ˆ θn, βc ,tnj , ˆ ˆ where θ and β are estimated using mean values from the variational distributions. [sent-222, score-0.768]
82 Actually, 1/S s=1 c δcs gives the n empirical estimation of p(cn ), where in variational inference, we approximate it using q(cn ). [sent-225, score-0.384]
83 This discrepancy is similar to that in [12] when variational inference is compared the collapsed Gibbs sampling for HDP. [sent-228, score-0.558]
84 2 Modeling handwritten digits using hierarchical component analysis For continuous data, we use the hierarchical component analysis for modeling handwritten digits (http://archive. [sent-233, score-0.343]
85 The whole tree has 30 nodes, with an average branching factor 2. [sent-238, score-0.218]
86 The whole tree has 45 nodes, with an average branching factor 2. [sent-241, score-0.218]
87 We use a global mean parameter µ for all paths, although a model with an individual mean parameter for each path can be similarly derived. [sent-246, score-0.198]
88 We put broad priors over the parameters similar to those in variational Bayesian PCA [17]. [sent-247, score-0.384]
89 Again, we run the variational inference until the relative change of log-likelihood is less than 0. [sent-250, score-0.475]
90 To compute the reconstruction error for our model, we first select the path for each data point using its MAP estimation by cn = arg maxc q(cn = c). [sent-253, score-0.583]
91 ˆ Then we use the similar approach [26, 24] to reconstruct tn , ˆ ˆ ˆ tn = Wcn (Wcn T Wcn )−1 Wcn T (tn − µ) + µ. [sent-254, score-0.352]
92 5 Conclusions In this paper, we presented the variational inference algorithm for the nested Chinese restaurant process based on its tree-based stick-breaking construction. [sent-259, score-0.716]
93 Our result indicates that the variational 7 Reconstruction error on handwritten digits #Depth HCA (tr) PCA (tr) HCA (te) PCA (te) 2(9) 631. [sent-260, score-0.453]
94 We have Case 1: All nodes of the path are in T , c ⊂ MT . [sent-291, score-0.254]
95 In the truncated tree T , let j0 be the maximum index c ¯ for the child node whose parent path is c, then we know if c 0 +1 > j0 , [¯, c 0 +1 , · · · , cL ] ⊂ MT . [sent-296, score-0.7]
96 c ¯ Now we fix the sub path c and let [c 0 +1 , · · · , cL ] vary (satisfying c 0 +1 > j0 ). [sent-297, score-0.241]
97 According to Equation 4, for c c any c ∈ child(¯) , Z0 Eq [log p(tn |xn , Wc )] is constant, since the variational distribution for w c outside the truncated tree is the same prior distribution. [sent-299, score-0.712]
98 Such cases contain all inner nodes in the truncated tree T . [sent-301, score-0.423]
99 We note that this organization c only depends on the truncated tree T and is sufficient for variational inference. [sent-305, score-0.712]
100 Hierarchical topic models and the nested Chinese restaurant process. [sent-313, score-0.284]
wordName wordTfidf (topN-words)
[('ncrp', 0.49), ('variational', 0.384), ('cn', 0.337), ('path', 0.198), ('tree', 0.193), ('tn', 0.176), ('eq', 0.17), ('wcn', 0.145), ('truncated', 0.135), ('mt', 0.122), ('vi', 0.115), ('child', 0.114), ('ppca', 0.111), ('restaurant', 0.11), ('gibbs', 0.109), ('crp', 0.106), ('nested', 0.103), ('wc', 0.102), ('inference', 0.091), ('xn', 0.088), ('stick', 0.088), ('paths', 0.085), ('cl', 0.083), ('chinese', 0.077), ('hca', 0.073), ('topic', 0.071), ('dirichlet', 0.065), ('jacm', 0.064), ('mixture', 0.063), ('posterior', 0.06), ('node', 0.06), ('hierarchical', 0.058), ('nodes', 0.056), ('kurihara', 0.054), ('wi', 0.054), ('proportions', 0.052), ('sampling', 0.051), ('dp', 0.051), ('reconstruction', 0.048), ('latent', 0.045), ('log', 0.045), ('pnas', 0.044), ('isolate', 0.044), ('te', 0.044), ('sub', 0.043), ('tables', 0.042), ('construction', 0.041), ('blei', 0.041), ('mixtures', 0.04), ('inner', 0.039), ('digits', 0.038), ('beta', 0.038), ('draws', 0.038), ('abstracts', 0.037), ('mice', 0.036), ('sits', 0.036), ('customers', 0.033), ('distributions', 0.033), ('equation', 0.033), ('component', 0.033), ('collapsed', 0.032), ('customer', 0.032), ('level', 0.032), ('pca', 0.032), ('dtest', 0.032), ('dtrain', 0.032), ('recombination', 0.032), ('draw', 0.032), ('nite', 0.031), ('handwritten', 0.031), ('principal', 0.031), ('vocabulary', 0.03), ('nonparametric', 0.029), ('imagine', 0.029), ('posit', 0.029), ('cardiac', 0.029), ('bayesian', 0.029), ('family', 0.029), ('truncation', 0.029), ('lengths', 0.029), ('expand', 0.029), ('process', 0.028), ('lth', 0.027), ('assignments', 0.026), ('tipping', 0.026), ('years', 0.025), ('iterating', 0.025), ('branching', 0.025), ('document', 0.024), ('species', 0.024), ('ending', 0.024), ('bi', 0.023), ('continuous', 0.023), ('cs', 0.023), ('test', 0.022), ('pj', 0.022), ('dna', 0.022), ('relied', 0.022), ('table', 0.022), ('welling', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
Author: Chong Wang, David M. Blei
Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1
2 0.17549804 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
Author: Peter Carbonetto, Matthew King, Firas Hamze
Abstract: We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation (LDA). Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a “population-based” sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Significantly, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. We conduct experiments on a difficult inference problem in population genetics, a problem that is related to inference for LDA. The results of these experiments suggest that our method can offer improvements in stability and accuracy over existing methods, and at a comparable cost. 1
3 0.13995744 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons
Author: Romain Brasselet, Roland Johansson, Angelo Arleo
Abstract: We study an encoding/decoding mechanism accounting for the relative spike timing of the signals propagating from peripheral nerve fibers to second-order somatosensory neurons in the cuneate nucleus (CN). The CN is modeled as a population of spiking neurons receiving as inputs the spatiotemporal responses of real mechanoreceptors obtained via microneurography recordings in humans. The efficiency of the haptic discrimination process is quantified by a novel definition of entropy that takes into full account the metrical properties of the spike train space. This measure proves to be a suitable decoding scheme for generalizing the classical Shannon entropy to spike-based neural codes. It permits an assessment of neurotransmission in the presence of a large output space (i.e. hundreds of spike trains) with 1 ms temporal precision. It is shown that the CN population code performs a complete discrimination of 81 distinct stimuli already within 35 ms of the first afferent spike, whereas a partial discrimination (80% of the maximum information transmission) is possible as rapidly as 15 ms. This study suggests that the CN may not constitute a mere synaptic relay along the somatosensory pathway but, rather, it may convey optimal contextual accounts (in terms of fast and reliable information transfer) of peripheral tactile inputs to downstream structures of the central nervous system. 1
4 0.12278223 205 nips-2009-Rethinking LDA: Why Priors Matter
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
5 0.10638063 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
Author: Feng Yan, Ningyi Xu, Yuan Qi
Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1
6 0.10629134 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
7 0.10622972 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
8 0.10617921 129 nips-2009-Learning a Small Mixture of Trees
9 0.10224071 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters
10 0.10195772 226 nips-2009-Spatial Normalized Gamma Processes
11 0.10188527 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
12 0.093163088 100 nips-2009-Gaussian process regression with Student-t likelihood
13 0.090194114 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
14 0.087113708 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
15 0.085693114 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization
16 0.079748534 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation
17 0.07816802 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems
18 0.078024961 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
19 0.075314611 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations
20 0.075047031 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
topicId topicWeight
[(0, -0.205), (1, -0.084), (2, -0.03), (3, -0.108), (4, 0.112), (5, -0.209), (6, 0.08), (7, 0.038), (8, -0.04), (9, -0.028), (10, -0.092), (11, -0.017), (12, 0.051), (13, 0.109), (14, 0.032), (15, -0.038), (16, -0.043), (17, 0.039), (18, 0.073), (19, 0.051), (20, 0.049), (21, -0.059), (22, -0.031), (23, 0.025), (24, 0.002), (25, -0.131), (26, 0.012), (27, 0.012), (28, 0.116), (29, 0.047), (30, 0.096), (31, 0.035), (32, -0.06), (33, 0.092), (34, -0.093), (35, 0.127), (36, -0.015), (37, -0.017), (38, 0.176), (39, -0.024), (40, 0.094), (41, -0.03), (42, 0.055), (43, -0.089), (44, -0.024), (45, -0.014), (46, -0.038), (47, -0.086), (48, 0.031), (49, -0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.96194977 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
Author: Chong Wang, David M. Blei
Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1
2 0.64713049 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
Author: Peter Carbonetto, Matthew King, Firas Hamze
Abstract: We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation (LDA). Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a “population-based” sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Significantly, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. We conduct experiments on a difficult inference problem in population genetics, a problem that is related to inference for LDA. The results of these experiments suggest that our method can offer improvements in stability and accuracy over existing methods, and at a comparable cost. 1
3 0.59750694 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
Author: Feng Yan, Ningyi Xu, Yuan Qi
Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1
4 0.56044078 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
5 0.52524745 226 nips-2009-Spatial Normalized Gamma Processes
Author: Vinayak Rao, Yee W. Teh
Abstract: Dependent Dirichlet processes (DPs) are dependent sets of random measures, each being marginally DP distributed. They are used in Bayesian nonparametric models when the usual exchangeability assumption does not hold. We propose a simple and general framework to construct dependent DPs by marginalizing and normalizing a single gamma process over an extended space. The result is a set of DPs, each associated with a point in a space such that neighbouring DPs are more dependent. We describe Markov chain Monte Carlo inference involving Gibbs sampling and three different Metropolis-Hastings proposals to speed up convergence. We report an empirical study of convergence on a synthetic dataset and demonstrate an application of the model to topic modeling through time. 1
6 0.51039618 205 nips-2009-Rethinking LDA: Why Priors Matter
7 0.50464553 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
8 0.47689873 100 nips-2009-Gaussian process regression with Student-t likelihood
9 0.47145048 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
10 0.4676322 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale
11 0.46546146 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
12 0.44641516 204 nips-2009-Replicated Softmax: an Undirected Topic Model
13 0.44208705 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
14 0.42780441 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
15 0.42196682 129 nips-2009-Learning a Small Mixture of Trees
16 0.41217148 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
17 0.40476608 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
18 0.40369159 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
19 0.39882532 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
20 0.39127555 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
topicId topicWeight
[(7, 0.012), (24, 0.067), (25, 0.059), (31, 0.262), (35, 0.053), (36, 0.082), (39, 0.07), (55, 0.011), (58, 0.057), (61, 0.034), (66, 0.015), (71, 0.086), (81, 0.026), (86, 0.07), (91, 0.014)]
simIndex simValue paperId paperTitle
1 0.92097187 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons
Author: Romain Brasselet, Roland Johansson, Angelo Arleo
Abstract: We study an encoding/decoding mechanism accounting for the relative spike timing of the signals propagating from peripheral nerve fibers to second-order somatosensory neurons in the cuneate nucleus (CN). The CN is modeled as a population of spiking neurons receiving as inputs the spatiotemporal responses of real mechanoreceptors obtained via microneurography recordings in humans. The efficiency of the haptic discrimination process is quantified by a novel definition of entropy that takes into full account the metrical properties of the spike train space. This measure proves to be a suitable decoding scheme for generalizing the classical Shannon entropy to spike-based neural codes. It permits an assessment of neurotransmission in the presence of a large output space (i.e. hundreds of spike trains) with 1 ms temporal precision. It is shown that the CN population code performs a complete discrimination of 81 distinct stimuli already within 35 ms of the first afferent spike, whereas a partial discrimination (80% of the maximum information transmission) is possible as rapidly as 15 ms. This study suggests that the CN may not constitute a mere synaptic relay along the somatosensory pathway but, rather, it may convey optimal contextual accounts (in terms of fast and reliable information transfer) of peripheral tactile inputs to downstream structures of the central nervous system. 1
2 0.8611415 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity
Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge
Abstract: The inter-subject alignment of functional MRI (fMRI) data is important for improving the statistical power of fMRI group analyses. In contrast to existing anatomically-based methods, we propose a novel multi-subject algorithm that derives a functional correspondence by aligning spatial patterns of functional connectivity across a set of subjects. We test our method on fMRI data collected during a movie viewing experiment. By cross-validating the results of our algorithm, we show that the correspondence successfully generalizes to a secondary movie dataset not used to derive the alignment. 1
same-paper 3 0.8076008 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
Author: Chong Wang, David M. Blei
Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1
4 0.8020972 256 nips-2009-Which graphical models are difficult to learn?
Author: Andrea Montanari, Jose A. Pereira
Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9
5 0.68952137 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
Author: Rob Fergus, Yair Weiss, Antonio Torralba
Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. 1
6 0.58395779 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
7 0.57828403 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
8 0.5760563 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
9 0.56873429 260 nips-2009-Zero-shot Learning with Semantic Output Codes
10 0.56859201 226 nips-2009-Spatial Normalized Gamma Processes
11 0.56724733 204 nips-2009-Replicated Softmax: an Undirected Topic Model
12 0.56655306 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
13 0.56591129 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
14 0.56553996 113 nips-2009-Improving Existing Fault Recovery Policies
15 0.56543159 112 nips-2009-Human Rademacher Complexity
16 0.56496131 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
17 0.56459922 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
18 0.56383222 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
19 0.56234384 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
20 0.56227076 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization