nips nips2013 nips2013-46 knowledge-graph by maker-knowledge-mining

46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Source: pdf

Author: Jie Liu, David Page

Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneﬁcial to group the parameters for more efﬁcient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. [sent-7, score-0.294]

2 Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). [sent-9, score-0.299]

3 Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. [sent-11, score-0.157]

4 In these large-scale networks, similar kinds of relations can occur frequently and give rise to repeated occurrences of similar parameters, but the grouping pattern among the parameters is usually unknown. [sent-17, score-0.193]

5 For a social network example, suppose that we collect voting data over the last 20 years from a group of 1,000 people who are related to each other through different types of relations (such as family, co-workers, classmates, friends and so on), but the relation types are usually unknown. [sent-18, score-0.157]

6 This can be done via standard maximum likelihood estimation (MLE), but the latent grouping pattern among the parameters is totally ignored, and the model can be overparametrized. [sent-21, score-0.278]

7 Can MRF parameter learners automatically identify these latent parameter groups during learning? [sent-23, score-0.153]

8 This paper shows that it is feasible and potentially beneﬁcial to identify the latent parameter groups during MRF parameter learning. [sent-25, score-0.153]

9 We propose two approximation algorithms, a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with stripped Beta approximation (Gibbs SBA). [sent-28, score-0.369]

10 The Gibbs SBA algorithm performs very close to the Gibbs sampling algorithm with exact likelihood calculation. [sent-31, score-0.157]

11 We also assume that each potential function φc is parameterized by one parameter θc , I(X =X ) namely φc (X; θc )=θc I(Xu =Xv ) (1−θc ) u v where I(Xu =Xv ) indicates whether the two nodes u and v connected by edge c take the same value, and 0<θc <1, ∀c=1, . [sent-44, score-0.135]

12 Note that contrastive divergence is related to pseudo-likelihood [4], ratio matching [17, 16], and together with other MRF parameter estimators [13, 31, 12] can be uniﬁed as minimum KL contraction [18]. [sent-64, score-0.191]

13 Therefore by ignoring y, we can generate the posterior samples of θ via Metropolis-Hastings. [sent-73, score-0.118]

14 Technically, this auxiliary variable approach requires perfect sampling [25], but [20] pointed out that other simpler Markov chain methods also work with the proviso that it converges adequately to the equilibrium distribution. [sent-74, score-0.212]

15 The second is a Gibbs sampling algorithm with stripped Beta approximation, as introduced in Section 3. [sent-98, score-0.241]

16 We update each element of c in turn; when resampling ci , we ﬁx c−i , all elements in c other than ci . [sent-126, score-0.55]

17 When updating ci , we repeatedly for M times propose a new value c∗ according to proposal Q(c∗ |ci ) and accept the move with probability i i min{1, a(c∗ |ci )} where a(c∗ |ci ) is the MH ratio. [sent-127, score-0.355]

18 After we update every element of c in the current i i iteration, we draw a posterior sample of φ according to the current grouping c. [sent-128, score-0.346]

19 Unlike the tractable Algorithm 5 in [22], we need to introduce auxiliary variables to bypass MRF’s intractable likelihood in two places, namely calculating the MH ratio (in Section 3. [sent-130, score-0.313]

20 1 Calculating Metropolis-Hastings Ratio The MH ratio of proposing a new value c∗ for ci i according to proposal Q(c∗ |ci ) is i Algorithm 1 The Metropolis-Hastings algorithm Input: observed data X={x1 , . [sent-137, score-0.353]

21 Then, the MH ratio with the auxiliary variable is a(c∗ , Z∗ |ci , Z) = i ˜ ˜ ˜ ˜ ˜ ˜ ˜ P (Z∗ ; θ)P (X; θ. [sent-166, score-0.112]

22 2 Drawing Posterior Samples of φ|c We draw posterior samples of φ under grouping c via the MH algorithm, again following [20]. [sent-177, score-0.323]

23 We move the Markov chain for S steps, and get S samples of φ by ignoring Y. [sent-186, score-0.137]

24 2 Gibbs Sampling with Stripped Beta Approximation In the Gibbs sampling algorithm (see Algorithm 2), the initialization of the Markov Algorithm 2 The Gibbs sampling algorithm chain is exactly the same as in the MH alInput: observed data X = {x1 , x2 , . [sent-189, score-0.179]

25 , θ ; T posterior samples of θ|X resembles Algorithm 2 in [22] and it can Procedure: be shown to be ergodic. [sent-197, score-0.122]

26 When we update c, we ﬁx the for i = 1 to r do values in φ, except we may add one new If current ci is unique in c, remove φci from φ value to φ or remove a value from φ. [sent-201, score-0.347]

27 When If new ci ∈c, draw a value for φci and add to φ we update ci , we ﬁrst examine whether ci end for is unique in c. [sent-204, score-0.869]

28 If so, we remove φci from Draw a posterior sample of φ according to current φ ﬁrst. [sent-205, score-0.127]

29 We then update ci by assigning it ˆ(t) c, and set θi = φci for i = 1, . [sent-206, score-0.287]

30 , r to an existing group or a new group with end for a probability proportional to a product of two quantities, namely P (ci = c|c−i , X, φc−i ) ∝ n−i,c r−1+α0 P (X; φc , φc−i ), if c ∈ c−i α0 r−1+α0 P (X; θi , φc−i ) dG0 (θi ), if c ∈ c−i . [sent-209, score-0.152]

31 The second quantity is the likelihood of X after assigning ci to the new value c conditional on φc−i . [sent-212, score-0.344]

32 After ci is resampled, it is either set to be an existing group or a new group. [sent-217, score-0.318]

33 After updating every element of c in the current iteration, we draw a posterior sample of φ under the current grouping c. [sent-219, score-0.322]

34 In total, we run T iterations, and get T posterior samples of θ. [sent-220, score-0.122]

35 This Gibbs sampling algorithm involves two intractable calculations, namely (i) calculating P (X; φc , φc−i ) and P (X; θi , φc−i ) dG0 (θi ) in (4) and (ii) drawing posterior samples for φ. [sent-221, score-0.291]

36 We use a stripped Beta approximation in both places, as in Sections 3. [sent-222, score-0.212]

37 Then L(θi |X, θ −i )= ≈ n j=1 n j=1 P (xj , xj |xj ; θi , θ −i )P (xj ; θi , θ −i ) u v −uv −uv P (xj , xj |xj ; θi , θ −i )P (xj ; θ −i ) ∝ u v −uv −uv n j=1 P (xj , xj |xj ; θi , θ −i ). [sent-241, score-0.423]

38 The term P (xj ; θ −i ) can be dropped since θ −i is ﬁxed, and we only have −uv 4 to consider P (xj , xj |xj ; θi , θ −i ). [sent-243, score-0.141]

39 Since θ −i is ﬁxed and we are conditioning on xj , they u v −uv −uv together can be regarded as a ﬁxed potential function telling how likely the rest of the graph thinks Xu and Xv should take the same value. [sent-244, score-0.169]

40 Suppose that this ﬁxed potential function (the message from the rest of the network xj ) is parameterized as ηi (0 < ηi < 1). [sent-245, score-0.189]

41 Then −uv n n n P (xj , xj |xj ; θi , θ −i )∝ u v −uv j=1 λ I(xj =xj ) u v I(xj =xj ) u v (1−λ) =λ j=1 I(xj =xj ) u v n (1−λ) j=1 I(xj =xj ) u v (5) j=1 where λ=θi ηi /{θi ηi +(1−θi )(1−ηi )}. [sent-246, score-0.141]

42 2 indicate that the performance of the stripped Beta approximation is very close to using exact calculation. [sent-254, score-0.237]

43 Also this approximation only requires as much computation as in the tractable tree-structure MRFs, and it does not require generating expensive particles as in the MH algorithm with auxiliary variables. [sent-255, score-0.143]

44 2 Drawing Posterior Samples of φ|c The stripped Beta approximation also allows us to draw posterior samples from φ|c approximately. [sent-260, score-0.366]

45 ˆ Suppose that there are k groups according to c, and we have estimates for φ, denoted as φ = ˆ1 , . [sent-261, score-0.125]

46 We denote the numbers of elements in the k groups by m = {m1 , . [sent-265, score-0.125]

47 For group ˆ (φ ˆ ˆ i, we draw a posterior sample for φi from Beta( mi nφi +1, mi n− mi nφi +1). [sent-269, score-0.208]

48 4 Simulations We investigate the performance of our Bayesian estimators on three models: (i) a tree-MRF, (ii) a small grid-MRF whose likelihood is tractable, and (iii) a large grid-MRF whose likelihood is intractable. [sent-270, score-0.228]

49 On training data, we apply our grouping-aware Bayesian estimators and two baseline estimators, namely a grouping-blind estimator and an oracle estimator. [sent-272, score-0.336]

50 The grouping-blind estimator does not know groups exist in the parameters, and estimates the parameters in the normal MLE fashion. [sent-273, score-0.197]

51 The oracle estimator knows the ground truth of the groupings, and ties the parameters from the same group and estimates them via MLE. [sent-274, score-0.222]

52 For the tree-MRF, our Bayesian estimator is exact since the likelihood is tractable. [sent-275, score-0.178]

53 For the small grid-MRF, we have three variations for the Bayesian estimator, namely Gibbs sampling with exact likelihood computation, MH with auxiliary variables, and Gibbs sampling with stripped Beta approximation. [sent-276, score-0.524]

54 For the large grid-MRF, the computational burden only allows us to apply Gibbs sampling with stripped Beta approximation. [sent-277, score-0.241]

55 The second measure is the log likelihood of the testing data, or the log pseudo-likelihood [4] of the testing data when exact likelihood is intractable. [sent-280, score-0.247]

56 Thirdly, we evaluate how informative the grouping yielded by the Bayesian estimator is. [sent-281, score-0.296]

57 We use the ˆ variation of information metric [19] between the inferred grouping C and the ground truth grouping ˆ C). [sent-282, score-0.411]

58 Since VI(C, C) is sensitive to the number of groups in C, we contrast it ˆ ˆ C, namely VI(C, ¯ ¯ ˆ with VI(C, C) where C is a random grouping with its number of groups the same as C. [sent-283, score-0.461]

59 A larger value of VI difference indicates a more informative grouping yielded by our Bayesian estimator. [sent-285, score-0.266]

60 Because we have one grouping in each of the T MCMC steps, we average the VI difference yielded in each of the T steps. [sent-286, score-0.246]

61 We assume there are 25 groups among the 8,190 parameters. [sent-291, score-0.125]

62 We ﬁrst generate the true parameters for the 25 groups from Unif(0, 1). [sent-293, score-0.145]

63 Subﬁgure (c) shows the VI difference between the grouping yielded by our Bayesian estimator and random grouping. [sent-305, score-0.318]

64 testing samples and n training samples (n=100, 200, . [sent-306, score-0.151]

65 Eventually, we apply the groupingblind MLE, the oracle MLE, and our grouping-aware Bayesian estimator on the training samples. [sent-310, score-0.26]

66 Our grouping-aware Bayesian estimator has a lower estimate error and a higher log likelihood of test data, compared with the grouping-blind MLE, demonstrating the “blessing of abstraction”. [sent-315, score-0.206]

67 Our Bayesian estimator performs worse than oracle MLE, as we expect. [sent-316, score-0.167]

68 In addition, as the training sample size increases, the performance of our Figure 2: Number of groups inferred by the Bayesian Bayesian estimator approaches that of the oracle estimator and its run time. [sent-317, score-0.551]

69 The VI difference in Figure 1(c) indicates that the Bayesian estimator also recovers the latent grouping to some extent, and the inferred groupings become more and more reliable as the training size increases. [sent-319, score-0.489]

70 The number of groups inferred by the Bayesian estimator and its running time are in Figure 2. [sent-320, score-0.27]

71 We ﬁrst generate the true parameters for the ﬁve groups from Unif(0, 1). [sent-329, score-0.145]

72 We then generate 1,000 testing samples and n training samples (n=100, 200, . [sent-331, score-0.171]

73 (a) Gibbs_ExactL (b) MH_AuxVar (c) Gibbs_SBA Our grouping-aware Bayesian estimators have a lower estimate error and a higher log likelihood of test data, compared with the groupingblind MLE, demonstrating the blessing of abstraction. [sent-344, score-0.264]

74 All three Bayesian estimators perform worse than oracle MLE, as we expect. [sent-345, score-0.161]

75 The Figure 3: The number of groups inferred by VI difference in Figure 4(c) indicates that the Gibbs ExactL, MH AuxVar and Gibbs SBA. [sent-346, score-0.24]

76 Bayesian estimators also recover the grouping to some extent, and the inferred groupings become more and more reliable as the training size increases. [sent-347, score-0.413]

77 In Figure 3, we provide the boxplots of the number of groups inferred by Gibbs ExactL, MH AuxVar and Gibbs SBA. [sent-348, score-0.198]

78 01 Error of Estimate q Log−pseudolikelihood of Test Data Figure 4: Performance of grouping-blind MLE, oracle MLE, Gibbs ExactL, MH AuxVar, and Gibbs SBA on the small grid-structure MRFs in terms of (a) error of estimate and (b) log-likelihood of test data. [sent-369, score-0.148]

79 Subﬁgure (c) shows the VI difference between the grouping yielded by our Bayesian estimators and random grouping. [sent-370, score-0.312]

80 Subﬁgure (c) shows the VI difference between the grouping yielded by our Bayesian estimator and random grouping. [sent-372, score-0.318]

81 Among the three Bayesian estimators, Table 1: The run time (in seconds) of Gibbs ExactL, Gibbs ExactL has the lowest estimate er- MH AuxVar and Gibbs SBA when training size is n. [sent-373, score-0.112]

82 Gibbs SBA runs fast, with its burden mainly from running PCD under a speciﬁc grouping in each Gibbs sampling step, and it scales well. [sent-390, score-0.22]

83 Exact likelihood is intractable for this large model, and we cannot run Gibbs ExactL. [sent-395, score-0.149]

84 We assume that there are 10 groups among the 1,740 parameters. [sent-398, score-0.125]

85 For all 10 training sets, our Bayesian estimator Gibbs SBA has a lower estimate error and a higher log likelihood of test data, compared with the grouping-blind MLE (via the PCD algorithm). [sent-403, score-0.267]

86 Gibbs SBA has a higher estimate error and a lower pseudo-likelihood of test data than the oracle MLE. [sent-404, score-0.148]

87 The VI difference in Figure 5(c) Figure 6: Number of groups inferred by Gibbs SBA indicates that Gibbs SBA gradually recovers the and its run time. [sent-405, score-0.264]

88 The number of groups inferred by Gibbs SBA and its running time are provided in Figure 6. [sent-407, score-0.198]

89 Gibbs SBA ﬁnishes the simulations on 900 nodes and 1,740 edges in hundreds of minutes (depending on the training size), which is considered to be very fast. [sent-410, score-0.138]

90 The 109th Congress has two sessions, the ﬁrst session in 2005 and the second session in 2006. [sent-425, score-0.122]

91 There are 366 votes and 278 votes in the two sessions, respectively. [sent-426, score-0.198]

92 There are 100 senators in both sessions, but Senator Corzine only served the ﬁrst session and Senator Menendez only served the second session. [sent-427, score-0.147]

93 In total, we have 99 senators in our experiments, and we treat the votes from the 99 senators as the 99 variables in the MRF. [sent-429, score-0.183]

94 We only consider contested votes, namely we remove the votes with less than ten or more than ninety supporters. [sent-430, score-0.171]

95 In total, there are 292 votes and 221 votes left in the two sessions, respectively. [sent-431, score-0.198]

96 First, we train the MRF using the ﬁrst session data, and test on the second session data. [sent-437, score-0.148]

97 We compare our Bayesian estimator (via Gibbs SBA) and MLE (via PCD) by the log pseudo-likelihood of testing data since exact likelihood is intractable. [sent-439, score-0.208]

98 Accordingly, we propose two types of approximation, namely a MetropolisHastings algorithm with auxiliary variables and a Gibbs sampling algorithm with stripped Beta approximation. [sent-457, score-0.367]

99 An efﬁcient Markov chain Monte Carlo method for distributions with intractable normalising constants. [sent-583, score-0.121]

100 Markov chain sampling methods for Dirichlet process mixture models. [sent-596, score-0.128]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sba', 0.46), ('mle', 0.343), ('gibbs', 0.293), ('ci', 0.263), ('mh', 0.254), ('pcd', 0.193), ('stripped', 0.19), ('grouping', 0.169), ('uv', 0.143), ('auxvar', 0.142), ('xj', 0.141), ('mrfs', 0.138), ('exactl', 0.127), ('groups', 0.125), ('bayesian', 0.118), ('mrf', 0.112), ('beta', 0.108), ('votes', 0.099), ('oracle', 0.095), ('vi', 0.089), ('auxiliary', 0.084), ('likelihood', 0.081), ('chain', 0.077), ('inferred', 0.073), ('contrastive', 0.073), ('estimator', 0.072), ('posterior', 0.068), ('estimators', 0.066), ('proposal', 0.062), ('session', 0.061), ('markov', 0.061), ('training', 0.061), ('voting', 0.057), ('draw', 0.056), ('ibbs', 0.056), ('yielded', 0.055), ('group', 0.055), ('unif', 0.053), ('senate', 0.051), ('sampling', 0.051), ('dirichlet', 0.048), ('xv', 0.047), ('sessions', 0.044), ('groupings', 0.044), ('intractable', 0.044), ('namely', 0.042), ('senators', 0.042), ('particles', 0.037), ('calculating', 0.034), ('blessing', 0.032), ('groupingblind', 0.032), ('simulations', 0.032), ('replicate', 0.031), ('testing', 0.03), ('move', 0.03), ('samples', 0.03), ('remove', 0.03), ('sample', 0.029), ('senator', 0.028), ('potential', 0.028), ('ratio', 0.028), ('latent', 0.028), ('estimate', 0.027), ('sub', 0.027), ('dept', 0.026), ('gutmann', 0.026), ('xu', 0.026), ('test', 0.026), ('exact', 0.025), ('eventually', 0.025), ('nodes', 0.025), ('carlo', 0.025), ('monte', 0.024), ('divergence', 0.024), ('relations', 0.024), ('update', 0.024), ('metropolishastings', 0.024), ('pseudolikelihood', 0.024), ('resembles', 0.024), ('run', 0.024), ('congress', 0.023), ('drawing', 0.022), ('approximation', 0.022), ('undirected', 0.022), ('served', 0.022), ('nonparametric', 0.022), ('difference', 0.022), ('state', 0.022), ('social', 0.021), ('places', 0.021), ('wisconsin', 0.02), ('generate', 0.02), ('jmlr', 0.02), ('edges', 0.02), ('indicates', 0.02), ('abstraction', 0.02), ('parameterized', 0.02), ('nite', 0.02), ('ller', 0.019), ('persistent', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Author: Jie Liu, David Page

2 0.18826334 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

Author: Edoardo M. Airoldi, Thiago B. Costa, Stanley H. Chan

Abstract: Non-parametric approaches for analyzing network data based on exchangeable graph models (ExGM) have recently gained interest. The key object that deﬁnes an ExGM is often referred to as a graphon. This non-parametric perspective on network modeling poses challenging questions on how to make inference on the graphon underlying observed network data. In this paper, we propose a computationally efﬁcient procedure to estimate a graphon from a set of observed networks generated from it. This procedure is based on a stochastic blockmodel approximation (SBA) of the graphon. We show that, by approximating the graphon with a stochastic block model, the graphon can be consistently estimated, that is, the estimation error vanishes as the size of the graph approaches inﬁnity.

3 0.17727166 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola

Abstract: In this paper we describe how MAP inference can be used to sample efﬁciently from Gibbs distributions. Speciﬁcally, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1

4 0.15254205 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve signiﬁcant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for ﬁnite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

5 0.13226143 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

Author: Justin Domke, Xianghang Liu

Abstract: Inference in general Ising models is difﬁcult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We ﬁnd that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.

6 0.11520632 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

7 0.10557924 161 nips-2013-Learning Stochastic Inverses

8 0.1008717 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

9 0.093021743 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks

10 0.081901871 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

11 0.081900254 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

12 0.080367319 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation

13 0.080178872 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

14 0.072887361 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

15 0.068154909 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

16 0.066014774 36 nips-2013-Annealing between distributions by averaging moments

17 0.065730438 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

18 0.06230133 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

19 0.060405195 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

20 0.059865918 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, 0.059), (2, -0.053), (3, 0.011), (4, 0.033), (5, 0.132), (6, 0.168), (7, -0.01), (8, 0.12), (9, -0.038), (10, 0.042), (11, 0.032), (12, 0.011), (13, 0.01), (14, 0.017), (15, 0.013), (16, -0.048), (17, -0.078), (18, -0.019), (19, 0.055), (20, -0.019), (21, 0.001), (22, 0.005), (23, -0.078), (24, 0.054), (25, 0.105), (26, 0.009), (27, -0.019), (28, -0.052), (29, 0.145), (30, 0.083), (31, -0.003), (32, 0.069), (33, -0.113), (34, 0.044), (35, -0.085), (36, -0.015), (37, -0.073), (38, 0.06), (39, -0.053), (40, 0.052), (41, 0.115), (42, 0.088), (43, 0.141), (44, 0.01), (45, -0.053), (46, 0.031), (47, 0.018), (48, -0.146), (49, -0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94849455 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Author: Jie Liu, David Page

2 0.78813416 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Author: Matthew Johnson, James Saunderson, Alan Willsky

Abstract: Sampling inference methods are computationally difﬁcult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1

3 0.73244727 161 nips-2013-Learning Stochastic Inverses

Author: Andreas Stuhlmüller, Jacob Taylor, Noah Goodman

Abstract: We describe a class of algorithms for amortized inference in Bayesian networks. In this setting, we invest computation upfront to support rapid online inference for a wide range of queries. Our approach is based on learning an inverse factorization of a model’s joint distribution: a factorization that turns observations into root nodes. Our algorithms accumulate information to estimate the local conditional distributions that constitute such a factorization. These stochastic inverses can be used to invert each of the computation steps leading to an observation, sampling backwards in order to quickly ﬁnd a likely explanation. We show that estimated inverses converge asymptotically in number of (prior or posterior) training samples. To make use of inverses before convergence, we describe the Inverse MCMC algorithm, which uses stochastic inverses to make block proposals for a Metropolis-Hastings sampler. We explore the efﬁciency of this sampler for a variety of parameter regimes and Bayes nets. 1

4 0.73156482 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

Author: Justin Domke, Xianghang Liu

5 0.70074958 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola

6 0.69376779 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

7 0.56130356 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

8 0.51737881 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

9 0.50875735 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

10 0.50504959 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

11 0.48869428 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

12 0.48510095 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

13 0.48088637 160 nips-2013-Learning Stochastic Feedforward Neural Networks

14 0.46478075 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

15 0.46401829 36 nips-2013-Annealing between distributions by averaging moments

16 0.45420021 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

17 0.44852358 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

18 0.44757193 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

19 0.43318826 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

20 0.41584232 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.052), (33, 0.582), (34, 0.079), (49, 0.022), (56, 0.065), (70, 0.018), (85, 0.031), (89, 0.033), (93, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99479938 88 nips-2013-Designed Measurements for Vector Count Data

Author: Liming Wang, David Carlson, Miguel Rodrigues, David Wilcox, Robert Calderbank, Lawrence Carin

Abstract: We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, X ∈ Rn , and the + observed data are a vector of counts, Y ∈ Zm . The projection matrix is designed + by maximizing mutual information between Y and X, I(Y ; X). When there is a latent class label C ∈ {1, . . . , L} associated with X, we consider the mutual information with respect to Y and C, I(Y ; C). New analytic expressions for the gradient of I(Y ; X) and I(Y ; C) are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classiﬁcation (photon counting). 1

2 0.99298251 217 nips-2013-On Poisson Graphical Models

Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu

Abstract: Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its inﬁnite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution. While this model can accommodate a wider range of conditional dependencies, some limitations still remain. To address this, we investigate two additional novel variants of the Poisson distribution and their corresponding joint graphical model distributions. Our three novel approaches provide classes of Poisson-like graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our models via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-sequencing data. 1

3 0.99011213 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov

Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difﬁculty of acquiring sufﬁcient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1

4 0.98974597 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

Author: Chris Hinrichs, Vamsi Ithapu, Qinyuan Sun, Sterling C. Johnson, Vikas Singh

Abstract: Multiple hypothesis testing is a signiﬁcant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non-parametric method of estimating the FWER for a given α-threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we show that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub–sampled — on the order of 0.5% — matrix completion methods. Based on this observation, we propose a novel permutation testing methodology which offers a large speedup, without sacriﬁcing the ﬁdelity of the estimated FWER. Our evaluations on four different neuroimaging datasets show that a computational speedup factor of roughly 50× can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated α-threshold is also recovered faithfully, and is stable. 1

5 0.98507845 160 nips-2013-Learning Stochastic Feedforward Neural Networks

Author: Yichuan Tang, Ruslan Salakhutdinov

Abstract: Multilayer perceptrons (MLPs) or neural networks are popular models used for nonlinear regression and classiﬁcation tasks. As regressors, MLPs model the conditional distribution of the predictor variables Y given the input variables X. However, this predictive distribution is assumed to be unimodal (e.g. Gaussian). For tasks involving structured prediction, the conditional distribution should be multi-modal, resulting in one-to-many mappings. By using stochastic hidden variables rather than deterministic ones, Sigmoid Belief Nets (SBNs) can induce a rich multimodal distribution in the output space. However, previously proposed learning algorithms for SBNs are not efﬁcient and unsuitable for modeling real-valued data. In this paper, we propose a stochastic feedforward network with hidden layers composed of both deterministic and stochastic variables. A new Generalized EM training procedure using importance sampling allows us to efﬁciently learn complicated conditional distributions. Our model achieves superior performance on synthetic and facial expressions datasets compared to conditional Restricted Boltzmann Machines and Mixture Density Networks. In addition, the latent features of our model improves classiﬁcation and can learn to generate colorful textures of objects. 1

same-paper 6 0.98127031 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

7 0.97318053 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

8 0.96505225 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

9 0.95859885 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

10 0.93487865 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

11 0.92387909 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

12 0.90241158 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

13 0.90114087 67 nips-2013-Conditional Random Fields via Univariate Exponential Families

14 0.90015012 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

15 0.89868343 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

16 0.8956781 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

17 0.8945182 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

18 0.88806516 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

19 0.88712025 226 nips-2013-One-shot learning by inverting a compositional causal process

20 0.88615179 335 nips-2013-Transfer Learning in a Transductive Setting