nips nips2010 nips2010-101 knowledge-graph by maker-knowledge-mining

101 nips-2010-Gaussian sampling by local perturbations


Source: pdf

Author: George Papandreou, Alan L. Yuille

Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Gaussian sampling by local perturbations George Papandreou Department of Statistics University of California, Los Angeles gpapan@stat. [sent-1, score-0.32]

2 edu Abstract We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. [sent-7, score-0.273]

3 Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. [sent-8, score-0.223]

4 Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. [sent-9, score-0.183]

5 First introduced in statistical physics, MRFs and related models such as Boltzmann machines have proved particularly successful in computer vision and machile learning tasks such as image segmentation, signal recovery, texture modeling, classification, and unsupervised learning [1, 3, 5]. [sent-13, score-0.226]

6 Sampling of MRFs also plays an important role within algorithms for model parameter fitting [7], signal estimation, and in image analysis for texture synthesis or inpainting [16, 19, 37]. [sent-15, score-0.31]

7 In this paper we study a technique which allows drawing exact samples from a GMRF in a single shot by first perturbing it and then computing the least energy configuration of the perturbed model. [sent-20, score-0.335]

8 The perturbation involved amounts to independently injecting noise to each of the Gaussian factors/potentials in a fully distributed manner, as discussed in detail in Sec. [sent-21, score-0.148]

9 This reduction of sampling to quadratic energy minimization allows us to employ as black-box GMRF simulator any existing algorithm for MAP computation which is effective for a particular Gaussian graphical model. [sent-23, score-0.184]

10 Marginal variances also arise in computations within non-linear sparse Bayesian learning and compressed sensing models [11, 26, 32]. [sent-25, score-0.11]

11 Gaussian models have proven inadequate for image modeling as they fail to capture important aspects of natural image statistics such as the heavy tails in marginal histograms of linear filter responses. [sent-30, score-0.308]

12 Nevertheless, much richer statistical image tools can be built if we also incorporate into our models latent variables or allow nonlinear interactions between multiple Gaussian fields and thus the GMRF sampling technique we describe here is very useful within this wider setting [10, 16, 19, 34]. [sent-31, score-0.473]

13 5 we discuss the integration of our GMRF sampling algorithm in a block-Gibbs sampling context, where the conditionally Gaussian continuous variables and the conditionally independent latent variables are sampled alternately. [sent-33, score-0.752]

14 Further, our sampling technique also applies when the latent variables are distributed, with each hidden variable affecting multiple experts. [sent-36, score-0.42]

15 Our GMRF sampling algorithm relies on a property of Gaussian densities (see Sec. [sent-39, score-0.184]

16 However, [21, 22] emphasize direct matrix factorization methods for solving the linear system arising in computing the Gaussian mean, which cannot handle the large models we consider here and do not discuss models with hidden variables. [sent-41, score-0.203]

17 Variations of the sampling technique we study here have been also used in the image modeling work of [16] and very recently of [23]. [sent-42, score-0.375]

18 However the sampling technique in these papers is used as a tool and not studied by itself. [sent-43, score-0.244]

19 1 Gaussian graphical models The linear Gaussian model We are working in the context of linear Gaussian models [20], in which a hidden vector x ∈ RN is assumed to follow a prior distribution P (x) and noisy linear measurements y ∈ RM of it are drawn with likelihood P (y|x). [sent-46, score-0.182]

20 (2) n n 1 We recall that the information form of the Gaussian density NI (x; k, J) ∝ exp − 2 xT Jx + kT x employs the precision matrix J and the potential vector k [13]. [sent-50, score-0.094]

21 By Bayes’ rule the posterior distribution of x given y is the product of the prior and likelihood terms and also has Gaussian density P (x|y) = N (x; µ, Σ) , µ=J −1 G T Σ−1 µp p with T + H Σ−1 (y − c) n and Σ−1 = J = GT Σ−1 G + HT Σ−1 H . [sent-55, score-0.21]

22 The respective filter responses Gx and Hx determine the prior and T T likelihood models of Eq. [sent-65, score-0.117]

23 ; fL ], L = K+M , as the union of {gk } and {hm } and further assume that any two filter responses are conditionally independent given x or, equivalently, that the covariance matrices in Eq. [sent-70, score-0.211]

24 Then the posterior factorizes as a product of L Gaussian experts P (x|y) ∝ L l=1 1 exp − 2 xT Jl x + kT x ∝ l L l=1 N (flT x; µl , Σl ) , (4) where the variances are Σl = Σp,l , l = 1 . [sent-88, score-0.329]

25 (3), we see that the L L posterior Gaussian information parameters split additively as J = l=1 Jl and k = l=1 kl . [sent-96, score-0.211]

26 The −1 individual Gaussian factors have potential vectors kl = fl Σl µl and rank-one precision matrices Jl = fl Σ−1 flT . [sent-97, score-0.416]

27 We see that there is a one-to-one correspondence l between factors and filters; moreover, the (i, j) entry of Jl is non-zero iff both i and j entries of fl are non-zero. [sent-99, score-0.18]

28 It is straightforward to jointly model conditionally dependent filter responses by letting Σp or Σn have block diagonal structure, yielding multivariate Gaussian factors in Eq. [sent-103, score-0.256]

29 3 Inference: Efficiently computing the posterior mean Conceptually, the Gaussian posterior distribution is fully characterized by the posterior mean µ and covariance matrix Σ, which are given in closed form in Eq. [sent-106, score-0.638]

30 For example, a typical 1 MP image model involves N = 106 variables; the corresponding symmetric covariance matrix Σ is generally dense and occupies as much space as about 5×105 equally-sized images. [sent-109, score-0.138]

31 In certain special cases direct methods 3 f1 f2 f3 f4 fL Σ−1 1 ¯ φ1 x x1 x2 x3 xN φ1 ¯ φB (a) Jx Σ−1 B φB (b) Figure 1: (a) The factor graph for the posterior GMRF contains the union f1:L of prior and likelihood factors/filters. [sent-111, score-0.259]

32 For example, spatially homogeneous GMRFs give rise to a block-circulant precision matrix and exact computations can be carried out in O(N log N ) complexity with DFT-based techniques [10]. [sent-116, score-0.103]

33 Exact inference can also be carried out in chain or tree-structured GMRFs using O(N ) Kalman filter equations which correspond to belief propagation (BP) updates recursively in time or scale [36]. [sent-117, score-0.123]

34 A related direct approach which in the context of GMRFs has been studied in detail by [21, 22] relies on the Cholesky factorization of the precision matrix by efficient sparse matrix techniques, which typically re-order the variables in x so as to minimize the bandwidth W of J. [sent-118, score-0.133]

35 The resulting algorithm has O(W 2 N ) speed and O(W N ) space complexity, which is still quite expensive for very large scale 2-D lattice image models, since the bandwidth W increases linearly with the spatial extent of the image and the support of the filters. [sent-119, score-0.227]

36 2, this essentially amounts to computing L the filter responses zl = flT x and the backprojection l=1 Σ−1 zl fl , which respectively involves l sending messages from the variables to the factors and back in the diagram of Fig. [sent-126, score-0.374]

37 The GMRFs arising in image modeling are typically defined on the image responses to a bank of linear filters {φℓ }, ℓ = 1 . [sent-128, score-0.363]

38 The time complexity per iteration is thus low, typically O(N ) or O(N log N ), provided that the filter kernels φℓ have small spatial support or correspond to wavelet or Fourier atoms for which fast discrete transforms exist, while computations can also be carried out in the GPU. [sent-134, score-0.117]

39 When multigrid applies, as in the example of Sec. [sent-138, score-0.113]

40 3 Gaussian sampling by independent factor perturbations Unlike direct methods, the iterative techniques discussed in Sec. [sent-141, score-0.333]

41 3 have been typically restricted to computing the posterior mode µ and considered less suited to posterior sampling or variance computation (but see Sec. [sent-143, score-0.738]

42 A sample xs from the posterior distribution P (x|y) = N (x; µ, Σ) of Eq. [sent-147, score-0.304]

43 (3) can be ˜ drawn using the following procedure: (1) Perturb the prior mean filter responses µp ∼ N (µp , Σp ). [sent-148, score-0.118]

44 (3) Use the procedure for computing the posterior mode keeping the same system matrix J, only replacing µp and y with their perturbed versions: xs = J−1 GT Σ−1 µp + HT Σ−1 (˜ − c) . [sent-150, score-0.495]

45 n y p ˜ Indeed, xs is a Gaussian random vector, as linear combination of Gaussians, and has the desired mean E{xs } = µ and covariance E{(xs − µ)(xs − µ)T } = J−1 = Σ, as can readily be verified. [sent-151, score-0.21]

46 The reduction above implies that posterior sampling under the linear Gaussian model is computationally as hard as mode computation, provided that the structure of Σp and Σn allows efficient sampling from the corresponding distributions, using, e. [sent-153, score-0.629]

47 The sampling algorithm takes a particularly simple and intuitive form for the GMRFs discussed in Sec. [sent-159, score-0.184]

48 In this case Σp and Σn are diagonal and thus for sampling we perturb independently the factor means µl ∼ N (µl , Σl ), l = 1 . [sent-162, score-0.282]

49 L, followed by finding the mode of the so perturbed GMRF ˜ in Eq. [sent-165, score-0.191]

50 The perturbation can be equivalently seen in the information parameterization as injecting −1/2 ˜ Gaussian noise to each potential vector by kl = kl + fl Σl ǫl , with ǫl ∼ N (0, 1), a simple local operation carried out independently at each factor of the diagram in Fig. [sent-167, score-0.494]

51 2 an image inpainting example in which we fill in the occluded parts of an 498×495 image under a 2-D thin-membrane prior GMRF model [12,29,31], in which the Gaussian factors are induced by the first-order spatial derivative filters φ1 = [ −1 1 ] and φ2 = [ −1 1 ]T . [sent-170, score-0.38]

52 The shared variance parameter Σl for the experts has been matched to the variance of the image derivative histogram. [sent-171, score-0.342]

53 To transform this efficient MAP computation technique into a powerful sampling algorithm for the thin-membrane GMRF, it suffices to inject noise to the factors, only perturbing the linear system’s right hand side. [sent-174, score-0.318]

54 Using a multigrid solver originally developed for solving PDE problems, we can draw about 4 posterior samples per second from the 2-D thin-membrane model of Fig. [sent-175, score-0.329]

55 2, which is particularly impressive given its size; the multilevel Gibbs sampling technique of [30] is the only other algorithm that could potentially achieve such speed in a similar setup, yet it cannot produce exact single-shot samples as our algorithm can. [sent-176, score-0.286]

56 01 Figure 2: Image inpainting by exact sampling from the posterior under a 2-D thin-membrane prior GMRF model, conditional on the image values at the known sites. [sent-185, score-0.607]

57 4 Posterior variance estimation It is often desirable not only to compute the mode µ but also recover aspects of the covariance structure in the posterior distribution. [sent-187, score-0.389]

58 For example, the diagonal of Σ contains the variance of each variable and thus, along with the mean, fully describes the posterior marginal densities [29]. [sent-191, score-0.306]

59 For many of these models variance estimation is the main computational bottleneck in applications involving large scale datasets. [sent-193, score-0.158]

60 A number of techniques have been proposed for posterior variance estimation. [sent-194, score-0.258]

61 One approach has been to employ modified conjugate gradient algorithms which allow forming variance estimates in parallel to computing the posterior mode when solving the linear system in Eq. [sent-195, score-0.388]

62 It is well known that belief propagation computes exact variances in tree-structured GMRFs [36]. [sent-199, score-0.112]

63 However, in graphs with cycles its loopy version typically underestimates the marginal variances since it overcounts the evidence, even when it converges to the correct means [13, 33]. [sent-200, score-0.158]

64 The variance estimator of [28] is only applicable to GMRFs for which just a small number of edges violates the graph’s tree structure. [sent-201, score-0.159]

65 The method in [12] relies on a low-rank approximation of the N × N unit matrix, carefully adapted to the problem covariance structure, also employing a wavelet hierarchy for models exhibiting long-range dependencies. [sent-202, score-0.114]

66 The ability to efficiently sample from the Gaussian posterior distribution using the algorithm of Sec. [sent-205, score-0.174]

67 3 immediately suggests the following Monte Carlo estimator of the posterior covariance matrix ˆ Σ = 1/S S s=1 (xs − µ)(xs − µ)T . [sent-206, score-0.293]

68 (5) If only the posterior variances are required, one will obviously just evaluate and retain the diagonal ˆ of the outer-products in the sum; any other selected elements of Σ can similarly be obtained. [sent-207, score-0.249]

69 The error drops quite slowly with the 2 number of samples (S = 2/r samples are required to reach a desired relative error r), so the technique is best suited if rough variance estimates suffice, which is often the case in practical applications [26]; e. [sent-210, score-0.228]

70 The proposed variance estimation technique can thus be readily applied to every GMRF at a cost of S times that of computing µ. [sent-214, score-0.144]

71 2 the result of applying the proposed variance estimator for the thin-membrane GMRF example considered in Sec. [sent-216, score-0.159]

72 5 Block Gibbs sampling in conditionally Gaussian Markov random fields Following the intuition behind Gaussian sampling by local perturbations, one could try to inject noise to the local potentials and find the mode of the perturbed model, even in the presence of nonquadratic MRF factors. [sent-220, score-0.792]

73 Although such a randomization process is interesting on its own right and deserves further study, it is not feasible to design it in a way that leads to single shot algorithms for exact sampling of non-Gaussian MRFs. [sent-221, score-0.239]

74 Without completely abandoning the Gaussian realm, we can get versatile models in which some hidden variables q control the mean and/or variance of the Gaussian factors. [sent-222, score-0.273]

75 Sampling from this model can be carried out efficiently (but not in a single shot any more) by alternately block sampling from P (x|q) and P (q|x), which typically mixes rapidly and is much more efficient than single-site Gibbs sampling [35]. [sent-224, score-0.557]

76 For large models this is feasible because, given the hidden variables, we can update the visible units collectively using the GMRF sampling by local perturbations algorithm, similarly to [16, 23]. [sent-225, score-0.605]

77 We assume that block sampling of the hidden units given the visible variables is also feasible, by considering their conditional distribution independent or tree-structured [16]. [sent-226, score-0.528]

78 One typically employs one discrete hidden variable ql per factor fl , leading to mixture of Gaussian local experts for which the joint distribution of visible and hidden units is L Jl P (x, q) ∝ l=1 j=1 πl,j N (flT x; µl,j , Σl,j ) [4, 16, 23, 34]. [sent-227, score-0.778]

79 Intuitively, the discrete latent unit ql turns off the smoothness constraint enforced by the factor fl by assigning a large variance Σl,j to it when an image edge is detected. [sent-228, score-0.462]

80 The block-Gibbs sampler leads to a rapidly mixing Markov chain which after a few burn-in iterations generates a sequence of samples {{x1 , q1 }, . [sent-229, score-0.106]

81 If we strive for minimizing the estimation’s mean square error as typically is the case in image denoising, our goal should be to induce the posterior mean from the sample sequence [23]. [sent-234, score-0.375]

82 9 1 Figure 3: Signal restoration under a total variation prior model and alternative estimation criteria. [sent-252, score-0.122]

83 The heavy tailed histograms of natural image filter responses are often conveniently approximated by kurtotic continuous parametric distributions [10,19,35]. [sent-253, score-0.14]

84 We can still resort to block Gibbs sampling for efficiently exploring the posterior distribution of the signal x if each expert can be represented as a continuous Gaussian scale mixture (GSM) [2], as has been done before for Student-t experts [35]. [sent-254, score-0.622]

85 Motivated by [14, 23], we show here how this can lead to a novel Bayesian treatment of signal N −1 restoration under a total variation (TV) prior P (x) ∝ l=1 L(∆xl ; α), which imposes an L1 penalty on the signal diferrences ∆xl = xl − xl+1 . [sent-255, score-0.317]

86 Thanks to the GSM nature of this representation and assuming a Gaussian measurement model, the conditionally Gaussian visible variables are easy to sample. [sent-257, score-0.269]

87 Further, the latent variances vl conditionally decouple −1/2 exp −|∆xl |2 /(2vl ) − vl /(2α2 ) , which can be recognized as a and have density P (vl |x) ∝ vl generalized inverse Gaussian distribution for which standard sampling routines exist. [sent-258, score-0.642]

88 7 We demonstrate our Bayesian TV restoration method in a signal denoising experiment illustrated in Fig. [sent-260, score-0.142]

89 We synthesized a length-1000 signal by integrating Laplacian noise (α = 1/8), also adding jumps of height 5 at four locations (outliers), and subsequently degraded it by adding Gaussian noise (with variance 1). [sent-262, score-0.14]

90 We depict the standard TV-MAP restoration result, as well as plausible solutions extracted from a 10-step block-Gibbs sampling run with our GSM-based Bayesian algorithm: the 10-th sample itself, and the two MMSE estimates outlined above (sample mean and RB). [sent-263, score-0.306]

91 We must emphasize that the block Gibbs sampling strategy outlined above in conjunction with our GMRF sampling by local perturbations algorithm is equally well applicable when the latent variables are distributed, with each hidden variable affecting multiple experts, as illustrated in Fig. [sent-268, score-0.732]

92 Training GRBMs with contrastive divergence [7] requires drawing random samples from the model. [sent-271, score-0.117]

93 Sampling the visible layer given the layer of discrete hidden variables is easy if there are no sideways connections between the continuous visible units, as assumed in [9]. [sent-272, score-0.424]

94 To take into account residual correlations among the visible units, the authors of the factored GRBM in [18] drop the conditional independence assumption, but resort to difficult to tune hybrid Monte Carlo (HMC) for sampling. [sent-273, score-0.174]

95 Employing our Gaussian sampling by local perturbations scheme we can efficiently jointly sample the correlated visible units, which allows us to still use the more efficient block-Gibbs sampler in training the model of [18]. [sent-274, score-0.49]

96 To verify this, we have accordingly replaced the sampling module in the publicly available implementation of [18], and have closely followed their setup leaving their model otherwise unchanged. [sent-275, score-0.184]

97 For conditionally Gaussian sampling of the correlated visible units we have used our local perturbation algorithm, coupled with 5 iterations of conjugate gradients running on the GPU. [sent-276, score-0.605]

98 q1 q2 f1 x1 q3 f2 x2 q4 f3 x3 qJ f4 fL xN (a) (b) Figure 4: (a) Each hidden unit can control a single factor (such as the q1 above) or it can affect multiple experts, resulting to models with distributed latent representations. [sent-281, score-0.259]

99 (b) The visible-to-factor filters arising in the factored GRBM model of [18], as learned using block Gibbs sampling. [sent-282, score-0.177]

100 Image inpainting with a wavelet domain hidden Markov tree model. [sent-387, score-0.23]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gmrf', 0.516), ('gmrfs', 0.226), ('sampling', 0.184), ('posterior', 0.174), ('gaussian', 0.167), ('jx', 0.159), ('fl', 0.143), ('xs', 0.13), ('conditionally', 0.121), ('inpainting', 0.119), ('cg', 0.113), ('flt', 0.113), ('multigrid', 0.113), ('grbm', 0.109), ('visible', 0.106), ('perturbed', 0.104), ('gsm', 0.102), ('perturbations', 0.1), ('mrfs', 0.099), ('lter', 0.098), ('image', 0.094), ('mode', 0.087), ('restoration', 0.086), ('variance', 0.084), ('xl', 0.083), ('gibbs', 0.081), ('experts', 0.08), ('jy', 0.079), ('hidden', 0.076), ('estimator', 0.075), ('variances', 0.075), ('ht', 0.069), ('units', 0.068), ('vl', 0.068), ('factored', 0.068), ('mmse', 0.068), ('db', 0.067), ('jl', 0.067), ('sampler', 0.064), ('technique', 0.06), ('elds', 0.06), ('injecting', 0.06), ('jrss', 0.06), ('tv', 0.059), ('latent', 0.058), ('arising', 0.057), ('precision', 0.056), ('signal', 0.056), ('pde', 0.055), ('psnr', 0.055), ('shot', 0.055), ('boltzmann', 0.053), ('bayesian', 0.053), ('kt', 0.052), ('block', 0.052), ('factor', 0.049), ('perturb', 0.049), ('marginal', 0.048), ('perturbation', 0.047), ('gt', 0.047), ('layer', 0.047), ('carried', 0.047), ('responses', 0.046), ('seeger', 0.045), ('covariance', 0.044), ('conjugate', 0.043), ('markov', 0.043), ('rb', 0.043), ('samples', 0.042), ('variables', 0.042), ('texture', 0.041), ('distributed', 0.041), ('ce', 0.04), ('drawing', 0.04), ('papandreou', 0.04), ('lanczos', 0.04), ('inject', 0.04), ('scale', 0.039), ('diagram', 0.038), ('bp', 0.038), ('employs', 0.038), ('kl', 0.037), ('factors', 0.037), ('belief', 0.037), ('mixture', 0.037), ('modeling', 0.037), ('mean', 0.036), ('local', 0.036), ('prior', 0.036), ('multiplication', 0.035), ('pami', 0.035), ('contrastive', 0.035), ('typically', 0.035), ('models', 0.035), ('wavelet', 0.035), ('ql', 0.034), ('malioutov', 0.034), ('zl', 0.034), ('hx', 0.034), ('perturbing', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 101 nips-2010-Gaussian sampling by local perturbations

Author: George Papandreou, Alan L. Yuille

Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1

2 0.13610685 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

Author: Iain Murray, Ryan P. Adams

Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1

3 0.13483411 103 nips-2010-Generating more realistic images using gated MRF's

Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton

Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9

4 0.12775934 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton

Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1

5 0.11089094 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

Author: Matthias Broecheler, Lise Getoor

Abstract: Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficiency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence. 1

6 0.10220849 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

7 0.1016603 99 nips-2010-Gated Softmax Classification

8 0.099774122 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points

9 0.095253378 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

10 0.087676309 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

11 0.086280085 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage

12 0.08429385 117 nips-2010-Identifying graph-structured activation patterns in networks

13 0.083449028 224 nips-2010-Regularized estimation of image statistics by Score Matching

14 0.08283101 284 nips-2010-Variational bounds for mixed-data factor analysis

15 0.080107011 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

16 0.076489545 118 nips-2010-Implicit Differentiation by Perturbation

17 0.071576245 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

18 0.0681739 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition

19 0.068067499 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior

20 0.067910582 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.25), (1, 0.057), (2, -0.077), (3, 0.048), (4, -0.088), (5, -0.026), (6, 0.055), (7, 0.079), (8, -0.065), (9, 0.008), (10, -0.097), (11, -0.12), (12, 0.043), (13, -0.113), (14, -0.131), (15, -0.014), (16, 0.025), (17, 0.02), (18, 0.075), (19, 0.035), (20, -0.06), (21, 0.105), (22, -0.075), (23, 0.012), (24, -0.047), (25, 0.034), (26, -0.015), (27, -0.021), (28, -0.023), (29, -0.098), (30, 0.034), (31, -0.064), (32, -0.005), (33, -0.007), (34, 0.006), (35, -0.063), (36, -0.074), (37, -0.03), (38, 0.027), (39, -0.051), (40, 0.146), (41, 0.064), (42, -0.027), (43, 0.0), (44, -0.047), (45, 0.087), (46, -0.105), (47, 0.067), (48, 0.066), (49, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95777351 101 nips-2010-Gaussian sampling by local perturbations

Author: George Papandreou, Alan L. Yuille

Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1

2 0.75966173 103 nips-2010-Generating more realistic images using gated MRF's

Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton

Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9

3 0.74229378 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

Author: Iain Murray, Ryan P. Adams

Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1

4 0.716519 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics

Author: Chang Su, Sargur Srihari

Abstract: A method for computing the rarity of latent fingerprints represented by minutiae is given. It allows determining the probability of finding a match for an evidence print in a database of n known prints. The probability of random correspondence between evidence and database is determined in three procedural steps. In the registration step the latent print is aligned by finding its core point; which is done using a procedure based on a machine learning approach based on Gaussian processes. In the evidence probability evaluation step a generative model based on Bayesian networks is used to determine the probability of the evidence; it takes into account both the dependency of each minutia on nearby minutiae and the confidence of their presence in the evidence. In the specific probability of random correspondence step the evidence probability is used to determine the probability of match among n for a given tolerance; the last evaluation is similar to the birthday correspondence probability for a specific birthday. The generative model is validated using a goodness-of-fit test evaluated with a standard database of fingerprints. The probability of random correspondence for several latent fingerprints are evaluated for varying numbers of minutiae. 1

5 0.65096289 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters

Author: Jose Puertas, Joerg Bornschein, Joerg Luecke

Abstract: We study the application of a strongly non-linear generative model to image patches. As in standard approaches such as Sparse Coding or Independent Component Analysis, the model assumes a sparse prior with independent hidden variables. However, in the place where standard approaches use the sum to combine basis functions we use the maximum. To derive tractable approximations for parameter estimation we apply a novel approach based on variational Expectation Maximization. The derived learning algorithm can be applied to large-scale problems with hundreds of observed and hidden variables. Furthermore, we can infer all model parameters including observation noise and the degree of sparseness. In applications to image patches we find that Gabor-like basis functions are obtained. Gabor-like functions are thus not a feature exclusive to approaches assuming linear superposition. Quantitatively, the inferred basis functions show a large diversity of shapes with many strongly elongated and many circular symmetric functions. The distribution of basis function shapes reflects properties of simple cell receptive fields that are not reproduced by standard linear approaches. In the study of natural image statistics, the implications of using different superposition assumptions have so far not been investigated systematically because models with strong non-linearities have been found analytically and computationally challenging. The presented algorithm represents the first large-scale application of such an approach. 1

6 0.63811278 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

7 0.60934329 262 nips-2010-Switched Latent Force Models for Movement Segmentation

8 0.5960623 215 nips-2010-Probabilistic Deterministic Infinite Automata

9 0.59205538 224 nips-2010-Regularized estimation of image statistics by Score Matching

10 0.58911389 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

11 0.57427365 284 nips-2010-Variational bounds for mixed-data factor analysis

12 0.56770986 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

13 0.56587225 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage

14 0.56467837 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach

15 0.54951048 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

16 0.54296726 99 nips-2010-Gated Softmax Classification

17 0.54251707 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

18 0.53947532 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

19 0.53299046 256 nips-2010-Structural epitome: a way to summarize one’s visual experience

20 0.53284824 120 nips-2010-Improvements to the Sequence Memoizer


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.032), (17, 0.014), (27, 0.051), (30, 0.046), (35, 0.01), (45, 0.191), (50, 0.459), (52, 0.036), (60, 0.022), (77, 0.032), (78, 0.013), (90, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93649781 120 nips-2010-Improvements to the Sequence Memoizer

Author: Jan Gasthaus, Yee W. Teh

Abstract: The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1

2 0.93611974 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models

Author: Danny Bickson, Carlos Guestrin

Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1

same-paper 3 0.91814905 101 nips-2010-Gaussian sampling by local perturbations

Author: George Papandreou, Alan L. Yuille

Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1

4 0.89987361 42 nips-2010-Boosting Classifier Cascades

Author: Nuno Vasconcelos, Mohammad J. Saberian

Abstract: The problem of optimal and automatic design of a detector cascade is considered. A novel mathematical model is introduced for a cascaded detector. This model is analytically tractable, leads to recursive computation, and accounts for both classification and complexity. A boosting algorithm, FCBoost, is proposed for fully automated cascade design. It exploits the new cascade model, minimizes a Lagrangian cost that accounts for both classification risk and complexity. It searches the space of cascade configurations to automatically determine the optimal number of stages and their predictors, and is compatible with bootstrapping of negative examples and cost sensitive learning. Experiments show that the resulting cascades have state-of-the-art performance in various computer vision problems. 1

5 0.86017412 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Author: Yi Zhang, Jeff G. Schneider

Abstract: In this paper, we propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks. Learning multiple (parametric) models can be viewed as estimating a matrix of parameters, where rows and columns of the matrix correspond to tasks and features, respectively. Following the matrix-variate normal density, we design a penalty that decomposes the full covariance of matrix elements into the Kronecker product of row covariance and column covariance, which characterizes both task relatedness and feature representation. Several recently proposed methods are variants of the special cases of this formulation. To address the overfitting issue and select meaningful task and feature structures, we include sparse covariance selection into our matrix-normal regularization via ℓ1 penalties on task and feature inverse covariances. We empirically study the proposed method and compare with related models in two real-world problems: detecting landmines in multiple fields and recognizing faces between different subjects. Experimental results show that the proposed framework provides an effective and flexible way to model various different structures of multiple tasks.

6 0.81185627 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

7 0.67196625 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata

8 0.66949171 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

9 0.6682955 54 nips-2010-Copula Processes

10 0.65399122 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage

11 0.64353991 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

12 0.63379747 96 nips-2010-Fractionally Predictive Spiking Neurons

13 0.62792021 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

14 0.62407178 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

15 0.6142903 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

16 0.6086939 217 nips-2010-Probabilistic Multi-Task Feature Selection

17 0.60203463 158 nips-2010-Learning via Gaussian Herding

18 0.6018557 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

19 0.6012131 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices

20 0.60022908 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers