nips nips2010 nips2010-224 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Diederik P. Kingma, Yann L. Cun
Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.
Reference: text
sentIndex sentText sentNum sentScore
1 Regularized estimation of image statistics by Score Matching Diederik P. [sent-1, score-0.154]
2 edu Abstract Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. [sent-8, score-0.143]
3 It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. [sent-9, score-0.282]
4 We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. [sent-10, score-0.234]
5 In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. [sent-11, score-0.187]
6 1 Introduction Consider the subject of density estimation for high-dimensional continuous random variables, like images. [sent-13, score-0.127]
7 Approaches for normalized density estimation, like mixture models, often suffer from the curse of dimensionality. [sent-14, score-0.072]
8 An alternative approach is Product-of-Experts (PoE) [7], where we model the density as a product, rather than a sum, of component (expert) densities. [sent-15, score-0.072]
9 The multiplicative nature of PoE models make them able to form complex densities: in contrast to mixture models, each expert has the ability to have a strongly negative influence on the density at any point by assigning it a very low component density. [sent-16, score-0.106]
10 However, Maximum Likelihood Estimation (MLE) of the model requires differentiation of a normalizing term, which is infeasible even for low data dimensionality. [sent-17, score-0.25]
11 A recently introduced estimation method is Score Matching [10], which involves minimizing the square distance between the model log-density slope (score) and data log-density slope, which is independent of the normalizing term. [sent-18, score-0.208]
12 Unfortunately, applications of SM estimation have thus far been limited. [sent-19, score-0.055]
13 Differentiating the SM loss with respect to the parameters can be very challenging, which somewhat complicates the use of SM in many situations. [sent-21, score-0.089]
14 Furthermore, the proof of the SM estimator [10] requires certain conditions that are often violated, like a smooth underlying density or an infinite number of samples. [sent-22, score-0.101]
15 CD and Basis Rotation estimation will be used as a basis for comparison. [sent-29, score-0.133]
16 In section 3 we show how computation and differentiation of the SM loss can be performed in automated fashion. [sent-31, score-0.323]
17 2 Regularized Score Matching Consider an energy-based [17] model E(x; w), where “energy” is the unnormalized negative logdensity such that the pdf is: p(x; w) = e−E(x;w) /Z(w), where Z(w) is the normalizing constant. [sent-33, score-0.253]
18 In other words, low energies correspond to high probability density, and high energies correspond to low probability density. [sent-34, score-0.074]
19 Score Matching works by fitting the slope (score) of the model density to the slope of the true, underlying density at the data points, which is obviously independent of the vertical offset of the logdensity (the normalizing constant). [sent-35, score-0.381]
20 Among the conditions 1 is (1) that px (x) is differentiable, and (2) that the log-density is finite everywhere. [sent-37, score-0.117]
21 In practice, the true pdf is unknown, and we have a finite sample of T discrete data points. [sent-38, score-0.103]
22 The sample version of the SM loss function is: J S (w) = 1 T T N t=1 i=1 1 2 ∂E(x(t) ; w) ∂xi 2 − ∂ 2 E(x(t) ; w) (∂xi )2 (2) which is asymptotically equivalent to the equation (1) as T approaches infinity, due to the law of large numbers. [sent-39, score-0.089]
23 This loss function was used in previous publications on SM [10, 12, 13, 15]. [sent-40, score-0.156]
24 1 Issues Should these conditions be violated, then (theoretically) the pdf cannot be estimated using equation (1). [sent-42, score-0.132]
25 Unfortunately, situations where the mentioned conditions are violated are not rare. [sent-46, score-0.069]
26 For proper estimation with SM, data can be smoothened by whitening; however, common whitening methods (such as PCA or SVD) are computational infeasible for large data dimensionality, and generally destroy the local structure of spatial and temporal data such as image and audio. [sent-49, score-0.243]
27 Some previous publications on Score Matching apply zero-phase whitening (ZCA) [13] which computes a weighed sum over an input patch which removes some of the original quantization, and can potentially be applied convolutionally. [sent-50, score-0.156]
28 However, 1 The conditions are: the true (underlying) pdf px (x) is differentiable, the expectations E ∂ log px (x)/∂x 2 and E ∂E(x; w)/∂x 2 w. [sent-51, score-0.308]
29 x are finite for any w, and px (x)∂E(x; w)/∂x goes to zero for any w when x → ∞. [sent-54, score-0.088]
30 2 the amount of information removed from the input by such whitening is not parameterized and potentially large. [sent-55, score-0.089]
31 By this replacement, the sample pdf becomes smooth and the conditions for proper SM estimation become satisfied. [sent-61, score-0.187]
32 The expected value of the sample loss is: E J S (x + ǫ; w) = 1 2 N i=1 N 2 ∂E(x + ǫ; w) ∂(xi + ǫi ) E − E i=1 ∂ 2 E(x + ǫ; w) (∂(xi + ǫi ))2 (3) We approximate the first and second term with a simple first-order Taylor expansion. [sent-62, score-0.089]
33 While minimization of above loss may be feasible in some situations, in general it requires differentiation of the full Hessian w. [sent-69, score-0.278]
34 This regularized loss is computationally convenient: the added complexity is almost negligible since differentiation of the second derivative terms (∂ 2 E/(∂xi )2 ) w. [sent-75, score-0.329]
35 the weights is already required for unregularized Score Matching. [sent-78, score-0.13]
36 The regularizer is related to Tikhonov regularization [22] and curvature-driven smoothing [2] where the square of the curvature of the energy surface at the data points are also penalized. [sent-79, score-0.148]
37 Black lines: computation of ′ quantities δj = ∂E/∂gj , δj = ∂ 2 E/(∂gi )2 and the SM loss J(x; w). [sent-82, score-0.089]
38 Red lines indicate computational flow for differentiation of the Score Matching loss: computation of e. [sent-83, score-0.189]
39 The influence of weights are not shown, for which the derivatives are computed in the last step. [sent-86, score-0.106]
40 3 Automatic differentiation of J(x; w) In most optimization methods for energy-based models [17], the sample loss is defined in readily obtainable quantities obtained by forward inference in the model. [sent-87, score-0.374]
41 the weights can be obtained in a straightforward and efficient fashion by standard application of the backpropagation algorithm. [sent-91, score-0.165]
42 For Score Matching, the situation is more complex since the (regularized) loss (equations 2,7) is defined in terms of {∂E/∂xi } and {∂ 2 E/(∂xi )2 }, each term being some function of x and w. [sent-92, score-0.089]
43 In earlier publications on Score Matching for continuous variables [10, 12, 13, 15], the authors rewrote {∂E/∂xi } and {∂ 2 E/(∂xi )2 } to their explicit forms in terms of x and w by manually differentiating the energy2. [sent-93, score-0.127]
44 This manual differentiation was repeated for different models, and is arguably a rather inflexible approach. [sent-98, score-0.189]
45 A procedure that could automatically (1) compute and (2) differentiate the loss would make SM estimation more accessible and flexible in practice. [sent-99, score-0.144]
46 Consequently, the terms {∂E/∂xi } and {∂ 2 E/(∂xi )2 } can be efficiently computed using a forward and backward pass: the first pass performs forward inference (computation of E(x; w)) and the second pass applies the backpropagation algorithm [3] to obtain the derivatives of the energy w. [sent-103, score-0.505]
47 However, only the loss J(x; w) is obtained by these two steps. [sent-107, score-0.089]
48 For differentiation of this loss, one must perform an additional forward and backward pass. [sent-108, score-0.294]
49 1 Obtaining the loss Consider a feed-forward neural network with input vector x and weights w and an ordered set of nodes indexed 1 . [sent-110, score-0.216]
50 N , each node j with child nodes i ∈ children(j) with j < i and parent nodes k ∈ parents(j) with k < j. [sent-113, score-0.115]
51 The first D < N nodes are input nodes, for which the activation value is gj = xj . [sent-114, score-0.614]
52 For the other nodes (hidden units and output unit), the activation value is determined by a differentiable scalar function gj ({gi }i∈parents(j) , w). [sent-115, score-0.616]
53 However, backpropagation of the full Hessian scales like O(W 2 ), where W is the number of model weights. [sent-119, score-0.144]
54 Here, we limit backpropagation to the diagonal approximation which scales like O(W ) [1]. [sent-120, score-0.144]
55 This will still result in the correct gradients ∂ 2 E/(∂xj )2 for one-layer models and the models considered in this paper. [sent-121, score-0.068]
56 The SM loss is split in D D 1 ′ ′ 2 two terms: J(x; w) = K + L with K = 2 j=1 δj and L = j=1 −δj + λ(δj )2 . [sent-124, score-0.089]
57 The equations for inference and backpropagation are given as the first two f or-loops in Algorithm 1. [sent-125, score-0.115]
58 2 Most previous publications do not express unnormalized neg. [sent-126, score-0.1]
59 log-density as “energy” 4 Input: x, w (data and weight vectors) for j ← D + 1 to N do compute gj (. [sent-127, score-0.504]
60 The combined network can be differentiated by an extended version of the double-backpropagation procedure [5], with the main difference that the appended ′ network not only computes {δj }, but also {δj }. [sent-135, score-0.15]
61 Automatic differentiation of the combined network consists of two phases, corresponding to reverse traversal of the appended and original network ′ respectively: (1) obtaining ∂K/∂δj , ∂L/∂δj and ∂L/∂δj for each node j in order 1 to N ; (2) obtaining ∂J/∂gj for each node j in order N to D + 1. [sent-136, score-0.413]
62 4 Experiments Consider the following Product-of-Experts (PoE) model: M T αi g(wi x) E(x; W, α) = (8) i=1 where M is the number of experts, wi is an image filter and the i-th row of W and αi are scaling parameters. [sent-139, score-0.163]
63 1 MNIST The first task is to estimate a density model of the MNIST handwritten digits [16]. [sent-146, score-0.072]
64 Since a large number of models need to be learned, a 2× downsampled version of MNIST was used. [sent-147, score-0.076]
65 The MNIST dataset is highly non-smooth: for each pixel, the extreme values (0 and 1) are highly frequent leading to sharp discontinuities in the data density at these points. [sent-148, score-0.072]
66 It is well known that for models ∞ with square weight matrix W , normalized g(. [sent-149, score-0.066]
67 ) (meaning −∞ exp(−g(x))dx = 1) and αi = 1, the normalizing constant can be computed [10]: Z(w) = | det W |. [sent-150, score-0.061]
68 Unregularized, and regularized models for different choices of λ were estimated and log-likelihood values were computed. [sent-152, score-0.127]
69 As expected, unregularized estimation did not result in an accurate model. [sent-159, score-0.135]
70 2 Denoising Consider grayscale natural image data from the Berkeley dataset [18]. [sent-164, score-0.099]
71 The data quantized and therefore non-smooth, so regularization is potentially beneficial. [sent-165, score-0.098]
72 In order to estimate the correct regularization magnitude, we again esimated a PoE model as in equation (8) with square W , such that Z(w) = | det W | and computed the log-likelihood of 10. [sent-166, score-0.075]
73 This value is lower than for MNIST data since natural image data is “less unsmooth”. [sent-169, score-0.099]
74 Subsequently, a convolutional PoE model known as Fields-of-Experts [21] (FoE) was estimated using regularized SM: M T αi g(wi x(p) ) E(x; W, α) = p (9) i=1 where p runs over image positions, and x(p) is a square image patch at p. [sent-170, score-0.281]
75 This is approximately equivalent to ZCA whitening used in [15]. [sent-175, score-0.089]
76 Models are evaluated by means of Bayesian denoising using maximum a posteriori (MAP) estimation. [sent-176, score-0.158]
77 As in a general Bayesian image restoration framework, the goal is to estimate the original input x given a noisy image y using the Bayesian proportionality p(x|y) ∝ p(y|x)p(x). [sent-177, score-0.198]
78 The assumption is white Gaussian noise such that the likelihood is p(y|x) ∼ N (0, σ 2 I). [sent-178, score-0.083]
79 The gradient of the log-posterior is: N ∇x log p(x|y) = −∇x E(x; w) + 1 (yi − xi )2 ∇x 2σ 2 i=1 (10) Denoising is performed by initializing x to a noise image, and 300 subsequent steps of steepest descent according to x′ ← x + α∇x log p(x|y), with α annealed from 2 · 10−2 to 5 · 10−4. [sent-180, score-0.205]
80 For comparison, we ran the same denoising procedure with models estimated by CD-1 and Basis Rotation, from [21] and [23] respectively. [sent-181, score-0.192]
81 The CD-1 model has been extensively applied to denoising before [21] and shown to compare favourably to specialized denoising methods. [sent-183, score-0.316]
82 Regularization turns out to be important for optimal denoising (see figure 2[e-g]). [sent-186, score-0.158]
83 See table 1 for denoising performance of the optimal model for specific standard images. [sent-187, score-0.158]
84 03 (c) σnoise=1/256 denoised PSNR log-likelihood (avg. [sent-210, score-0.085]
85 Middle and bottom: random sample of filters from unregularized and regularized (λ = 0. [sent-214, score-0.131]
86 000 random natural image patches for complete model, for different choices of λ. [sent-222, score-0.189]
87 (e-g) PSNR of 500 denoised images, for different levels of noise and choices of λ. [sent-223, score-0.173]
88 Note that λ∗ ≈ 10−5 , both for maximum likelihood and best denoising performance. [sent-224, score-0.195]
89 Table 1: Peak signal-to-noise ratio (PSNR) of denoised images with σnoise = 5/256. [sent-232, score-0.128]
90 An original image xorig is sampled down to image xsmall by averaging blocks of 2 × 2 pixels into a single pixel. [sent-266, score-0.254]
91 A first approximation x is computed by linearly scaling up xsmall and subsequent application of a low-pass filter to remove false high frequency information. [sent-267, score-0.121]
92 The image is than fine-tuned by 200 repetitions of two subsequent steps: (1) refining the image slightly using x′ ← x + α∇x E(x; w) with α annealed from 2 · 10−2 to 5 · 10−4 ; (2) updating each k × k block of pixels such that their average corresponds to the down-sampled value. [sent-268, score-0.275]
93 Note: the simple block-downsampling results in serious aliasing artifacts in the Barbara image, so the Castle image is used instead. [sent-269, score-0.099]
94 The considered models made give slight improvements in terms of PSNR over the initial solution with low pass filter. [sent-272, score-0.097]
95 Image Weights Peppers House Lena Boat Castle Low pass filter 27. [sent-275, score-0.063]
96 31 5 Conclusion We have shown how the addition of a principled regularization term to the expression of the Score Matching loss lifts continuity assumptions on the data density, such that the estimation method becomes more generally applicable. [sent-295, score-0.187]
97 The effectiveness of the regularizer was verified with the discontinuous MNIST and Berkeley datasets, with respect to likelihood of test data in the model. [sent-296, score-0.107]
98 For both datasets, the optimal regularization parameter is approximately equal for both likelihood and subsequent classification and denoising tasks. [sent-297, score-0.273]
99 In addition, we showed how computation and differentiation of the Score Matching loss can be automated using an efficient algorithm. [sent-298, score-0.323]
100 Modeling image patches with a directed hierarchy of markov random fields. [sent-449, score-0.147]
wordName wordTfidf (topN-words)
[('gj', 0.504), ('sm', 0.299), ('poe', 0.196), ('gk', 0.189), ('differentiation', 0.189), ('rotation', 0.18), ('score', 0.169), ('denoising', 0.158), ('hyv', 0.147), ('psnr', 0.136), ('mnist', 0.124), ('backpropagation', 0.115), ('matching', 0.103), ('pdf', 0.103), ('gi', 0.1), ('parents', 0.1), ('image', 0.099), ('loss', 0.089), ('whitening', 0.089), ('px', 0.088), ('denoised', 0.085), ('lena', 0.084), ('nce', 0.084), ('sesm', 0.084), ('lters', 0.084), ('ica', 0.083), ('xi', 0.082), ('rbm', 0.081), ('unregularized', 0.08), ('basis', 0.078), ('appended', 0.074), ('density', 0.072), ('children', 0.07), ('publications', 0.067), ('pass', 0.063), ('forward', 0.062), ('normalizing', 0.061), ('slope', 0.06), ('cd', 0.06), ('differentiating', 0.06), ('castle', 0.056), ('lindgren', 0.056), ('logdensity', 0.056), ('peppers', 0.056), ('xsmall', 0.056), ('derivatives', 0.056), ('estimation', 0.055), ('quantized', 0.055), ('regularized', 0.051), ('osindero', 0.05), ('weights', 0.05), ('zca', 0.049), ('lecun', 0.049), ('hinton', 0.049), ('patches', 0.048), ('subsequently', 0.048), ('noise', 0.046), ('automated', 0.045), ('contrastive', 0.044), ('editors', 0.044), ('backward', 0.043), ('images', 0.043), ('regularization', 0.043), ('boat', 0.042), ('ster', 0.042), ('annealed', 0.042), ('downsampled', 0.042), ('choices', 0.042), ('energy', 0.041), ('violated', 0.04), ('yann', 0.04), ('barbara', 0.04), ('nodes', 0.039), ('discontinuous', 0.038), ('differentiable', 0.038), ('network', 0.038), ('likelihood', 0.037), ('node', 0.037), ('house', 0.037), ('energies', 0.037), ('xj', 0.036), ('hessian', 0.036), ('activation', 0.035), ('subsequent', 0.035), ('wi', 0.034), ('models', 0.034), ('unnormalized', 0.033), ('experts', 0.033), ('square', 0.032), ('peak', 0.032), ('berkeley', 0.032), ('regularizer', 0.032), ('becker', 0.032), ('gure', 0.031), ('scaling', 0.03), ('ranzato', 0.03), ('student', 0.029), ('conditions', 0.029), ('scales', 0.029), ('fields', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 224 nips-2010-Regularized estimation of image statistics by Score Matching
Author: Diederik P. Kingma, Yann L. Cun
Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.
2 0.15591404 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
3 0.13915032 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
4 0.13099892 103 nips-2010-Generating more realistic images using gated MRF's
Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton
Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9
5 0.10583823 99 nips-2010-Gated Softmax Classification
Author: Roland Memisevic, Christopher Zach, Marc Pollefeys, Geoffrey E. Hinton
Abstract: We describe a ”log-bilinear” model that computes class probabilities by combining an input vector multiplicatively with a vector of binary latent variables. Even though the latent variables can take on exponentially many possible combinations of values, we can efficiently compute the exact probability of each class by marginalizing over the latent variables. This makes it possible to get the exact gradient of the log likelihood. The bilinear score-functions are defined using a three-dimensional weight tensor, and we show that factorizing this tensor allows the model to encode invariances inherent in a task by learning a dictionary of invariant basis functions. Experiments on a set of benchmark problems show that this fully probabilistic model can achieve classification performance that is competitive with (kernel) SVMs, backpropagation, and deep belief nets. 1
6 0.092045978 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
7 0.091436185 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
8 0.091367014 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation
9 0.083449028 101 nips-2010-Gaussian sampling by local perturbations
10 0.075679451 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
11 0.07203161 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
12 0.070022829 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
13 0.069347724 228 nips-2010-Reverse Multi-Label Learning
14 0.067627251 149 nips-2010-Learning To Count Objects in Images
15 0.066424839 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
16 0.065271556 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
17 0.065139592 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
18 0.063860439 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
19 0.063521311 235 nips-2010-Self-Paced Learning for Latent Variable Models
20 0.062511168 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
topicId topicWeight
[(0, 0.21), (1, 0.074), (2, -0.051), (3, -0.021), (4, -0.009), (5, -0.033), (6, -0.001), (7, 0.068), (8, -0.049), (9, 0.005), (10, -0.084), (11, -0.061), (12, -0.009), (13, -0.055), (14, -0.096), (15, -0.042), (16, 0.019), (17, -0.044), (18, -0.011), (19, -0.087), (20, 0.031), (21, -0.014), (22, 0.052), (23, -0.042), (24, -0.036), (25, 0.109), (26, 0.099), (27, -0.045), (28, 0.038), (29, -0.036), (30, -0.014), (31, 0.055), (32, 0.131), (33, -0.059), (34, 0.004), (35, -0.018), (36, -0.094), (37, -0.069), (38, 0.069), (39, -0.046), (40, 0.016), (41, 0.031), (42, 0.123), (43, 0.179), (44, 0.032), (45, 0.072), (46, -0.142), (47, -0.042), (48, 0.127), (49, 0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.9226386 224 nips-2010-Regularized estimation of image statistics by Score Matching
Author: Diederik P. Kingma, Yann L. Cun
Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.
2 0.65406168 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation
Author: Vicky Froyen, Jacob Feldman, Manish Singh
Abstract: Figure/ground assignment, in which the visual image is divided into nearer (figural) and farther (ground) surfaces, is an essential step in visual processing, but its underlying computational mechanisms are poorly understood. Figural assignment (often referred to as border ownership) can vary along a contour, suggesting a spatially distributed process whereby local and global cues are combined to yield local estimates of border ownership. In this paper we model figure/ground estimation in a Bayesian belief network, attempting to capture the propagation of border ownership across the image as local cues (contour curvature and T-junctions) interact with more global cues to yield a figure/ground assignment. Our network includes as a nonlocal factor skeletal (medial axis) structure, under the hypothesis that medial structure “draws” border ownership so that borders are owned by the skeletal hypothesis that best explains them. We also briefly present a psychophysical experiment in which we measured local border ownership along a contour at various distances from an inducing cue (a T-junction). Both the human subjects and the network show similar patterns of performance, converging rapidly to a similar pattern of spatial variation in border ownership along contours. Figure/ground assignment (further referred to as f/g), in which the visual image is divided into nearer (figural) and farther (ground) surfaces, is an essential step in visual processing. A number of factors are known to affect f/g assignment, including region size [9], convexity [7, 16], and symmetry [1, 7, 11]. Figural assignment (often referred to as border ownership, under the assumption that the figural side “owns” the border) is usually studied globally, meaning that entire surfaces and their enclosing boundaries are assumed to receive a globally consistent figural status. But recent psychophysical findings [8] have suggested that border ownership can vary locally along a boundary, even leading to a globally inconsistent figure/ground assignment—broadly consistent with electrophysiological evidence showing local coding for border ownership in area V2 as early as 68 msec after image onset [20]. This suggests a spatially distributed and potentially competitive process of figural assignment [15], in which adjacent surfaces compete to own their common boundary, with figural status propagating across the image as this competition proceeds. But both the principles and computational mechanisms underlying this process are poorly understood. ∗ V.F. was supported by a Fullbright Honorary fellowship and by the Rutgers NSF IGERT program in Perceptual Science, NSF DGE 0549115, J.F. by NIH R01 EY15888, and M.S. by NSF CCF-0541185 1 In this paper we consider how border ownership might propagate over both space and time—that is, across the image as well as over the progression of computation. Following Weiss et al. [18] we adopt a Bayesian belief network architecture, with nodes along boundaries representing estimated border ownership, and connections arranged so that both neighboring nodes and nonlocal integrating nodes combine to influence local estimates of border ownership. Our model is novel in two particular respects: (a) we combine both local and global influences on border ownership in an integrated and principled way; and (b) we include as a nonlocal factor skeletal (medial axis) influences on f/g assignment. Skeletal structure has not been previously considered as a factor on border ownership, but its relevance follows from a model [4] in which shapes are conceived of as generated by or “grown” from an internal skeleton, with the consequence that their boundaries are perceptually “owned” by the skeletal side. We also briey present a psychophysical experiment in which we measured local border ownership along a contour, at several distances from a strong local f/g inducing cue, and at several time delays after the onset of the cue. The results show measurable spatial differences in judged border ownership, with judgments varying with distance from the inducer; but no temporal effect, with essentially asymptotic judgments even after very brief exposures. Both results are consistent with the behavior of the network, which converges quickly to an asymptotic but spatially nonuniform f/g assignment. 1 The Model The Network. For simplicity, we take an edge map as input for the model, assuming that edges and T-junctions have already been detected. From this edge map we then create a Bayesian belief network consisting of four hierarchical levels. At the input level the model receives evidence E from the image, consisting of local contour curvature and T-junctions. The nodes for this level are placed at equidistant locations along the contour. At the first level the model estimates local border ownership. The border ownership, or B-nodes at this level are at the same locations as the E-nodes, but are connected to their nearest neighbors, and are the parent of the E-node at their location. (As a simplifying assumption, such connections are broken at T-junctions in such a way that the occluded contour is disconnected from the occluder.) The highest level has skeletal nodes, S, whose positions are defined by the circumcenters of the Delaunay triangulation on all the E-nodes, creating a coarse medial axis skeleton [13]. Because of the structure of the Delaunay, each S-node is connected to exactly three E-nodes from which they receive information about the position and the local tangent of the contour. In the current state of the model the S-nodes are “passive”, meaning their posteriors are computed before the model is initiated. Between the S nodes and the B nodes are the grouping nodes G. They have the same positions as the S-nodes and the same Delaunay connections, but to B-nodes that have the same image positions as the E-nodes. They will integrate information from distant B-nodes, applying an interiority cue that is influenced by the local strength of skeletal axes as computed by the S-nodes (Fig. 1). Although this is a multiply connected network, we have found that given reasonable parameters the model converges to intuitive posteriors for a variety of shapes (see below). Updating. Our goal is to compute the posterior p(Bi |I), where I is the whole image. Bi is a binary variable coding for the local direction of border ownership, that is, the side that owns the border. In order for border ownership estimates to be influenced by image structure elsewhere in the image, information has to propagate throughout the network. To achieve this propagation, we use standard equations for node updating [14, 12]. However while to all other connections being directed, connections at the B-node level are undirected, causing each node to be child and parent node at the same time. Considering only the B-node level, a node Bi is only separated from the rest of the network by its two neighbors. Hence the Markovian property applies, in that Bi only needs to get iterative information from its neighbors to eventually compute p(Bi |I). So considering the whole network, at each iteration t, Bi receives information from both its child, Ei and from its parents—that is neigbouring nodes (Bi+1 and Bi−1 )—as well as all grouping nodes connected to it (Gj , ..., Gm ). The latter encode for interiority versus exteriority, interiority meaning that the B-node’s estimated gural direction points towards the G-node in question, exteriority meaning that it points away. Integrating all this information creates a multidimensional likelihood function: p(Bi |Bi−1 , Bi+1 , Gj , ..., Gm ). Because of its complexity we choose to approximate it (assuming all nodes are marginally independent of each other when conditioned on Bi ) by 2 Figure 1: Basic network structure of the model. Both skeletal (S-nodes) and border-ownerhsip nodes (B-nodes) get evidence from E-nodes, though different types. S-nodes receive mere positional information, while B-nodes receive information about local curvature and the presence of T-junctions. Because of the structure of the Delaunay triangulation S-nodes and G-nodes (grouping nodes) always get input from exactly three nodes, respectively E and B-nodes. The gray color depicts the fact that this part of the network is computed before the model is initiated and does not thereafter interact with the dynamics of the model. m p(Bi |Pj , ..., Pm ) ∝ p(Bi |Pj ) (1) j where the Pj ’s are the parents of Bi . Given this, at each iteration, each node Bi performs the following computation: Bel(Bi ) ← cλ(Bi )π(Bi )α(Bi )β(Bi ) (2) where conceptually λ stands for bottom-up information, π for top down information and α and β for information received from within the same level. More formally, λ(Bi ) ← p(E|Bi ) (3) m π(Bi ) ← p(Bi |Gj )πGj (Bi ) j (4) Gj and analogously to equation 4 for α(Bi ) and β(Bi ), which compute information coming from Bi−1 and Bi+1 respectively. For these πBi−1 (Bi ), πBi+1 (Bi ), and πGj (Bi ): πGj (Bi ) ← c π(G) λBk (Gj ) (5) k=i πBi−1 (Bi ) ← c β(Bi−1 )λ(Bi−1 )π(Bi−1 ) 3 (6) and πBi+1 (Bi ) is analogous to πBi−1 (Bi ), with c and c being normalization constants. Finally for the G-nodes: Bel(Gi ) ← cλ(Gi )π(Gi ) λ(Gi ) ← (7) λBj (Gi ) (8) j m λBj (Gi ) ← λ(Bj )p(Bi |Gj )[α(Bj )β(Bj ) Bj p(Bi |Gk )πGk (Bi )] (9) k=i Gk The posteriors of the S-nodes are used to compute the π(Gi ). This posterior computes how well the S-node at each position explains the contour—that is, how well it accounts for the cues flowing from the E-nodes it is connected to. Each Delaunay connection between S- and E-nodes can be seen as a rib that sprouts from the skeleton. More specifically each rib sprouts in a direction that is normal (perpendicular) to the tangent of the contour at the E-node plus a random error φi chosen independently for each rib from a von Mises distribution centered on zero, i.e. φi ∼ V (0, κS ) with spread parameter κS [4]. The rib lengths are drawn from an exponential decreasing density function p(ρi ) ∝ e−λS ρi [4]. We can now express how well this node “explains” the three E-nodes it is connected to via the probability that this S-node deserves to be a skeletal node or not, p(S = true|E1 , E2 , E3 ) ∝ p(ρi )p(φi ) (10) i with S = true depicting that this S-node deserves to be a skeletal node. From this we then compute the prior π(Gi ) in such a way that good (high posterior) skeletal nodes induce a high interiority bias, hence a stronger tendency to induce figural status. Conversely, bad (low posterior) skeletal nodes create a prior close to indifferent (uniform) and thus have less (or no) influence on figural status. Likelihood functions Finally we need to express the likelihood function necessary for the updating rules described above. The first two likelihood functions are part of p(Ei |Bi ), one for each of the local cues. The first one, reflecting local curvature, gives the probability of the orientations of the two vectors inherent to Ei (α1 and α2 ) given both direction of figure (θ) encoded in Bi as a von Mises density centered on θ, i.e. αi ∼ V (θ, κEB ). The second likelihood function, reflecting the presence of a T-junction, simply assumes a fixed likelihood when a T-junction is present—that is p(T-junction = true|Bi ) = θT , where Bi places the direction of figure in the direction of the occluder. This likelihood function is only in effect when a T-junction is present, replacing the curvature cue at that node. The third likelihood function serves to keep consistency between nodes of the first level. This function p(Bi |Bi−1 ) or p(Bi |Bi+1 ) is used to compute α(B) and β(B) and is defined 2x2 conditional probability matrix with a single free parameter, θBB (the probability that figural direction at both B-nodes are the same). A fourth and final likelihood function p(Bi |Gj ) serves to propagate information between level one and two. This likelihood function is 2x2 conditional probability matrix matrix with one free parameter, θBG . In this case θBG encodes the probability that the figural direction of the B-node is in the direction of the exterior or interior preference of the G-node. In total this brings us to six free parameters in the model: κS , λS , κEB , θT , θBB , and θBG . 2 Basic Simulations To evaluate the performance of the model, we first tested it on several basic stimulus configurations in which the desired outcome is intuitively clear: a convex shape, a concave shape, a pair of overlapping shapes, and a pair of non-overlapping shapes (Fig. 2,3). The convex shape is the simplest in that curvature never changes sign. The concave shape includes a region with oppositely signed curvature. (The shape is naturally described as predominantly positively curved with a region of negative curvature, i.e. a concavity. But note that it can also be interpreted as predominantly negatively curved “window” with a region of positive curvature, although this is not the intuitive interpretation.) 4 The overlapping pair of shapes consists of two convex shapes with one partly occluding the other, creating a competition between the two shapes for the ownership of the common borderline. Finally the non-overlapping shapes comprise two simple convex shapes that do not touch—again setting up a competition for ownership of the two inner boundaries (i.e. between each shape and the ground space between them). Fig. 2 shows the network structures for each of these four cases. Figure 2: Network structure for the four shape categories (left to right: convex, concave, overlapping, non-overlapping shapes). Blue depict the locations of the B-nodes (and also the E-nodes), the red connections are the connections between B-nodes, the green connections are connections between B-nodes and G-nodes, and the G-nodes (and also the S-nodes) go from orange to dark red. This colour code depicts low (orange) to high (dark red) probability that this is a skeletal node, and hence the strength of the interiority cue. Running our model with hand-estimated parameter values yields highly intuitive posteriors (Fig. 3), an essential “sanity check” to ensure that the network approximates human judgments in simple cases. For the convex shape the model assigns figure to the interior just as one would expect even based solely on local curvature (Fig. 3A). In the concave figure (Fig. 3B), estimated border ownership begins to reverse inside the deep concavity. This may seem surprising, but actually closely matches empirical results obtained when local border ownership is probed psychophysically inside a similarly deep concavity, i.e. a “negative part” in which f/g seems to partly reverse [8]. For the overlapping shapes posteriors were also intuitive, with the occluding shape interpreted as in front and owning the common border (Fig. 3C). Finally, for the two non-overlapping shapes the model computed border-ownership just as one would expect if each shape were run separately, with each shape treated as figural along its entire boundary (Fig. 3D). That is, even though there is skeletal structure in the ground-region between the two shapes (see Fig. 2D), its posterior is weak compared to the skeletal structure inside the shapes, which thus loses the competition to own the boundary between them. For all these configurations, the model not only converged to intuitive estimates but did so rapidly (Fig. 4), always in fewer cycles than would be expected by pure lateral propagation, niterations < Nnodes [18] (with these parameters, typically about five times faster). Figure 3: Posteriors after convergence for the four shape categories (left to right: convex, concave, overlapping, non-overlapping). Arrows indicate estimated border ownership, with direction pointing to the perceived figural side, and length proportional to the magnitude of the posterior. All four simulations used the same parameters. 5 Figure 4: Convergence of the model for the basic shape categories. The vertical lines represent the point of convergence for each of the three shape categories. The posterior change is calculated as |p(Bi = 1|I)t − p(Bi = 1|I)t−1 | at each iteration. 3 Comparison to human data Beyond the simple cases reviewed above, we wished to submit our network to a more fine-grained comparison with human data. To this end we compared its performance to that of human subjects in an experiment we conducted (to be presented in more detail in a future paper). Briefly, our experiment involved finding evidence for propagation of f/g signals across the image. Subjects were first shown a stimulus in which the f/g configuration was globally and locally unambiguous and consistent: a smaller rectangle partly occluding a larger one (Fig. 5A), meaning that the smaller (front) one owns the common border. Then this configuration was perturbed by adding two bars, of which one induced a local f/g reversal—making it now appear locally that the larger rectangle owned the border (Fig. 5B). (The other bar in the display does not alter f/g interpretation, but was included to control for the attentional affects of introducing a bar in the image.) The inducing bar creates T-junctions that serve as strong local f/g cues, in this case tending to reverse the prior global interpretation of the figure. We then measured subjective border ownership along the central contour at various distances from the inducing bar, and at different times after the onset of the bar (25ms, 100ms and 250ms). We measured border ownership locally using a method introduced in [8] in which a local motion probe is introduced at a point on the boundary between two color regions of different colors, and the subject is asked which color appeared to move. Because the figural side “owns” the border, the response reflects perceived figural status. The goal of the experiment was to actually measure the progression of the influence of the inducing T-junction as it (hypothetically) propagated along the boundary. Briefly, we found no evidence of temporal differences, meaning that f/g judgments were essentially constant over time, suggesting rapid convergence of local f/g assignment. (This is consistent with the very rapid convergence of our network, which would suggest a lack of measurable temporal differences except at much shorter time scales than we measured.) But we did find a progressive reduction of f/g reversal with increasing distance from the inducer—that is, the influence of the T-junction decayed with distance. Mean responses aggregated over subjects (shortest delay only) are shown in Fig. 6. In order to run our model on this stimulus (which has a much more complex structure than the simple figures tested above) we had to make some adjustments. We removed the bars from the edge map, leaving only the T-junctions as underlying cues. This was a necessary first step because our model is not yet able to cope with skeletons that are split up by occluders. (The larger rectangle’s skeleton has been split up by the lower bar.) In this way all contours except those created by the bars were used to create the network (Fig. 7). Given this network we ran the model using hand-picked parameters that 6 Figure 5: Stimuli used in the experiment. A. Initial stimulus with locally and globally consistent and unambiguous f/g. B. Subsequently bars were added of which one (the top bar in this case) created a local reversal of f/g. C. Positions at which local f/g judgments of subjects were probed. Figure 6: Results from our experiment aggregated for all 7 subjects (shortest delay only) are shown in red. The x-axis shows distance from the inducing bar at which f/g judgment was probed. The y-axis shows the proportion of trials on which subjects judged the smaller rectangle to own the boundary. As can be seen, the further from the T-junction, the lower the f/g reversal. The fitted model (green curve) shows very similar pattern. Horizontal black line indicates chance performance (ambiguous f/g). gave us the best possible qualitative similarity to the human data. The parameters used never entailed total elimination of the influence of any likelihood function (κS = 16, λS = .025, κEB = .5, θT = .9, θBB = .9, and θBG = .6). As can be seen in Fig. 6 the border-ownership estimates at the locations where we had data show compelling similarities to human judgments. Furthermore along the entire contour the model converged to intuitive border-ownership estimates (Fig. 7) very rapidly (within 36 iterations). The fact that our model yielded intuitive estimates for the current network in which not all contours were completed shows another strength of our model. Because our model included grouping nodes, it did not require contours to be amodally completed [6] in order for information to propagate. 4 Conclusion In this paper we proposed a model rooted in Bayesian belief networks to compute figure/ground. The model uses both local and global cues, combined in a principled way, to achieve a stable and apparently psychologically reasonable estimate of border ownership. Local cues included local curvature and T-junctions, both well-established cues to f/g. Global cues included skeletal structure, 7 Figure 7: (left) Node structure for the experimental stimulus. (right) The model’s local borderownership estimates after convergence. a novel cue motivated by the idea that strongly axial shapes tend to be figural and thus own their boundaries. We successfully tested this model on both simple displays, in which it gave intuitive results, and on a more complex experimental stimulus, in which it gave a close match to the pattern of f/g propagation found in our subjects. Specifically, the model, like the human subjects rapidly converged to a stable local f/g interpretation. Our model’s structure shows several interesting parallels to properties of neural coding of border ownership in visual cortex. Some cortical cells (end-stopped cells) appear to code for local curvature [3] and T-junctions [5]. The B-nodes in our model could be seen as corresponding to cells that code for border ownership [20]. Furthermore, some authors [2] have suggested that recurrent feedback loops between border ownership cells in V2 and cells in V4 (corresponding to G-nodes in our model) play a role in the rapid computation of border ownership. The very rapid convergence we observed in our model likewise appears to be due to the connections between B-nodes and G-nodes. Finally scale-invariant shape representations (such as, speculatively, those based on skeletons) are thought to be present in higher cortical regions such as IT [17], which project down to earlier areas in ways that are not yet understood. A number of parallels to past models of f/g should be mentioned. Weiss [18] pioneered the application of belief networks to the f/g problem, though their network only considered a more restricted set of local cues and no global ones, such that information only propagated along the contour. Furthermore it has not been systematically compared to human judgments. Kogo et al. [10] proposed an exponential decay of f/g signals as they spread throughout the image. Our model has a similar decay for information going through the G-nodes, though it is also influenced by an angular factor defined by the position of the skeletal node. Like the model by Li Zhaoping [19], our model includes horizontal propagation between B-nodes, analogous to border-ownership cells in her model. A neurophysiological model by Craft et al. [2] defines grouping cells coding for an interiority preference that decays with the size of the receptive fields of these grouping cells. Our model takes this a step further by including shape (skeletal) structure as a factor in interiority estimates, rather than simply size of receptive fields (which is similar to the rib lengths in our model). Currently, our use of skeletons as shape representations is still limited to medial axis skeletons and surfaces that are not split up by occluders. Our future goals including integrating skeletons in a more robust way following the probabilistic account suggested by Feldman and Singh [4]. Eventually, we hope to fully integrate skeleton computation with f/g computation so that the more general problem of shape and surface estimation can be approached in a coherent and unified fashion. 8 References [1] P. Bahnsen. Eine untersuchung uber symmetrie und assymmetrie bei visuellen wahrnehmungen. Zeitschrift fur psychology, 108:129–154, 1928. [2] E. Craft, H. Sch¨ tze, E. Niebur, and R. von der Heydt. A neural model of figure-ground u organization. Journal of Neurophysiology, 97:4310–4326, 2007. [3] A. Dobbins, S. W. Zucker, and M. S. Cyander. Endstopping and curvature. Vision Research, 29:1371–1387, 1989. [4] J. Feldman and M. Singh. Bayesian estimation of the shape skeleton. Proceedings of the National Academy of Sciences, 103:18014–18019, 2006. [5] B. Heider, V. Meskenaite, and E. Peterhans. Anatomy and physiology of a neural mechanism defining depth order and contrast polarity at illusory contours. European Journal of Neuroscience, 12:4117–4130, 2000. [6] G. Kanizsa. Organization inVision. New York: Praeger, 1979. [7] G. Kanizsa and W. Gerbino. Vision and Artifact, chapter Convexity and symmetry in figureground organisation, pages 25–32. New York: Springer, 1976. [8] S. Kim and J. Feldman. Globally inconsistent figure/ground relations induced by a negative part. Journal of Vision, 9:1534–7362, 2009. [9] K. Koffka. Principles of Gestalt Psychology. Lund Humphries, London, 1935. [10] N. Kogo, C. Strecha, L. Van Gool, and J. Wagemans. Surface construction by a 2-d differentiation-integration process: a neurocomputational model for perceived border ownership, depth, and lightness in kanizsa figures. Psychological Review, 117:406–439, 2010. [11] B. Machielsen, M. Pauwels, and J. Wagemans. The role of vertical mirror-symmetry in visual shape detection. Journal of Vision, 9:1–11, 2009. [12] K. Murphy, Y. Weiss, and M.I. Jordan. Loopy belief propagation for approximate inference: an empirical study. Proceedings of Uncertainty in AI, pages 467–475, 1999. [13] R. L. Ogniewicz and O. K¨ bler. Hierarchic Voronoi skeletons. Pattern Recognition, 28:343– u 359, 1995. [14] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [15] M. A. Peterson and E. Skow. Inhibitory competition between shape properties in figureground perception. Journal of Experimental Psychology: Human Perception and Performance, 34:251–267, 2008. [16] K. A. Stevens and A. Brookes. The concave cusp as a determiner of figure-ground. Perception, 17:35–42, 1988. [17] K. Tanaka, H. Saito, Y. Fukada, and M. Moriya. Coding visual images of object in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66:170–189, 1991. [18] Y. Weiss. Interpreting images by propagating Bayesian beliefs. Adv. in Neural Information Processing Systems, 9:908915, 1997. [19] L. Zhaoping. Border ownership from intracortical interactions in visual area V2. Neuron, 47(1):143–153, Jul 2005. [20] H. Zhou, H. S. Friedman, and R. von der Heydt. Coding of border ownerschip in monkey visual cortex. The Journal of Neuroscience, 20:6594–6611, 2000. 9
3 0.61974394 103 nips-2010-Generating more realistic images using gated MRF's
Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton
Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9
4 0.58996564 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
5 0.58599067 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
6 0.57019174 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
7 0.56644297 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
8 0.56599265 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
9 0.56405109 99 nips-2010-Gated Softmax Classification
10 0.56200969 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
11 0.54916769 245 nips-2010-Space-Variant Single-Image Blind Deconvolution for Removing Camera Shake
12 0.52625275 101 nips-2010-Gaussian sampling by local perturbations
13 0.50078326 267 nips-2010-The Multidimensional Wisdom of Crowds
14 0.49672547 17 nips-2010-A biologically plausible network for the computation of orientation dominance
15 0.47635862 256 nips-2010-Structural epitome: a way to summarize one’s visual experience
16 0.46829957 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
17 0.4531126 234 nips-2010-Segmentation as Maximum-Weight Independent Set
18 0.45291939 118 nips-2010-Implicit Differentiation by Perturbation
19 0.44246837 120 nips-2010-Improvements to the Sequence Memoizer
20 0.44144231 149 nips-2010-Learning To Count Objects in Images
topicId topicWeight
[(13, 0.026), (17, 0.056), (27, 0.076), (30, 0.046), (35, 0.042), (45, 0.271), (50, 0.085), (52, 0.033), (60, 0.025), (77, 0.065), (88, 0.182), (90, 0.032)]
simIndex simValue paperId paperTitle
1 0.93029821 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
Author: Ziv Bar-joseph, Hai-son P. Le
Abstract: Recent studies compare gene expression data across species to identify core and species specific genes in biological systems. To perform such comparisons researchers need to match genes across species. This is a challenging task since the correct matches (orthologs) are not known for most genes. Previous work in this area used deterministic matchings or reduced multidimensional expression data to binary representation. Here we develop a new method that can utilize soft matches (given as priors) to infer both, unique and similar expression patterns across species and a matching for the genes in both species. Our method uses a Dirichlet process mixture model which includes a latent data matching variable. We present learning and inference algorithms based on variational methods for this model. Applying our method to immune response data we show that it can accurately identify common and unique response patterns by improving the matchings between human and mouse genes. 1
2 0.90089279 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
same-paper 3 0.90080857 224 nips-2010-Regularized estimation of image statistics by Score Matching
Author: Diederik P. Kingma, Yann L. Cun
Abstract: Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.
4 0.84663016 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: We propose a class of sparse coding models that utilizes a Laplacian Scale Mixture (LSM) prior to model dependencies among coefficients. Each coefficient is modeled as a Laplacian distribution with a variable scale parameter, with a Gamma distribution prior over the scale parameter. We show that, due to the conjugacy of the Gamma prior, it is possible to derive efficient inference procedures for both the coefficients and the scale parameter. When the scale parameters of a group of coefficients are combined into a single variable, it is possible to describe the dependencies that occur due to common amplitude fluctuations among coefficients, which have been shown to constitute a large fraction of the redundancy in natural images [1]. We show that, as a consequence of this group sparse coding, the resulting inference of the coefficients follows a divisive normalization rule, and that this may be efficiently implemented in a network architecture similar to that which has been proposed to occur in primary visual cortex. We also demonstrate improvements in image coding and compressive sensing recovery using the LSM model. 1
5 0.84347469 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: For many structured prediction problems, complex models often require adopting approximate inference techniques such as variational methods or sampling, which generally provide no satisfactory accuracy guarantees. In this work, we propose sidestepping intractable inference altogether by learning ensembles of tractable sub-models as part of a structured prediction cascade. We focus in particular on problems with high-treewidth and large state-spaces, which occur in many computer vision tasks. Unlike other variational methods, our ensembles do not enforce agreement between sub-models, but filter the space of possible outputs by simply adding and thresholding the max-marginals of each constituent model. Our framework jointly estimates parameters for all models in the ensemble for each level of the cascade by minimizing a novel, convex loss function, yet requires only a linear increase in computation over learning or inference in a single tractable sub-model. We provide a generalization bound on the filtering loss of the ensemble as a theoretical justification of our approach, and we evaluate our method on both synthetic data and the task of estimating articulated human pose from challenging videos. We find that our approach significantly outperforms loopy belief propagation on the synthetic data and a state-of-the-art model on the pose estimation/tracking problem. 1
6 0.84345269 103 nips-2010-Generating more realistic images using gated MRF's
7 0.84227866 96 nips-2010-Fractionally Predictive Spiking Neurons
8 0.83956844 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
9 0.83844984 158 nips-2010-Learning via Gaussian Herding
10 0.83821899 257 nips-2010-Structured Determinantal Point Processes
11 0.83676046 155 nips-2010-Learning the context of a category
12 0.83611435 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
13 0.83535475 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
14 0.83533973 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
15 0.83470821 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
16 0.83461529 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
17 0.83452684 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
18 0.83411735 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
19 0.83370912 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
20 0.83364409 172 nips-2010-Multi-Stage Dantzig Selector