nips nips2004 nips2004-25 knowledge-graph by maker-knowledge-mining

25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images


Source: pdf

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efficacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and finance.16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear filters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to find combinations of these (perhaps non-linearly transformed) as a way of finding higher order filters (eg complex cells). One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in figure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n filter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modified Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to filter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. Wainwright2 considered a prespecified tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which filter responses share which mixer variables. We first study this issue using synthetic data in which two groups of filter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (figure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few filter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single filter response, the Gaussian estimate peaks away from zero. For assignments including more filter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. Specifically, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single filter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many filter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefficients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). The true generative model specifies probabilistic overlap of 3 mixer variables. We generated 5000 samples for each filter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefficient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in figure 2). Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal filters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with filter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between filters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the first 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet filters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA first-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet filters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefficients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artificial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. [sent-7, score-0.183]

2 GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. [sent-8, score-0.208]

3 We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. [sent-9, score-0.094]

4 We demonstrate the efficacy of the approach on both synthetic and image data. [sent-10, score-0.154]

5 A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. [sent-14, score-0.105]

6 1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). [sent-16, score-0.869]

7 GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen. [sent-17, score-0.142]

8 4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis. [sent-18, score-1.048]

9 7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control. [sent-19, score-0.105]

10 16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure. [sent-21, score-0.092]

11 One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. [sent-23, score-0.984]

12 Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. [sent-25, score-0.151]

13 In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. [sent-26, score-0.106]

14 Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. [sent-27, score-0.906]

15 We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. [sent-28, score-0.238]

16 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . [sent-30, score-1.017]

17 The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. [sent-37, score-0.125]

18 The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. [sent-51, score-1.703]

19 In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. [sent-54, score-0.179]

20 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . [sent-92, score-1.824]

21 B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. [sent-93, score-0.203]

22 For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. [sent-94, score-0.101]

23 C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. [sent-95, score-0.931]

24 C Distribution of estimate of the mixer variable vα . [sent-96, score-0.869]

25 Note that mixer variable values are by definition positive. [sent-97, score-0.869]

26 E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . [sent-99, score-0.184]

27 generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. [sent-100, score-0.823]

28 Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. [sent-101, score-0.849]

29 This poses a critical assignment problem of working out which filter responses share which mixer variables. [sent-102, score-1.037]

30 We first study this issue using synthetic data in which two groups of filter responses l1 . [sent-103, score-0.179]

31 l40 are generated by two mixer variables vα and vβ (figure 1). [sent-109, score-0.869]

32 Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. [sent-111, score-1.022]

33 For assignments including more filter responses, the estimates become good. [sent-114, score-0.104]

34 However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). [sent-115, score-0.2]

35 In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 . [sent-116, score-0.134]

36 2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. [sent-125, score-0.909]

37 The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = . [sent-126, score-0.863]

38 B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . [sent-128, score-0.23]

39 C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. [sent-129, score-0.875]

40 Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. [sent-131, score-1.095]

41 Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). [sent-134, score-0.204]

42 We have observed qualitatively similar statistics for estimation based on coefficients in natural images. [sent-135, score-0.094]

43 26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. [sent-137, score-0.293]

44 Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. [sent-138, score-0.096]

45 Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. [sent-139, score-0.166]

46 Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. [sent-140, score-0.157]

47 To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . [sent-142, score-1.2]

48 We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . [sent-146, score-0.99]

49 First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. [sent-152, score-0.121]

50 Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . [sent-155, score-0.115]

51 Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. [sent-156, score-2.509]

52 We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). [sent-157, score-1.048]

53 The true generative model specifies probabilistic overlap of 3 mixer variables. [sent-158, score-0.86]

54 In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. [sent-163, score-0.951]

55 In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). [sent-164, score-0.138]

56 The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. [sent-165, score-0.895]

57 In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. [sent-166, score-0.879]

58 Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. [sent-167, score-1.072]

59 3 Image data Having validated the inference model using synthetic data, we turned to natural images. [sent-169, score-0.148]

60 We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. [sent-170, score-0.133]

61 The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. [sent-171, score-0.232]

62 We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = . [sent-172, score-0.137]

63 Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. [sent-174, score-0.886]

64 Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). [sent-176, score-1.082]

65 The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. [sent-177, score-0.936]

66 For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). [sent-178, score-0.27]

67 The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. [sent-179, score-0.175]

68 This can also be seen by averaging the 200 image patches with the maximum log likelihood. [sent-180, score-0.171]

69 Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). [sent-181, score-0.949]

70 Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0. [sent-183, score-0.172]

71 15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. [sent-186, score-0.982]

72 The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). [sent-187, score-0.869]

73 B First two image ensemble neighborhoods obtained from Gibbs sampling. [sent-192, score-0.179]

74 D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. [sent-195, score-0.846]

75 From the obtained associations, we estimated the mixer and Gaussian variables according to our model. [sent-201, score-0.869]

76 The learned distributions of Gaussian and mixer variables are quite close to our assumptions. [sent-203, score-0.891]

77 The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. [sent-204, score-1.055]

78 We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. [sent-205, score-0.395]

79 In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). [sent-206, score-1.152]

80 As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. [sent-207, score-0.507]

81 Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. [sent-208, score-0.15]

82 Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). [sent-210, score-0.986]

83 Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). [sent-211, score-0.117]

84 not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. [sent-212, score-0.196]

85 This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). [sent-213, score-0.202]

86 In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. [sent-214, score-0.268]

87 This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). [sent-215, score-0.254]

88 We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). [sent-221, score-1.103]

89 We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. [sent-222, score-0.206]

90 We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. [sent-223, score-0.165]

91 The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). [sent-225, score-1.031]

92 Scale mixtures of Gaussians and the statistics of natural images. [sent-236, score-0.118]

93 Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. [sent-247, score-0.122]

94 Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. [sent-251, score-0.126]

95 Image compression via joint statistical characterization in the wavelet domain. [sent-254, score-0.129]

96 Relations between the statistics of natural images and the response properties of cortical cells. [sent-260, score-0.215]

97 Bayesian wavelet domain image modeling using hidden Markov trees. [sent-285, score-0.166]

98 A parametric texture model based on joint statistics of complex wavelet coefficients. [sent-295, score-0.182]

99 Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. [sent-345, score-0.128]

100 Image denoising using a scale mixture of Gaussians in the wavelet domain. [sent-349, score-0.128]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mixer', 0.823), ('gsm', 0.217), ('lter', 0.204), ('neighborhood', 0.113), ('responses', 0.11), ('gibbs', 0.103), ('patches', 0.086), ('image', 0.085), ('lters', 0.082), ('wavelet', 0.081), ('filter', 0.072), ('associations', 0.072), ('synthetic', 0.069), ('neighborhoods', 0.068), ('portilla', 0.064), ('quadrature', 0.058), ('gaussian', 0.057), ('assignment', 0.056), ('statistics', 0.053), ('estimates', 0.052), ('assignments', 0.052), ('stripes', 0.051), ('joint', 0.048), ('coordinated', 0.047), ('variable', 0.046), ('patch', 0.046), ('variables', 0.046), ('cortical', 0.045), ('bowtie', 0.044), ('gsms', 0.044), ('hhmi', 0.044), ('rayleigh', 0.044), ('zebra', 0.044), ('phase', 0.044), ('solla', 0.044), ('wainwright', 0.044), ('int', 0.042), ('natural', 0.041), ('orientations', 0.041), ('eg', 0.04), ('response', 0.04), ('horizontal', 0.039), ('inference', 0.038), ('hyv', 0.037), ('fit', 0.037), ('pij', 0.037), ('generative', 0.037), ('spatial', 0.037), ('li', 0.037), ('images', 0.036), ('proc', 0.036), ('coef', 0.035), ('hierarchy', 0.035), ('actual', 0.034), ('conditional', 0.033), ('multiply', 0.032), ('yielded', 0.031), ('cells', 0.031), ('receptive', 0.031), ('bowties', 0.029), ('conf', 0.029), ('divisively', 0.029), ('engle', 0.029), ('mixers', 0.029), ('odelia', 0.029), ('salk', 0.029), ('trans', 0.029), ('coordination', 0.028), ('simoncelli', 0.028), ('editors', 0.027), ('ensemble', 0.026), ('share', 0.026), ('association', 0.026), ('filters', 0.026), ('bubbles', 0.026), ('hoyer', 0.026), ('karklin', 0.026), ('ruderman', 0.026), ('strela', 0.026), ('mixture', 0.026), ('mixtures', 0.024), ('cients', 0.023), ('estimations', 0.023), ('gn', 0.023), ('schwartz', 0.023), ('vertical', 0.023), ('gure', 0.023), ('distributions', 0.022), ('critical', 0.022), ('posterior', 0.022), ('phases', 0.022), ('gi', 0.021), ('denoising', 0.021), ('jolla', 0.021), ('column', 0.02), ('hierarchical', 0.02), ('volume', 0.02), ('multiplicative', 0.02), ('rinen', 0.02), ('wavelets', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efficacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and finance.16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear filters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to find combinations of these (perhaps non-linearly transformed) as a way of finding higher order filters (eg complex cells). One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in figure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n filter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modified Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to filter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. Wainwright2 considered a prespecified tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which filter responses share which mixer variables. We first study this issue using synthetic data in which two groups of filter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (figure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few filter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single filter response, the Gaussian estimate peaks away from zero. For assignments including more filter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. Specifically, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single filter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many filter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefficients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). The true generative model specifies probabilistic overlap of 3 mixer variables. We generated 5000 samples for each filter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefficient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in figure 2). Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal filters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with filter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between filters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the first 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet filters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA first-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet filters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefficients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artificial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

2 0.091062754 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

Author: Hyun J. Park, Te W. Lee

Abstract: Capturing dependencies in images in an unsupervised manner is important for many image processing applications. We propose a new method for capturing nonlinear dependencies in images of natural scenes. This method is an extension of the linear Independent Component Analysis (ICA) method by building a hierarchical model based on ICA and mixture of Laplacian distribution. The model parameters are learned via an EM algorithm and it can accurately capture variance correlation and other high order structures in a simple manner. We visualize the learned variance structure and demonstrate applications to image segmentation and denoising. 1 In trod u ction Unsupervised learning has become an important tool for understanding biological information processing and building intelligent signal processing methods. Real biological systems however are much more robust and flexible than current artificial intelligence mostly due to a much more efficient representations used in biological systems. Therefore, unsupervised learning algorithms that capture more sophisticated representations can provide a better understanding of neural information processing and also provide improved algorithm for signal processing applications. For example, independent component analysis (ICA) can learn representations similar to simple cell receptive fields in visual cortex [1] and is also applied for feature extraction, image segmentation and denoising [2,3]. ICA can approximate statistics of natural image patches by Eq.(1,2), where X is the data and u is a source signal whose distribution is a product of sparse distributions like a generalized Laplacian distribution. X = Au (1) P (u ) = ∏ P (u i ) (2) But the representation learned by the ICA algorithm is relatively low-level. In biological systems there are more high-level representations such as contours, textures and objects, which are not well represented by the linear ICA model. ICA learns only linear dependency between pixels by finding strongly correlated linear axis. Therefore, the modeling capability of ICA is quite limited. Previous approaches showed that one can learn more sophisticated high-level representations by capturing nonlinear dependencies in a post-processing step after the ICA step [4,5,6,7,8]. The focus of these efforts has centered on variance correlation in natural images. After ICA, a source signal is not linearly predictable from others. However, given variance dependencies, a source signal is still ‘predictable’ in a nonlinear manner. It is not possible to de-correlate this variance dependency using a linear transformation. Several researchers have proposed extensions to capture the nonlinear dependencies. Portilla et al. used Gaussian Scale Mixture (GSM) to model variance dependency in wavelet domain. This model can learn variance correlation in source prior and showed improvement in image denoising [4]. But in this model, dependency is defined only between a subset of wavelet coefficients. Hyvarinen and Hoyer suggested using a special variance related distribution to model the variance correlated source prior. This model can learn grouping of dependent sources (Subspace ICA) or topographic arrangements of correlated sources (Topographic ICA) [5,6]. Similarly, Welling et al. suggested a product of expert model where each expert represents a variance correlated group [7]. The product form of the model enables applications to image denoising. But these models don’t reveal higher-order structures explicitly. Our model is motivated by Lewicki and Karklin who proposed a 2-stage model where the 1st stage is an ICA model (Eq. (3)) and the 2 nd-stage is a linear generative model where another source v generates logarithmic variance for the 1st stage (Eq. (4)) [8]. This model captures variance dependency structure explicitly, but treating variance as an additional random variable introduces another level of complexity and requires several approximations. Thus, it is difficult to obtain a simple analytic PDF of source signal u and to apply the model for image processing problems. ( P (u | λ ) = c exp − u / λ q ) (3) log[λ ] = Bv (4) We propose a hierarchical model based on ICA and a mixture of Laplacian distribution. Our model can be considered as a simplification of model in [8] by constraining v to be 0/1 random vector where only one element can be 1. Our model is computationally simpler but still can capture variance dependency. Experiments show that our model can reveal higher order structures similar to [8]. In addition, our model provides a simple parametric PDF of variance correlated priors, which is an important advantage for adaptive signal processing. Utilizing this, we demonstrate simple applications on image segmentation and image denoising. Our model provides an improved statistic model for natural images and can be used for other applications including feature extraction, image coding, or learning even higher order structures. 2 Modeling nonlinear dependencies We propose a hierarchical or 2-stage model where the 1 st stage is an ICA source signal model and the 2nd stage is modeled by a mixture model with different variances (figure 1). In natural images, the correlation of variance reflects different types of regularities in the real world. Such specialized regularities can be summarized as “context” information. To model the context dependent variance correlation, we use mixture models where Laplacian distributions with different variance represent different contexts. For each image patch, a context variable Z “selects” which Laplacian distribution will represent ICA source signal u. Laplacian distributions have 0-mean but different variances. The advantage of Laplacian distribution for modeling context is that we can model a sparse distribution using only one Laplacian distribution. But we need more than two Gaussian distributions to do the same thing. Also conventional ICA is a special case of our model with one Laplacian. We define the mixture model and its learning algorithm in the next sections. Figure 1: Proposed hierarchical model (1st stage is ICA generative model. 2nd stage is mixture of “context dependent” Laplacian distributions which model U. Z is a random variable that selects a Laplacian distribution that generates the given image patch) 2.1 Mixture of Laplacian Distribution We define a PDF for mixture of M-dimensional Laplacian Distribution as Eq.(5), where N is the number of data samples, and K is the number of mixtures. N N K M N K r r r P(U | Λ, Π) = ∏ P(u n | Λ, Π) = ∏∑ π k P(u n | λk ) = ∏∑ π k ∏ n n k n k m 1 (2λ ) k ,m  u n,m exp −  λk , m      (5) r r r r r un = (un,1 , un , 2 , , , un,M ) : n-th data sample, U = (u1 , u 2 , , , ui , , , u N ) r r r r r λk = (λk ,1 , λk , 2 ,..., λk ,M ) : Variance of k-th Laplacian distribution, Λ = (λ1 , λ2 , , , λk , , , λK ) πk : probability of Laplacian distribution k, Π = (π 1 , , , π K ) and ∑ k πk =1 It is not easy to maximize Eq.(5) directly, and we use EM (expectation maximization) algorithm for parameter estimation. Here we introduce a new hidden context variable Z that represents which Laplacian k, is responsible for a given data point. Assuming we know the hidden variable Z, we can write the likelihood of data and Z as Eq.(6), n zk K   N r  (π )zkn   1  ⋅ exp − z n u n ,m   P(U , Z | Λ, Π ) = ∏ P(u n , Z | Λ, Π ) = ∏ ∏ k ∏      k   k λk , m n n m   2λk ,m        N               (6) n z k : Hidden binary random variable, 1 if n-th data sample is generated from k-th n Laplacian, 0 other wise. ( Z = (z kn ) and ∑ z k = 1 for all n = 1…N) k 2.2 EM algorithm for learning the mixture model The EM algorithm maximizes the log likelihood of data averaged over hidden variable Z. The log likelihood and its expectation can be computed as in Eq.(7,8).   u 1 n n log P(U , Z | Λ, Π ) = ∑  z k log(π k ) + ∑ z k  log( ) − n ,m  2λk ,m λk , m n ,k  m       (7)   u 1 n E {log P (U , Z | Λ, Π )} = ∑ E z k log(π k ) + ∑  log( ) − n ,m  2λ k , m λk , m n ,k m    { }     (8) The expectation in Eq.(8) can be evaluated, if we are given the data U and estimated parameters Λ and Π. For Λ and Π, EM algorithm uses current estimation Λ’ and Π’. { } { } ∑ z P( z n n E z k ≡ E zk | U , Λ' , Π ' = 1 n z k =0 n k n k n | u n , Λ' , Π ' ) = P( z k = 1 | u n , Λ' , Π ' ) (9) = n n P (u n | z k = 1, Λ' , Π ' ) P( z k = 1 | Λ ' , Π ' ) P(u n | Λ' , Π ' ) = M u n ,m 1 1 1 ∏ 2λ ' exp(− λ ' ) ⋅ π k ' = c P (u n | Λ ' , Π ' ) m k ,m k ,m n M πk ' ∏ 2λ m k ,m ' exp(− u n ,m λk , m ' ) Where the normalization constant can be computed as K K M k k =1 m =1 n cn = P (u n | Λ ' , Π ' ) = ∑ P (u n | z k , Λ ' , Π ' ) P ( z kn | Λ ' , Π ' ) = ∑ π k ∏ 1 (2λ ) exp( − k ,m u n ,m λk ,m ) (10) The EM algorithm works by maximizing Eq.(8), given the expectation computed from Eq.(9,10). Eq.(9,10) can be computed using Λ’ and Π’ estimated in the previous iteration of EM algorithm. This is E-step of EM algorithm. Then in M-step of EM algorithm, we need to maximize Eq.(8) over parameter Λ and Π. First, we can maximize Eq.(8) with respect to Λ, by setting the derivative as 0.  1 u n,m  ∂E{log P (U , Z | Λ, Π )} n  = 0 = ∑ E z k  − +  λ k , m (λ k , m ) 2   ∂λ k ,m  n   { } ⇒ λ k ,m ∑ E{z }⋅ u = ∑ E{z } n k n ,m n (11) n k n Second, for maximization of Eq.(8) with respect to Π, we can rewrite Eq.(8) as below. n (12) E {log P (U , Z | Λ , Π )} = C + ∑ E {z k ' }log(π k ' ) n ,k ' As we see, the derivative of Eq.(12) with respect to Π cannot be 0. Instead, we need to use Lagrange multiplier method for maximization. A Lagrange function can be defined as Eq.(14) where ρ is a Lagrange multiplier. { } (13) n L (Π , ρ ) = − ∑ E z k ' log(π k ' ) + ρ (∑ π k ' − 1) n,k ' k' By setting the derivative of Eq.(13) to be 0 with respect to ρ and Π, we can simply get the maximization solution with respect to Π. We just show the solution in Eq.(14). ∂L(Π, ρ ) ∂L(Π, ρ ) =0 = 0, ∂Π ∂ρ  n   n  ⇒ π k =  ∑ E z k  /  ∑∑ E z k     k n  n { } { } (14) Then the EM algorithm can be summarized as figure 2. For the convergence criteria, we can use the expectation of log likelihood, which can be calculated from Eq. (8). πk = { } , λk , m = E um + e (e is small random noise) 2. Calculate the Expectation by 1. Initialize 1 K u n ,m 1 M πk ' ∏ 2λ ' exp( − λ ' ) cn m k ,m k ,m 3. Maximize the log likelihood given the Expectation { } { } n n E z k ≡ E zk | U , Λ' , Π ' =     λk ,m ←  ∑ E {z kn }⋅ u n,m  /  ∑ E {z kn } ,     π k ←  ∑ E {z kn } /  ∑∑ E {z kn }   n   n   k n  4. If (converged) stop, otherwise repeat from step 2.  n Figure 2: Outline of EM algorithm for Learning the Mixture Model 3 Experimental Results Here we provide examples of image data and show how the learning procedure is performed for the mixture model. We also provide visualization of learned variances that reveal the structure of variance correlation and an application to image denoising. 3.1 Learning Nonlinear Dependencies in Natural images As shown in figure 1, the 1 st stage of the proposed model is simply the linear ICA. The ICA matrix A and W(=A-1) are learned by the FastICA algorithm [9]. We sampled 105(=N) data from 16x16 patches (256 dim.) of natural images and use them for both first and second stage learning. ICA input dimension is 256, and source dimension is set to be 160(=M). The learned ICA basis is partially shown in figure 1. The 2nd stage mixture model is learned given the ICA source signals. In the 2 nd stage the number of mixtures is set to 16, 64, or 256(=K). Training by the EM algorithm is fast and several hundred iterations are sufficient for convergence (0.5 hour on a 1.7GHz Pentium PC). For the visualization of learned variance, we adapted the visualization method from [8]. Each dimension of ICA source signal corresponds to an ICA basis (columns of A) and each ICA basis is localized in both image and frequency space. Then for each Laplacian distribution, we can display its variance vector as a set of points in image and frequency space. Each point can be color coded by variance value as figure 3. (a1) (a2) (b1) (b2) Figure 3: Visualization of learned variances (a1 and a2 visualize variance of Laplacian #4 and b1 and 2 show that of Laplacian #5. High variance value is mapped to red color and low variance is mapped to blue. In Laplacian #4, variances for diagonally oriented edges are high. But in Laplacian #5, variances for edges at spatially right position are high. Variance structures are related to “contexts” in the image. For example, Laplacian #4 explains image patches that have oriented textures or edges. Laplacian #5 captures patches where left side of the patch is clean but right side is filled with randomly oriented edges.) A key idea of our model is that we can mix up independent distributions to get nonlinearly dependent distribution. This modeling power can be shown by figure 4. Figure 4: Joint distribution of nonlinearly dependent sources. ((a) is a joint histogram of 2 ICA sources, (b) is computed from learned mixture model, and (c) is from learned Laplacian model. In (a), variance of u2 is smaller than u1 at center area (arrow A), but almost equal to u1 at outside (arrow B). So the variance of u2 is dependent on u1. This nonlinear dependency is closely approximated by mixture model in (b), but not in (c).) 3.2 Unsupervised Image Segmentation The idea behind our model is that the image can be modeled as mixture of different variance correlated “contexts”. We show how the learned model can be used to classify different context by an unsupervised image segmentation task. Given learned model and data, we can compute the expectation of a hidden variable Z from Eq. (9). Then for an image patch, we can select a Laplacian distribution with highest probability, which is the most explaining Laplacian or “context”. For segmentation, we use the model with 16 Laplacians. This enables abstract partitioning of images and we can visualize organization of images more clearly (figure 5). Figure 5: Unsupervised image segmentation (left is original image, middle is color labeled image, right image shows color coded Laplacians with variance structure. Each color corresponds to a Laplacian distribution, which represents surface or textural organization of underlying contexts. Laplacian #14 captures smooth surface and Laplacian #9 captures contrast between clear sky and textured ground scenes.) 3.3 Application to Image Restoration The proposed mixture model provides a better parametric model of the ICA source distribution and hence an improved model of the image structure. An advantage is in the MAP (maximum a posterior) estimation of a noisy image. If we assume Gaussian noise n, the image generation model can be written as Eq.(15). Then, we can compute MAP estimation of ICA source signal u by Eq.(16) and reconstruct the original image. (15) X = Au + n (16) ˆ u = argmax log P (u | X , A) = argmax (log P ( X | u , A) + log P (u ) ) u u Since we assumed Gaussian noise, P(X|u,A) in Eq. (16) is Gaussian. P(u) in Eq. (16) can be modeled as a Laplacian or a mixture of Laplacian distribution. The mixture distribution can be approximated by a maximum explaining Laplacian. We evaluated 3 different methods for image restoration including ICA MAP estimation with simple Laplacian prior, same with Laplacian mixture prior, and the Wiener filter. Figure 6 shows an example and figure 7 summarizes the results obtained with different noise levels. As shown MAP estimation with the mixture prior performs better than the others in terms of SNR and SSIM (Structural Similarity Measure) [10]. Figure 6: Image restoration results (signal variance 1.0, noise variance 0.81) 16 ICA MAP (Mixture prior) ICA MAP (Laplacian prior) W iener 14 0.8 SSIM Index SNR 12 10 8 6 0.6 0.4 0.2 4 2 ICA MAP(Mixture prior) ICA MAP(Laplacian prior) W iener Noisy Image 1 0 0.5 1 1.5 Noise variance 2 2.5 0 0 0.5 1 1.5 Noise variance 2 2.5 Figure 7: SNR and SSIM for 3 different algorithms (signal variance = 1.0) 4 D i s c u s s i on We proposed a mixture model to learn nonlinear dependencies of ICA source signals for natural images. The proposed mixture of Laplacian distribution model is a generalization of the conventional independent source priors and can model variance dependency given natural image signals. Experiments show that the proposed model can learn the variance correlated signals grouped as different mixtures and learn highlevel structures, which are highly correlated with the underlying physical properties captured in the image. Our model provides an analytic prior of nearly independent and variance-correlated signals, which was not viable in previous models [4,5,6,7,8]. The learned variances of the mixture model show structured localization in image and frequency space, which are similar to the result in [8]. Since the model is given no information about the spatial location or frequency of the source signals, we can assume that the dependency captured by the mixture model reveals regularity in the natural images. As shown in image labeling experiments, such regularities correspond to specific surface types (textures) or boundaries between surfaces. The learned mixture model can be used to discover hidden contexts that generated such regularity or correlated signal groups. Experiments also show that the labeling of image patches is highly correlated with the object surface types shown in the image. The segmentation results show regularity across image space and strong correlation with high-level concepts. Finally, we showed applications of the model for image restoration. We compare the performance with the conventional ICA MAP estimation and Wiener filter. Our results suggest that the proposed model outperforms other traditional methods. It is due to the estimation of the correlated variance structure, which provides an improved prior that has not been considered in other methods. In our future work, we plan to exploit the regularity of the image segmentation result to lean more high-level structures by building additional hierarchies on the current model. Furthermore, the application to image coding seems promising. References [1] A. J. Bell and T. J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edge Filters, Vision Research, 37(23):3327–3338, 1997. [2] A. Hyvarinen, Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation,Neural Computation, 11(7):1739-1768, 1999. [3] T. Lee, M. Lewicki, and T. Sejnowski., ICA Mixture Models for unsupervised Classification of non-gaussian classes and automatic context switching in blind separation. PAMI, 22(10), October 2000. [4] J. Portilla, V. Strela, M. J. Wainwright and E. P Simoncelli, Image Denoising using Scale Mixtures of Gaussians in the Wavelet Domain, IEEE Trans. On Image Processing, Vol.12, No. 11, 1338-1351, 2003. [5] A. Hyvarinen, P. O. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neurocomputing, 1999. [6] A. Hyvarinen, P.O. Hoyer, Topographic Independent component analysis as a model of V1 Receptive Fields, Neurocomputing, Vol. 38-40, June 2001. [7] M. Welling and G. E. Hinton, S. Osindero, Learning Sparse Topographic Representations with Products of Student-t Distributions, NIPS, 2002. [8] M. S. Lewicki and Y. Karklin, Learning higher-order structures in natural images, Network: Comput. Neural Syst. 14 (August 2003) 483-499. [9] A.Hyvarinen, P.O. Hoyer, Fast ICA matlab code., http://www.cis.hut.fi/projects/compneuro/extensions.html/ [10] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, The SSIM Index for Image Quality Assessment, IEEE Transactions on Image Processing, vol. 13, no. 4, Apr. 2004.

3 0.088636704 99 nips-2004-Learning Hyper-Features for Visual Identification

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

Abstract: We address the problem of identifying specific instances of a class (cars) from a set of images all belonging to that class. Although we cannot build a model for any particular instance (as we may be provided with only one “training” example of it), we can use information extracted from observing other members of the class. We pose this task as a learning problem, in which the learner is given image pairs, labeled as matching or not, and must discover which image features are most consistent for matching instances and discriminative for mismatches. We explore a patch based representation, where we model the distributions of similarity measurements defined on the patches. Finally, we describe an algorithm that selects the most salient patches based on a mutual information criterion. This algorithm performs identification well for our challenging dataset of car images, after matching only a few, well chosen patches. 1

4 0.059446886 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

Author: Erik G. Learned-miller, Parvez Ahammad

Abstract: The correction of bias in magnetic resonance images is an important problem in medical image processing. Most previous approaches have used a maximum likelihood method to increase the likelihood of the pixels in a single image by adaptively estimating a correction to the unknown image bias field. The pixel likelihoods are defined either in terms of a pre-existing tissue model, or non-parametrically in terms of the image’s own pixel values. In both cases, the specific location of a pixel in the image is not used to calculate the likelihoods. We suggest a new approach in which we simultaneously eliminate the bias from a set of images of the same anatomy, but from different patients. We use the statistics from the same location across different images, rather than within an image, to eliminate bias fields from all of the images simultaneously. The method builds a “multi-resolution” non-parametric tissue model conditioned on image location while eliminating the bias fields associated with the original image set. We present experiments on both synthetic and real MR data sets, and present comparisons with other methods. 1

5 0.058893066 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

6 0.057278372 68 nips-2004-Face Detection --- Efficient and Rank Deficient

7 0.056669269 192 nips-2004-The power of feature clustering: An application to object detection

8 0.054083258 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

9 0.051242672 160 nips-2004-Seeing through water

10 0.049862839 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

11 0.049463026 50 nips-2004-Dependent Gaussian Processes

12 0.049042817 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

13 0.046875659 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

14 0.045783639 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters

15 0.045029629 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

16 0.044438824 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

17 0.043751359 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

18 0.043279067 40 nips-2004-Common-Frame Model for Object Recognition

19 0.040956728 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

20 0.039421007 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, -0.012), (2, -0.053), (3, -0.107), (4, 0.032), (5, 0.002), (6, 0.054), (7, -0.005), (8, -0.037), (9, -0.002), (10, 0.039), (11, 0.003), (12, 0.052), (13, -0.066), (14, -0.002), (15, 0.032), (16, 0.025), (17, 0.073), (18, 0.074), (19, -0.073), (20, 0.05), (21, -0.01), (22, -0.046), (23, -0.065), (24, 0.013), (25, 0.009), (26, 0.023), (27, -0.026), (28, -0.051), (29, -0.008), (30, 0.026), (31, 0.017), (32, -0.035), (33, -0.06), (34, 0.085), (35, 0.097), (36, -0.132), (37, -0.042), (38, -0.013), (39, 0.065), (40, 0.021), (41, 0.039), (42, -0.035), (43, 0.06), (44, 0.081), (45, -0.009), (46, -0.09), (47, -0.019), (48, -0.182), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92577076 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efficacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and finance.16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear filters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to find combinations of these (perhaps non-linearly transformed) as a way of finding higher order filters (eg complex cells). One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in figure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n filter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modified Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to filter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. Wainwright2 considered a prespecified tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which filter responses share which mixer variables. We first study this issue using synthetic data in which two groups of filter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (figure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few filter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single filter response, the Gaussian estimate peaks away from zero. For assignments including more filter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. Specifically, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single filter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many filter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefficients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). The true generative model specifies probabilistic overlap of 3 mixer variables. We generated 5000 samples for each filter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefficient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in figure 2). Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal filters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with filter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between filters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the first 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet filters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA first-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet filters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefficients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artificial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

2 0.60744709 99 nips-2004-Learning Hyper-Features for Visual Identification

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

Abstract: We address the problem of identifying specific instances of a class (cars) from a set of images all belonging to that class. Although we cannot build a model for any particular instance (as we may be provided with only one “training” example of it), we can use information extracted from observing other members of the class. We pose this task as a learning problem, in which the learner is given image pairs, labeled as matching or not, and must discover which image features are most consistent for matching instances and discriminative for mismatches. We explore a patch based representation, where we model the distributions of similarity measurements defined on the patches. Finally, we describe an algorithm that selects the most salient patches based on a mutual information criterion. This algorithm performs identification well for our challenging dataset of car images, after matching only a few, well chosen patches. 1

3 0.51566541 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

Author: Matthias O. Franz, Bernhard Schölkopf

Abstract: The computation of classical higher-order statistics such as higher-order moments or spectra is difficult for images due to the huge number of terms to be estimated and interpreted. We propose an alternative approach in which multiplicative pixel interactions are described by a series of Wiener functionals. Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of five play an important role in natural images. Most of the interesting structure in a natural image is characterized by its higher-order statistics. Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. This would require knowledge of a second point on the line, i.e., we have to consider some third-order statistics which describe the interactions between triplets of points. Analogously, the prediction of a corner neighbourhood needs at least fourth-order statistics, and so on. In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i.e. phase correlations between several Fourier components of the image. Classically, harmonic phase interactions are measured by higher-order spectra [4]. Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. For instance, a sixth-order spectrum of a 16×16 sized image contains roughly 1012 coefficients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9]. Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly via polynomial kernels. This allows us to decompose an image into components that are characterized by interactions of a given order. In the next section, we introduce the Wiener expansion and discuss its capability of modeling higher-order pixel interactions. The implicit estimation method is described in Sect. 2, followed by some examples of use in Sect. 3. We conclude in Sect. 4 by briefly discussing the results and possible improvements. 1 Modeling pixel interactions with Wiener functionals For our analysis, we adopt a prediction framework: Given a d × d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. We are particularly interested to which extent interactions of different orders contribute to the overall prediction. Our basic assumption is that the dependency of the central pixel value y on its neighbours xi , i = 1, . . . , m = d2 − 1 can be modeled as a series y = H0 [x] + H1 [x] + H2 [x] + · · · + Hn [x] + · · · (1) of discrete Volterra functionals H0 [x] = h0 = const. and Hn [x] = m i1 =1 ··· m in =1 (n) hi1 ...in xi1 . . . xin . (2) Here, we have stacked the grayvalues of the neighbourhood into the vector x = (x1 , . . . , xm ) ∈ Rm . The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefficients (n) hi1 ...in . Volterra functionals provide a controlled way of introducing multiplicative interactions of image points since a functional of order n contains all products of the input of order n. In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels. Unfortunately, Volterra functionals are not orthogonal to each other, i.e., depending on the input distribution, a functional of order n generally leads to additional lower-order interactions. As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. If one wants to estimate the zeroeth-order component of an image (i.e., the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. For general Volterra series, this correction can be achieved by decomposing it into a new series y = G0 [x] + G1 [x] + · · · + Gn [x] + · · · of functionals Gn [x] that are uncorrelated, i.e., orthogonal with respect to the input. The resulting Wiener functionals1 Gn [x] are linear combinations of Volterra functionals up to order n. They are computed from the original Volterra series by a procedure akin to Gram-Schmidt orthogonalization [5]. It can be shown that any Wiener expansion of finite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5]. So far, we have not gained anything compared to classical estimation of higher-order moments or spectra: an nth-order Volterra functional contains the same number of terms as 1 Strictly speaking, the term Wiener functional is reserved for orthogonal Volterra functionals with respect to Gaussian input. Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions. the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefficients as it consists also of lower-order Volterra functionals. In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efficient computation of the Wiener functionals. 2 Estimating Wiener series by regression in RKHS Volterra series as linear functionals in RKHS. The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. We can interpret the evaluation of this functional for a given input x as a map φn defined for n = 0, 1, 2, . . . as φ0 (x) = 1 and φn (x) = (xn , xn−1 x2 , . . . , x1 xn−1 , xn , . . . , xn ) 1 2 m 1 2 (3) n such that φn maps the input x ∈ Rm into a vector φn (x) ∈ Fn = Rm containing all mn ordered monomials of degree n. Using φn , we can write the nth-order Volterra functional in Eq. (2) as a scalar product in Fn , Hn [x] = ηn φn (x), (n) (4) (n) (n) with the coefficients stacked into the vector ηn = (h1,1,..1 , h1,2,..1 , h1,3,..1 , . . . ) ∈ Fn . The same idea can be applied to the entire pth-order Volterra series. By stacking the maps φn into a single map φ(p) (x) = (φ0 (x), φ1 (x), . . . , φp (x)) , one obtains a mapping from p+1 2 p Rm into F(p) = R × Rm × Rm × . . . Rm = RM with dimensionality M = 1−m . The 1−m entire pth-order Volterra series can be written as a scalar product in F(p) p n=0 Hn [x] = (η (p) ) φ(p) (x) (5) with η (p) ∈ F(p) . Below, we will show how we can express η (p) as an expansion in terms of the training points. This will dramatically reduce the number of parameters we have to estimate. This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). As a consequence, the dot product in Fn can be computed by evaluating a positive definite kernel function kn (x1 , x2 ). For monomials, one can easily show that (e.g., [6]) φn (x1 ) φn (x2 ) = (x1 x2 )n =: kn (x1 , x2 ). (6) Since F(p) is generated as a direct sum of the single spaces Fn , the associated scalar product is simply the sum of the scalar products in the Fn : φ(p) (x1 ) φ(p) (x2 ) = p n=0 (x1 x2 )n = k (p) (x1 , x2 ). (7) Thus, we have shown that the discretized Volterra series can be expressed as a linear functional in a RKHS2 . Linear regression in RKHS. For our prediction problem (1), the RKHS property of the Volterra series leads to an efficient solution which is in part due to the so called representer theorem (e.g., [6]). It states the following: suppose we are given N observations 2 A similar approach has been taken by [1] using the inhomogeneous polynomial kernel = (1 + x1 x2 )p . This kernel implies a map φinh into the same space of monomials, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem. (p) kinh (x1 , x2 ) (x1 , y1 ), . . . , (xN , yN ) of the function (1) and an arbitrary cost function c, Ω is a nondecreasing function on R>0 and . F is the norm of the RKHS associated with the kernel k. If we minimize an objective function c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) + Ω( f F ), (8) 3 over all functions in the RKHS, then an optimal solution can be expressed as N f (x) = j=1 aj k(x, xj ), aj ∈ R. (9) In other words, although we optimized over the entire RKHS including functions which are defined for arbitrary input points, it turns out that we can always express the solution in terms of the observations xj only. Hence the optimization problem over the extremely large number of coefficients η (p) in Eq. (5) is transformed into one over N variables aj . Let us consider the special case where the cost function is the mean squared error, N 1 c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) = N j=1 (f (xj ) − yj )2 , and the regularizer Ω is zero4 . The solution for a = (a1 , . . . , aN ) is readily computed by setting the derivative of (8) with respect to the vector a equal to zero; it takes the form a = K −1 y with the Gram matrix defined as Kij = k(xi , xj ), hence5 y = f (x) = a z(x) = y K −1 z(x), (10) N where z(x) = (k(x, x1 ), k(x, x2 ), . . . k(x, xN )) ∈ R . Implicit Wiener series estimation. As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error. This can be put into the regression framework: since any finite Volterra series can be represented as a linear functional in the corresponding RKHS, we can find the pth-order Volterra series that minimizes the squared error by linear regression. This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property6 . From Eqn. (10), we obtain the following expressions for the implicit Wiener series p p 1 −1 G0 [x] = y 1, Hn [x] = y Kp z(p) (x) (11) Gn [x] = n=0 n=0 N (p) where the Gram matrix Kp and the coefficient vector z (x) are computed using the kernel from Eq. (7) and 1 = (1, 1, . . . ) ∈ RN . Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. Thus, we can avoid the “curse of dimensionality”, i.e., there is no need to compute the possibly large number of coefficients explicitly. The explicit Volterra and Wiener expansions can be recovered from Eq. (11) by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by −1 Hn [x] = y Kp zn (x) n n (12) n with zn (x) = ((x1 x) , (x2 x) , . . . , (xN x) ) . For p = 0 the only term is the constant zero-order Volterra functional H0 [x] = G0 [x]. The coefficient vector ηn = (n) (n) (n) (h1,1,...1 , h1,2,...1 , h1,3,...1 , . . . ) of the explicit Volterra functional is obtained as −1 η n = Φ n Kp y 3 (13) for conditions on uniqueness of the solution, see [6] Note that this is different from the regularized approach used by [1]. If Ω is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. 5 If K is not invertible, K −1 denotes the pseudo-inverse of K. 6 assuming symmetrized Volterra kernels which can be obtained from any Volterra expanson. 4 using the design matrix Φn = (φn (x1 ) , φn (x1 ) , . . . , φn (x1 ) ) . The individual Wiener functionals can only be recovered by applying the regression procedure twice. If we are interested in the nth-degree Wiener functional, we have to compute the solution for the kernels k (n) (x1 , x2 ) and k (n−1) (x1 , x2 ). The Wiener functional for n > 0 is then obtained from the difference of the two results as Gn [x] = n i=0 Gi [x] − n−1 i=0 Gi [x] = y −1 −1 Kn z(n) (x) − Kn−1 z(n−1) (x) . (14) The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. (12) and (13) [3]. Orthogonality. The resulting Wiener functionals must fulfill the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. Formally, we will prove the following Theorem 1 The functionals obtained from Eq. (14) fulfill the orthogonality condition E [m(x)Gp [x]] = 0 (15) where E denotes the expectation over the input distribution and m(x) an arbitrary ithorder monomial with i < p. We will show that this a consequence of the least squares fit of any linear expansion in a set M of basis functions of the form y = j=1 γj ϕj (x). In the case of the Wiener and Volterra expansions, the basis functions ϕj (x) are monomials of the components of x. M We denote the error of the expansion as e(x) = y − j=1 γj ϕj (xi ). The minimum of the expected quadratic loss L with respect to the expansion coefficient γk is given by ∂ ∂L = E e(x) ∂γk ∂γk 2 = −2E [ϕk (x)e(x)] = 0. (16) This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. The approximation error is given by the ∞ sum of the higher-order Wiener functionals e(x) = n=p Gn [x], so Gp [x] is part of the error. As a consequence of the linearity of the expectation, Eq. (16) implies ∞ n=p ∞ E [ϕk (x)Gn [x]] = 0 and n=p+1 E [ϕk (x)Gn [x]] = 0 (17) for any φk of order less than p. The difference of both equations yields E [ϕk (x)Gp [x]] = 0, so that Gp [x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p. 3 Experiments Toy examples. In our first experiment, we check whether our intuitions about higher-order statistics described in the introduction are captured by the proposed method. In particular, we expect that arbitrarily oriented lines can only be predicted using third-order statistics. As a consequence, we should need at least a second-order Wiener functional to predict lines correctly. Our first test image (size 80 × 110, upper row in Fig. 1) contains only lines of varying orientations. Choosing a 5 × 5 neighbourhood, we predicted the central pixel using (11). original image 0th-order component/ reconstruction 1st-order reconstruction 1st-order component 2nd-order reconstruction 2nd-order component 3rd-order reconstruction mse = 583.7 mse = 0.006 mse = 0 mse = 1317 mse = 37.4 mse = 0.001 mse = 1845 mse = 334.9 3rd-order component mse = 19.0 Figure 1: Higher-order components of toy images. The image components of different orders are created by the corresponding Wiener functionals. They are added up to obtain the different orders of reconstruction. Note that the constant 0-order component and reconstruction are identical. The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. Although the linear first-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). A correct prediction is only obtained by adding a second-order component to the model. The third-order component is only significant at crossings, corners and line endings. Models of orders 0 . . . 3 were learned from the image by extracting the maximal training set of 76 × 106 patches of size 5 × 57 . The corresponding image components of order 0 to 3 were computed according to (14). Note the different components generated by the Wiener functionals can also be negative. In Fig. 1, they are scaled to the gray values [0..255]. The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. The additional small correction achieved by the third-order model is mainly due to discretization effects. Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. This is confirmed by the second and third test image shown in the corresponding row in Fig. 1. Note that the third-order component is only significant at crossings, corners and line endings. The fourth- and fifth-order terms (not shown) have only negligible contributions. The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel. Application to natural images. Are there further predictable structures in natural images that are not due to lines, crossings or corners? This can be investigated by applying our method to a set of natural images (an example of size 80 × 110 is depicted in Fig. 2). Our 7 In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand. original image 0th-order component/ reconstruction 1st-order reconstruction mse = 1070 1st-order component 2nd-order reconstruction mse = 957.4 2nd-order component 3rd-order reconstruction mse = 414.6 3rd-order component 4th-order reconstruction mse = 98.5 4th-order component 5th-order reconstruction mse = 18.5 5th-order component 6th-order reconstruction mse = 4.98 6th-order component 7th-order reconstruction mse = 1.32 7th-order component 8th-order reconstruction mse = 0.41 8th-order component Figure 2: Higher-order components and reconstructions of a photograph. Interactions up to the fifth order play an important role. Note that significant components become sparser with increasing model order. results on a set of 10 natural images of size 50 × 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. 3). Interestingly, terms up to order 5 still play a significant role, although the image regions with a significant component become sparser with increasing model order (see Fig. 2). Note that the nonlinear terms reduce the reconstruction error to almost 0. This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models. 4 Conclusion The implicit estimation of Wiener functionals via polynomial kernels opens up new possibilities for the estimation of higher-order image statistics. Compared to the classical methods such as higher-order spectra, moments or cumulants, our approach avoids the combinatorial explosion caused by the exponential increase of the number of terms to be estimated and interpreted. When put into a predictive framework, multiplicative pixel interactions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners. There is no one-to-one mapping between the classical higher-order statistics and multiplicative pixel interactions. Any nonlinear Wiener functional, for instance, creates infinitely many correlations or cumulants of higher order, and often also of lower order. On the other 700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. mse 500 400 300 200 100 0 0 1 2 3 4 5 6 7 model order hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. In contrast, our method is able to isolate image components that are created by a single order of interaction. Although of preliminary nature, our results on natural images suggest an important role of statistics up to the fifth order. Most of the currently used low-level feature detectors such as edge or corner detectors maximally use third-order interactions. The investigation of fourth- or higher-order features is a field that might lead to new insights into the nature and role of higher-order image structures. As often observed in the literature (e.g. [2][7]), our results seem to confirm that a large proportion of the redundancy in natural images is contained in the higher-order pixel interactions. Before any further conclusions can be drawn, however, our study needs to be extended in several directions: 1. A representative image database has to be analyzed. The images must be carefully calibrated since nonlinear statistics can be highly calibrationsensitive. In addition, the contribution of image noise has to be investigated. 2. Currently, only images up to 9000 pixels can be analyzed due to the matrix inversion required by Eq. 11. To accomodate for larger images, our method has to be reformulated in an iterative algorithm. 3. So far, we only considered 5 × 5-patches. To systematically investigate patch size effects, the analysis has to be conducted in a multi-scale framework. References [1] T. J. Dodd and R. F. Harrison. A new solution to Volterra series estimation. In CD-Rom Proc. 2002 IFAC World Congress, 2002. [2] D. J. Field. What is the goal of sensory coding? Neural Computation, 6:559 – 601, 1994. [3] M. O. Franz and B. Sch¨lkopf. Implicit Wiener series. Technical Report 114, Max-Plancko Institut f¨r biologische Kybernetik, T¨bingen, June 2003. u u [4] C. L. Nikias and A. P. Petropulu. Higher-order spectra analysis. Prentice Hall, Englewood Cliffs, NJ, 1993. [5] M. Schetzen. The Volterra and Wiener theories of nonlinear systems. Krieger, Malabar, 1989. [6] B. Sch¨lkopf and A. J. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. o [7] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neurosc., 4(8):819 – 825, 2001. [8] M. G. A. Thomson. Higher-order structure in natural scenes. J. Opt.Soc. Am. A, 16(7):1549 – 1553, 1999. [9] M. G. A. Thomson. Beats, kurtosis and visual coding. Network: Compt. Neural Syst., 12:271 – 287, 2001.

4 0.49896187 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

Author: Hyun J. Park, Te W. Lee

Abstract: Capturing dependencies in images in an unsupervised manner is important for many image processing applications. We propose a new method for capturing nonlinear dependencies in images of natural scenes. This method is an extension of the linear Independent Component Analysis (ICA) method by building a hierarchical model based on ICA and mixture of Laplacian distribution. The model parameters are learned via an EM algorithm and it can accurately capture variance correlation and other high order structures in a simple manner. We visualize the learned variance structure and demonstrate applications to image segmentation and denoising. 1 In trod u ction Unsupervised learning has become an important tool for understanding biological information processing and building intelligent signal processing methods. Real biological systems however are much more robust and flexible than current artificial intelligence mostly due to a much more efficient representations used in biological systems. Therefore, unsupervised learning algorithms that capture more sophisticated representations can provide a better understanding of neural information processing and also provide improved algorithm for signal processing applications. For example, independent component analysis (ICA) can learn representations similar to simple cell receptive fields in visual cortex [1] and is also applied for feature extraction, image segmentation and denoising [2,3]. ICA can approximate statistics of natural image patches by Eq.(1,2), where X is the data and u is a source signal whose distribution is a product of sparse distributions like a generalized Laplacian distribution. X = Au (1) P (u ) = ∏ P (u i ) (2) But the representation learned by the ICA algorithm is relatively low-level. In biological systems there are more high-level representations such as contours, textures and objects, which are not well represented by the linear ICA model. ICA learns only linear dependency between pixels by finding strongly correlated linear axis. Therefore, the modeling capability of ICA is quite limited. Previous approaches showed that one can learn more sophisticated high-level representations by capturing nonlinear dependencies in a post-processing step after the ICA step [4,5,6,7,8]. The focus of these efforts has centered on variance correlation in natural images. After ICA, a source signal is not linearly predictable from others. However, given variance dependencies, a source signal is still ‘predictable’ in a nonlinear manner. It is not possible to de-correlate this variance dependency using a linear transformation. Several researchers have proposed extensions to capture the nonlinear dependencies. Portilla et al. used Gaussian Scale Mixture (GSM) to model variance dependency in wavelet domain. This model can learn variance correlation in source prior and showed improvement in image denoising [4]. But in this model, dependency is defined only between a subset of wavelet coefficients. Hyvarinen and Hoyer suggested using a special variance related distribution to model the variance correlated source prior. This model can learn grouping of dependent sources (Subspace ICA) or topographic arrangements of correlated sources (Topographic ICA) [5,6]. Similarly, Welling et al. suggested a product of expert model where each expert represents a variance correlated group [7]. The product form of the model enables applications to image denoising. But these models don’t reveal higher-order structures explicitly. Our model is motivated by Lewicki and Karklin who proposed a 2-stage model where the 1st stage is an ICA model (Eq. (3)) and the 2 nd-stage is a linear generative model where another source v generates logarithmic variance for the 1st stage (Eq. (4)) [8]. This model captures variance dependency structure explicitly, but treating variance as an additional random variable introduces another level of complexity and requires several approximations. Thus, it is difficult to obtain a simple analytic PDF of source signal u and to apply the model for image processing problems. ( P (u | λ ) = c exp − u / λ q ) (3) log[λ ] = Bv (4) We propose a hierarchical model based on ICA and a mixture of Laplacian distribution. Our model can be considered as a simplification of model in [8] by constraining v to be 0/1 random vector where only one element can be 1. Our model is computationally simpler but still can capture variance dependency. Experiments show that our model can reveal higher order structures similar to [8]. In addition, our model provides a simple parametric PDF of variance correlated priors, which is an important advantage for adaptive signal processing. Utilizing this, we demonstrate simple applications on image segmentation and image denoising. Our model provides an improved statistic model for natural images and can be used for other applications including feature extraction, image coding, or learning even higher order structures. 2 Modeling nonlinear dependencies We propose a hierarchical or 2-stage model where the 1 st stage is an ICA source signal model and the 2nd stage is modeled by a mixture model with different variances (figure 1). In natural images, the correlation of variance reflects different types of regularities in the real world. Such specialized regularities can be summarized as “context” information. To model the context dependent variance correlation, we use mixture models where Laplacian distributions with different variance represent different contexts. For each image patch, a context variable Z “selects” which Laplacian distribution will represent ICA source signal u. Laplacian distributions have 0-mean but different variances. The advantage of Laplacian distribution for modeling context is that we can model a sparse distribution using only one Laplacian distribution. But we need more than two Gaussian distributions to do the same thing. Also conventional ICA is a special case of our model with one Laplacian. We define the mixture model and its learning algorithm in the next sections. Figure 1: Proposed hierarchical model (1st stage is ICA generative model. 2nd stage is mixture of “context dependent” Laplacian distributions which model U. Z is a random variable that selects a Laplacian distribution that generates the given image patch) 2.1 Mixture of Laplacian Distribution We define a PDF for mixture of M-dimensional Laplacian Distribution as Eq.(5), where N is the number of data samples, and K is the number of mixtures. N N K M N K r r r P(U | Λ, Π) = ∏ P(u n | Λ, Π) = ∏∑ π k P(u n | λk ) = ∏∑ π k ∏ n n k n k m 1 (2λ ) k ,m  u n,m exp −  λk , m      (5) r r r r r un = (un,1 , un , 2 , , , un,M ) : n-th data sample, U = (u1 , u 2 , , , ui , , , u N ) r r r r r λk = (λk ,1 , λk , 2 ,..., λk ,M ) : Variance of k-th Laplacian distribution, Λ = (λ1 , λ2 , , , λk , , , λK ) πk : probability of Laplacian distribution k, Π = (π 1 , , , π K ) and ∑ k πk =1 It is not easy to maximize Eq.(5) directly, and we use EM (expectation maximization) algorithm for parameter estimation. Here we introduce a new hidden context variable Z that represents which Laplacian k, is responsible for a given data point. Assuming we know the hidden variable Z, we can write the likelihood of data and Z as Eq.(6), n zk K   N r  (π )zkn   1  ⋅ exp − z n u n ,m   P(U , Z | Λ, Π ) = ∏ P(u n , Z | Λ, Π ) = ∏ ∏ k ∏      k   k λk , m n n m   2λk ,m        N               (6) n z k : Hidden binary random variable, 1 if n-th data sample is generated from k-th n Laplacian, 0 other wise. ( Z = (z kn ) and ∑ z k = 1 for all n = 1…N) k 2.2 EM algorithm for learning the mixture model The EM algorithm maximizes the log likelihood of data averaged over hidden variable Z. The log likelihood and its expectation can be computed as in Eq.(7,8).   u 1 n n log P(U , Z | Λ, Π ) = ∑  z k log(π k ) + ∑ z k  log( ) − n ,m  2λk ,m λk , m n ,k  m       (7)   u 1 n E {log P (U , Z | Λ, Π )} = ∑ E z k log(π k ) + ∑  log( ) − n ,m  2λ k , m λk , m n ,k m    { }     (8) The expectation in Eq.(8) can be evaluated, if we are given the data U and estimated parameters Λ and Π. For Λ and Π, EM algorithm uses current estimation Λ’ and Π’. { } { } ∑ z P( z n n E z k ≡ E zk | U , Λ' , Π ' = 1 n z k =0 n k n k n | u n , Λ' , Π ' ) = P( z k = 1 | u n , Λ' , Π ' ) (9) = n n P (u n | z k = 1, Λ' , Π ' ) P( z k = 1 | Λ ' , Π ' ) P(u n | Λ' , Π ' ) = M u n ,m 1 1 1 ∏ 2λ ' exp(− λ ' ) ⋅ π k ' = c P (u n | Λ ' , Π ' ) m k ,m k ,m n M πk ' ∏ 2λ m k ,m ' exp(− u n ,m λk , m ' ) Where the normalization constant can be computed as K K M k k =1 m =1 n cn = P (u n | Λ ' , Π ' ) = ∑ P (u n | z k , Λ ' , Π ' ) P ( z kn | Λ ' , Π ' ) = ∑ π k ∏ 1 (2λ ) exp( − k ,m u n ,m λk ,m ) (10) The EM algorithm works by maximizing Eq.(8), given the expectation computed from Eq.(9,10). Eq.(9,10) can be computed using Λ’ and Π’ estimated in the previous iteration of EM algorithm. This is E-step of EM algorithm. Then in M-step of EM algorithm, we need to maximize Eq.(8) over parameter Λ and Π. First, we can maximize Eq.(8) with respect to Λ, by setting the derivative as 0.  1 u n,m  ∂E{log P (U , Z | Λ, Π )} n  = 0 = ∑ E z k  − +  λ k , m (λ k , m ) 2   ∂λ k ,m  n   { } ⇒ λ k ,m ∑ E{z }⋅ u = ∑ E{z } n k n ,m n (11) n k n Second, for maximization of Eq.(8) with respect to Π, we can rewrite Eq.(8) as below. n (12) E {log P (U , Z | Λ , Π )} = C + ∑ E {z k ' }log(π k ' ) n ,k ' As we see, the derivative of Eq.(12) with respect to Π cannot be 0. Instead, we need to use Lagrange multiplier method for maximization. A Lagrange function can be defined as Eq.(14) where ρ is a Lagrange multiplier. { } (13) n L (Π , ρ ) = − ∑ E z k ' log(π k ' ) + ρ (∑ π k ' − 1) n,k ' k' By setting the derivative of Eq.(13) to be 0 with respect to ρ and Π, we can simply get the maximization solution with respect to Π. We just show the solution in Eq.(14). ∂L(Π, ρ ) ∂L(Π, ρ ) =0 = 0, ∂Π ∂ρ  n   n  ⇒ π k =  ∑ E z k  /  ∑∑ E z k     k n  n { } { } (14) Then the EM algorithm can be summarized as figure 2. For the convergence criteria, we can use the expectation of log likelihood, which can be calculated from Eq. (8). πk = { } , λk , m = E um + e (e is small random noise) 2. Calculate the Expectation by 1. Initialize 1 K u n ,m 1 M πk ' ∏ 2λ ' exp( − λ ' ) cn m k ,m k ,m 3. Maximize the log likelihood given the Expectation { } { } n n E z k ≡ E zk | U , Λ' , Π ' =     λk ,m ←  ∑ E {z kn }⋅ u n,m  /  ∑ E {z kn } ,     π k ←  ∑ E {z kn } /  ∑∑ E {z kn }   n   n   k n  4. If (converged) stop, otherwise repeat from step 2.  n Figure 2: Outline of EM algorithm for Learning the Mixture Model 3 Experimental Results Here we provide examples of image data and show how the learning procedure is performed for the mixture model. We also provide visualization of learned variances that reveal the structure of variance correlation and an application to image denoising. 3.1 Learning Nonlinear Dependencies in Natural images As shown in figure 1, the 1 st stage of the proposed model is simply the linear ICA. The ICA matrix A and W(=A-1) are learned by the FastICA algorithm [9]. We sampled 105(=N) data from 16x16 patches (256 dim.) of natural images and use them for both first and second stage learning. ICA input dimension is 256, and source dimension is set to be 160(=M). The learned ICA basis is partially shown in figure 1. The 2nd stage mixture model is learned given the ICA source signals. In the 2 nd stage the number of mixtures is set to 16, 64, or 256(=K). Training by the EM algorithm is fast and several hundred iterations are sufficient for convergence (0.5 hour on a 1.7GHz Pentium PC). For the visualization of learned variance, we adapted the visualization method from [8]. Each dimension of ICA source signal corresponds to an ICA basis (columns of A) and each ICA basis is localized in both image and frequency space. Then for each Laplacian distribution, we can display its variance vector as a set of points in image and frequency space. Each point can be color coded by variance value as figure 3. (a1) (a2) (b1) (b2) Figure 3: Visualization of learned variances (a1 and a2 visualize variance of Laplacian #4 and b1 and 2 show that of Laplacian #5. High variance value is mapped to red color and low variance is mapped to blue. In Laplacian #4, variances for diagonally oriented edges are high. But in Laplacian #5, variances for edges at spatially right position are high. Variance structures are related to “contexts” in the image. For example, Laplacian #4 explains image patches that have oriented textures or edges. Laplacian #5 captures patches where left side of the patch is clean but right side is filled with randomly oriented edges.) A key idea of our model is that we can mix up independent distributions to get nonlinearly dependent distribution. This modeling power can be shown by figure 4. Figure 4: Joint distribution of nonlinearly dependent sources. ((a) is a joint histogram of 2 ICA sources, (b) is computed from learned mixture model, and (c) is from learned Laplacian model. In (a), variance of u2 is smaller than u1 at center area (arrow A), but almost equal to u1 at outside (arrow B). So the variance of u2 is dependent on u1. This nonlinear dependency is closely approximated by mixture model in (b), but not in (c).) 3.2 Unsupervised Image Segmentation The idea behind our model is that the image can be modeled as mixture of different variance correlated “contexts”. We show how the learned model can be used to classify different context by an unsupervised image segmentation task. Given learned model and data, we can compute the expectation of a hidden variable Z from Eq. (9). Then for an image patch, we can select a Laplacian distribution with highest probability, which is the most explaining Laplacian or “context”. For segmentation, we use the model with 16 Laplacians. This enables abstract partitioning of images and we can visualize organization of images more clearly (figure 5). Figure 5: Unsupervised image segmentation (left is original image, middle is color labeled image, right image shows color coded Laplacians with variance structure. Each color corresponds to a Laplacian distribution, which represents surface or textural organization of underlying contexts. Laplacian #14 captures smooth surface and Laplacian #9 captures contrast between clear sky and textured ground scenes.) 3.3 Application to Image Restoration The proposed mixture model provides a better parametric model of the ICA source distribution and hence an improved model of the image structure. An advantage is in the MAP (maximum a posterior) estimation of a noisy image. If we assume Gaussian noise n, the image generation model can be written as Eq.(15). Then, we can compute MAP estimation of ICA source signal u by Eq.(16) and reconstruct the original image. (15) X = Au + n (16) ˆ u = argmax log P (u | X , A) = argmax (log P ( X | u , A) + log P (u ) ) u u Since we assumed Gaussian noise, P(X|u,A) in Eq. (16) is Gaussian. P(u) in Eq. (16) can be modeled as a Laplacian or a mixture of Laplacian distribution. The mixture distribution can be approximated by a maximum explaining Laplacian. We evaluated 3 different methods for image restoration including ICA MAP estimation with simple Laplacian prior, same with Laplacian mixture prior, and the Wiener filter. Figure 6 shows an example and figure 7 summarizes the results obtained with different noise levels. As shown MAP estimation with the mixture prior performs better than the others in terms of SNR and SSIM (Structural Similarity Measure) [10]. Figure 6: Image restoration results (signal variance 1.0, noise variance 0.81) 16 ICA MAP (Mixture prior) ICA MAP (Laplacian prior) W iener 14 0.8 SSIM Index SNR 12 10 8 6 0.6 0.4 0.2 4 2 ICA MAP(Mixture prior) ICA MAP(Laplacian prior) W iener Noisy Image 1 0 0.5 1 1.5 Noise variance 2 2.5 0 0 0.5 1 1.5 Noise variance 2 2.5 Figure 7: SNR and SSIM for 3 different algorithms (signal variance = 1.0) 4 D i s c u s s i on We proposed a mixture model to learn nonlinear dependencies of ICA source signals for natural images. The proposed mixture of Laplacian distribution model is a generalization of the conventional independent source priors and can model variance dependency given natural image signals. Experiments show that the proposed model can learn the variance correlated signals grouped as different mixtures and learn highlevel structures, which are highly correlated with the underlying physical properties captured in the image. Our model provides an analytic prior of nearly independent and variance-correlated signals, which was not viable in previous models [4,5,6,7,8]. The learned variances of the mixture model show structured localization in image and frequency space, which are similar to the result in [8]. Since the model is given no information about the spatial location or frequency of the source signals, we can assume that the dependency captured by the mixture model reveals regularity in the natural images. As shown in image labeling experiments, such regularities correspond to specific surface types (textures) or boundaries between surfaces. The learned mixture model can be used to discover hidden contexts that generated such regularity or correlated signal groups. Experiments also show that the labeling of image patches is highly correlated with the object surface types shown in the image. The segmentation results show regularity across image space and strong correlation with high-level concepts. Finally, we showed applications of the model for image restoration. We compare the performance with the conventional ICA MAP estimation and Wiener filter. Our results suggest that the proposed model outperforms other traditional methods. It is due to the estimation of the correlated variance structure, which provides an improved prior that has not been considered in other methods. In our future work, we plan to exploit the regularity of the image segmentation result to lean more high-level structures by building additional hierarchies on the current model. Furthermore, the application to image coding seems promising. References [1] A. J. Bell and T. J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edge Filters, Vision Research, 37(23):3327–3338, 1997. [2] A. Hyvarinen, Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation,Neural Computation, 11(7):1739-1768, 1999. [3] T. Lee, M. Lewicki, and T. Sejnowski., ICA Mixture Models for unsupervised Classification of non-gaussian classes and automatic context switching in blind separation. PAMI, 22(10), October 2000. [4] J. Portilla, V. Strela, M. J. Wainwright and E. P Simoncelli, Image Denoising using Scale Mixtures of Gaussians in the Wavelet Domain, IEEE Trans. On Image Processing, Vol.12, No. 11, 1338-1351, 2003. [5] A. Hyvarinen, P. O. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neurocomputing, 1999. [6] A. Hyvarinen, P.O. Hoyer, Topographic Independent component analysis as a model of V1 Receptive Fields, Neurocomputing, Vol. 38-40, June 2001. [7] M. Welling and G. E. Hinton, S. Osindero, Learning Sparse Topographic Representations with Products of Student-t Distributions, NIPS, 2002. [8] M. S. Lewicki and Y. Karklin, Learning higher-order structures in natural images, Network: Comput. Neural Syst. 14 (August 2003) 483-499. [9] A.Hyvarinen, P.O. Hoyer, Fast ICA matlab code., http://www.cis.hut.fi/projects/compneuro/extensions.html/ [10] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, The SSIM Index for Image Quality Assessment, IEEE Transactions on Image Processing, vol. 13, no. 4, Apr. 2004.

5 0.49871272 160 nips-2004-Seeing through water

Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai

Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1

6 0.48497033 192 nips-2004-The power of feature clustering: An application to object detection

7 0.47635964 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

8 0.44822082 35 nips-2004-Chemosensory Processing in a Spiking Model of the Olfactory Bulb: Chemotopic Convergence and Center Surround Inhibition

9 0.4457849 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

10 0.42647275 68 nips-2004-Face Detection --- Efficient and Rank Deficient

11 0.41791704 158 nips-2004-Sampling Methods for Unsupervised Learning

12 0.40214646 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

13 0.3941136 44 nips-2004-Conditional Random Fields for Object Recognition

14 0.37012473 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

15 0.36552817 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

16 0.35430974 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

17 0.34389365 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

18 0.32997742 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

19 0.32914782 50 nips-2004-Dependent Gaussian Processes

20 0.32798299 109 nips-2004-Mass Meta-analysis in Talairach Space


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.239), (13, 0.117), (15, 0.123), (26, 0.06), (31, 0.022), (33, 0.164), (35, 0.035), (39, 0.018), (50, 0.025), (76, 0.025), (87, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89221489 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

Author: Jianlin Cheng, Alessandro Vullo, Pierre F. Baldi

Abstract: The formation of disulphide bridges among cysteines is an important feature of protein structures. Here we develop new methods for the prediction of disulphide bond connectivity. We first build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cysteine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efficiently. We show how the method consistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain. 1

2 0.87470812 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively. 1

same-paper 3 0.817415 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efficacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and finance.16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear filters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to find combinations of these (perhaps non-linearly transformed) as a way of finding higher order filters (eg complex cells). One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in figure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n filter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modified Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to filter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. Wainwright2 considered a prespecified tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which filter responses share which mixer variables. We first study this issue using synthetic data in which two groups of filter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (figure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few filter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single filter response, the Gaussian estimate peaks away from zero. For assignments including more filter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. Specifically, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single filter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many filter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefficients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). The true generative model specifies probabilistic overlap of 3 mixer variables. We generated 5000 samples for each filter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefficient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in figure 2). Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal filters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with filter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between filters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the first 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet filters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA first-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet filters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefficients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artificial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

4 0.80759883 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

Abstract: In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classifier design and selection. On the other hand, we use A DA B OOST’s efficient learning mechanism to significantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the specific manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. 1

5 0.75231761 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the final regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efficient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a finite number of steps. Then we provide confidence-interval-type bounds on the generalization error. 1

6 0.71831167 145 nips-2004-Parametric Embedding for Class Visualization

7 0.71498907 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

8 0.70939904 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

9 0.70613027 200 nips-2004-Using Random Forests in the Structured Language Model

10 0.70412415 131 nips-2004-Non-Local Manifold Tangent Learning

11 0.70261061 54 nips-2004-Distributed Information Regularization on Graphs

12 0.70023113 17 nips-2004-Adaptive Manifold Learning

13 0.69963759 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

14 0.69937378 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

15 0.69810998 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

16 0.69767821 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

17 0.69659549 102 nips-2004-Learning first-order Markov models for control

18 0.6954118 100 nips-2004-Learning Preferences for Multiclass Problems

19 0.69296515 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

20 0.69252247 178 nips-2004-Support Vector Classification with Input Data Uncertainty