nips nips2004 nips2004-81 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthias O. Franz, Bernhard Schölkopf
Abstract: The computation of classical higher-order statistics such as higher-order moments or spectra is difficult for images due to the huge number of terms to be estimated and interpreted. We propose an alternative approach in which multiplicative pixel interactions are described by a series of Wiener functionals. Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of five play an important role in natural images. Most of the interesting structure in a natural image is characterized by its higher-order statistics. Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. This would require knowledge of a second point on the line, i.e., we have to consider some third-order statistics which describe the interactions between triplets of points. Analogously, the prediction of a corner neighbourhood needs at least fourth-order statistics, and so on. In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i.e. phase correlations between several Fourier components of the image. Classically, harmonic phase interactions are measured by higher-order spectra [4]. Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. For instance, a sixth-order spectrum of a 16×16 sized image contains roughly 1012 coefficients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9]. Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly via polynomial kernels. This allows us to decompose an image into components that are characterized by interactions of a given order. In the next section, we introduce the Wiener expansion and discuss its capability of modeling higher-order pixel interactions. The implicit estimation method is described in Sect. 2, followed by some examples of use in Sect. 3. We conclude in Sect. 4 by briefly discussing the results and possible improvements. 1 Modeling pixel interactions with Wiener functionals For our analysis, we adopt a prediction framework: Given a d × d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. We are particularly interested to which extent interactions of different orders contribute to the overall prediction. Our basic assumption is that the dependency of the central pixel value y on its neighbours xi , i = 1, . . . , m = d2 − 1 can be modeled as a series y = H0 [x] + H1 [x] + H2 [x] + · · · + Hn [x] + · · · (1) of discrete Volterra functionals H0 [x] = h0 = const. and Hn [x] = m i1 =1 ··· m in =1 (n) hi1 ...in xi1 . . . xin . (2) Here, we have stacked the grayvalues of the neighbourhood into the vector x = (x1 , . . . , xm ) ∈ Rm . The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefficients (n) hi1 ...in . Volterra functionals provide a controlled way of introducing multiplicative interactions of image points since a functional of order n contains all products of the input of order n. In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels. Unfortunately, Volterra functionals are not orthogonal to each other, i.e., depending on the input distribution, a functional of order n generally leads to additional lower-order interactions. As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. If one wants to estimate the zeroeth-order component of an image (i.e., the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. For general Volterra series, this correction can be achieved by decomposing it into a new series y = G0 [x] + G1 [x] + · · · + Gn [x] + · · · of functionals Gn [x] that are uncorrelated, i.e., orthogonal with respect to the input. The resulting Wiener functionals1 Gn [x] are linear combinations of Volterra functionals up to order n. They are computed from the original Volterra series by a procedure akin to Gram-Schmidt orthogonalization [5]. It can be shown that any Wiener expansion of finite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5]. So far, we have not gained anything compared to classical estimation of higher-order moments or spectra: an nth-order Volterra functional contains the same number of terms as 1 Strictly speaking, the term Wiener functional is reserved for orthogonal Volterra functionals with respect to Gaussian input. Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions. the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefficients as it consists also of lower-order Volterra functionals. In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efficient computation of the Wiener functionals. 2 Estimating Wiener series by regression in RKHS Volterra series as linear functionals in RKHS. The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. We can interpret the evaluation of this functional for a given input x as a map φn defined for n = 0, 1, 2, . . . as φ0 (x) = 1 and φn (x) = (xn , xn−1 x2 , . . . , x1 xn−1 , xn , . . . , xn ) 1 2 m 1 2 (3) n such that φn maps the input x ∈ Rm into a vector φn (x) ∈ Fn = Rm containing all mn ordered monomials of degree n. Using φn , we can write the nth-order Volterra functional in Eq. (2) as a scalar product in Fn , Hn [x] = ηn φn (x), (n) (4) (n) (n) with the coefficients stacked into the vector ηn = (h1,1,..1 , h1,2,..1 , h1,3,..1 , . . . ) ∈ Fn . The same idea can be applied to the entire pth-order Volterra series. By stacking the maps φn into a single map φ(p) (x) = (φ0 (x), φ1 (x), . . . , φp (x)) , one obtains a mapping from p+1 2 p Rm into F(p) = R × Rm × Rm × . . . Rm = RM with dimensionality M = 1−m . The 1−m entire pth-order Volterra series can be written as a scalar product in F(p) p n=0 Hn [x] = (η (p) ) φ(p) (x) (5) with η (p) ∈ F(p) . Below, we will show how we can express η (p) as an expansion in terms of the training points. This will dramatically reduce the number of parameters we have to estimate. This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). As a consequence, the dot product in Fn can be computed by evaluating a positive definite kernel function kn (x1 , x2 ). For monomials, one can easily show that (e.g., [6]) φn (x1 ) φn (x2 ) = (x1 x2 )n =: kn (x1 , x2 ). (6) Since F(p) is generated as a direct sum of the single spaces Fn , the associated scalar product is simply the sum of the scalar products in the Fn : φ(p) (x1 ) φ(p) (x2 ) = p n=0 (x1 x2 )n = k (p) (x1 , x2 ). (7) Thus, we have shown that the discretized Volterra series can be expressed as a linear functional in a RKHS2 . Linear regression in RKHS. For our prediction problem (1), the RKHS property of the Volterra series leads to an efficient solution which is in part due to the so called representer theorem (e.g., [6]). It states the following: suppose we are given N observations 2 A similar approach has been taken by [1] using the inhomogeneous polynomial kernel = (1 + x1 x2 )p . This kernel implies a map φinh into the same space of monomials, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem. (p) kinh (x1 , x2 ) (x1 , y1 ), . . . , (xN , yN ) of the function (1) and an arbitrary cost function c, Ω is a nondecreasing function on R>0 and . F is the norm of the RKHS associated with the kernel k. If we minimize an objective function c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) + Ω( f F ), (8) 3 over all functions in the RKHS, then an optimal solution can be expressed as N f (x) = j=1 aj k(x, xj ), aj ∈ R. (9) In other words, although we optimized over the entire RKHS including functions which are defined for arbitrary input points, it turns out that we can always express the solution in terms of the observations xj only. Hence the optimization problem over the extremely large number of coefficients η (p) in Eq. (5) is transformed into one over N variables aj . Let us consider the special case where the cost function is the mean squared error, N 1 c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) = N j=1 (f (xj ) − yj )2 , and the regularizer Ω is zero4 . The solution for a = (a1 , . . . , aN ) is readily computed by setting the derivative of (8) with respect to the vector a equal to zero; it takes the form a = K −1 y with the Gram matrix defined as Kij = k(xi , xj ), hence5 y = f (x) = a z(x) = y K −1 z(x), (10) N where z(x) = (k(x, x1 ), k(x, x2 ), . . . k(x, xN )) ∈ R . Implicit Wiener series estimation. As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error. This can be put into the regression framework: since any finite Volterra series can be represented as a linear functional in the corresponding RKHS, we can find the pth-order Volterra series that minimizes the squared error by linear regression. This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property6 . From Eqn. (10), we obtain the following expressions for the implicit Wiener series p p 1 −1 G0 [x] = y 1, Hn [x] = y Kp z(p) (x) (11) Gn [x] = n=0 n=0 N (p) where the Gram matrix Kp and the coefficient vector z (x) are computed using the kernel from Eq. (7) and 1 = (1, 1, . . . ) ∈ RN . Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. Thus, we can avoid the “curse of dimensionality”, i.e., there is no need to compute the possibly large number of coefficients explicitly. The explicit Volterra and Wiener expansions can be recovered from Eq. (11) by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by −1 Hn [x] = y Kp zn (x) n n (12) n with zn (x) = ((x1 x) , (x2 x) , . . . , (xN x) ) . For p = 0 the only term is the constant zero-order Volterra functional H0 [x] = G0 [x]. The coefficient vector ηn = (n) (n) (n) (h1,1,...1 , h1,2,...1 , h1,3,...1 , . . . ) of the explicit Volterra functional is obtained as −1 η n = Φ n Kp y 3 (13) for conditions on uniqueness of the solution, see [6] Note that this is different from the regularized approach used by [1]. If Ω is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. 5 If K is not invertible, K −1 denotes the pseudo-inverse of K. 6 assuming symmetrized Volterra kernels which can be obtained from any Volterra expanson. 4 using the design matrix Φn = (φn (x1 ) , φn (x1 ) , . . . , φn (x1 ) ) . The individual Wiener functionals can only be recovered by applying the regression procedure twice. If we are interested in the nth-degree Wiener functional, we have to compute the solution for the kernels k (n) (x1 , x2 ) and k (n−1) (x1 , x2 ). The Wiener functional for n > 0 is then obtained from the difference of the two results as Gn [x] = n i=0 Gi [x] − n−1 i=0 Gi [x] = y −1 −1 Kn z(n) (x) − Kn−1 z(n−1) (x) . (14) The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. (12) and (13) [3]. Orthogonality. The resulting Wiener functionals must fulfill the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. Formally, we will prove the following Theorem 1 The functionals obtained from Eq. (14) fulfill the orthogonality condition E [m(x)Gp [x]] = 0 (15) where E denotes the expectation over the input distribution and m(x) an arbitrary ithorder monomial with i < p. We will show that this a consequence of the least squares fit of any linear expansion in a set M of basis functions of the form y = j=1 γj ϕj (x). In the case of the Wiener and Volterra expansions, the basis functions ϕj (x) are monomials of the components of x. M We denote the error of the expansion as e(x) = y − j=1 γj ϕj (xi ). The minimum of the expected quadratic loss L with respect to the expansion coefficient γk is given by ∂ ∂L = E e(x) ∂γk ∂γk 2 = −2E [ϕk (x)e(x)] = 0. (16) This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. The approximation error is given by the ∞ sum of the higher-order Wiener functionals e(x) = n=p Gn [x], so Gp [x] is part of the error. As a consequence of the linearity of the expectation, Eq. (16) implies ∞ n=p ∞ E [ϕk (x)Gn [x]] = 0 and n=p+1 E [ϕk (x)Gn [x]] = 0 (17) for any φk of order less than p. The difference of both equations yields E [ϕk (x)Gp [x]] = 0, so that Gp [x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p. 3 Experiments Toy examples. In our first experiment, we check whether our intuitions about higher-order statistics described in the introduction are captured by the proposed method. In particular, we expect that arbitrarily oriented lines can only be predicted using third-order statistics. As a consequence, we should need at least a second-order Wiener functional to predict lines correctly. Our first test image (size 80 × 110, upper row in Fig. 1) contains only lines of varying orientations. Choosing a 5 × 5 neighbourhood, we predicted the central pixel using (11). original image 0th-order component/ reconstruction 1st-order reconstruction 1st-order component 2nd-order reconstruction 2nd-order component 3rd-order reconstruction mse = 583.7 mse = 0.006 mse = 0 mse = 1317 mse = 37.4 mse = 0.001 mse = 1845 mse = 334.9 3rd-order component mse = 19.0 Figure 1: Higher-order components of toy images. The image components of different orders are created by the corresponding Wiener functionals. They are added up to obtain the different orders of reconstruction. Note that the constant 0-order component and reconstruction are identical. The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. Although the linear first-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). A correct prediction is only obtained by adding a second-order component to the model. The third-order component is only significant at crossings, corners and line endings. Models of orders 0 . . . 3 were learned from the image by extracting the maximal training set of 76 × 106 patches of size 5 × 57 . The corresponding image components of order 0 to 3 were computed according to (14). Note the different components generated by the Wiener functionals can also be negative. In Fig. 1, they are scaled to the gray values [0..255]. The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. The additional small correction achieved by the third-order model is mainly due to discretization effects. Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. This is confirmed by the second and third test image shown in the corresponding row in Fig. 1. Note that the third-order component is only significant at crossings, corners and line endings. The fourth- and fifth-order terms (not shown) have only negligible contributions. The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel. Application to natural images. Are there further predictable structures in natural images that are not due to lines, crossings or corners? This can be investigated by applying our method to a set of natural images (an example of size 80 × 110 is depicted in Fig. 2). Our 7 In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand. original image 0th-order component/ reconstruction 1st-order reconstruction mse = 1070 1st-order component 2nd-order reconstruction mse = 957.4 2nd-order component 3rd-order reconstruction mse = 414.6 3rd-order component 4th-order reconstruction mse = 98.5 4th-order component 5th-order reconstruction mse = 18.5 5th-order component 6th-order reconstruction mse = 4.98 6th-order component 7th-order reconstruction mse = 1.32 7th-order component 8th-order reconstruction mse = 0.41 8th-order component Figure 2: Higher-order components and reconstructions of a photograph. Interactions up to the fifth order play an important role. Note that significant components become sparser with increasing model order. results on a set of 10 natural images of size 50 × 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. 3). Interestingly, terms up to order 5 still play a significant role, although the image regions with a significant component become sparser with increasing model order (see Fig. 2). Note that the nonlinear terms reduce the reconstruction error to almost 0. This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models. 4 Conclusion The implicit estimation of Wiener functionals via polynomial kernels opens up new possibilities for the estimation of higher-order image statistics. Compared to the classical methods such as higher-order spectra, moments or cumulants, our approach avoids the combinatorial explosion caused by the exponential increase of the number of terms to be estimated and interpreted. When put into a predictive framework, multiplicative pixel interactions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners. There is no one-to-one mapping between the classical higher-order statistics and multiplicative pixel interactions. Any nonlinear Wiener functional, for instance, creates infinitely many correlations or cumulants of higher order, and often also of lower order. On the other 700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. mse 500 400 300 200 100 0 0 1 2 3 4 5 6 7 model order hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. In contrast, our method is able to isolate image components that are created by a single order of interaction. Although of preliminary nature, our results on natural images suggest an important role of statistics up to the fifth order. Most of the currently used low-level feature detectors such as edge or corner detectors maximally use third-order interactions. The investigation of fourth- or higher-order features is a field that might lead to new insights into the nature and role of higher-order image structures. As often observed in the literature (e.g. [2][7]), our results seem to confirm that a large proportion of the redundancy in natural images is contained in the higher-order pixel interactions. Before any further conclusions can be drawn, however, our study needs to be extended in several directions: 1. A representative image database has to be analyzed. The images must be carefully calibrated since nonlinear statistics can be highly calibrationsensitive. In addition, the contribution of image noise has to be investigated. 2. Currently, only images up to 9000 pixels can be analyzed due to the matrix inversion required by Eq. 11. To accomodate for larger images, our method has to be reformulated in an iterative algorithm. 3. So far, we only considered 5 × 5-patches. To systematically investigate patch size effects, the analysis has to be conducted in a multi-scale framework. References [1] T. J. Dodd and R. F. Harrison. A new solution to Volterra series estimation. In CD-Rom Proc. 2002 IFAC World Congress, 2002. [2] D. J. Field. What is the goal of sensory coding? Neural Computation, 6:559 – 601, 1994. [3] M. O. Franz and B. Sch¨lkopf. Implicit Wiener series. Technical Report 114, Max-Plancko Institut f¨r biologische Kybernetik, T¨bingen, June 2003. u u [4] C. L. Nikias and A. P. Petropulu. Higher-order spectra analysis. Prentice Hall, Englewood Cliffs, NJ, 1993. [5] M. Schetzen. The Volterra and Wiener theories of nonlinear systems. Krieger, Malabar, 1989. [6] B. Sch¨lkopf and A. J. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. o [7] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neurosc., 4(8):819 – 825, 2001. [8] M. G. A. Thomson. Higher-order structure in natural scenes. J. Opt.Soc. Am. A, 16(7):1549 – 1553, 1999. [9] M. G. A. Thomson. Beats, kurtosis and visual coding. Network: Compt. Neural Syst., 12:271 – 287, 2001.
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract The computation of classical higher-order statistics such as higher-order moments or spectra is difficult for images due to the huge number of terms to be estimated and interpreted. [sent-5, score-0.234]
2 We propose an alternative approach in which multiplicative pixel interactions are described by a series of Wiener functionals. [sent-6, score-0.331]
3 Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. [sent-7, score-0.377]
4 First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of five play an important role in natural images. [sent-8, score-0.479]
5 Most of the interesting structure in a natural image is characterized by its higher-order statistics. [sent-9, score-0.122]
6 Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. [sent-10, score-0.167]
7 , we have to consider some third-order statistics which describe the interactions between triplets of points. [sent-13, score-0.138]
8 In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i. [sent-15, score-0.203]
9 Classically, harmonic phase interactions are measured by higher-order spectra [4]. [sent-18, score-0.196]
10 Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. [sent-19, score-0.184]
11 For instance, a sixth-order spectrum of a 16×16 sized image contains roughly 1012 coefficients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. [sent-20, score-0.153]
12 First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9]. [sent-21, score-0.17]
13 Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. [sent-22, score-0.32]
14 A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. [sent-23, score-0.474]
15 In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly via polynomial kernels. [sent-24, score-0.174]
16 This allows us to decompose an image into components that are characterized by interactions of a given order. [sent-25, score-0.237]
17 1 Modeling pixel interactions with Wiener functionals For our analysis, we adopt a prediction framework: Given a d × d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. [sent-32, score-0.63]
18 We are particularly interested to which extent interactions of different orders contribute to the overall prediction. [sent-33, score-0.141]
19 , m = d2 − 1 can be modeled as a series y = H0 [x] + H1 [x] + H2 [x] + · · · + Hn [x] + · · · (1) of discrete Volterra functionals H0 [x] = h0 = const. [sent-37, score-0.353]
20 The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefficients (n) hi1 . [sent-49, score-0.327]
21 Volterra functionals provide a controlled way of introducing multiplicative interactions of image points since a functional of order n contains all products of the input of order n. [sent-53, score-0.702]
22 In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels. [sent-54, score-0.197]
23 Unfortunately, Volterra functionals are not orthogonal to each other, i. [sent-55, score-0.284]
24 , depending on the input distribution, a functional of order n generally leads to additional lower-order interactions. [sent-57, score-0.176]
25 As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. [sent-58, score-0.168]
26 For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. [sent-59, score-0.15]
27 If one wants to estimate the zeroeth-order component of an image (i. [sent-60, score-0.151]
28 , the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. [sent-62, score-0.366]
29 For general Volterra series, this correction can be achieved by decomposing it into a new series y = G0 [x] + G1 [x] + · · · + Gn [x] + · · · of functionals Gn [x] that are uncorrelated, i. [sent-63, score-0.373]
30 The resulting Wiener functionals1 Gn [x] are linear combinations of Volterra functionals up to order n. [sent-66, score-0.261]
31 It can be shown that any Wiener expansion of finite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. [sent-68, score-0.282]
32 The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. [sent-69, score-0.444]
33 In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5]. [sent-70, score-0.383]
34 Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions. [sent-72, score-0.252]
35 the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefficients as it consists also of lower-order Volterra functionals. [sent-73, score-0.159]
36 In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efficient computation of the Wiener functionals. [sent-74, score-0.22]
37 2 Estimating Wiener series by regression in RKHS Volterra series as linear functionals in RKHS. [sent-75, score-0.487]
38 The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. [sent-76, score-0.325]
39 We can interpret the evaluation of this functional for a given input x as a map φn defined for n = 0, 1, 2, . [sent-77, score-0.15]
40 , xn ) 1 2 m 1 2 (3) n such that φn maps the input x ∈ Rm into a vector φn (x) ∈ Fn = Rm containing all mn ordered monomials of degree n. [sent-86, score-0.294]
41 Using φn , we can write the nth-order Volterra functional in Eq. [sent-87, score-0.133]
42 The 1−m entire pth-order Volterra series can be written as a scalar product in F(p) p n=0 Hn [x] = (η (p) ) φ(p) (x) (5) with η (p) ∈ F(p) . [sent-106, score-0.151]
43 This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). [sent-109, score-0.194]
44 (7) Thus, we have shown that the discretized Volterra series can be expressed as a linear functional in a RKHS2 . [sent-115, score-0.251]
45 For our prediction problem (1), the RKHS property of the Volterra series leads to an efficient solution which is in part due to the so called representer theorem (e. [sent-117, score-0.136]
46 This kernel implies a map φinh into the same space of monomials, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem. [sent-121, score-0.194]
47 As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error. [sent-146, score-0.233]
48 This can be put into the regression framework: since any finite Volterra series can be represented as a linear functional in the corresponding RKHS, we can find the pth-order Volterra series that minimizes the squared error by linear regression. [sent-147, score-0.471]
49 This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property6 . [sent-148, score-0.236]
50 (10), we obtain the following expressions for the implicit Wiener series p p 1 −1 G0 [x] = y 1, Hn [x] = y Kp z(p) (x) (11) Gn [x] = n=0 n=0 N (p) where the Gram matrix Kp and the coefficient vector z (x) are computed using the kernel from Eq. [sent-150, score-0.188]
51 Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. [sent-155, score-0.193]
52 (11) by collecting all terms containing monomials of the desired order and summing them up. [sent-160, score-0.201]
53 The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by −1 Hn [x] = y Kp zn (x) n n (12) n with zn (x) = ((x1 x) , (x2 x) , . [sent-161, score-0.443]
54 For p = 0 the only term is the constant zero-order Volterra functional H0 [x] = G0 [x]. [sent-165, score-0.133]
55 ) of the explicit Volterra functional is obtained as −1 η n = Φ n Kp y 3 (13) for conditions on uniqueness of the solution, see [6] Note that this is different from the regularized approach used by [1]. [sent-178, score-0.133]
56 If Ω is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. [sent-179, score-0.285]
57 The individual Wiener functionals can only be recovered by applying the regression procedure twice. [sent-186, score-0.271]
58 The Wiener functional for n > 0 is then obtained from the difference of the two results as Gn [x] = n i=0 Gi [x] − n−1 i=0 Gi [x] = y −1 −1 Kn z(n) (x) − Kn−1 z(n−1) (x) . [sent-188, score-0.133]
59 (14) The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. [sent-189, score-0.388]
60 The resulting Wiener functionals must fulfill the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. [sent-192, score-0.644]
61 Formally, we will prove the following Theorem 1 The functionals obtained from Eq. [sent-193, score-0.235]
62 In the case of the Wiener and Volterra expansions, the basis functions ϕj (x) are monomials of the components of x. [sent-196, score-0.227]
63 (16) This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. [sent-199, score-0.205]
64 Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. [sent-200, score-0.255]
65 The approximation error is given by the ∞ sum of the higher-order Wiener functionals e(x) = n=p Gn [x], so Gp [x] is part of the error. [sent-201, score-0.262]
66 The difference of both equations yields E [ϕk (x)Gp [x]] = 0, so that Gp [x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p. [sent-204, score-0.293]
67 As a consequence, we should need at least a second-order Wiener functional to predict lines correctly. [sent-208, score-0.186]
68 original image 0th-order component/ reconstruction 1st-order reconstruction 1st-order component 2nd-order reconstruction 2nd-order component 3rd-order reconstruction mse = 583. [sent-212, score-1.103]
69 The image components of different orders are created by the corresponding Wiener functionals. [sent-219, score-0.205]
70 Note that the constant 0-order component and reconstruction are identical. [sent-221, score-0.223]
71 The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. [sent-222, score-0.359]
72 Although the linear first-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). [sent-223, score-0.266]
73 The third-order component is only significant at crossings, corners and line endings. [sent-225, score-0.137]
74 The corresponding image components of order 0 to 3 were computed according to (14). [sent-230, score-0.158]
75 Note the different components generated by the Wiener functionals can also be negative. [sent-231, score-0.27]
76 The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. [sent-236, score-0.312]
77 Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. [sent-238, score-0.155]
78 Note that the third-order component is only significant at crossings, corners and line endings. [sent-241, score-0.137]
79 The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel. [sent-243, score-0.336]
80 Are there further predictable structures in natural images that are not due to lines, crossings or corners? [sent-245, score-0.165]
81 Our 7 In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand. [sent-248, score-0.15]
82 original image 0th-order component/ reconstruction 1st-order reconstruction mse = 1070 1st-order component 2nd-order reconstruction mse = 957. [sent-249, score-1.102]
83 4 2nd-order component 3rd-order reconstruction mse = 414. [sent-250, score-0.445]
84 6 3rd-order component 4th-order reconstruction mse = 98. [sent-251, score-0.445]
85 5 4th-order component 5th-order reconstruction mse = 18. [sent-252, score-0.445]
86 5 5th-order component 6th-order reconstruction mse = 4. [sent-253, score-0.445]
87 98 6th-order component 7th-order reconstruction mse = 1. [sent-254, score-0.445]
88 32 7th-order component 8th-order reconstruction mse = 0. [sent-255, score-0.445]
89 results on a set of 10 natural images of size 50 × 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. [sent-259, score-0.435]
90 Interestingly, terms up to order 5 still play a significant role, although the image regions with a significant component become sparser with increasing model order (see Fig. [sent-261, score-0.243]
91 Note that the nonlinear terms reduce the reconstruction error to almost 0. [sent-263, score-0.196]
92 This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models. [sent-264, score-0.148]
93 4 Conclusion The implicit estimation of Wiener functionals via polynomial kernels opens up new possibilities for the estimation of higher-order image statistics. [sent-265, score-0.484]
94 When put into a predictive framework, multiplicative pixel interactions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners. [sent-267, score-0.441]
95 There is no one-to-one mapping between the classical higher-order statistics and multiplicative pixel interactions. [sent-268, score-0.174]
96 On the other 700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. [sent-270, score-0.247]
97 mse 500 400 300 200 100 0 0 1 2 3 4 5 6 7 model order hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. [sent-271, score-0.558]
98 Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. [sent-272, score-0.139]
99 In contrast, our method is able to isolate image components that are created by a single order of interaction. [sent-273, score-0.195]
100 [2][7]), our results seem to confirm that a large proportion of the redundancy in natural images is contained in the higher-order pixel interactions. [sent-279, score-0.153]
wordName wordTfidf (topN-words)
[('volterra', 0.619), ('wiener', 0.508), ('functionals', 0.235), ('mse', 0.222), ('monomials', 0.175), ('reconstruction', 0.169), ('functional', 0.133), ('series', 0.118), ('gn', 0.106), ('interactions', 0.105), ('image', 0.097), ('crossings', 0.073), ('spectra', 0.071), ('rkhs', 0.071), ('corners', 0.064), ('hn', 0.064), ('pixel', 0.063), ('fn', 0.063), ('xn', 0.061), ('neighbourhood', 0.058), ('expansion', 0.056), ('component', 0.054), ('implicit', 0.051), ('rm', 0.051), ('kp', 0.05), ('orthogonal', 0.049), ('multiplicative', 0.045), ('images', 0.045), ('gp', 0.044), ('kn', 0.041), ('squared', 0.039), ('created', 0.037), ('orders', 0.036), ('coef', 0.035), ('orthogonality', 0.035), ('components', 0.035), ('lines', 0.035), ('moments', 0.034), ('biologische', 0.033), ('cumulants', 0.033), ('kybernetik', 0.033), ('classical', 0.033), ('statistics', 0.033), ('scalar', 0.033), ('kurtosis', 0.029), ('polynomial', 0.029), ('spectrum', 0.028), ('error', 0.027), ('aj', 0.027), ('cients', 0.027), ('stacked', 0.027), ('franz', 0.027), ('order', 0.026), ('natural', 0.025), ('estimation', 0.025), ('implicitly', 0.024), ('predicted', 0.024), ('explosion', 0.023), ('expansions', 0.023), ('yn', 0.023), ('consequence', 0.023), ('sparser', 0.022), ('zn', 0.022), ('fth', 0.022), ('kernels', 0.022), ('degree', 0.022), ('structures', 0.022), ('corner', 0.021), ('correction', 0.02), ('maximally', 0.02), ('minimizes', 0.02), ('phase', 0.02), ('bingen', 0.02), ('analogously', 0.02), ('detectors', 0.02), ('redundancy', 0.02), ('recovered', 0.02), ('line', 0.019), ('ful', 0.019), ('mn', 0.019), ('kernel', 0.019), ('toy', 0.018), ('gray', 0.018), ('predict', 0.018), ('prediction', 0.018), ('play', 0.018), ('usual', 0.018), ('fourier', 0.018), ('huge', 0.018), ('products', 0.018), ('basis', 0.017), ('gram', 0.017), ('interaction', 0.017), ('input', 0.017), ('oriented', 0.016), ('regression', 0.016), ('needs', 0.016), ('xj', 0.016), ('cant', 0.016), ('gi', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis
Author: Matthias O. Franz, Bernhard Schölkopf
Abstract: The computation of classical higher-order statistics such as higher-order moments or spectra is difficult for images due to the huge number of terms to be estimated and interpreted. We propose an alternative approach in which multiplicative pixel interactions are described by a series of Wiener functionals. Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of five play an important role in natural images. Most of the interesting structure in a natural image is characterized by its higher-order statistics. Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. This would require knowledge of a second point on the line, i.e., we have to consider some third-order statistics which describe the interactions between triplets of points. Analogously, the prediction of a corner neighbourhood needs at least fourth-order statistics, and so on. In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i.e. phase correlations between several Fourier components of the image. Classically, harmonic phase interactions are measured by higher-order spectra [4]. Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. For instance, a sixth-order spectrum of a 16×16 sized image contains roughly 1012 coefficients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9]. Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly via polynomial kernels. This allows us to decompose an image into components that are characterized by interactions of a given order. In the next section, we introduce the Wiener expansion and discuss its capability of modeling higher-order pixel interactions. The implicit estimation method is described in Sect. 2, followed by some examples of use in Sect. 3. We conclude in Sect. 4 by briefly discussing the results and possible improvements. 1 Modeling pixel interactions with Wiener functionals For our analysis, we adopt a prediction framework: Given a d × d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. We are particularly interested to which extent interactions of different orders contribute to the overall prediction. Our basic assumption is that the dependency of the central pixel value y on its neighbours xi , i = 1, . . . , m = d2 − 1 can be modeled as a series y = H0 [x] + H1 [x] + H2 [x] + · · · + Hn [x] + · · · (1) of discrete Volterra functionals H0 [x] = h0 = const. and Hn [x] = m i1 =1 ··· m in =1 (n) hi1 ...in xi1 . . . xin . (2) Here, we have stacked the grayvalues of the neighbourhood into the vector x = (x1 , . . . , xm ) ∈ Rm . The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefficients (n) hi1 ...in . Volterra functionals provide a controlled way of introducing multiplicative interactions of image points since a functional of order n contains all products of the input of order n. In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels. Unfortunately, Volterra functionals are not orthogonal to each other, i.e., depending on the input distribution, a functional of order n generally leads to additional lower-order interactions. As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. If one wants to estimate the zeroeth-order component of an image (i.e., the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. For general Volterra series, this correction can be achieved by decomposing it into a new series y = G0 [x] + G1 [x] + · · · + Gn [x] + · · · of functionals Gn [x] that are uncorrelated, i.e., orthogonal with respect to the input. The resulting Wiener functionals1 Gn [x] are linear combinations of Volterra functionals up to order n. They are computed from the original Volterra series by a procedure akin to Gram-Schmidt orthogonalization [5]. It can be shown that any Wiener expansion of finite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5]. So far, we have not gained anything compared to classical estimation of higher-order moments or spectra: an nth-order Volterra functional contains the same number of terms as 1 Strictly speaking, the term Wiener functional is reserved for orthogonal Volterra functionals with respect to Gaussian input. Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions. the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefficients as it consists also of lower-order Volterra functionals. In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efficient computation of the Wiener functionals. 2 Estimating Wiener series by regression in RKHS Volterra series as linear functionals in RKHS. The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. We can interpret the evaluation of this functional for a given input x as a map φn defined for n = 0, 1, 2, . . . as φ0 (x) = 1 and φn (x) = (xn , xn−1 x2 , . . . , x1 xn−1 , xn , . . . , xn ) 1 2 m 1 2 (3) n such that φn maps the input x ∈ Rm into a vector φn (x) ∈ Fn = Rm containing all mn ordered monomials of degree n. Using φn , we can write the nth-order Volterra functional in Eq. (2) as a scalar product in Fn , Hn [x] = ηn φn (x), (n) (4) (n) (n) with the coefficients stacked into the vector ηn = (h1,1,..1 , h1,2,..1 , h1,3,..1 , . . . ) ∈ Fn . The same idea can be applied to the entire pth-order Volterra series. By stacking the maps φn into a single map φ(p) (x) = (φ0 (x), φ1 (x), . . . , φp (x)) , one obtains a mapping from p+1 2 p Rm into F(p) = R × Rm × Rm × . . . Rm = RM with dimensionality M = 1−m . The 1−m entire pth-order Volterra series can be written as a scalar product in F(p) p n=0 Hn [x] = (η (p) ) φ(p) (x) (5) with η (p) ∈ F(p) . Below, we will show how we can express η (p) as an expansion in terms of the training points. This will dramatically reduce the number of parameters we have to estimate. This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). As a consequence, the dot product in Fn can be computed by evaluating a positive definite kernel function kn (x1 , x2 ). For monomials, one can easily show that (e.g., [6]) φn (x1 ) φn (x2 ) = (x1 x2 )n =: kn (x1 , x2 ). (6) Since F(p) is generated as a direct sum of the single spaces Fn , the associated scalar product is simply the sum of the scalar products in the Fn : φ(p) (x1 ) φ(p) (x2 ) = p n=0 (x1 x2 )n = k (p) (x1 , x2 ). (7) Thus, we have shown that the discretized Volterra series can be expressed as a linear functional in a RKHS2 . Linear regression in RKHS. For our prediction problem (1), the RKHS property of the Volterra series leads to an efficient solution which is in part due to the so called representer theorem (e.g., [6]). It states the following: suppose we are given N observations 2 A similar approach has been taken by [1] using the inhomogeneous polynomial kernel = (1 + x1 x2 )p . This kernel implies a map φinh into the same space of monomials, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem. (p) kinh (x1 , x2 ) (x1 , y1 ), . . . , (xN , yN ) of the function (1) and an arbitrary cost function c, Ω is a nondecreasing function on R>0 and . F is the norm of the RKHS associated with the kernel k. If we minimize an objective function c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) + Ω( f F ), (8) 3 over all functions in the RKHS, then an optimal solution can be expressed as N f (x) = j=1 aj k(x, xj ), aj ∈ R. (9) In other words, although we optimized over the entire RKHS including functions which are defined for arbitrary input points, it turns out that we can always express the solution in terms of the observations xj only. Hence the optimization problem over the extremely large number of coefficients η (p) in Eq. (5) is transformed into one over N variables aj . Let us consider the special case where the cost function is the mean squared error, N 1 c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) = N j=1 (f (xj ) − yj )2 , and the regularizer Ω is zero4 . The solution for a = (a1 , . . . , aN ) is readily computed by setting the derivative of (8) with respect to the vector a equal to zero; it takes the form a = K −1 y with the Gram matrix defined as Kij = k(xi , xj ), hence5 y = f (x) = a z(x) = y K −1 z(x), (10) N where z(x) = (k(x, x1 ), k(x, x2 ), . . . k(x, xN )) ∈ R . Implicit Wiener series estimation. As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error. This can be put into the regression framework: since any finite Volterra series can be represented as a linear functional in the corresponding RKHS, we can find the pth-order Volterra series that minimizes the squared error by linear regression. This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property6 . From Eqn. (10), we obtain the following expressions for the implicit Wiener series p p 1 −1 G0 [x] = y 1, Hn [x] = y Kp z(p) (x) (11) Gn [x] = n=0 n=0 N (p) where the Gram matrix Kp and the coefficient vector z (x) are computed using the kernel from Eq. (7) and 1 = (1, 1, . . . ) ∈ RN . Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. Thus, we can avoid the “curse of dimensionality”, i.e., there is no need to compute the possibly large number of coefficients explicitly. The explicit Volterra and Wiener expansions can be recovered from Eq. (11) by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by −1 Hn [x] = y Kp zn (x) n n (12) n with zn (x) = ((x1 x) , (x2 x) , . . . , (xN x) ) . For p = 0 the only term is the constant zero-order Volterra functional H0 [x] = G0 [x]. The coefficient vector ηn = (n) (n) (n) (h1,1,...1 , h1,2,...1 , h1,3,...1 , . . . ) of the explicit Volterra functional is obtained as −1 η n = Φ n Kp y 3 (13) for conditions on uniqueness of the solution, see [6] Note that this is different from the regularized approach used by [1]. If Ω is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. 5 If K is not invertible, K −1 denotes the pseudo-inverse of K. 6 assuming symmetrized Volterra kernels which can be obtained from any Volterra expanson. 4 using the design matrix Φn = (φn (x1 ) , φn (x1 ) , . . . , φn (x1 ) ) . The individual Wiener functionals can only be recovered by applying the regression procedure twice. If we are interested in the nth-degree Wiener functional, we have to compute the solution for the kernels k (n) (x1 , x2 ) and k (n−1) (x1 , x2 ). The Wiener functional for n > 0 is then obtained from the difference of the two results as Gn [x] = n i=0 Gi [x] − n−1 i=0 Gi [x] = y −1 −1 Kn z(n) (x) − Kn−1 z(n−1) (x) . (14) The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. (12) and (13) [3]. Orthogonality. The resulting Wiener functionals must fulfill the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. Formally, we will prove the following Theorem 1 The functionals obtained from Eq. (14) fulfill the orthogonality condition E [m(x)Gp [x]] = 0 (15) where E denotes the expectation over the input distribution and m(x) an arbitrary ithorder monomial with i < p. We will show that this a consequence of the least squares fit of any linear expansion in a set M of basis functions of the form y = j=1 γj ϕj (x). In the case of the Wiener and Volterra expansions, the basis functions ϕj (x) are monomials of the components of x. M We denote the error of the expansion as e(x) = y − j=1 γj ϕj (xi ). The minimum of the expected quadratic loss L with respect to the expansion coefficient γk is given by ∂ ∂L = E e(x) ∂γk ∂γk 2 = −2E [ϕk (x)e(x)] = 0. (16) This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. The approximation error is given by the ∞ sum of the higher-order Wiener functionals e(x) = n=p Gn [x], so Gp [x] is part of the error. As a consequence of the linearity of the expectation, Eq. (16) implies ∞ n=p ∞ E [ϕk (x)Gn [x]] = 0 and n=p+1 E [ϕk (x)Gn [x]] = 0 (17) for any φk of order less than p. The difference of both equations yields E [ϕk (x)Gp [x]] = 0, so that Gp [x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p. 3 Experiments Toy examples. In our first experiment, we check whether our intuitions about higher-order statistics described in the introduction are captured by the proposed method. In particular, we expect that arbitrarily oriented lines can only be predicted using third-order statistics. As a consequence, we should need at least a second-order Wiener functional to predict lines correctly. Our first test image (size 80 × 110, upper row in Fig. 1) contains only lines of varying orientations. Choosing a 5 × 5 neighbourhood, we predicted the central pixel using (11). original image 0th-order component/ reconstruction 1st-order reconstruction 1st-order component 2nd-order reconstruction 2nd-order component 3rd-order reconstruction mse = 583.7 mse = 0.006 mse = 0 mse = 1317 mse = 37.4 mse = 0.001 mse = 1845 mse = 334.9 3rd-order component mse = 19.0 Figure 1: Higher-order components of toy images. The image components of different orders are created by the corresponding Wiener functionals. They are added up to obtain the different orders of reconstruction. Note that the constant 0-order component and reconstruction are identical. The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. Although the linear first-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). A correct prediction is only obtained by adding a second-order component to the model. The third-order component is only significant at crossings, corners and line endings. Models of orders 0 . . . 3 were learned from the image by extracting the maximal training set of 76 × 106 patches of size 5 × 57 . The corresponding image components of order 0 to 3 were computed according to (14). Note the different components generated by the Wiener functionals can also be negative. In Fig. 1, they are scaled to the gray values [0..255]. The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. The additional small correction achieved by the third-order model is mainly due to discretization effects. Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. This is confirmed by the second and third test image shown in the corresponding row in Fig. 1. Note that the third-order component is only significant at crossings, corners and line endings. The fourth- and fifth-order terms (not shown) have only negligible contributions. The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel. Application to natural images. Are there further predictable structures in natural images that are not due to lines, crossings or corners? This can be investigated by applying our method to a set of natural images (an example of size 80 × 110 is depicted in Fig. 2). Our 7 In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand. original image 0th-order component/ reconstruction 1st-order reconstruction mse = 1070 1st-order component 2nd-order reconstruction mse = 957.4 2nd-order component 3rd-order reconstruction mse = 414.6 3rd-order component 4th-order reconstruction mse = 98.5 4th-order component 5th-order reconstruction mse = 18.5 5th-order component 6th-order reconstruction mse = 4.98 6th-order component 7th-order reconstruction mse = 1.32 7th-order component 8th-order reconstruction mse = 0.41 8th-order component Figure 2: Higher-order components and reconstructions of a photograph. Interactions up to the fifth order play an important role. Note that significant components become sparser with increasing model order. results on a set of 10 natural images of size 50 × 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. 3). Interestingly, terms up to order 5 still play a significant role, although the image regions with a significant component become sparser with increasing model order (see Fig. 2). Note that the nonlinear terms reduce the reconstruction error to almost 0. This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models. 4 Conclusion The implicit estimation of Wiener functionals via polynomial kernels opens up new possibilities for the estimation of higher-order image statistics. Compared to the classical methods such as higher-order spectra, moments or cumulants, our approach avoids the combinatorial explosion caused by the exponential increase of the number of terms to be estimated and interpreted. When put into a predictive framework, multiplicative pixel interactions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners. There is no one-to-one mapping between the classical higher-order statistics and multiplicative pixel interactions. Any nonlinear Wiener functional, for instance, creates infinitely many correlations or cumulants of higher order, and often also of lower order. On the other 700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. mse 500 400 300 200 100 0 0 1 2 3 4 5 6 7 model order hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. In contrast, our method is able to isolate image components that are created by a single order of interaction. Although of preliminary nature, our results on natural images suggest an important role of statistics up to the fifth order. Most of the currently used low-level feature detectors such as edge or corner detectors maximally use third-order interactions. The investigation of fourth- or higher-order features is a field that might lead to new insights into the nature and role of higher-order image structures. As often observed in the literature (e.g. [2][7]), our results seem to confirm that a large proportion of the redundancy in natural images is contained in the higher-order pixel interactions. Before any further conclusions can be drawn, however, our study needs to be extended in several directions: 1. A representative image database has to be analyzed. The images must be carefully calibrated since nonlinear statistics can be highly calibrationsensitive. In addition, the contribution of image noise has to be investigated. 2. Currently, only images up to 9000 pixels can be analyzed due to the matrix inversion required by Eq. 11. To accomodate for larger images, our method has to be reformulated in an iterative algorithm. 3. So far, we only considered 5 × 5-patches. To systematically investigate patch size effects, the analysis has to be conducted in a multi-scale framework. References [1] T. J. Dodd and R. F. Harrison. A new solution to Volterra series estimation. In CD-Rom Proc. 2002 IFAC World Congress, 2002. [2] D. J. Field. What is the goal of sensory coding? Neural Computation, 6:559 – 601, 1994. [3] M. O. Franz and B. Sch¨lkopf. Implicit Wiener series. Technical Report 114, Max-Plancko Institut f¨r biologische Kybernetik, T¨bingen, June 2003. u u [4] C. L. Nikias and A. P. Petropulu. Higher-order spectra analysis. Prentice Hall, Englewood Cliffs, NJ, 1993. [5] M. Schetzen. The Volterra and Wiener theories of nonlinear systems. Krieger, Malabar, 1989. [6] B. Sch¨lkopf and A. J. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. o [7] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neurosc., 4(8):819 – 825, 2001. [8] M. G. A. Thomson. Higher-order structure in natural scenes. J. Opt.Soc. Am. A, 16(7):1549 – 1553, 1999. [9] M. G. A. Thomson. Beats, kurtosis and visual coding. Network: Compt. Neural Syst., 12:271 – 287, 2001.
2 0.079266526 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
Author: Eizaburo Doi, Michael S. Lewicki
Abstract: It has been suggested that the primary goal of the sensory system is to represent input in such a way as to reduce the high degree of redundancy. Given a noisy neural representation, however, solely reducing redundancy is not desirable, since redundancy is the only clue to reduce the effects of noise. Here we propose a model that best balances redundancy reduction and redundant representation. Like previous models, our model accounts for the localized and oriented structure of simple cells, but it also predicts a different organization for the population. With noisy, limited-capacity units, the optimal representation becomes an overcomplete, multi-scale representation, which, compared to previous models, is in closer agreement with physiological data. These results offer a new perspective on the expansion of the number of neurons from retina to V1 and provide a theoretical model of incorporating useful redundancy into efficient neural representations. 1
3 0.068890676 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
Author: Peter Sollich, Christopher Williams
Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.
4 0.06653896 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
Author: Hyun J. Park, Te W. Lee
Abstract: Capturing dependencies in images in an unsupervised manner is important for many image processing applications. We propose a new method for capturing nonlinear dependencies in images of natural scenes. This method is an extension of the linear Independent Component Analysis (ICA) method by building a hierarchical model based on ICA and mixture of Laplacian distribution. The model parameters are learned via an EM algorithm and it can accurately capture variance correlation and other high order structures in a simple manner. We visualize the learned variance structure and demonstrate applications to image segmentation and denoising. 1 In trod u ction Unsupervised learning has become an important tool for understanding biological information processing and building intelligent signal processing methods. Real biological systems however are much more robust and flexible than current artificial intelligence mostly due to a much more efficient representations used in biological systems. Therefore, unsupervised learning algorithms that capture more sophisticated representations can provide a better understanding of neural information processing and also provide improved algorithm for signal processing applications. For example, independent component analysis (ICA) can learn representations similar to simple cell receptive fields in visual cortex [1] and is also applied for feature extraction, image segmentation and denoising [2,3]. ICA can approximate statistics of natural image patches by Eq.(1,2), where X is the data and u is a source signal whose distribution is a product of sparse distributions like a generalized Laplacian distribution. X = Au (1) P (u ) = ∏ P (u i ) (2) But the representation learned by the ICA algorithm is relatively low-level. In biological systems there are more high-level representations such as contours, textures and objects, which are not well represented by the linear ICA model. ICA learns only linear dependency between pixels by finding strongly correlated linear axis. Therefore, the modeling capability of ICA is quite limited. Previous approaches showed that one can learn more sophisticated high-level representations by capturing nonlinear dependencies in a post-processing step after the ICA step [4,5,6,7,8]. The focus of these efforts has centered on variance correlation in natural images. After ICA, a source signal is not linearly predictable from others. However, given variance dependencies, a source signal is still ‘predictable’ in a nonlinear manner. It is not possible to de-correlate this variance dependency using a linear transformation. Several researchers have proposed extensions to capture the nonlinear dependencies. Portilla et al. used Gaussian Scale Mixture (GSM) to model variance dependency in wavelet domain. This model can learn variance correlation in source prior and showed improvement in image denoising [4]. But in this model, dependency is defined only between a subset of wavelet coefficients. Hyvarinen and Hoyer suggested using a special variance related distribution to model the variance correlated source prior. This model can learn grouping of dependent sources (Subspace ICA) or topographic arrangements of correlated sources (Topographic ICA) [5,6]. Similarly, Welling et al. suggested a product of expert model where each expert represents a variance correlated group [7]. The product form of the model enables applications to image denoising. But these models don’t reveal higher-order structures explicitly. Our model is motivated by Lewicki and Karklin who proposed a 2-stage model where the 1st stage is an ICA model (Eq. (3)) and the 2 nd-stage is a linear generative model where another source v generates logarithmic variance for the 1st stage (Eq. (4)) [8]. This model captures variance dependency structure explicitly, but treating variance as an additional random variable introduces another level of complexity and requires several approximations. Thus, it is difficult to obtain a simple analytic PDF of source signal u and to apply the model for image processing problems. ( P (u | λ ) = c exp − u / λ q ) (3) log[λ ] = Bv (4) We propose a hierarchical model based on ICA and a mixture of Laplacian distribution. Our model can be considered as a simplification of model in [8] by constraining v to be 0/1 random vector where only one element can be 1. Our model is computationally simpler but still can capture variance dependency. Experiments show that our model can reveal higher order structures similar to [8]. In addition, our model provides a simple parametric PDF of variance correlated priors, which is an important advantage for adaptive signal processing. Utilizing this, we demonstrate simple applications on image segmentation and image denoising. Our model provides an improved statistic model for natural images and can be used for other applications including feature extraction, image coding, or learning even higher order structures. 2 Modeling nonlinear dependencies We propose a hierarchical or 2-stage model where the 1 st stage is an ICA source signal model and the 2nd stage is modeled by a mixture model with different variances (figure 1). In natural images, the correlation of variance reflects different types of regularities in the real world. Such specialized regularities can be summarized as “context” information. To model the context dependent variance correlation, we use mixture models where Laplacian distributions with different variance represent different contexts. For each image patch, a context variable Z “selects” which Laplacian distribution will represent ICA source signal u. Laplacian distributions have 0-mean but different variances. The advantage of Laplacian distribution for modeling context is that we can model a sparse distribution using only one Laplacian distribution. But we need more than two Gaussian distributions to do the same thing. Also conventional ICA is a special case of our model with one Laplacian. We define the mixture model and its learning algorithm in the next sections. Figure 1: Proposed hierarchical model (1st stage is ICA generative model. 2nd stage is mixture of “context dependent” Laplacian distributions which model U. Z is a random variable that selects a Laplacian distribution that generates the given image patch) 2.1 Mixture of Laplacian Distribution We define a PDF for mixture of M-dimensional Laplacian Distribution as Eq.(5), where N is the number of data samples, and K is the number of mixtures. N N K M N K r r r P(U | Λ, Π) = ∏ P(u n | Λ, Π) = ∏∑ π k P(u n | λk ) = ∏∑ π k ∏ n n k n k m 1 (2λ ) k ,m u n,m exp − λk , m (5) r r r r r un = (un,1 , un , 2 , , , un,M ) : n-th data sample, U = (u1 , u 2 , , , ui , , , u N ) r r r r r λk = (λk ,1 , λk , 2 ,..., λk ,M ) : Variance of k-th Laplacian distribution, Λ = (λ1 , λ2 , , , λk , , , λK ) πk : probability of Laplacian distribution k, Π = (π 1 , , , π K ) and ∑ k πk =1 It is not easy to maximize Eq.(5) directly, and we use EM (expectation maximization) algorithm for parameter estimation. Here we introduce a new hidden context variable Z that represents which Laplacian k, is responsible for a given data point. Assuming we know the hidden variable Z, we can write the likelihood of data and Z as Eq.(6), n zk K N r (π )zkn 1 ⋅ exp − z n u n ,m P(U , Z | Λ, Π ) = ∏ P(u n , Z | Λ, Π ) = ∏ ∏ k ∏ k k λk , m n n m 2λk ,m N (6) n z k : Hidden binary random variable, 1 if n-th data sample is generated from k-th n Laplacian, 0 other wise. ( Z = (z kn ) and ∑ z k = 1 for all n = 1…N) k 2.2 EM algorithm for learning the mixture model The EM algorithm maximizes the log likelihood of data averaged over hidden variable Z. The log likelihood and its expectation can be computed as in Eq.(7,8). u 1 n n log P(U , Z | Λ, Π ) = ∑ z k log(π k ) + ∑ z k log( ) − n ,m 2λk ,m λk , m n ,k m (7) u 1 n E {log P (U , Z | Λ, Π )} = ∑ E z k log(π k ) + ∑ log( ) − n ,m 2λ k , m λk , m n ,k m { } (8) The expectation in Eq.(8) can be evaluated, if we are given the data U and estimated parameters Λ and Π. For Λ and Π, EM algorithm uses current estimation Λ’ and Π’. { } { } ∑ z P( z n n E z k ≡ E zk | U , Λ' , Π ' = 1 n z k =0 n k n k n | u n , Λ' , Π ' ) = P( z k = 1 | u n , Λ' , Π ' ) (9) = n n P (u n | z k = 1, Λ' , Π ' ) P( z k = 1 | Λ ' , Π ' ) P(u n | Λ' , Π ' ) = M u n ,m 1 1 1 ∏ 2λ ' exp(− λ ' ) ⋅ π k ' = c P (u n | Λ ' , Π ' ) m k ,m k ,m n M πk ' ∏ 2λ m k ,m ' exp(− u n ,m λk , m ' ) Where the normalization constant can be computed as K K M k k =1 m =1 n cn = P (u n | Λ ' , Π ' ) = ∑ P (u n | z k , Λ ' , Π ' ) P ( z kn | Λ ' , Π ' ) = ∑ π k ∏ 1 (2λ ) exp( − k ,m u n ,m λk ,m ) (10) The EM algorithm works by maximizing Eq.(8), given the expectation computed from Eq.(9,10). Eq.(9,10) can be computed using Λ’ and Π’ estimated in the previous iteration of EM algorithm. This is E-step of EM algorithm. Then in M-step of EM algorithm, we need to maximize Eq.(8) over parameter Λ and Π. First, we can maximize Eq.(8) with respect to Λ, by setting the derivative as 0. 1 u n,m ∂E{log P (U , Z | Λ, Π )} n = 0 = ∑ E z k − + λ k , m (λ k , m ) 2 ∂λ k ,m n { } ⇒ λ k ,m ∑ E{z }⋅ u = ∑ E{z } n k n ,m n (11) n k n Second, for maximization of Eq.(8) with respect to Π, we can rewrite Eq.(8) as below. n (12) E {log P (U , Z | Λ , Π )} = C + ∑ E {z k ' }log(π k ' ) n ,k ' As we see, the derivative of Eq.(12) with respect to Π cannot be 0. Instead, we need to use Lagrange multiplier method for maximization. A Lagrange function can be defined as Eq.(14) where ρ is a Lagrange multiplier. { } (13) n L (Π , ρ ) = − ∑ E z k ' log(π k ' ) + ρ (∑ π k ' − 1) n,k ' k' By setting the derivative of Eq.(13) to be 0 with respect to ρ and Π, we can simply get the maximization solution with respect to Π. We just show the solution in Eq.(14). ∂L(Π, ρ ) ∂L(Π, ρ ) =0 = 0, ∂Π ∂ρ n n ⇒ π k = ∑ E z k / ∑∑ E z k k n n { } { } (14) Then the EM algorithm can be summarized as figure 2. For the convergence criteria, we can use the expectation of log likelihood, which can be calculated from Eq. (8). πk = { } , λk , m = E um + e (e is small random noise) 2. Calculate the Expectation by 1. Initialize 1 K u n ,m 1 M πk ' ∏ 2λ ' exp( − λ ' ) cn m k ,m k ,m 3. Maximize the log likelihood given the Expectation { } { } n n E z k ≡ E zk | U , Λ' , Π ' = λk ,m ← ∑ E {z kn }⋅ u n,m / ∑ E {z kn } , π k ← ∑ E {z kn } / ∑∑ E {z kn } n n k n 4. If (converged) stop, otherwise repeat from step 2. n Figure 2: Outline of EM algorithm for Learning the Mixture Model 3 Experimental Results Here we provide examples of image data and show how the learning procedure is performed for the mixture model. We also provide visualization of learned variances that reveal the structure of variance correlation and an application to image denoising. 3.1 Learning Nonlinear Dependencies in Natural images As shown in figure 1, the 1 st stage of the proposed model is simply the linear ICA. The ICA matrix A and W(=A-1) are learned by the FastICA algorithm [9]. We sampled 105(=N) data from 16x16 patches (256 dim.) of natural images and use them for both first and second stage learning. ICA input dimension is 256, and source dimension is set to be 160(=M). The learned ICA basis is partially shown in figure 1. The 2nd stage mixture model is learned given the ICA source signals. In the 2 nd stage the number of mixtures is set to 16, 64, or 256(=K). Training by the EM algorithm is fast and several hundred iterations are sufficient for convergence (0.5 hour on a 1.7GHz Pentium PC). For the visualization of learned variance, we adapted the visualization method from [8]. Each dimension of ICA source signal corresponds to an ICA basis (columns of A) and each ICA basis is localized in both image and frequency space. Then for each Laplacian distribution, we can display its variance vector as a set of points in image and frequency space. Each point can be color coded by variance value as figure 3. (a1) (a2) (b1) (b2) Figure 3: Visualization of learned variances (a1 and a2 visualize variance of Laplacian #4 and b1 and 2 show that of Laplacian #5. High variance value is mapped to red color and low variance is mapped to blue. In Laplacian #4, variances for diagonally oriented edges are high. But in Laplacian #5, variances for edges at spatially right position are high. Variance structures are related to “contexts” in the image. For example, Laplacian #4 explains image patches that have oriented textures or edges. Laplacian #5 captures patches where left side of the patch is clean but right side is filled with randomly oriented edges.) A key idea of our model is that we can mix up independent distributions to get nonlinearly dependent distribution. This modeling power can be shown by figure 4. Figure 4: Joint distribution of nonlinearly dependent sources. ((a) is a joint histogram of 2 ICA sources, (b) is computed from learned mixture model, and (c) is from learned Laplacian model. In (a), variance of u2 is smaller than u1 at center area (arrow A), but almost equal to u1 at outside (arrow B). So the variance of u2 is dependent on u1. This nonlinear dependency is closely approximated by mixture model in (b), but not in (c).) 3.2 Unsupervised Image Segmentation The idea behind our model is that the image can be modeled as mixture of different variance correlated “contexts”. We show how the learned model can be used to classify different context by an unsupervised image segmentation task. Given learned model and data, we can compute the expectation of a hidden variable Z from Eq. (9). Then for an image patch, we can select a Laplacian distribution with highest probability, which is the most explaining Laplacian or “context”. For segmentation, we use the model with 16 Laplacians. This enables abstract partitioning of images and we can visualize organization of images more clearly (figure 5). Figure 5: Unsupervised image segmentation (left is original image, middle is color labeled image, right image shows color coded Laplacians with variance structure. Each color corresponds to a Laplacian distribution, which represents surface or textural organization of underlying contexts. Laplacian #14 captures smooth surface and Laplacian #9 captures contrast between clear sky and textured ground scenes.) 3.3 Application to Image Restoration The proposed mixture model provides a better parametric model of the ICA source distribution and hence an improved model of the image structure. An advantage is in the MAP (maximum a posterior) estimation of a noisy image. If we assume Gaussian noise n, the image generation model can be written as Eq.(15). Then, we can compute MAP estimation of ICA source signal u by Eq.(16) and reconstruct the original image. (15) X = Au + n (16) ˆ u = argmax log P (u | X , A) = argmax (log P ( X | u , A) + log P (u ) ) u u Since we assumed Gaussian noise, P(X|u,A) in Eq. (16) is Gaussian. P(u) in Eq. (16) can be modeled as a Laplacian or a mixture of Laplacian distribution. The mixture distribution can be approximated by a maximum explaining Laplacian. We evaluated 3 different methods for image restoration including ICA MAP estimation with simple Laplacian prior, same with Laplacian mixture prior, and the Wiener filter. Figure 6 shows an example and figure 7 summarizes the results obtained with different noise levels. As shown MAP estimation with the mixture prior performs better than the others in terms of SNR and SSIM (Structural Similarity Measure) [10]. Figure 6: Image restoration results (signal variance 1.0, noise variance 0.81) 16 ICA MAP (Mixture prior) ICA MAP (Laplacian prior) W iener 14 0.8 SSIM Index SNR 12 10 8 6 0.6 0.4 0.2 4 2 ICA MAP(Mixture prior) ICA MAP(Laplacian prior) W iener Noisy Image 1 0 0.5 1 1.5 Noise variance 2 2.5 0 0 0.5 1 1.5 Noise variance 2 2.5 Figure 7: SNR and SSIM for 3 different algorithms (signal variance = 1.0) 4 D i s c u s s i on We proposed a mixture model to learn nonlinear dependencies of ICA source signals for natural images. The proposed mixture of Laplacian distribution model is a generalization of the conventional independent source priors and can model variance dependency given natural image signals. Experiments show that the proposed model can learn the variance correlated signals grouped as different mixtures and learn highlevel structures, which are highly correlated with the underlying physical properties captured in the image. Our model provides an analytic prior of nearly independent and variance-correlated signals, which was not viable in previous models [4,5,6,7,8]. The learned variances of the mixture model show structured localization in image and frequency space, which are similar to the result in [8]. Since the model is given no information about the spatial location or frequency of the source signals, we can assume that the dependency captured by the mixture model reveals regularity in the natural images. As shown in image labeling experiments, such regularities correspond to specific surface types (textures) or boundaries between surfaces. The learned mixture model can be used to discover hidden contexts that generated such regularity or correlated signal groups. Experiments also show that the labeling of image patches is highly correlated with the object surface types shown in the image. The segmentation results show regularity across image space and strong correlation with high-level concepts. Finally, we showed applications of the model for image restoration. We compare the performance with the conventional ICA MAP estimation and Wiener filter. Our results suggest that the proposed model outperforms other traditional methods. It is due to the estimation of the correlated variance structure, which provides an improved prior that has not been considered in other methods. In our future work, we plan to exploit the regularity of the image segmentation result to lean more high-level structures by building additional hierarchies on the current model. Furthermore, the application to image coding seems promising. References [1] A. J. Bell and T. J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edge Filters, Vision Research, 37(23):3327–3338, 1997. [2] A. Hyvarinen, Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation,Neural Computation, 11(7):1739-1768, 1999. [3] T. Lee, M. Lewicki, and T. Sejnowski., ICA Mixture Models for unsupervised Classification of non-gaussian classes and automatic context switching in blind separation. PAMI, 22(10), October 2000. [4] J. Portilla, V. Strela, M. J. Wainwright and E. P Simoncelli, Image Denoising using Scale Mixtures of Gaussians in the Wavelet Domain, IEEE Trans. On Image Processing, Vol.12, No. 11, 1338-1351, 2003. [5] A. Hyvarinen, P. O. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neurocomputing, 1999. [6] A. Hyvarinen, P.O. Hoyer, Topographic Independent component analysis as a model of V1 Receptive Fields, Neurocomputing, Vol. 38-40, June 2001. [7] M. Welling and G. E. Hinton, S. Osindero, Learning Sparse Topographic Representations with Products of Student-t Distributions, NIPS, 2002. [8] M. S. Lewicki and Y. Karklin, Learning higher-order structures in natural images, Network: Comput. Neural Syst. 14 (August 2003) 483-499. [9] A.Hyvarinen, P.O. Hoyer, Fast ICA matlab code., http://www.cis.hut.fi/projects/compneuro/extensions.html/ [10] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, The SSIM Index for Image Quality Assessment, IEEE Transactions on Image Processing, vol. 13, no. 4, Apr. 2004.
5 0.06537649 166 nips-2004-Semi-supervised Learning via Gaussian Processes
Author: Neil D. Lawrence, Michael I. Jordan
Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1
6 0.059517149 92 nips-2004-Kernel Methods for Implicit Surface Modeling
7 0.057552643 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images
8 0.053418513 132 nips-2004-Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis
9 0.051636152 50 nips-2004-Dependent Gaussian Processes
10 0.049841195 179 nips-2004-Surface Reconstruction using Learned Shape Models
11 0.047634661 168 nips-2004-Semigroup Kernels on Finite Sets
12 0.047226533 68 nips-2004-Face Detection --- Efficient and Rank Deficient
13 0.045695327 177 nips-2004-Supervised Graph Inference
14 0.043560155 192 nips-2004-The power of feature clustering: An application to object detection
15 0.042442869 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging
16 0.039347645 94 nips-2004-Kernels for Multi--task Learning
17 0.039054081 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
18 0.03887248 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
19 0.038631473 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
20 0.037855834 124 nips-2004-Multiple Alignment of Continuous Time Series
topicId topicWeight
[(0, -0.121), (1, 0.028), (2, -0.05), (3, -0.039), (4, -0.035), (5, -0.015), (6, 0.041), (7, -0.067), (8, 0.041), (9, 0.002), (10, -0.013), (11, -0.033), (12, 0.078), (13, -0.051), (14, -0.006), (15, 0.042), (16, 0.03), (17, 0.034), (18, 0.099), (19, -0.099), (20, 0.0), (21, -0.002), (22, -0.075), (23, 0.065), (24, -0.04), (25, -0.085), (26, 0.029), (27, -0.05), (28, -0.04), (29, 0.084), (30, 0.042), (31, 0.098), (32, -0.056), (33, -0.019), (34, 0.096), (35, 0.058), (36, -0.014), (37, 0.041), (38, 0.017), (39, 0.039), (40, 0.018), (41, 0.054), (42, -0.034), (43, -0.062), (44, 0.032), (45, 0.122), (46, -0.075), (47, -0.04), (48, -0.022), (49, -0.16)]
simIndex simValue paperId paperTitle
same-paper 1 0.93642199 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis
Author: Matthias O. Franz, Bernhard Schölkopf
Abstract: The computation of classical higher-order statistics such as higher-order moments or spectra is difficult for images due to the huge number of terms to be estimated and interpreted. We propose an alternative approach in which multiplicative pixel interactions are described by a series of Wiener functionals. Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of five play an important role in natural images. Most of the interesting structure in a natural image is characterized by its higher-order statistics. Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. This would require knowledge of a second point on the line, i.e., we have to consider some third-order statistics which describe the interactions between triplets of points. Analogously, the prediction of a corner neighbourhood needs at least fourth-order statistics, and so on. In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i.e. phase correlations between several Fourier components of the image. Classically, harmonic phase interactions are measured by higher-order spectra [4]. Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. For instance, a sixth-order spectrum of a 16×16 sized image contains roughly 1012 coefficients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9]. Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly via polynomial kernels. This allows us to decompose an image into components that are characterized by interactions of a given order. In the next section, we introduce the Wiener expansion and discuss its capability of modeling higher-order pixel interactions. The implicit estimation method is described in Sect. 2, followed by some examples of use in Sect. 3. We conclude in Sect. 4 by briefly discussing the results and possible improvements. 1 Modeling pixel interactions with Wiener functionals For our analysis, we adopt a prediction framework: Given a d × d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. We are particularly interested to which extent interactions of different orders contribute to the overall prediction. Our basic assumption is that the dependency of the central pixel value y on its neighbours xi , i = 1, . . . , m = d2 − 1 can be modeled as a series y = H0 [x] + H1 [x] + H2 [x] + · · · + Hn [x] + · · · (1) of discrete Volterra functionals H0 [x] = h0 = const. and Hn [x] = m i1 =1 ··· m in =1 (n) hi1 ...in xi1 . . . xin . (2) Here, we have stacked the grayvalues of the neighbourhood into the vector x = (x1 , . . . , xm ) ∈ Rm . The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefficients (n) hi1 ...in . Volterra functionals provide a controlled way of introducing multiplicative interactions of image points since a functional of order n contains all products of the input of order n. In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels. Unfortunately, Volterra functionals are not orthogonal to each other, i.e., depending on the input distribution, a functional of order n generally leads to additional lower-order interactions. As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. If one wants to estimate the zeroeth-order component of an image (i.e., the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. For general Volterra series, this correction can be achieved by decomposing it into a new series y = G0 [x] + G1 [x] + · · · + Gn [x] + · · · of functionals Gn [x] that are uncorrelated, i.e., orthogonal with respect to the input. The resulting Wiener functionals1 Gn [x] are linear combinations of Volterra functionals up to order n. They are computed from the original Volterra series by a procedure akin to Gram-Schmidt orthogonalization [5]. It can be shown that any Wiener expansion of finite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5]. So far, we have not gained anything compared to classical estimation of higher-order moments or spectra: an nth-order Volterra functional contains the same number of terms as 1 Strictly speaking, the term Wiener functional is reserved for orthogonal Volterra functionals with respect to Gaussian input. Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions. the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefficients as it consists also of lower-order Volterra functionals. In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efficient computation of the Wiener functionals. 2 Estimating Wiener series by regression in RKHS Volterra series as linear functionals in RKHS. The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. We can interpret the evaluation of this functional for a given input x as a map φn defined for n = 0, 1, 2, . . . as φ0 (x) = 1 and φn (x) = (xn , xn−1 x2 , . . . , x1 xn−1 , xn , . . . , xn ) 1 2 m 1 2 (3) n such that φn maps the input x ∈ Rm into a vector φn (x) ∈ Fn = Rm containing all mn ordered monomials of degree n. Using φn , we can write the nth-order Volterra functional in Eq. (2) as a scalar product in Fn , Hn [x] = ηn φn (x), (n) (4) (n) (n) with the coefficients stacked into the vector ηn = (h1,1,..1 , h1,2,..1 , h1,3,..1 , . . . ) ∈ Fn . The same idea can be applied to the entire pth-order Volterra series. By stacking the maps φn into a single map φ(p) (x) = (φ0 (x), φ1 (x), . . . , φp (x)) , one obtains a mapping from p+1 2 p Rm into F(p) = R × Rm × Rm × . . . Rm = RM with dimensionality M = 1−m . The 1−m entire pth-order Volterra series can be written as a scalar product in F(p) p n=0 Hn [x] = (η (p) ) φ(p) (x) (5) with η (p) ∈ F(p) . Below, we will show how we can express η (p) as an expansion in terms of the training points. This will dramatically reduce the number of parameters we have to estimate. This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). As a consequence, the dot product in Fn can be computed by evaluating a positive definite kernel function kn (x1 , x2 ). For monomials, one can easily show that (e.g., [6]) φn (x1 ) φn (x2 ) = (x1 x2 )n =: kn (x1 , x2 ). (6) Since F(p) is generated as a direct sum of the single spaces Fn , the associated scalar product is simply the sum of the scalar products in the Fn : φ(p) (x1 ) φ(p) (x2 ) = p n=0 (x1 x2 )n = k (p) (x1 , x2 ). (7) Thus, we have shown that the discretized Volterra series can be expressed as a linear functional in a RKHS2 . Linear regression in RKHS. For our prediction problem (1), the RKHS property of the Volterra series leads to an efficient solution which is in part due to the so called representer theorem (e.g., [6]). It states the following: suppose we are given N observations 2 A similar approach has been taken by [1] using the inhomogeneous polynomial kernel = (1 + x1 x2 )p . This kernel implies a map φinh into the same space of monomials, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem. (p) kinh (x1 , x2 ) (x1 , y1 ), . . . , (xN , yN ) of the function (1) and an arbitrary cost function c, Ω is a nondecreasing function on R>0 and . F is the norm of the RKHS associated with the kernel k. If we minimize an objective function c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) + Ω( f F ), (8) 3 over all functions in the RKHS, then an optimal solution can be expressed as N f (x) = j=1 aj k(x, xj ), aj ∈ R. (9) In other words, although we optimized over the entire RKHS including functions which are defined for arbitrary input points, it turns out that we can always express the solution in terms of the observations xj only. Hence the optimization problem over the extremely large number of coefficients η (p) in Eq. (5) is transformed into one over N variables aj . Let us consider the special case where the cost function is the mean squared error, N 1 c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) = N j=1 (f (xj ) − yj )2 , and the regularizer Ω is zero4 . The solution for a = (a1 , . . . , aN ) is readily computed by setting the derivative of (8) with respect to the vector a equal to zero; it takes the form a = K −1 y with the Gram matrix defined as Kij = k(xi , xj ), hence5 y = f (x) = a z(x) = y K −1 z(x), (10) N where z(x) = (k(x, x1 ), k(x, x2 ), . . . k(x, xN )) ∈ R . Implicit Wiener series estimation. As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error. This can be put into the regression framework: since any finite Volterra series can be represented as a linear functional in the corresponding RKHS, we can find the pth-order Volterra series that minimizes the squared error by linear regression. This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property6 . From Eqn. (10), we obtain the following expressions for the implicit Wiener series p p 1 −1 G0 [x] = y 1, Hn [x] = y Kp z(p) (x) (11) Gn [x] = n=0 n=0 N (p) where the Gram matrix Kp and the coefficient vector z (x) are computed using the kernel from Eq. (7) and 1 = (1, 1, . . . ) ∈ RN . Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. Thus, we can avoid the “curse of dimensionality”, i.e., there is no need to compute the possibly large number of coefficients explicitly. The explicit Volterra and Wiener expansions can be recovered from Eq. (11) by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by −1 Hn [x] = y Kp zn (x) n n (12) n with zn (x) = ((x1 x) , (x2 x) , . . . , (xN x) ) . For p = 0 the only term is the constant zero-order Volterra functional H0 [x] = G0 [x]. The coefficient vector ηn = (n) (n) (n) (h1,1,...1 , h1,2,...1 , h1,3,...1 , . . . ) of the explicit Volterra functional is obtained as −1 η n = Φ n Kp y 3 (13) for conditions on uniqueness of the solution, see [6] Note that this is different from the regularized approach used by [1]. If Ω is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. 5 If K is not invertible, K −1 denotes the pseudo-inverse of K. 6 assuming symmetrized Volterra kernels which can be obtained from any Volterra expanson. 4 using the design matrix Φn = (φn (x1 ) , φn (x1 ) , . . . , φn (x1 ) ) . The individual Wiener functionals can only be recovered by applying the regression procedure twice. If we are interested in the nth-degree Wiener functional, we have to compute the solution for the kernels k (n) (x1 , x2 ) and k (n−1) (x1 , x2 ). The Wiener functional for n > 0 is then obtained from the difference of the two results as Gn [x] = n i=0 Gi [x] − n−1 i=0 Gi [x] = y −1 −1 Kn z(n) (x) − Kn−1 z(n−1) (x) . (14) The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. (12) and (13) [3]. Orthogonality. The resulting Wiener functionals must fulfill the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. Formally, we will prove the following Theorem 1 The functionals obtained from Eq. (14) fulfill the orthogonality condition E [m(x)Gp [x]] = 0 (15) where E denotes the expectation over the input distribution and m(x) an arbitrary ithorder monomial with i < p. We will show that this a consequence of the least squares fit of any linear expansion in a set M of basis functions of the form y = j=1 γj ϕj (x). In the case of the Wiener and Volterra expansions, the basis functions ϕj (x) are monomials of the components of x. M We denote the error of the expansion as e(x) = y − j=1 γj ϕj (xi ). The minimum of the expected quadratic loss L with respect to the expansion coefficient γk is given by ∂ ∂L = E e(x) ∂γk ∂γk 2 = −2E [ϕk (x)e(x)] = 0. (16) This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. The approximation error is given by the ∞ sum of the higher-order Wiener functionals e(x) = n=p Gn [x], so Gp [x] is part of the error. As a consequence of the linearity of the expectation, Eq. (16) implies ∞ n=p ∞ E [ϕk (x)Gn [x]] = 0 and n=p+1 E [ϕk (x)Gn [x]] = 0 (17) for any φk of order less than p. The difference of both equations yields E [ϕk (x)Gp [x]] = 0, so that Gp [x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p. 3 Experiments Toy examples. In our first experiment, we check whether our intuitions about higher-order statistics described in the introduction are captured by the proposed method. In particular, we expect that arbitrarily oriented lines can only be predicted using third-order statistics. As a consequence, we should need at least a second-order Wiener functional to predict lines correctly. Our first test image (size 80 × 110, upper row in Fig. 1) contains only lines of varying orientations. Choosing a 5 × 5 neighbourhood, we predicted the central pixel using (11). original image 0th-order component/ reconstruction 1st-order reconstruction 1st-order component 2nd-order reconstruction 2nd-order component 3rd-order reconstruction mse = 583.7 mse = 0.006 mse = 0 mse = 1317 mse = 37.4 mse = 0.001 mse = 1845 mse = 334.9 3rd-order component mse = 19.0 Figure 1: Higher-order components of toy images. The image components of different orders are created by the corresponding Wiener functionals. They are added up to obtain the different orders of reconstruction. Note that the constant 0-order component and reconstruction are identical. The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. Although the linear first-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). A correct prediction is only obtained by adding a second-order component to the model. The third-order component is only significant at crossings, corners and line endings. Models of orders 0 . . . 3 were learned from the image by extracting the maximal training set of 76 × 106 patches of size 5 × 57 . The corresponding image components of order 0 to 3 were computed according to (14). Note the different components generated by the Wiener functionals can also be negative. In Fig. 1, they are scaled to the gray values [0..255]. The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. The additional small correction achieved by the third-order model is mainly due to discretization effects. Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. This is confirmed by the second and third test image shown in the corresponding row in Fig. 1. Note that the third-order component is only significant at crossings, corners and line endings. The fourth- and fifth-order terms (not shown) have only negligible contributions. The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel. Application to natural images. Are there further predictable structures in natural images that are not due to lines, crossings or corners? This can be investigated by applying our method to a set of natural images (an example of size 80 × 110 is depicted in Fig. 2). Our 7 In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand. original image 0th-order component/ reconstruction 1st-order reconstruction mse = 1070 1st-order component 2nd-order reconstruction mse = 957.4 2nd-order component 3rd-order reconstruction mse = 414.6 3rd-order component 4th-order reconstruction mse = 98.5 4th-order component 5th-order reconstruction mse = 18.5 5th-order component 6th-order reconstruction mse = 4.98 6th-order component 7th-order reconstruction mse = 1.32 7th-order component 8th-order reconstruction mse = 0.41 8th-order component Figure 2: Higher-order components and reconstructions of a photograph. Interactions up to the fifth order play an important role. Note that significant components become sparser with increasing model order. results on a set of 10 natural images of size 50 × 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. 3). Interestingly, terms up to order 5 still play a significant role, although the image regions with a significant component become sparser with increasing model order (see Fig. 2). Note that the nonlinear terms reduce the reconstruction error to almost 0. This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models. 4 Conclusion The implicit estimation of Wiener functionals via polynomial kernels opens up new possibilities for the estimation of higher-order image statistics. Compared to the classical methods such as higher-order spectra, moments or cumulants, our approach avoids the combinatorial explosion caused by the exponential increase of the number of terms to be estimated and interpreted. When put into a predictive framework, multiplicative pixel interactions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners. There is no one-to-one mapping between the classical higher-order statistics and multiplicative pixel interactions. Any nonlinear Wiener functional, for instance, creates infinitely many correlations or cumulants of higher order, and often also of lower order. On the other 700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. mse 500 400 300 200 100 0 0 1 2 3 4 5 6 7 model order hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. In contrast, our method is able to isolate image components that are created by a single order of interaction. Although of preliminary nature, our results on natural images suggest an important role of statistics up to the fifth order. Most of the currently used low-level feature detectors such as edge or corner detectors maximally use third-order interactions. The investigation of fourth- or higher-order features is a field that might lead to new insights into the nature and role of higher-order image structures. As often observed in the literature (e.g. [2][7]), our results seem to confirm that a large proportion of the redundancy in natural images is contained in the higher-order pixel interactions. Before any further conclusions can be drawn, however, our study needs to be extended in several directions: 1. A representative image database has to be analyzed. The images must be carefully calibrated since nonlinear statistics can be highly calibrationsensitive. In addition, the contribution of image noise has to be investigated. 2. Currently, only images up to 9000 pixels can be analyzed due to the matrix inversion required by Eq. 11. To accomodate for larger images, our method has to be reformulated in an iterative algorithm. 3. So far, we only considered 5 × 5-patches. To systematically investigate patch size effects, the analysis has to be conducted in a multi-scale framework. References [1] T. J. Dodd and R. F. Harrison. A new solution to Volterra series estimation. In CD-Rom Proc. 2002 IFAC World Congress, 2002. [2] D. J. Field. What is the goal of sensory coding? Neural Computation, 6:559 – 601, 1994. [3] M. O. Franz and B. Sch¨lkopf. Implicit Wiener series. Technical Report 114, Max-Plancko Institut f¨r biologische Kybernetik, T¨bingen, June 2003. u u [4] C. L. Nikias and A. P. Petropulu. Higher-order spectra analysis. Prentice Hall, Englewood Cliffs, NJ, 1993. [5] M. Schetzen. The Volterra and Wiener theories of nonlinear systems. Krieger, Malabar, 1989. [6] B. Sch¨lkopf and A. J. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. o [7] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neurosc., 4(8):819 – 825, 2001. [8] M. G. A. Thomson. Higher-order structure in natural scenes. J. Opt.Soc. Am. A, 16(7):1549 – 1553, 1999. [9] M. G. A. Thomson. Beats, kurtosis and visual coding. Network: Compt. Neural Syst., 12:271 – 287, 2001.
2 0.63369745 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
Author: Eizaburo Doi, Michael S. Lewicki
Abstract: It has been suggested that the primary goal of the sensory system is to represent input in such a way as to reduce the high degree of redundancy. Given a noisy neural representation, however, solely reducing redundancy is not desirable, since redundancy is the only clue to reduce the effects of noise. Here we propose a model that best balances redundancy reduction and redundant representation. Like previous models, our model accounts for the localized and oriented structure of simple cells, but it also predicts a different organization for the population. With noisy, limited-capacity units, the optimal representation becomes an overcomplete, multi-scale representation, which, compared to previous models, is in closer agreement with physiological data. These results offer a new perspective on the expansion of the number of neurons from retina to V1 and provide a theoretical model of incorporating useful redundancy into efficient neural representations. 1
3 0.51477891 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan
Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efficacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and finance.16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear filters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to find combinations of these (perhaps non-linearly transformed) as a way of finding higher order filters (eg complex cells). One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in figure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n filter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modified Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to filter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. Wainwright2 considered a prespecified tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which filter responses share which mixer variables. We first study this issue using synthetic data in which two groups of filter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (figure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few filter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single filter response, the Gaussian estimate peaks away from zero. For assignments including more filter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. Specifically, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single filter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many filter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefficients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). The true generative model specifies probabilistic overlap of 3 mixer variables. We generated 5000 samples for each filter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefficient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in figure 2). Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal filters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with filter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between filters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the first 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet filters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA first-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet filters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefficients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artificial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.
4 0.51338083 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images
Author: Erik G. Learned-miller, Parvez Ahammad
Abstract: The correction of bias in magnetic resonance images is an important problem in medical image processing. Most previous approaches have used a maximum likelihood method to increase the likelihood of the pixels in a single image by adaptively estimating a correction to the unknown image bias field. The pixel likelihoods are defined either in terms of a pre-existing tissue model, or non-parametrically in terms of the image’s own pixel values. In both cases, the specific location of a pixel in the image is not used to calculate the likelihoods. We suggest a new approach in which we simultaneously eliminate the bias from a set of images of the same anatomy, but from different patients. We use the statistics from the same location across different images, rather than within an image, to eliminate bias fields from all of the images simultaneously. The method builds a “multi-resolution” non-parametric tissue model conditioned on image location while eliminating the bias fields associated with the original image set. We present experiments on both synthetic and real MR data sets, and present comparisons with other methods. 1
5 0.48485386 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
Author: Peter Sollich, Christopher Williams
Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.
6 0.48144326 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
7 0.45546353 166 nips-2004-Semi-supervised Learning via Gaussian Processes
8 0.42535818 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
9 0.41270721 50 nips-2004-Dependent Gaussian Processes
10 0.38893843 168 nips-2004-Semigroup Kernels on Finite Sets
11 0.36859858 158 nips-2004-Sampling Methods for Unsupervised Learning
12 0.3683365 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
13 0.36661068 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
14 0.35697928 179 nips-2004-Surface Reconstruction using Learned Shape Models
15 0.34620553 192 nips-2004-The power of feature clustering: An application to object detection
16 0.34023011 68 nips-2004-Face Detection --- Efficient and Rank Deficient
17 0.33748111 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
18 0.33607325 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps
19 0.32849512 94 nips-2004-Kernels for Multi--task Learning
20 0.3161889 109 nips-2004-Mass Meta-analysis in Talairach Space
topicId topicWeight
[(13, 0.075), (15, 0.137), (26, 0.084), (31, 0.012), (33, 0.157), (35, 0.02), (39, 0.035), (50, 0.032), (66, 0.289), (87, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.80375963 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis
Author: Matthias O. Franz, Bernhard Schölkopf
Abstract: The computation of classical higher-order statistics such as higher-order moments or spectra is difficult for images due to the huge number of terms to be estimated and interpreted. We propose an alternative approach in which multiplicative pixel interactions are described by a series of Wiener functionals. Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of five play an important role in natural images. Most of the interesting structure in a natural image is characterized by its higher-order statistics. Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. This would require knowledge of a second point on the line, i.e., we have to consider some third-order statistics which describe the interactions between triplets of points. Analogously, the prediction of a corner neighbourhood needs at least fourth-order statistics, and so on. In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i.e. phase correlations between several Fourier components of the image. Classically, harmonic phase interactions are measured by higher-order spectra [4]. Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. For instance, a sixth-order spectrum of a 16×16 sized image contains roughly 1012 coefficients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9]. Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly via polynomial kernels. This allows us to decompose an image into components that are characterized by interactions of a given order. In the next section, we introduce the Wiener expansion and discuss its capability of modeling higher-order pixel interactions. The implicit estimation method is described in Sect. 2, followed by some examples of use in Sect. 3. We conclude in Sect. 4 by briefly discussing the results and possible improvements. 1 Modeling pixel interactions with Wiener functionals For our analysis, we adopt a prediction framework: Given a d × d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. We are particularly interested to which extent interactions of different orders contribute to the overall prediction. Our basic assumption is that the dependency of the central pixel value y on its neighbours xi , i = 1, . . . , m = d2 − 1 can be modeled as a series y = H0 [x] + H1 [x] + H2 [x] + · · · + Hn [x] + · · · (1) of discrete Volterra functionals H0 [x] = h0 = const. and Hn [x] = m i1 =1 ··· m in =1 (n) hi1 ...in xi1 . . . xin . (2) Here, we have stacked the grayvalues of the neighbourhood into the vector x = (x1 , . . . , xm ) ∈ Rm . The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefficients (n) hi1 ...in . Volterra functionals provide a controlled way of introducing multiplicative interactions of image points since a functional of order n contains all products of the input of order n. In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels. Unfortunately, Volterra functionals are not orthogonal to each other, i.e., depending on the input distribution, a functional of order n generally leads to additional lower-order interactions. As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. If one wants to estimate the zeroeth-order component of an image (i.e., the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. For general Volterra series, this correction can be achieved by decomposing it into a new series y = G0 [x] + G1 [x] + · · · + Gn [x] + · · · of functionals Gn [x] that are uncorrelated, i.e., orthogonal with respect to the input. The resulting Wiener functionals1 Gn [x] are linear combinations of Volterra functionals up to order n. They are computed from the original Volterra series by a procedure akin to Gram-Schmidt orthogonalization [5]. It can be shown that any Wiener expansion of finite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5]. So far, we have not gained anything compared to classical estimation of higher-order moments or spectra: an nth-order Volterra functional contains the same number of terms as 1 Strictly speaking, the term Wiener functional is reserved for orthogonal Volterra functionals with respect to Gaussian input. Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions. the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefficients as it consists also of lower-order Volterra functionals. In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efficient computation of the Wiener functionals. 2 Estimating Wiener series by regression in RKHS Volterra series as linear functionals in RKHS. The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. We can interpret the evaluation of this functional for a given input x as a map φn defined for n = 0, 1, 2, . . . as φ0 (x) = 1 and φn (x) = (xn , xn−1 x2 , . . . , x1 xn−1 , xn , . . . , xn ) 1 2 m 1 2 (3) n such that φn maps the input x ∈ Rm into a vector φn (x) ∈ Fn = Rm containing all mn ordered monomials of degree n. Using φn , we can write the nth-order Volterra functional in Eq. (2) as a scalar product in Fn , Hn [x] = ηn φn (x), (n) (4) (n) (n) with the coefficients stacked into the vector ηn = (h1,1,..1 , h1,2,..1 , h1,3,..1 , . . . ) ∈ Fn . The same idea can be applied to the entire pth-order Volterra series. By stacking the maps φn into a single map φ(p) (x) = (φ0 (x), φ1 (x), . . . , φp (x)) , one obtains a mapping from p+1 2 p Rm into F(p) = R × Rm × Rm × . . . Rm = RM with dimensionality M = 1−m . The 1−m entire pth-order Volterra series can be written as a scalar product in F(p) p n=0 Hn [x] = (η (p) ) φ(p) (x) (5) with η (p) ∈ F(p) . Below, we will show how we can express η (p) as an expansion in terms of the training points. This will dramatically reduce the number of parameters we have to estimate. This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). As a consequence, the dot product in Fn can be computed by evaluating a positive definite kernel function kn (x1 , x2 ). For monomials, one can easily show that (e.g., [6]) φn (x1 ) φn (x2 ) = (x1 x2 )n =: kn (x1 , x2 ). (6) Since F(p) is generated as a direct sum of the single spaces Fn , the associated scalar product is simply the sum of the scalar products in the Fn : φ(p) (x1 ) φ(p) (x2 ) = p n=0 (x1 x2 )n = k (p) (x1 , x2 ). (7) Thus, we have shown that the discretized Volterra series can be expressed as a linear functional in a RKHS2 . Linear regression in RKHS. For our prediction problem (1), the RKHS property of the Volterra series leads to an efficient solution which is in part due to the so called representer theorem (e.g., [6]). It states the following: suppose we are given N observations 2 A similar approach has been taken by [1] using the inhomogeneous polynomial kernel = (1 + x1 x2 )p . This kernel implies a map φinh into the same space of monomials, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem. (p) kinh (x1 , x2 ) (x1 , y1 ), . . . , (xN , yN ) of the function (1) and an arbitrary cost function c, Ω is a nondecreasing function on R>0 and . F is the norm of the RKHS associated with the kernel k. If we minimize an objective function c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) + Ω( f F ), (8) 3 over all functions in the RKHS, then an optimal solution can be expressed as N f (x) = j=1 aj k(x, xj ), aj ∈ R. (9) In other words, although we optimized over the entire RKHS including functions which are defined for arbitrary input points, it turns out that we can always express the solution in terms of the observations xj only. Hence the optimization problem over the extremely large number of coefficients η (p) in Eq. (5) is transformed into one over N variables aj . Let us consider the special case where the cost function is the mean squared error, N 1 c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) = N j=1 (f (xj ) − yj )2 , and the regularizer Ω is zero4 . The solution for a = (a1 , . . . , aN ) is readily computed by setting the derivative of (8) with respect to the vector a equal to zero; it takes the form a = K −1 y with the Gram matrix defined as Kij = k(xi , xj ), hence5 y = f (x) = a z(x) = y K −1 z(x), (10) N where z(x) = (k(x, x1 ), k(x, x2 ), . . . k(x, xN )) ∈ R . Implicit Wiener series estimation. As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error. This can be put into the regression framework: since any finite Volterra series can be represented as a linear functional in the corresponding RKHS, we can find the pth-order Volterra series that minimizes the squared error by linear regression. This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property6 . From Eqn. (10), we obtain the following expressions for the implicit Wiener series p p 1 −1 G0 [x] = y 1, Hn [x] = y Kp z(p) (x) (11) Gn [x] = n=0 n=0 N (p) where the Gram matrix Kp and the coefficient vector z (x) are computed using the kernel from Eq. (7) and 1 = (1, 1, . . . ) ∈ RN . Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. Thus, we can avoid the “curse of dimensionality”, i.e., there is no need to compute the possibly large number of coefficients explicitly. The explicit Volterra and Wiener expansions can be recovered from Eq. (11) by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by −1 Hn [x] = y Kp zn (x) n n (12) n with zn (x) = ((x1 x) , (x2 x) , . . . , (xN x) ) . For p = 0 the only term is the constant zero-order Volterra functional H0 [x] = G0 [x]. The coefficient vector ηn = (n) (n) (n) (h1,1,...1 , h1,2,...1 , h1,3,...1 , . . . ) of the explicit Volterra functional is obtained as −1 η n = Φ n Kp y 3 (13) for conditions on uniqueness of the solution, see [6] Note that this is different from the regularized approach used by [1]. If Ω is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. 5 If K is not invertible, K −1 denotes the pseudo-inverse of K. 6 assuming symmetrized Volterra kernels which can be obtained from any Volterra expanson. 4 using the design matrix Φn = (φn (x1 ) , φn (x1 ) , . . . , φn (x1 ) ) . The individual Wiener functionals can only be recovered by applying the regression procedure twice. If we are interested in the nth-degree Wiener functional, we have to compute the solution for the kernels k (n) (x1 , x2 ) and k (n−1) (x1 , x2 ). The Wiener functional for n > 0 is then obtained from the difference of the two results as Gn [x] = n i=0 Gi [x] − n−1 i=0 Gi [x] = y −1 −1 Kn z(n) (x) − Kn−1 z(n−1) (x) . (14) The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. (12) and (13) [3]. Orthogonality. The resulting Wiener functionals must fulfill the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. Formally, we will prove the following Theorem 1 The functionals obtained from Eq. (14) fulfill the orthogonality condition E [m(x)Gp [x]] = 0 (15) where E denotes the expectation over the input distribution and m(x) an arbitrary ithorder monomial with i < p. We will show that this a consequence of the least squares fit of any linear expansion in a set M of basis functions of the form y = j=1 γj ϕj (x). In the case of the Wiener and Volterra expansions, the basis functions ϕj (x) are monomials of the components of x. M We denote the error of the expansion as e(x) = y − j=1 γj ϕj (xi ). The minimum of the expected quadratic loss L with respect to the expansion coefficient γk is given by ∂ ∂L = E e(x) ∂γk ∂γk 2 = −2E [ϕk (x)e(x)] = 0. (16) This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. The approximation error is given by the ∞ sum of the higher-order Wiener functionals e(x) = n=p Gn [x], so Gp [x] is part of the error. As a consequence of the linearity of the expectation, Eq. (16) implies ∞ n=p ∞ E [ϕk (x)Gn [x]] = 0 and n=p+1 E [ϕk (x)Gn [x]] = 0 (17) for any φk of order less than p. The difference of both equations yields E [ϕk (x)Gp [x]] = 0, so that Gp [x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p. 3 Experiments Toy examples. In our first experiment, we check whether our intuitions about higher-order statistics described in the introduction are captured by the proposed method. In particular, we expect that arbitrarily oriented lines can only be predicted using third-order statistics. As a consequence, we should need at least a second-order Wiener functional to predict lines correctly. Our first test image (size 80 × 110, upper row in Fig. 1) contains only lines of varying orientations. Choosing a 5 × 5 neighbourhood, we predicted the central pixel using (11). original image 0th-order component/ reconstruction 1st-order reconstruction 1st-order component 2nd-order reconstruction 2nd-order component 3rd-order reconstruction mse = 583.7 mse = 0.006 mse = 0 mse = 1317 mse = 37.4 mse = 0.001 mse = 1845 mse = 334.9 3rd-order component mse = 19.0 Figure 1: Higher-order components of toy images. The image components of different orders are created by the corresponding Wiener functionals. They are added up to obtain the different orders of reconstruction. Note that the constant 0-order component and reconstruction are identical. The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. Although the linear first-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). A correct prediction is only obtained by adding a second-order component to the model. The third-order component is only significant at crossings, corners and line endings. Models of orders 0 . . . 3 were learned from the image by extracting the maximal training set of 76 × 106 patches of size 5 × 57 . The corresponding image components of order 0 to 3 were computed according to (14). Note the different components generated by the Wiener functionals can also be negative. In Fig. 1, they are scaled to the gray values [0..255]. The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. The additional small correction achieved by the third-order model is mainly due to discretization effects. Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. This is confirmed by the second and third test image shown in the corresponding row in Fig. 1. Note that the third-order component is only significant at crossings, corners and line endings. The fourth- and fifth-order terms (not shown) have only negligible contributions. The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel. Application to natural images. Are there further predictable structures in natural images that are not due to lines, crossings or corners? This can be investigated by applying our method to a set of natural images (an example of size 80 × 110 is depicted in Fig. 2). Our 7 In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand. original image 0th-order component/ reconstruction 1st-order reconstruction mse = 1070 1st-order component 2nd-order reconstruction mse = 957.4 2nd-order component 3rd-order reconstruction mse = 414.6 3rd-order component 4th-order reconstruction mse = 98.5 4th-order component 5th-order reconstruction mse = 18.5 5th-order component 6th-order reconstruction mse = 4.98 6th-order component 7th-order reconstruction mse = 1.32 7th-order component 8th-order reconstruction mse = 0.41 8th-order component Figure 2: Higher-order components and reconstructions of a photograph. Interactions up to the fifth order play an important role. Note that significant components become sparser with increasing model order. results on a set of 10 natural images of size 50 × 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. 3). Interestingly, terms up to order 5 still play a significant role, although the image regions with a significant component become sparser with increasing model order (see Fig. 2). Note that the nonlinear terms reduce the reconstruction error to almost 0. This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models. 4 Conclusion The implicit estimation of Wiener functionals via polynomial kernels opens up new possibilities for the estimation of higher-order image statistics. Compared to the classical methods such as higher-order spectra, moments or cumulants, our approach avoids the combinatorial explosion caused by the exponential increase of the number of terms to be estimated and interpreted. When put into a predictive framework, multiplicative pixel interactions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners. There is no one-to-one mapping between the classical higher-order statistics and multiplicative pixel interactions. Any nonlinear Wiener functional, for instance, creates infinitely many correlations or cumulants of higher order, and often also of lower order. On the other 700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. mse 500 400 300 200 100 0 0 1 2 3 4 5 6 7 model order hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. In contrast, our method is able to isolate image components that are created by a single order of interaction. Although of preliminary nature, our results on natural images suggest an important role of statistics up to the fifth order. Most of the currently used low-level feature detectors such as edge or corner detectors maximally use third-order interactions. The investigation of fourth- or higher-order features is a field that might lead to new insights into the nature and role of higher-order image structures. As often observed in the literature (e.g. [2][7]), our results seem to confirm that a large proportion of the redundancy in natural images is contained in the higher-order pixel interactions. Before any further conclusions can be drawn, however, our study needs to be extended in several directions: 1. A representative image database has to be analyzed. The images must be carefully calibrated since nonlinear statistics can be highly calibrationsensitive. In addition, the contribution of image noise has to be investigated. 2. Currently, only images up to 9000 pixels can be analyzed due to the matrix inversion required by Eq. 11. To accomodate for larger images, our method has to be reformulated in an iterative algorithm. 3. So far, we only considered 5 × 5-patches. To systematically investigate patch size effects, the analysis has to be conducted in a multi-scale framework. References [1] T. J. Dodd and R. F. Harrison. A new solution to Volterra series estimation. In CD-Rom Proc. 2002 IFAC World Congress, 2002. [2] D. J. Field. What is the goal of sensory coding? Neural Computation, 6:559 – 601, 1994. [3] M. O. Franz and B. Sch¨lkopf. Implicit Wiener series. Technical Report 114, Max-Plancko Institut f¨r biologische Kybernetik, T¨bingen, June 2003. u u [4] C. L. Nikias and A. P. Petropulu. Higher-order spectra analysis. Prentice Hall, Englewood Cliffs, NJ, 1993. [5] M. Schetzen. The Volterra and Wiener theories of nonlinear systems. Krieger, Malabar, 1989. [6] B. Sch¨lkopf and A. J. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. o [7] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neurosc., 4(8):819 – 825, 2001. [8] M. G. A. Thomson. Higher-order structure in natural scenes. J. Opt.Soc. Am. A, 16(7):1549 – 1553, 1999. [9] M. G. A. Thomson. Beats, kurtosis and visual coding. Network: Compt. Neural Syst., 12:271 – 287, 2001.
2 0.7965948 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
Author: Guy Shani, Ronen I. Brafman
Abstract: Agents learning to act in a partially observable domain may need to overcome the problem of perceptual aliasing – i.e., different states that appear similar but require different responses. This problem is exacerbated when the agent’s sensors are noisy, i.e., sensors may produce different observations in the same state. We show that many well-known reinforcement learning methods designed to deal with perceptual aliasing, such as Utile SufÄ?Ĺš x Memory, Ä?Ĺš nite size history windows, eligibility traces, and memory bits, do not handle noisy sensors well. We suggest a new algorithm, Noisy Utile SufÄ?Ĺš x Memory (NUSM), based on USM, that uses a weighted classiÄ?Ĺš cation of observed trajectories. We compare NUSM to the above methods and show it to be more robust to noise.
3 0.62798673 68 nips-2004-Face Detection --- Efficient and Rank Deficient
Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scanning large images, this decreases the computational complexity by a significant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1
4 0.62404591 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
5 0.62328231 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng
Abstract: The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated. 1
6 0.62318581 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.62160581 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.62089843 131 nips-2004-Non-Local Manifold Tangent Learning
9 0.62063992 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
10 0.62040246 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
11 0.62030315 161 nips-2004-Self-Tuning Spectral Clustering
12 0.61973828 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
13 0.61967677 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
14 0.61845195 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
15 0.61827117 103 nips-2004-Limits of Spectral Clustering
16 0.61813307 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
17 0.61708784 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
18 0.61660105 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
19 0.61601853 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
20 0.61590362 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications