nips nips2009 nips2009-231 knowledge-graph by maker-knowledge-mining

231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing


Source: pdf

Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz

Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. [sent-8, score-0.169]

2 Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. [sent-9, score-0.283]

3 Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. [sent-11, score-0.258]

4 The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. [sent-12, score-1.207]

5 1 Introduction That visual input at a given point is greatly influenced by its spatial context is manifest in a host of neural and perceptual effects (see, e. [sent-13, score-0.287]

6 ,[11–13]) and contour integration [14], and V1’s suggested role in computing salience has been realized in a large-scale dynamical model [1, 15]. [sent-20, score-0.404]

7 However, these have not substantially encompassed neurophysiological data or indeed made connections with the perceptual literature on contour integration and the tilt illusion. [sent-24, score-0.536]

8 Our aim is to build a principled model based on scene statistics that can ultimately account for, and therefore unify, the whole set of contextual effects above. [sent-25, score-0.279]

9 However, contextual effects emerge from the interactions among multiple filters; therefore here we address the much less well studied issue of the learned, statistical, basis of the coordination of the group of filters — the scene–dependent, linear and non–linear interactions among them. [sent-29, score-0.192]

10 As yet, the GSM has not been applied to the wide range of contextual phenomena discussed above. [sent-34, score-0.191]

11 This is partly because linear correlations, which appear important to capture phenomena such as contour integration, have largely been ignored outside image processing (e. [sent-35, score-0.233]

12 Recent work has shown that incorporating a simple, predetermined, solution to the assignment problem in a GSM could capture the tilt illusion [34]. [sent-39, score-0.48]

13 Further, the implications of assignment for cortical V1 data and salience have not been explored. [sent-41, score-0.338]

14 We then apply the model to contextual neural V1 data, noting its link to the tilt illusion (section 3); and then to perceptual salience examples (section 4). [sent-43, score-0.842]

15 2 Methods A recent focus in natural image statistics has been the joint conditional histograms of the activations of pairs of oriented linear filters (throughout the paper, filters come from the first level of a steerable pyramid with 4 orientations [36]). [sent-47, score-0.278]

16 First, in addition to the variance dependency, filters which are close enough in space and feature space are linearly dependent, as shown by the tilt of the bowtie in fig. [sent-51, score-0.273]

17 This matrix can be approximated by the sample covariance matrix of the filter activations or learned directly [23]; here, we learn it by maximizing the likelihood of the observed data. [sent-54, score-0.199]

18 The second issue is that filter dependencies differ across image patches, implying that there is no fixed relationship between mixers and filters [28]. [sent-55, score-0.155]

19 The general issue of learning multiple pools of filters, each assigned to a different mixer on an patch–dependent basis, has been addressed in recent work [30], but using a computationally and biologically impracticable scheme [37] which allowed for arbitrary pooling. [sent-56, score-0.203]

20 We consider an approximation to the assignment problem, by allowing a group of surround filters to either share or not the same mixer with a target filter. [sent-57, score-0.615]

21 1 The generative model The basic repeating unit of our simplified model involves center and surround groups of filters: we use nc to denote the number of center filters, and xc their activations; similarly, we use ns and xs . [sent-60, score-0.8]

22 We consider c c s s a single assignment choice as to whether the center group’s mixer variable vc is (case ξ1 ), or is not (case ξ2 ) shared with the surround, which in the latter case would have its own mixer variable vs . [sent-69, score-0.656]

23 We show this from the perspective of the center group, since in the implementation we will be reporting model neuron responses in the center location given the contextual surround. [sent-75, score-0.504]

24 E-step: In the E-step we compute an estimate, Q, of the posterior distribution over the assignment variable, given the filter activations and the previous estimates of the parameters, namely k old and . [sent-88, score-0.258]

25 This requires an explicit form for the gradient: cs Σcs 1 B(− n2 ; λcs ) ∂f = Q(ξ1 ) − xx cs 2 2λcs B(1 − n2 ; λcs ) ∂Σ−1 cs (9) Similar expressions hold for the other partial derivatives. [sent-96, score-0.771]

26 In practice, we add the constraint that the covariances of the surround filters are spatially symmetric. [sent-97, score-0.343]

27 3 Inference: patch–by–patch assignment and model neural unit Upon convergence of EM, the covariance matrices and prior k over the assignment are found. [sent-99, score-0.314]

28 Then, for a new image patch, the probability p(ξ1 | x) that the surround shares a common mixer with the center is inferred. [sent-100, score-0.677]

29 The output of the center group is taken to be the estimate (for the present, we consider just the mean) of the Gaussian component E [gc | x], which we take to be our model neural unit response. [sent-101, score-0.142]

30 To estimate the normalized response of the center filter, we need to compute the following expected value under the full model: E [gc | x] = dgc gc p(gc | x) = p(ξ1 | x)E [gc | x , ξ1 ] + p(ξ2 | x)E [gc | xc , ξ2 ] (10) the r. [sent-102, score-0.381]

31 Note that in either configuration, the mixer variable’s effect on this is a form of divisive normalization or gain control, through λ (including for stability, as in [30], an additive constant set to 1 for the λ values; we omit the formulæ to save space). [sent-109, score-0.38]

32 Note also that, due to the presence of the inverse covariance matrix in λcs = x Σ−1 x, the gain control cs signal is reduced when there is strong covariance, which in turn enhances the neural unit response. [sent-111, score-0.411]

33 We take as our model neuron the absolute value of the complex activation composed by the non–linear responses (eq. [sent-114, score-0.163]

34 First, we measure the so-called Area Summation curve, namely the response to gratings that are optimal in orientation and spatial frequency, as a function of size. [sent-117, score-0.257]

35 Cat and monkey experiments have shown striking non–linearities, with the peak response at low contrasts being for significantly larger diameters than at high contrasts (Figure 2a). [sent-118, score-0.161]

36 This behavior is due to the assignment: for small grating sizes, center and surround have a higher posterior probability at high contrast than at low contrast, and therefore the surround exerts stronger gain control. [sent-120, score-0.908]

37 We then assess the modulatory effect of a surround grating on a fixed, optimally–oriented central grating, as a function of their relative orientations (Figure 3a). [sent-122, score-0.547]

38 As is common, we determine the spatial extent of the center and surround stimuli based on the area summation curves (see [4]). [sent-123, score-0.494]

39 The model simulations (Figure 3b), as in the data, exhibit the most reduced responses when the center and surround have similar orientation (but note the ”blip” when they are exactly equal in Figure 3a;b, which arises in the model from the covariance of the Gaussian; see also [31]). [sent-124, score-0.84]

40 4 Figure 2: Area summation curves show the normalized firing rate of a neuron in response to optimal gratings of increasing size. [sent-126, score-0.167]

41 3; (c) a reduced model assuming that the surround filters are always in the gain pool of the center filter. [sent-128, score-0.559]

42 (a) and (b): normalized firing rate in response to a stimulus composed by an optimal central grating surrounded by an annular grating of varying orientation, for (a) a V1 neuron, after [3]; and (b) the model neuron described in Sec. [sent-130, score-0.428]

43 3 (c) Probability that the surround normalizes the center as a function of the relative orientation of the annular grating. [sent-131, score-0.619]

44 as the orientation difference between center and surround grows, the response increases and then decreases, an effect that arises from the assignments. [sent-132, score-0.662]

45 Figure 3c shows the posterior assignment probability for the same two contrasts as in figure 3b, as a function of the surround orientation. [sent-136, score-0.504]

46 Note that a previous GSM population model assumed (but did not learn) this form of fall off of the posterior weights of of figure 3c, and showed that it is a basis for explaining the so-called direct and indirect biases in the tilt illusion; i. [sent-138, score-0.291]

47 , repulsion and attraction in the perception of a center stimulus orientation in the presence of a surround stimulus [34]. [sent-140, score-0.702]

48 Figure 4 compares the GSM model of [34] designed with parameters matched to perceptual data, to the result of our learned model. [sent-141, score-0.201]

49 4 Salience popout and contour integration simulations To address perceptual salience effects, we need a population model of oriented units. [sent-143, score-0.653]

50 We compute the non–linear response of each model neuron as in Sec. [sent-146, score-0.163]

51 3, and take the maximum across the four orientations as the population output, as in standard population decoding. [sent-147, score-0.14]

52 This is performed at each pixel of the input image, and the result is interpreted as a saliency map. [sent-148, score-0.339]

53 Input image and output saliency map (the brighter, the more salient) are shown in Fig. [sent-151, score-0.393]

54 As in [1], the target pops out since it is less suppressed by its own, orthogonally-oriented neighbors than the surround bars are by their parallel ones; here, this emerges straight from normative inference. [sent-153, score-0.462]

55 [8] quantified relative target saliency above detection threshold 5 Figure 4: The tilt illusion. [sent-154, score-0.558]

56 (a) Comparison of the learned GSM model (black, solid line with filled squares), with the GSM model in [34] (blue, solid line; parameters set to account for the illusion data of [39]), and the model in [34] with parameters modified to match the learned model (blue, dashed line). [sent-155, score-0.357]

57 The response of each neuron in the population is plotted as a function of the difference between the surround stimulus orientation and the preferred center stimulus orientation. [sent-156, score-0.869]

58 We assume all oriented neurons have identical properties to the learned vertical neuron (i. [sent-157, score-0.155]

59 The learned model is as in the previous section, but with filters of narrower orientation tuning (because of denser sampling of 16 orientations in the pyramid), which results in an earlier point on the x axis of maximal response. [sent-161, score-0.265]

60 (b) Simulations of the tilt illusion using the model in [34], based on parameters matched to the learned model (dashed line) versus parameters matched to the data of [39] (solid line). [sent-163, score-0.538]

61 as a function of the difference in orientation between target and distractors using luminance (Fig. [sent-164, score-0.168]

62 5c plots saliency from the model; it exhibits non–linear saturation for large orientation contrast, an effect that not all saliency models capture (see [17] for discussion). [sent-167, score-0.85]

63 5b-c) data, in both experiment and model; for the latter, this arises from differences in stimuli (gratings versus bars, how the center and surround extents were determined). [sent-170, score-0.451]

64 The second class of saliency effects involves collinear facilitation. [sent-171, score-0.594]

65 One example is the so called border effect, shown in figure 6a – one side of the border, whose individual bars are collinear, is more salient than the other (e. [sent-172, score-0.202]

66 The middle and right plots in figure 6a depict the saliency map for the full model and a reduced model that uses a diagonal covariance matrix. [sent-175, score-0.519]

67 Notice that the reduced model also shows an enhancement of the collinear side of the border vs the parallel, due to the partial overlap of the linear receptive fields; but, as explained in Sec. [sent-176, score-0.49]

68 3, the higher covariance between collinear filters in the full model, strengthens the effect. [sent-178, score-0.268]

69 To quantify the difference, we report also the ratio between the salience values on the collinear and parallel sides of the border, after subtracting the saliency value of the homogeneous regions: the lower value for the reduced model (1. [sent-179, score-0.833]

70 74 for the full model) shows that the full model enhances the collinear relative to the parallel side. [sent-181, score-0.266]

71 6b provides another, stronger example of the collinear facilitation. [sent-188, score-0.188]

72 5 Discussion We have extended a standard GSM generative model of scene statistics to encompass contextual effects. [sent-189, score-0.212]

73 Using parameters learned from natural scenes, we showed that this model provides a promising account of neurophysiological data on area summation and centersurround orientation contrast, and perceptual data on the saliency of image elements. [sent-191, score-0.773]

74 This form of model has previously been applied to the tilt illusion [34], but had just assumed the assignments of figure 3c, in order to account for the indirect tilt illusion. [sent-192, score-0.633]

75 This model therefore unifies a wealth of data and ideas about contextual visual processing. [sent-194, score-0.229]

76 To our 6 Figure 5: (a) An example of the stimulus and saliency map computed by the model. [sent-195, score-0.396]

77 (b) Perceptual data reproduced after [8], and (c) model output, of the saliency of the central bar as a function of the orientation contrast between center and surround. [sent-196, score-0.651]

78 Figure 6: (a) Border effect: the collinear side of the border is more salient than the parallel one; the center plot is the saliency map for the full model, right plot is for a reduced model with diagonal covariance matrix. [sent-197, score-0.992]

79 (b) Another example of collinear facilitation: the center row of bars is more salient, relative to the background, when the bars are collinear (left) rather than when they are parallel (right). [sent-198, score-0.598]

80 In both (a) and (b), Col/P ar is the ratio between the salience values on the collinear and parallel sides of the border, after subtracting the saliency value of the homogeneous regions. [sent-199, score-0.767]

81 Previous bottom-up models of divisive normalization, which were the original inspiration for the application by [22] of the GSM, can account for some neural non–linearities by learning divisive weights instead of assignments (e. [sent-202, score-0.188]

82 non-linear ICA [41], have also been proposed, but have been applied only to the orientation masking nonlinearity, therefore not addressing spatial context. [sent-208, score-0.18]

83 However, [35] has been applied only to the image processing domain, and the model of [31] has not been tied to the perceptual phenomena we have considered, nor to contrast data. [sent-212, score-0.261]

84 We showed that our assignment process, and the normalization that results, is a good match for (and thus a normative justification of) at least some of the results that [1, 15] captured in a dynamical realization of the V1 saliency hypothesis. [sent-215, score-0.516]

85 The covariance between the Gaussian components captures some aspects of the long range excitatory effects in that model, which permit contour integration. [sent-217, score-0.26]

86 Note also that dynamical models have not previously been applied to the same range of data (such as the tilt illusion). [sent-219, score-0.219]

87 Open theoretical issues include quantifying carefully the effect of the rather coarse assignment approximation, as well as the differences between the learned model and the idealized population model of the tilt illusion [34]. [sent-220, score-0.651]

88 This is critical to characterize psychophysical results on contrast detection in the face of noise and also orientation acuity, and also raises the issues aired by [31] as to how neural responses convey uncertainties. [sent-222, score-0.176]

89 Open experimental issues include a range of other contextual effects as to salience, contour integration, and even perceptual crowding. [sent-223, score-0.412]

90 The ”silent” surround of v1 receptive fields: theory and experie ments. [sent-241, score-0.42]

91 Selectivity and spatial distribution of signals from the receptive field surround in macaque v1 neurons. [sent-259, score-0.463]

92 Edge co-occurrence in natural images predicts contour grouping performance. [sent-303, score-0.149]

93 The role of feedback in shaping the extraclassical receptive field of cortical neurons: a recurrent network model. [sent-311, score-0.155]

94 Extraclassical receptive field phenomena and short-range connectivity in v1. [sent-316, score-0.143]

95 Computational modeling and exploration of contour integration for visual saliency. [sent-329, score-0.244]

96 Visual segmentation by contextual influences via intracortical interactions in primary visual cortex. [sent-333, score-0.195]

97 Sun: A bayesian framework for saliency using natural statistics. [sent-354, score-0.375]

98 A multi-layer sparse coding network learns contour coding from natural a images. [sent-411, score-0.149]

99 Soft mixer assignment in a hierarchical generative model of natural scene statistics. [sent-436, score-0.395]

100 Testing a v1 model: perceptual biases and saliency effects. [sent-517, score-0.446]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('surround', 0.343), ('saliency', 0.339), ('gsm', 0.258), ('cs', 0.257), ('tilt', 0.219), ('salience', 0.196), ('gc', 0.188), ('collinear', 0.188), ('lters', 0.178), ('mixer', 0.172), ('illusion', 0.161), ('orientation', 0.137), ('contextual', 0.125), ('border', 0.125), ('non', 0.114), ('contour', 0.113), ('center', 0.108), ('perceptual', 0.107), ('assignment', 0.1), ('divisive', 0.094), ('neuron', 0.09), ('activations', 0.089), ('covariance', 0.08), ('receptive', 0.077), ('grating', 0.072), ('mixers', 0.071), ('ncs', 0.071), ('proc', 0.071), ('visual', 0.07), ('vc', 0.07), ('old', 0.069), ('effects', 0.067), ('phenomena', 0.066), ('orientations', 0.064), ('integration', 0.061), ('contrasts', 0.061), ('lter', 0.059), ('stimulus', 0.057), ('ns', 0.056), ('image', 0.054), ('bowtie', 0.054), ('linearities', 0.054), ('scene', 0.053), ('xc', 0.046), ('parallel', 0.044), ('spatial', 0.043), ('vision', 0.043), ('gain', 0.042), ('cortical', 0.042), ('nc', 0.042), ('salient', 0.042), ('schwartz', 0.041), ('normative', 0.04), ('responses', 0.039), ('response', 0.039), ('population', 0.038), ('gratings', 0.038), ('conf', 0.038), ('normalization', 0.037), ('suppression', 0.036), ('neurophysiological', 0.036), ('natural', 0.036), ('aecom', 0.036), ('bronx', 0.036), ('clifford', 0.036), ('dvc', 0.036), ('extraclassical', 0.036), ('neurobiologically', 0.036), ('popout', 0.036), ('solomon', 0.036), ('effect', 0.035), ('oriented', 0.035), ('sci', 0.035), ('hyv', 0.035), ('bars', 0.035), ('patch', 0.034), ('model', 0.034), ('vs', 0.034), ('simulations', 0.033), ('gurations', 0.033), ('central', 0.033), ('simoncelli', 0.032), ('int', 0.032), ('reduced', 0.032), ('plausibility', 0.031), ('annular', 0.031), ('pools', 0.031), ('hurri', 0.031), ('distractors', 0.031), ('eld', 0.031), ('learned', 0.03), ('dependencies', 0.03), ('det', 0.03), ('surrounding', 0.03), ('matched', 0.03), ('gaussian', 0.029), ('accounts', 0.029), ('correlations', 0.029), ('xs', 0.029), ('facilitation', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz

Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1

2 0.19004665 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

Author: Daniel Zoran, Yair Weiss

Abstract: We propose a new model for natural image statistics. Instead of minimizing dependency between components of natural images, we maximize a simple form of dependency in the form of tree-dependencies. By learning filters and tree structures which are best suited for natural images we observe that the resulting filters are edge filters, similar to the famous ICA on natural images results. Calculating the likelihood of an image patch using our model requires estimating the squared output of pairs of filters connected in the tree. We observe that after learning, these pairs of filters are predominantly of similar orientations but different phases, so their joint energy resembles models of complex cells. 1 Introduction and related work Many models of natural image statistics have been proposed in recent years [1, 2, 3, 4]. A common goal of many of these models is finding a representation in which components or sub-components of the image are made as independent or as sparse as possible [5, 6, 2]. This has been found to be a difficult goal, as natural images have a highly intricate structure and removing dependencies between components is hard [7]. In this work we take a different approach, instead of minimizing dependence between components we try to maximize a simple form of dependence - tree dependence. It would be useful to place this model in context of previous works about natural image statistics. Many earlier models are described by the marginal statistics solely, obtaining a factorial form of the likelihood: p(x) = pi (xi ) (1) i The most notable model of this approach is Independent Component Analysis (ICA), where one seeks to find a linear transformation which maximizes independence between components (thus fitting well with the aforementioned factorization). This model has been applied to many scenarios, and proved to be one of the great successes of natural image statistics modeling with the emergence of edge-filters [5]. This approach has two problems. The first is that dependencies between components are still very strong, even with those learned transformation seeking to remove them. Second, it has been shown that ICA achieves, after the learned transformation, only marginal gains when measured quantitatively against simpler method like PCA [7] in terms of redundancy reduction. A different approach was taken recently in the form of radial Gaussianization [8], in which components which are distributed in a radially symmetric manner are made independent by transforming them non-linearly into a radial Gaussian, and thus, independent from one another. A more elaborate approach, related to ICA, is Independent Subspace Component Analysis or ISA. In this model, one looks for independent subspaces of the data, while allowing the sub-components 1 Figure 1: Our model with respect to marginal models such as ICA (a), and ISA like models (b). Our model, being a tree based model (c), allows components to belong to more than one subspace, and the subspaces are not required to be independent. of each subspace to be dependent: p(x) = pk (xi∈K ) (2) k This model has been applied to natural images as well and has been shown to produce the emergence of phase invariant edge detectors, akin to complex cells in V1 [2]. Independent models have several shortcoming, but by far the most notable one is the fact that the resulting components are, in fact, highly dependent. First, dependency between the responses of ICA filters has been reported many times [2, 7]. Also, dependencies between ISA components has also been observed [9]. Given these robust dependencies between filter outputs, it is somewhat peculiar that in order to get simple cell properties one needs to assume independence. In this work we ask whether it is possible to obtain V1 like filters in a model that assumes dependence. In our model we assume the filter distribution can be described by a tree graphical model [10] (see Figure 1). Degenerate cases of tree graphical models include ICA (in which no edges are present) and ISA (in which edges are only present within a subspace). But in its non-degenerate form, our model assumes any two filter outputs may be dependent. We allow components to belong to more than one subspace, and as a result, do not require independence between them. 2 Model and learning Our model is comprised of three main components. Given a set of patches, we look for the parameters which maximize the likelihood of a whitened natural image patch z: N p(yi |ypai ; β) p(z; W, β, T ) = p(y1 ) (3) i=1 Where y = Wz, T is the tree structure, pai denotes the parent of node i and β is a parameter of the density model (see below for the details). The three components we are trying to learn are: 1. The filter matrix W, where every row defines one of the filters. The response of these filters is assumed to be tree-dependent. We assume that W is orthogonal (and is a rotation of a whitening transform). 2. The tree structure T which specifies which components are dependent on each other. 3. The probability density function for connected nodes in the tree, which specify the exact form of dependency between nodes. All three together describe a complete model for whitened natural image patches, allowing likelihood estimation and exact inference [11]. We perform the learning in an iterative manner: we start by learning the tree structure and density model from the entire data set, then, keeping the structure and density constant, we learn the filters via gradient ascent in mini-batches. Going back to the tree structure we repeat the process many times iteratively. It is important to note that both the filter set and tree structure are learned from the data, and are continuously updated during learning. In the following sections we will provide details on the specifics of each part of the model. 2 β=0.0 β=0.5 β=1.0 β=0.0 β=0.5 β=1.0 2 1 1 1 1 1 1 0 −1 0 −1 −2 −1 −2 −3 0 x1 2 0 x1 2 0 x1 2 −2 −3 −2 0 x1 2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 −3 −2 0 x1 2 −2 0 x1 2 Figure 2: Shape of the conditional (Left three plots) and joint (Right three plots) density model in log scale for several values of β, from dependence to independence. 2.1 Learning tree structure In their seminal paper, Chow and Liu showed how to learn the optimal tree structure approximation for a multidimensional probability density function [12]. This algorithm is easy to apply to this scenario, and requires just a few simple steps. First, given the current estimate for the filter matrix W, we calculate the response of each of the filters with all the patches in the data set. Using these responses, we calculate the mutual information between each pair of filters (nodes) to obtain a fully connected weighted graph. The final step is to find a maximal spanning tree over this graph. The resulting unrooted tree is the optimal tree approximation of the joint distribution function over all nodes. We will note that the tree is unrooted, and the root can be chosen arbitrarily - this means that there is no node, or filter, that is more important than the others - the direction in the tree graph is arbitrary as long as it is chosen in a consistent way. 2.2 Joint probability density functions Gabor filter responses on natural images exhibit highly kurtotic marginal distributions, with heavy tails and sharp peaks [13, 3, 14]. Joint pair wise distributions also exhibit this same shape with varying degrees of dependency between the components [13, 2]. The density model we use allows us to capture both the highly kurtotic nature of the distributions, while still allowing varying degrees of dependence using a mixing variable. We use a mix of two forms of finite, zero mean Gaussian Scale Mixtures (GSM). In one, the components are assumed to be independent of each other and in the other, they are assumed to be spherically distributed. The mixing variable linearly interpolates between the two, allowing us to capture the whole range of dependencies: p(x1 , x2 ; β) = βpdep (x1 , x2 ) + (1 − β)pind (x1 , x2 ) (4) When β = 1 the two components are dependent (unless p is Gaussian), whereas when β = 0 the two components are independent. For the density functions themselves, we use a finite GSM. The dependent case is a scale mixture of bivariate Gaussians: 2 πk N (x1 , x2 ; σk I) pdep (x1 , x2 ) = (5) k While the independent case is a product of two independent univariate Gaussians: 2 πk N (x1 ; σk ) pind (x1 , x2 ) = k 2 πk N (x2 ; σk ) (6) k 2 Estimating parameters πk and σk for the GSM is done directly from the data using Expectation Maximization. These parameters are the same for all edges and are estimated only once on the first iteration. See Figure 2 for a visualization of the conditional distribution functions for varying values of β. We will note that the marginal distributions for the two types of joint distributions above are the same. The mixing parameter β is also estimated using EM, but this is done for each edge in the tree separately, thus allowing our model to theoretically capture the fully independent case (ICA) and other degenerate models such as ISA. 2.3 Learning tree dependent components Given the current tree structure and density model, we can now learn the matrix W via gradient ascent on the log likelihood of the model. All learning is performed on whitened, dimensionally 3 reduced patches. This means that W is a N × N rotation (orthonormal) matrix, where N is the number of dimensions after dimensionality reduction (see details below). Given an image patch z we multiply it by W to get the response vector y: y = Wz (7) Now we can calculate the log likelihood of the given patch using the tree model (which we assume is constant at the moment): N log p(yi |ypai ) log p(y) = log p(yroot ) + (8) i=1 Where pai denotes the parent of node i. Now, taking the derivative w.r.t the r-th row of W: ∂ log p(y) ∂ log p(y) T = z ∂Wr ∂yr (9) Where z is the whitened natural image patch. Finally, we can calculate the derivative of the log likelihood with respect to the r-th element in y: ∂ log p(ypar , yr ) ∂ log p(y) = + ∂yr ∂yr c∈C(r) ∂ log p(yr , yc ) ∂ log p(yr ) − ∂yr ∂yr (10) Where C(r) denote the children of node r. In summary, the gradient ascent rule for updating the rotation matrix W is given by: t+1 t Wr = Wr + η ∂ log p(y) T z ∂yr (11) Where η is the learning rate constant. After update, the rows of W are orthonormalized. This gradient ascent rule is applied for several hundreds of patches (see details below), after which the tree structure is learned again as described in Section 2.1, using the new filter matrix W, repeating this process for many iterations. 3 Results and analysis 3.1 Validation Before running the full algorithm on natural image data, we wanted to validate that it does produce sensible results with simple synthetic data. We generated noise from four different models, one is 1/f independent Gaussian noise with 8 Discrete Cosine Transform (DCT) filters, the second is a simple ICA model with 8 DCT filters, and highly kurtotic marginals. The third was a simple ISA model - 4 subspaces, each with two filters from the DCT filter set. Distribution within the subspace was a circular, highly kurtotic GSM, and the subspaces were sampled independently. Finally, we generated data from a simple synthetic tree of DCT filters, using the same joint distributions as for the ISA model. These four synthetic random data sets were given to the algorithm - results can be seen in Figure 3 for the ICA, ISA and tree samples. In all cases the model learned the filters and distribution correctly, reproducing both the filters (up to rotations within the subspace in ISA) and the dependency structure between the different filters. In the case of 1/f Gaussian noise, any whitening transformation is equally likely and any value of beta is equally likely. Thus in this case, the algorithm cannot find the tree or the filters. 3.2 Learning from natural image patches We then ran experiments with a set of natural images [9]1 . These images contain natural scenes such as mountains, fields and lakes. . The data set was 50,000 patches, each 16 × 16 pixels large. The patches’ DC was removed and they were then whitened using PCA. Dimension was reduced from 256 to 128 dimensions. The GSM for the density model had 16 components. Several initial 1 available at http://www.cis.hut.fi/projects/ica/imageica/ 4 Figure 3: Validation of the algorithm. Noise was generated from three models - top row is ICA, middle row is ISA and bottom row is a tree model. Samples were then given to the algorithm. On the right are the resulting learned tree models. Presented are the learned filters, tree model (with white edges meaning β = 0, black meaning β = 1 and grays intermediate values) and an example of a marginal histogram for one of the filters. It can be seen that in all cases all parts of the model were correctly learned. Filters in the ISA case were learned up to rotation within the subspace, and all filters were learned up to sign. β values for the ICA case were always below 0.1, as were the values of β between subspaces in ISA. conditions for the matrix W were tried out (random rotations, identity) but this had little effect on results. Mini-batches of 10 patches each were used for the gradient ascent - the gradient of 10 patches was summed, and then normalized to have unit norm. The learning rate constant η was set to 0.1. Tree structure learning and estimation of the mixing variable β were done every 500 mini-batches. All in all, 50 iterations were done over the data set. 3.3 Filters and tree structure Figures 4 and 5 show the learned filters (WQ where Q is the whitening matrix) and tree structure (T ) learned from natural images. Unlike the ISA toy data in figure 3, here a full tree was learned and β is approximately one for all edges. The GSM that was learned for the marginals was highly kurtotic. It can be seen that resulting filters are edge filters at varying scales, positions and orientations. This is similar to the result one gets when applying ICA to natural images [5, 15]. More interesting is Figure 4: Left: Filter set learned from 16 × 16 natural image patches. Filters are ordered by PCA eigenvalues, largest to smallest. Resulting filters are edge filters having different orientations, positions, frequencies and phases. Right: The “feature” set learned, that is, columns of the pseudo inverse of the filter set. 5 Figure 5: The learned tree graph structure and feature set. It can be seen that neighboring features on the graph have similar orientation, position and frequency. See Figure 4 for a better view of the feature details, and see text for full detail and analysis. Note that the figure is rotated CW. 6 Optimal Orientation Optimal Frequency 3.5 Optimal Phase 7 3 0.8 6 2.5 Optimal Position Y 0.9 6 5 5 0.7 0.6 3 Child 1.5 4 Child 2 Child Child 4 3 2 1 1 0.4 0.3 2 0.5 0.5 0.2 0 1 0.1 0 0 1 2 Parent 3 4 0 0 2 4 Parent 6 8 0 0 1 2 3 Parent 4 5 6 0 0.2 0.4 0.6 Parent 0.8 1 Figure 6: Correlation of optimal parameters in neighboring nodes in the tree graph. Orientation, frequency and position are highly correlated, while phase seems to be entirely uncorrelated. This property of correlation in frequency and orientation, while having no correlation in phase is related to the ubiquitous energy model of complex cells in V1. See text for further details. Figure 7: Left: Comparison of log likelihood values of our model with PCA, ICA and ISA. Our model gives the highest likelihood. Right: Samples taken at random from ICA, ISA and our model. Samples from our model appear to contain more long-range structure. the tree graph structure learned along with the filters which is shown in Figure 5. It can be seen that neighboring filters (nodes) in the tree tend to have similar position, frequency and orientation. Figure 6 shows the correlation of optimal frequency, orientation and position for neighboring filters in the tree - it is obvious that all three are highly correlated. Also apparent in this figure is the fact that the optimal phase for neighboring filters has no significant correlation. It has been suggested that filters which have the same orientation, frequency and position with different phase can be related to complex cells in V1 [2, 16]. 3.4 Comparison to other models Since our model is a generalization of both ICA and ISA we use it to learn both models. In order to learn ICA we used the exact same data set, but the tree had no edges and was not learned from the data (alternatively, we could have just set β = 0). For ISA we used a forest architecture of 2 node trees, setting β = 1 for all edges (which means a spherical symmetric distribution), no tree structure was learned. Both models produce edge filters similar to what we learn (and to those in [5, 15, 6]). The ISA model produces neighboring nodes with similar frequency and orientation, but different phase, as was reported in [2]. We also compare to a simple PCA whitening transform, using the same whitening transform and marginals as in the ICA case, but setting W = I. We compare the likelihood each model gives for a test set of natural image patches, different from the one that was used in training. There were 50,000 patches in the test set, and we calculate the mean log likelihood over the entire set. The table in Figure 7 shows the result - as can be seen, our model performs better in likelihood terms than both ICA and ISA. Using a tree model, as opposed to more complex graphical models, allows for easy sampling from the model. Figure 7 shows 20 random samples taken from our tree model along with samples from the ICA and ISA models. Note the elongated structures (e.g. in the bottom left sample) in the samples from the tree model, and compare to patches sampled from the ICA and ISA models. 7 40 40 30 30 20 20 10 1 10 0.8 0.6 0.4 0 0.2 0 0 1 2 3 Orientation 4 0 0 2 4 6 Frequency 8 0 2 4 Phase Figure 8: Left: Interpretation of the model. Given a patch, the response of all edge filters is computed (“simple cells”), then at each edge, the corresponding nodes are squared and summed to produce the response of the “complex cell” this edge represents. Both the response of complex cells and simple cells is summed to produce the likelihood of the patch. Right: Response of a “complex cell” in our model to changing phase, frequency and orientation. Response in the y-axis is the sum of squares of the two filters in this “complex cell”. Note that while the cell is selective to orientation and frequency, it is rather invariant to phase. 3.5 Tree models and complex cells One way to interpret the model is looking at the likelihood of a given patch under this model. For the case of β = 1 substituting Equation 4 into Equation 3 yields: 2 2 ρ( yi + ypai ) − ρ(|ypai |) log L(z) = (12) i 2 Where ρ(x) = log k πk N (x; σk ) . This form of likelihood has an interesting similarity to models of complex cells in V1 [2, 4]. In Figure 8 we draw a simple two-layer network that computes the likelihood. The first layer applies linear filters (“simple cells”) to the image patch, while the second layer sums the squared outputs of similarly oriented filters from the first layer, having different phases, which are connected in the tree (“complex cells”). Output is also dependent on the actual response of the “simple cell” layer. The likelihood here is maximized when both the response of the parent filter ypai and the child yi is zero, but, given that one filter has responded with a non-zero value, the likelihood is maximized when the other filter also fires (see the conditional density in Figure 2). Figure 8 also shows an example of the phase invariance which is present in the learned

3 0.13154335 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

Author: Mingyuan Zhou, Haojun Chen, Lu Ren, Guillermo Sapiro, Lawrence Carin, John W. Paisley

Abstract: Non-parametric Bayesian techniques are considered for learning dictionaries for sparse image representations, with applications in denoising, inpainting and compressive sensing (CS). The beta process is employed as a prior for learning the dictionary, and this non-parametric method naturally infers an appropriate dictionary size. The Dirichlet process and a probit stick-breaking process are also considered to exploit structure within an image. The proposed method can learn a sparse dictionary in situ; training images may be exploited if available, but they are not required. Further, the noise variance need not be known, and can be nonstationary. Another virtue of the proposed method is that sequential inference can be readily employed, thereby allowing scaling to large images. Several example results are presented, using both Gibbs and variational Bayesian inference, with comparisons to other state-of-the-art approaches. 1

4 0.11896416 43 nips-2009-Bayesian estimation of orientation preference maps

Author: Sebastian Gerwinn, Leonard White, Matthias Kaschube, Matthias Bethge, Jakob H. Macke

Abstract: Imaging techniques such as optical imaging of intrinsic signals, 2-photon calcium imaging and voltage sensitive dye imaging can be used to measure the functional organization of visual cortex across different spatial and temporal scales. Here, we present Bayesian methods based on Gaussian processes for extracting topographic maps from functional imaging data. In particular, we focus on the estimation of orientation preference maps (OPMs) from intrinsic signal imaging data. We model the underlying map as a bivariate Gaussian process, with a prior covariance function that reflects known properties of OPMs, and a noise covariance adjusted to the data. The posterior mean can be interpreted as an optimally smoothed estimate of the map, and can be used for model based interpolations of the map from sparse measurements. By sampling from the posterior distribution, we can get error bars on statistical properties such as preferred orientations, pinwheel locations or pinwheel counts. Finally, the use of an explicit probabilistic model facilitates interpretation of parameters and quantitative model comparisons. We demonstrate our model both on simulated data and on intrinsic signaling data from ferret visual cortex. 1

5 0.1129203 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

Author: Yoshua Bengio, James S. Bergstra

Abstract: We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hiddenlayer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. With this pretraining, the same single-hidden-layer model achieves 1.34% error, even though the pretraining sample distribution is very different from the fine-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features. 1

6 0.11099252 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

7 0.0991605 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

8 0.096840434 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

9 0.095426477 88 nips-2009-Extending Phase Mechanism to Differential Motion Opponency for Motion Pop-out

10 0.09523163 152 nips-2009-Measuring model complexity with the prior predictive

11 0.092634752 200 nips-2009-Reconstruction of Sparse Circuits Using Multi-neuronal Excitation (RESCUME)

12 0.086247228 164 nips-2009-No evidence for active sparsification in the visual cortex

13 0.081132412 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

14 0.078006193 163 nips-2009-Neurometric function analysis of population codes

15 0.074248418 211 nips-2009-Segmenting Scenes by Matching Image Composites

16 0.071640983 151 nips-2009-Measuring Invariances in Deep Networks

17 0.069431953 13 nips-2009-A Neural Implementation of the Kalman Filter

18 0.065062113 137 nips-2009-Learning transport operators for image manifolds

19 0.062008642 237 nips-2009-Subject independent EEG-based BCI decoding

20 0.060910765 201 nips-2009-Region-based Segmentation and Object Detection


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.18), (1, -0.193), (2, 0.069), (3, 0.07), (4, 0.023), (5, 0.051), (6, 0.049), (7, 0.059), (8, 0.057), (9, -0.112), (10, -0.076), (11, 0.008), (12, -0.0), (13, 0.075), (14, 0.071), (15, 0.038), (16, -0.013), (17, 0.115), (18, -0.025), (19, 0.098), (20, -0.014), (21, -0.084), (22, 0.031), (23, -0.119), (24, 0.052), (25, 0.066), (26, -0.066), (27, 0.025), (28, -0.006), (29, 0.049), (30, 0.003), (31, -0.073), (32, -0.01), (33, 0.062), (34, 0.007), (35, -0.022), (36, -0.016), (37, -0.076), (38, -0.025), (39, -0.023), (40, 0.003), (41, -0.046), (42, 0.035), (43, -0.013), (44, -0.145), (45, -0.047), (46, -0.038), (47, 0.127), (48, -0.028), (49, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93238252 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz

Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1

2 0.70118022 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

Author: Daniel Zoran, Yair Weiss

Abstract: We propose a new model for natural image statistics. Instead of minimizing dependency between components of natural images, we maximize a simple form of dependency in the form of tree-dependencies. By learning filters and tree structures which are best suited for natural images we observe that the resulting filters are edge filters, similar to the famous ICA on natural images results. Calculating the likelihood of an image patch using our model requires estimating the squared output of pairs of filters connected in the tree. We observe that after learning, these pairs of filters are predominantly of similar orientations but different phases, so their joint energy resembles models of complex cells. 1 Introduction and related work Many models of natural image statistics have been proposed in recent years [1, 2, 3, 4]. A common goal of many of these models is finding a representation in which components or sub-components of the image are made as independent or as sparse as possible [5, 6, 2]. This has been found to be a difficult goal, as natural images have a highly intricate structure and removing dependencies between components is hard [7]. In this work we take a different approach, instead of minimizing dependence between components we try to maximize a simple form of dependence - tree dependence. It would be useful to place this model in context of previous works about natural image statistics. Many earlier models are described by the marginal statistics solely, obtaining a factorial form of the likelihood: p(x) = pi (xi ) (1) i The most notable model of this approach is Independent Component Analysis (ICA), where one seeks to find a linear transformation which maximizes independence between components (thus fitting well with the aforementioned factorization). This model has been applied to many scenarios, and proved to be one of the great successes of natural image statistics modeling with the emergence of edge-filters [5]. This approach has two problems. The first is that dependencies between components are still very strong, even with those learned transformation seeking to remove them. Second, it has been shown that ICA achieves, after the learned transformation, only marginal gains when measured quantitatively against simpler method like PCA [7] in terms of redundancy reduction. A different approach was taken recently in the form of radial Gaussianization [8], in which components which are distributed in a radially symmetric manner are made independent by transforming them non-linearly into a radial Gaussian, and thus, independent from one another. A more elaborate approach, related to ICA, is Independent Subspace Component Analysis or ISA. In this model, one looks for independent subspaces of the data, while allowing the sub-components 1 Figure 1: Our model with respect to marginal models such as ICA (a), and ISA like models (b). Our model, being a tree based model (c), allows components to belong to more than one subspace, and the subspaces are not required to be independent. of each subspace to be dependent: p(x) = pk (xi∈K ) (2) k This model has been applied to natural images as well and has been shown to produce the emergence of phase invariant edge detectors, akin to complex cells in V1 [2]. Independent models have several shortcoming, but by far the most notable one is the fact that the resulting components are, in fact, highly dependent. First, dependency between the responses of ICA filters has been reported many times [2, 7]. Also, dependencies between ISA components has also been observed [9]. Given these robust dependencies between filter outputs, it is somewhat peculiar that in order to get simple cell properties one needs to assume independence. In this work we ask whether it is possible to obtain V1 like filters in a model that assumes dependence. In our model we assume the filter distribution can be described by a tree graphical model [10] (see Figure 1). Degenerate cases of tree graphical models include ICA (in which no edges are present) and ISA (in which edges are only present within a subspace). But in its non-degenerate form, our model assumes any two filter outputs may be dependent. We allow components to belong to more than one subspace, and as a result, do not require independence between them. 2 Model and learning Our model is comprised of three main components. Given a set of patches, we look for the parameters which maximize the likelihood of a whitened natural image patch z: N p(yi |ypai ; β) p(z; W, β, T ) = p(y1 ) (3) i=1 Where y = Wz, T is the tree structure, pai denotes the parent of node i and β is a parameter of the density model (see below for the details). The three components we are trying to learn are: 1. The filter matrix W, where every row defines one of the filters. The response of these filters is assumed to be tree-dependent. We assume that W is orthogonal (and is a rotation of a whitening transform). 2. The tree structure T which specifies which components are dependent on each other. 3. The probability density function for connected nodes in the tree, which specify the exact form of dependency between nodes. All three together describe a complete model for whitened natural image patches, allowing likelihood estimation and exact inference [11]. We perform the learning in an iterative manner: we start by learning the tree structure and density model from the entire data set, then, keeping the structure and density constant, we learn the filters via gradient ascent in mini-batches. Going back to the tree structure we repeat the process many times iteratively. It is important to note that both the filter set and tree structure are learned from the data, and are continuously updated during learning. In the following sections we will provide details on the specifics of each part of the model. 2 β=0.0 β=0.5 β=1.0 β=0.0 β=0.5 β=1.0 2 1 1 1 1 1 1 0 −1 0 −1 −2 −1 −2 −3 0 x1 2 0 x1 2 0 x1 2 −2 −3 −2 0 x1 2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 −3 −2 0 x1 2 −2 0 x1 2 Figure 2: Shape of the conditional (Left three plots) and joint (Right three plots) density model in log scale for several values of β, from dependence to independence. 2.1 Learning tree structure In their seminal paper, Chow and Liu showed how to learn the optimal tree structure approximation for a multidimensional probability density function [12]. This algorithm is easy to apply to this scenario, and requires just a few simple steps. First, given the current estimate for the filter matrix W, we calculate the response of each of the filters with all the patches in the data set. Using these responses, we calculate the mutual information between each pair of filters (nodes) to obtain a fully connected weighted graph. The final step is to find a maximal spanning tree over this graph. The resulting unrooted tree is the optimal tree approximation of the joint distribution function over all nodes. We will note that the tree is unrooted, and the root can be chosen arbitrarily - this means that there is no node, or filter, that is more important than the others - the direction in the tree graph is arbitrary as long as it is chosen in a consistent way. 2.2 Joint probability density functions Gabor filter responses on natural images exhibit highly kurtotic marginal distributions, with heavy tails and sharp peaks [13, 3, 14]. Joint pair wise distributions also exhibit this same shape with varying degrees of dependency between the components [13, 2]. The density model we use allows us to capture both the highly kurtotic nature of the distributions, while still allowing varying degrees of dependence using a mixing variable. We use a mix of two forms of finite, zero mean Gaussian Scale Mixtures (GSM). In one, the components are assumed to be independent of each other and in the other, they are assumed to be spherically distributed. The mixing variable linearly interpolates between the two, allowing us to capture the whole range of dependencies: p(x1 , x2 ; β) = βpdep (x1 , x2 ) + (1 − β)pind (x1 , x2 ) (4) When β = 1 the two components are dependent (unless p is Gaussian), whereas when β = 0 the two components are independent. For the density functions themselves, we use a finite GSM. The dependent case is a scale mixture of bivariate Gaussians: 2 πk N (x1 , x2 ; σk I) pdep (x1 , x2 ) = (5) k While the independent case is a product of two independent univariate Gaussians: 2 πk N (x1 ; σk ) pind (x1 , x2 ) = k 2 πk N (x2 ; σk ) (6) k 2 Estimating parameters πk and σk for the GSM is done directly from the data using Expectation Maximization. These parameters are the same for all edges and are estimated only once on the first iteration. See Figure 2 for a visualization of the conditional distribution functions for varying values of β. We will note that the marginal distributions for the two types of joint distributions above are the same. The mixing parameter β is also estimated using EM, but this is done for each edge in the tree separately, thus allowing our model to theoretically capture the fully independent case (ICA) and other degenerate models such as ISA. 2.3 Learning tree dependent components Given the current tree structure and density model, we can now learn the matrix W via gradient ascent on the log likelihood of the model. All learning is performed on whitened, dimensionally 3 reduced patches. This means that W is a N × N rotation (orthonormal) matrix, where N is the number of dimensions after dimensionality reduction (see details below). Given an image patch z we multiply it by W to get the response vector y: y = Wz (7) Now we can calculate the log likelihood of the given patch using the tree model (which we assume is constant at the moment): N log p(yi |ypai ) log p(y) = log p(yroot ) + (8) i=1 Where pai denotes the parent of node i. Now, taking the derivative w.r.t the r-th row of W: ∂ log p(y) ∂ log p(y) T = z ∂Wr ∂yr (9) Where z is the whitened natural image patch. Finally, we can calculate the derivative of the log likelihood with respect to the r-th element in y: ∂ log p(ypar , yr ) ∂ log p(y) = + ∂yr ∂yr c∈C(r) ∂ log p(yr , yc ) ∂ log p(yr ) − ∂yr ∂yr (10) Where C(r) denote the children of node r. In summary, the gradient ascent rule for updating the rotation matrix W is given by: t+1 t Wr = Wr + η ∂ log p(y) T z ∂yr (11) Where η is the learning rate constant. After update, the rows of W are orthonormalized. This gradient ascent rule is applied for several hundreds of patches (see details below), after which the tree structure is learned again as described in Section 2.1, using the new filter matrix W, repeating this process for many iterations. 3 Results and analysis 3.1 Validation Before running the full algorithm on natural image data, we wanted to validate that it does produce sensible results with simple synthetic data. We generated noise from four different models, one is 1/f independent Gaussian noise with 8 Discrete Cosine Transform (DCT) filters, the second is a simple ICA model with 8 DCT filters, and highly kurtotic marginals. The third was a simple ISA model - 4 subspaces, each with two filters from the DCT filter set. Distribution within the subspace was a circular, highly kurtotic GSM, and the subspaces were sampled independently. Finally, we generated data from a simple synthetic tree of DCT filters, using the same joint distributions as for the ISA model. These four synthetic random data sets were given to the algorithm - results can be seen in Figure 3 for the ICA, ISA and tree samples. In all cases the model learned the filters and distribution correctly, reproducing both the filters (up to rotations within the subspace in ISA) and the dependency structure between the different filters. In the case of 1/f Gaussian noise, any whitening transformation is equally likely and any value of beta is equally likely. Thus in this case, the algorithm cannot find the tree or the filters. 3.2 Learning from natural image patches We then ran experiments with a set of natural images [9]1 . These images contain natural scenes such as mountains, fields and lakes. . The data set was 50,000 patches, each 16 × 16 pixels large. The patches’ DC was removed and they were then whitened using PCA. Dimension was reduced from 256 to 128 dimensions. The GSM for the density model had 16 components. Several initial 1 available at http://www.cis.hut.fi/projects/ica/imageica/ 4 Figure 3: Validation of the algorithm. Noise was generated from three models - top row is ICA, middle row is ISA and bottom row is a tree model. Samples were then given to the algorithm. On the right are the resulting learned tree models. Presented are the learned filters, tree model (with white edges meaning β = 0, black meaning β = 1 and grays intermediate values) and an example of a marginal histogram for one of the filters. It can be seen that in all cases all parts of the model were correctly learned. Filters in the ISA case were learned up to rotation within the subspace, and all filters were learned up to sign. β values for the ICA case were always below 0.1, as were the values of β between subspaces in ISA. conditions for the matrix W were tried out (random rotations, identity) but this had little effect on results. Mini-batches of 10 patches each were used for the gradient ascent - the gradient of 10 patches was summed, and then normalized to have unit norm. The learning rate constant η was set to 0.1. Tree structure learning and estimation of the mixing variable β were done every 500 mini-batches. All in all, 50 iterations were done over the data set. 3.3 Filters and tree structure Figures 4 and 5 show the learned filters (WQ where Q is the whitening matrix) and tree structure (T ) learned from natural images. Unlike the ISA toy data in figure 3, here a full tree was learned and β is approximately one for all edges. The GSM that was learned for the marginals was highly kurtotic. It can be seen that resulting filters are edge filters at varying scales, positions and orientations. This is similar to the result one gets when applying ICA to natural images [5, 15]. More interesting is Figure 4: Left: Filter set learned from 16 × 16 natural image patches. Filters are ordered by PCA eigenvalues, largest to smallest. Resulting filters are edge filters having different orientations, positions, frequencies and phases. Right: The “feature” set learned, that is, columns of the pseudo inverse of the filter set. 5 Figure 5: The learned tree graph structure and feature set. It can be seen that neighboring features on the graph have similar orientation, position and frequency. See Figure 4 for a better view of the feature details, and see text for full detail and analysis. Note that the figure is rotated CW. 6 Optimal Orientation Optimal Frequency 3.5 Optimal Phase 7 3 0.8 6 2.5 Optimal Position Y 0.9 6 5 5 0.7 0.6 3 Child 1.5 4 Child 2 Child Child 4 3 2 1 1 0.4 0.3 2 0.5 0.5 0.2 0 1 0.1 0 0 1 2 Parent 3 4 0 0 2 4 Parent 6 8 0 0 1 2 3 Parent 4 5 6 0 0.2 0.4 0.6 Parent 0.8 1 Figure 6: Correlation of optimal parameters in neighboring nodes in the tree graph. Orientation, frequency and position are highly correlated, while phase seems to be entirely uncorrelated. This property of correlation in frequency and orientation, while having no correlation in phase is related to the ubiquitous energy model of complex cells in V1. See text for further details. Figure 7: Left: Comparison of log likelihood values of our model with PCA, ICA and ISA. Our model gives the highest likelihood. Right: Samples taken at random from ICA, ISA and our model. Samples from our model appear to contain more long-range structure. the tree graph structure learned along with the filters which is shown in Figure 5. It can be seen that neighboring filters (nodes) in the tree tend to have similar position, frequency and orientation. Figure 6 shows the correlation of optimal frequency, orientation and position for neighboring filters in the tree - it is obvious that all three are highly correlated. Also apparent in this figure is the fact that the optimal phase for neighboring filters has no significant correlation. It has been suggested that filters which have the same orientation, frequency and position with different phase can be related to complex cells in V1 [2, 16]. 3.4 Comparison to other models Since our model is a generalization of both ICA and ISA we use it to learn both models. In order to learn ICA we used the exact same data set, but the tree had no edges and was not learned from the data (alternatively, we could have just set β = 0). For ISA we used a forest architecture of 2 node trees, setting β = 1 for all edges (which means a spherical symmetric distribution), no tree structure was learned. Both models produce edge filters similar to what we learn (and to those in [5, 15, 6]). The ISA model produces neighboring nodes with similar frequency and orientation, but different phase, as was reported in [2]. We also compare to a simple PCA whitening transform, using the same whitening transform and marginals as in the ICA case, but setting W = I. We compare the likelihood each model gives for a test set of natural image patches, different from the one that was used in training. There were 50,000 patches in the test set, and we calculate the mean log likelihood over the entire set. The table in Figure 7 shows the result - as can be seen, our model performs better in likelihood terms than both ICA and ISA. Using a tree model, as opposed to more complex graphical models, allows for easy sampling from the model. Figure 7 shows 20 random samples taken from our tree model along with samples from the ICA and ISA models. Note the elongated structures (e.g. in the bottom left sample) in the samples from the tree model, and compare to patches sampled from the ICA and ISA models. 7 40 40 30 30 20 20 10 1 10 0.8 0.6 0.4 0 0.2 0 0 1 2 3 Orientation 4 0 0 2 4 6 Frequency 8 0 2 4 Phase Figure 8: Left: Interpretation of the model. Given a patch, the response of all edge filters is computed (“simple cells”), then at each edge, the corresponding nodes are squared and summed to produce the response of the “complex cell” this edge represents. Both the response of complex cells and simple cells is summed to produce the likelihood of the patch. Right: Response of a “complex cell” in our model to changing phase, frequency and orientation. Response in the y-axis is the sum of squares of the two filters in this “complex cell”. Note that while the cell is selective to orientation and frequency, it is rather invariant to phase. 3.5 Tree models and complex cells One way to interpret the model is looking at the likelihood of a given patch under this model. For the case of β = 1 substituting Equation 4 into Equation 3 yields: 2 2 ρ( yi + ypai ) − ρ(|ypai |) log L(z) = (12) i 2 Where ρ(x) = log k πk N (x; σk ) . This form of likelihood has an interesting similarity to models of complex cells in V1 [2, 4]. In Figure 8 we draw a simple two-layer network that computes the likelihood. The first layer applies linear filters (“simple cells”) to the image patch, while the second layer sums the squared outputs of similarly oriented filters from the first layer, having different phases, which are connected in the tree (“complex cells”). Output is also dependent on the actual response of the “simple cell” layer. The likelihood here is maximized when both the response of the parent filter ypai and the child yi is zero, but, given that one filter has responded with a non-zero value, the likelihood is maximized when the other filter also fires (see the conditional density in Figure 2). Figure 8 also shows an example of the phase invariance which is present in the learned

3 0.6129114 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

Author: Yoshua Bengio, James S. Bergstra

Abstract: We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hiddenlayer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. With this pretraining, the same single-hidden-layer model achieves 1.34% error, even though the pretraining sample distribution is very different from the fine-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features. 1

4 0.56729609 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

Author: Samuel Gershman, Ed Vul, Joshua B. Tenenbaum

Abstract: While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of realworld tasks, and it remains unclear how the human mind approximates Bayesian computations algorithmically. We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision. 1

5 0.5347836 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.

6 0.52928454 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

7 0.5215413 164 nips-2009-No evidence for active sparsification in the visual cortex

8 0.52083218 88 nips-2009-Extending Phase Mechanism to Differential Motion Opponency for Motion Pop-out

9 0.50044483 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

10 0.48797348 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

11 0.46997067 152 nips-2009-Measuring model complexity with the prior predictive

12 0.46708134 13 nips-2009-A Neural Implementation of the Kalman Filter

13 0.46491751 137 nips-2009-Learning transport operators for image manifolds

14 0.44506627 43 nips-2009-Bayesian estimation of orientation preference maps

15 0.44502962 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion

16 0.43354374 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

17 0.43146661 237 nips-2009-Subject independent EEG-based BCI decoding

18 0.42054459 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

19 0.41944247 93 nips-2009-Fast Image Deconvolution using Hyper-Laplacian Priors

20 0.39926812 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.038), (24, 0.035), (25, 0.088), (35, 0.067), (36, 0.055), (37, 0.265), (39, 0.057), (42, 0.011), (57, 0.011), (58, 0.097), (62, 0.013), (71, 0.041), (81, 0.028), (86, 0.085), (91, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92577517 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion

Author: Ian Stevenson, Konrad Koerding

Abstract: In many domains, humans appear to combine perceptual cues in a near-optimal, probabilistic fashion: two noisy pieces of information tend to be combined linearly with weights proportional to the precision of each cue. Here we present a case where structural information plays an important role. The presence of a background cue gives rise to the possibility of occlusion, and places a soft constraint on the location of a target - in effect propelling it forward. We present an ideal observer model of depth estimation for this situation where structural or ordinal information is important and then fit the model to human data from a stereo-matching task. To test whether subjects are truly using ordinal cues in a probabilistic manner we then vary the uncertainty of the task. We find that the model accurately predicts shifts in subject’s behavior. Our results indicate that the nervous system estimates depth ordering in a probabilistic fashion and estimates the structure of the visual scene during depth perception. 1

same-paper 2 0.80556881 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz

Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1

3 0.75392896 234 nips-2009-Streaming k-means approximation

Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni

Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1

4 0.7144081 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.

5 0.65489334 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

6 0.59559429 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

7 0.57263374 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

8 0.56896448 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

9 0.5671702 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

10 0.56690371 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

11 0.56581771 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

12 0.56343025 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

13 0.56137806 211 nips-2009-Segmenting Scenes by Matching Image Composites

14 0.55858934 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

15 0.55791003 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

16 0.55705363 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

17 0.55672127 137 nips-2009-Learning transport operators for image manifolds

18 0.55649024 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

19 0.55540496 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

20 0.55298352 104 nips-2009-Group Sparse Coding