nips nips2004 nips2004-89 knowledge-graph by maker-knowledge-mining

89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

Source: pdf

Author: Erik G. Learned-miller, Parvez Ahammad

Abstract: The correction of bias in magnetic resonance images is an important problem in medical image processing. Most previous approaches have used a maximum likelihood method to increase the likelihood of the pixels in a single image by adaptively estimating a correction to the unknown image bias ﬁeld. The pixel likelihoods are deﬁned either in terms of a pre-existing tissue model, or non-parametrically in terms of the image’s own pixel values. In both cases, the speciﬁc location of a pixel in the image is not used to calculate the likelihoods. We suggest a new approach in which we simultaneously eliminate the bias from a set of images of the same anatomy, but from different patients. We use the statistics from the same location across different images, rather than within an image, to eliminate bias ﬁelds from all of the images simultaneously. The method builds a “multi-resolution” non-parametric tissue model conditioned on image location while eliminating the bias ﬁelds associated with the original image set. We present experiments on both synthetic and real MR data sets, and present comparisons with other methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Most previous approaches have used a maximum likelihood method to increase the likelihood of the pixels in a single image by adaptively estimating a correction to the unknown image bias ﬁeld. [sent-3, score-1.204]

2 The pixel likelihoods are deﬁned either in terms of a pre-existing tissue model, or non-parametrically in terms of the image’s own pixel values. [sent-4, score-0.506]

3 In both cases, the speciﬁc location of a pixel in the image is not used to calculate the likelihoods. [sent-5, score-0.37]

4 We suggest a new approach in which we simultaneously eliminate the bias from a set of images of the same anatomy, but from different patients. [sent-6, score-0.843]

5 We use the statistics from the same location across different images, rather than within an image, to eliminate bias ﬁelds from all of the images simultaneously. [sent-7, score-0.935]

6 The method builds a “multi-resolution” non-parametric tissue model conditioned on image location while eliminating the bias ﬁelds associated with the original image set. [sent-8, score-1.355]

7 1 Introduction The problem of bias ﬁelds in magnetic resonance (MR) images is an important problem in medical imaging. [sent-10, score-0.942]

8 When a patient is imaged in the MR scanner, the goal is to obtain an image which is a function solely of the underlying tissue (left of Figure 1). [sent-12, score-0.595]

9 However, typically the desired anatomical image is corrupted by a multiplicative bias ﬁeld (2nd image of Figure 1) that is caused by engineering issues such as imperfections in the radio frequency coils used to record the MR signal. [sent-13, score-1.103]

10 The result is a corrupted image (3rd image of Figure 1). [sent-14, score-0.51]

11 ) The goal of MR bias correction is to estimate the uncorrupted image from the corrupted image. [sent-16, score-0.94]

12 [7] developed a statistical model using a discrete set of tissues, with the brightness distribution for each tissue type (in a bias-free image) represented by a one-dimensional Guassian distribution. [sent-19, score-0.502]

13 An expectation-maximization (EM) procedure was then used to simultaneouly estimate the bias ﬁeld, the tissue type, and the residual noise. [sent-20, score-0.919]

14 While this method works well in many cases, it has several drawbacks: (1) Models must be developed a priori for each type of acquistion (for each different setting of the MR scanner), for each Figure 1: On the left is an idealized mid-axial MR image of the human brain with little or no bias ﬁeld. [sent-21, score-1.007]

15 The second image is a simulated low-frequency bias ﬁeld. [sent-22, score-0.802]

16 The third image is the result of pixelwise multiplication of the image by the bias ﬁeld. [sent-24, score-1.105]

17 The goal of MR bias correction is to recover the low-bias image on the left from the biased image on the right. [sent-25, score-1.126]

18 On the right is the sine/cosine basis, used to construct band-limited bias ﬁels (see text). [sent-26, score-0.544]

19 In addition, a discrete tissue model does not handle so-called partial volume effects in which a pixel represents a combination of several tissue types. [sent-31, score-0.715]

20 This occurs frequently since many pixels occur at tissue boundaries. [sent-32, score-0.432]

21 In that work, a non-parametric model of the tissue was developed from a single image. [sent-34, score-0.345]

22 Using the observation that the entropy of the pixel brightness distribution for a single image is likely to increase when a bias ﬁeld is added, Viola’s method postulates a bias-correction ﬁeld by minimizing the entropy of the resulting pixel brightness distribution. [sent-35, score-1.659]

23 (2) There is no mechanism for distinguishing between certain low-frequency image components and a bias ﬁeld. [sent-37, score-0.776]

24 That is, the method may mistake signal for noise in certain cases when removal of the true signal reduces the entropy of the brightness distriibution. [sent-38, score-0.421]

25 It models tissue brightness non-parametrically, but uses data from multiple images to provide improved distribution estimates and alleviate the need for bias-free images for making a model. [sent-43, score-0.993]

26 2 The Image Model and Problem Formulation We assume we are given a set I of observed images Ii with 1 ≤ i ≤ N, as shown on the left side of Figure 2. [sent-46, score-0.26]

27 Each of these images is assumed to be the product of some bias-free image Li and a smooth bias ﬁeld Bi ∈ B . [sent-47, score-1.068]

28 We shall refer to the bias-free images as latent images (also called intrinsic images by some authors). [sent-48, score-0.902]

29 The set of all latent images shall be denoted L and the set of unknown bias ﬁelds B. [sent-49, score-0.926]

30 Then each observed image can be written as the product Ii (x, y) = Li (x, y) ∗ Bi (x, y), where (x, y) gives the pixel coordinates of each point, with P pixels per image. [sent-50, score-0.409]

31 A pixel-stack through each image set is shown as the set of pixels corresponding to a particular location in each image (not necessarily the same tissue type). [sent-52, score-0.889]

32 Our method relies on the principle that the pixel-stack values will have lower entropy when the bias ﬁelds have been removed. [sent-53, score-0.743]

33 Figure 3 shows the simulated effect, on the distribution of values in a pixel-stack, of adding different bias ﬁelds to each image. [sent-54, score-0.57]

34 The latent image generation model assumes that each pixel is drawn from a ﬁxed distribution px,y (·) which gives the probability of each gray value at the the location (x, y) in the image. [sent-55, score-0.482]

35 Furthermore, we assume that all pixels in the latent image are independent, given the distributions from which they are drawn. [sent-56, score-0.399]

36 It is also assumed that the bias ﬁelds for each image are chosen independently from some ﬁxed distribution over bias ﬁelds. [sent-57, score-1.32]

37 Unlike most models for this problem which rely on statistical regularities within an image, we take a completely orthogonal approach by assuming that pixel values are independent given their image locations, but that pixel-stacks in general have low entropy when bias ﬁelds are removed. [sent-58, score-1.074]

38 We formulate the problem as a maximum a posteriori (MAP) problem, searching for the most probable bias ﬁelds given the set of observed images. [sent-59, score-0.544]

39 B1 (x, y) BN (x, y) (8) ˆ Here H is the Shannon entropy (−E(log P(x))) and HVasicek is a sample-based entropy 1 (a) is just an application of Bayes rule. [sent-67, score-0.398]

40 1 The entropy estimator used is similar to Vasicek’s estimator [6], given (up to minor details) by ˆ HVasicek (Z 1 , . [sent-71, score-0.283]

41 Figure 2: On the left are a set of mid-coronal brain images from eight different infants, showing clear signs of bias ﬁelds. [sent-77, score-0.94]

42 Although there are probably no more than two or three tissue types represented by the pixel-stack, the brightness distribution through the pixel-stack has high empirical entropy due to the presence of different bias ﬁelds in each image. [sent-79, score-1.184]

43 On the right are a set of images that have been corrected using our bias ﬁeld removal algorithm. [sent-80, score-0.893]

44 While the images are still far from identical, the pixel-stack entropies have been reduced by mapping similar tissues to similar values in an “unsupervised” fashion, i. [sent-81, score-0.515]

45 (c) expresses the fact that the probability of the observed image given a particular bias ﬁeld is the same as the probability of the latent image associated with that observed image and bias ﬁeld. [sent-84, score-1.873]

46 The approximation (d) replaces the empirical mean of the log probability at each pixel with the negative entropy of the underlying distribution at that pixel. [sent-85, score-0.298]

47 This entropy is in turn estimated (e) using the entropy estimator of Vasicek [6] directly from the samples in the pixel-stack, without ever estimating the distributions px,y explicitly. [sent-86, score-0.44]

48 ) 3 The Algorithm Using these ideas, it is straightforward to construct algorithms for joint bias ﬁeld removal. [sent-89, score-0.567]

49 As mentioned above, we chose to optimize Equation (8) over the set of band-limited bias ﬁelds. [sent-90, score-0.544]

50 To do this, we parameterize the set of bias ﬁelds using the sine/cosine basis images shown on the right of Figure 1: 25 Bi = ∑ α j φ j (x, y). [sent-91, score-0.804]

51 j=1 We optimize Equation (8) by simultaneously updating the bias ﬁeld estimates (taking a step along the numerical gradient) for each image to reduce the overall entropy. [sent-92, score-0.838]

52 That is, at time step t, the coefﬁcients α j for each bias ﬁeld are updated using the latent image estimates and entropy estimates from time step t − 1. [sent-93, score-1.128]

53 After all α’s have been updated, a new set of latent images and pixel-stack entropies are calculated, and another gradient step is taken. [sent-94, score-0.525]

54 Though it is possible to do a full gradient descent to convergence by optimizing one image at a time, the optimization landscape tends to have more local minima for the last few images in the process. [sent-95, score-0.652]

55 The two sharp peaks in the brightness distribution represent two tissues which are commonly found at that particular pixel location. [sent-97, score-0.328]

56 On the right is the result of adding an independent bias ﬁeld to each image. [sent-98, score-0.544]

57 In this work, we seek to remove bias ﬁelds by seeking to reduce the entropy of the pixel-stack distribution to its original state. [sent-100, score-0.822]

58 Initialize the bias ﬁeld coefﬁcients for each image to 0, with the exception of the coefﬁcient for the DC-offset (the constant bias ﬁeld component), which is initialized to 1. [sent-104, score-1.32]

59 Compute the summed pixelwise entropies for the set of images with initial “neutral” bias ﬁeld corrections. [sent-107, score-1.036]

60 Calculate the numerical gradient ∇α HVasicek of (8) with respect to the bias ﬁeld coefﬁcients (α j ’s) for the current image. [sent-112, score-0.585]

61 Upon convergence, it is assumed that the entropy has been reduced as much as possible by changing the bias ﬁelds, unless one or more of the gradient descents is stuck in a local minimum. [sent-116, score-0.808]

62 Empirically, the likelihood of sticking in local minima is dramatically reduced by increasing the number of images (N) in the optimization. [sent-117, score-0.307]

63 In our experiments described below with only 21 real infant brains, the algorithm appears to have found a global minimum of all bias ﬁelds, at least to the extent that this can be discerned visually. [sent-118, score-0.714]

64 Note that for a set of identical images, the pixel-stack entropies are not increased by multiplying each image by the same bias ﬁeld (since all images will still be the same). [sent-119, score-1.198]

65 More generally, when images are approximately equivalent, their pixel-stack entropies are not signﬁcantly affected by a “common” bias ﬁeld, i. [sent-120, score-0.939]

66 2 This means that the algorithm cannot, in general, eliminate all bias ﬁelds from a set of images, but can only set all of the bias ﬁelds to be equivalent. [sent-123, score-1.127]

67 We refer to any constant bias ﬁeld remaining in all of the images after convergence as the residual bias ﬁeld. [sent-124, score-1.415]

68 2 Actually, multiplying each image by a bias ﬁeld of small magnitude can artiﬁcially reduce the entropy of a pixel-stack, but this is only the result of the brightness values shrinking towards zero. [sent-125, score-1.165]

69 Fortunately, there is an effect that tends to minimize the impact of the residual bias ﬁeld in many test cases. [sent-127, score-0.641]

70 In particular, the residual bias ﬁeld tends to consist of components for each α j that approximate the mean of that component across images. [sent-128, score-0.718]

71 For example, if half of the observed images have a positive value for a particular component’s coefﬁcient, and half have a negative coefﬁcient for that component, the residual bias ﬁeld will tend to have a coefﬁcient near zero for that component. [sent-129, score-0.896]

72 Hence, the algorithm naturally eliminates bias ﬁeld effects that are non-systematic, i. [sent-130, score-0.544]

73 If the same type of bias ﬁeld component occurs in a majority of the images, then the algorithm will not remove it, as the component is indistinguishable, under our model, from the underlying anatomy. [sent-133, score-0.688]

74 4 Experiments To test our algorithm, we ran two sets of experiments, the ﬁrst on synthetic images for validation, and the second on real brain images. [sent-137, score-0.451]

75 We obtained synthetic brain images from the BrainWeb project [8, 9] such as the one shown on the left of Figure 1. [sent-138, score-0.427]

76 These images can be considered “idealized” MR images in the sense that the brightness values for each tissue are constant (up to a small amount of manually added isotropic noise). [sent-139, score-0.988]

77 The initial goal was to ensure that our algorithm could remove synthetically added bias ﬁelds, in which the bias ﬁeld coefﬁcients were known. [sent-141, score-1.185]

78 Using K copies of a single “latent” image, we added known but different bias ﬁelds to each one. [sent-142, score-0.571]

79 For as few as ﬁve images, we could reliably recover the known bias ﬁeld coefﬁcients, up to a ﬁxed offset for each image, to within 1% of the power of the original bias coefﬁcients. [sent-143, score-1.088]

80 More interesting are the results on real images, in which the latent images come from different patients. [sent-144, score-0.373]

81 We obtained 21 pre-registered3 infant brain images (top of Figure 4) from Brigham and Women’s Hospital in Boston, Massachusetts. [sent-145, score-0.542]

82 Large bias ﬁelds can be seen in many of the images. [sent-146, score-0.544]

83 Probably the most striking is a “ramp-like” bias ﬁeld in the sixth image of the second row. [sent-147, score-0.776]

84 ) Because the brain’s white matter is not fully developed in these infant scans, it is difﬁcult to categorize tissues into a ﬁxed number of classes as is typically done for adult brain images; hence, these images are not amenable to methods based on speciﬁc tissue models developed for adults (e. [sent-149, score-1.193]

85 The middle third of Figure 4 shows the results of our algorithm on the infant brain images. [sent-152, score-0.325]

86 It is interesting to compare these results to a method that reduces the entropy of each image individually, without using constraints between images. [sent-156, score-0.431]

87 Using the results of our algorithm as a starting point, we continued to reduce the entropy of the pixels within each image (using a method akin to Viola’s [10]), rather than across images. [sent-157, score-0.592]

88 The proposed MAP method works under very broad conditions, the main condition being that the bias ﬁelds do not span the same space as parts of the actual medical images. [sent-160, score-0.604]

89 It is true, however, that as the latent images become less registered or differ in other ways, that a much larger number of images is needed to get good estimates of the pixel-stack distributions. [sent-161, score-0.641]

90 This is most likely because the entropy of the pixels within a particular image can be reduced by increasing the bias ﬁeld “correction” in the central part of the image. [sent-163, score-1.077]

91 In other words, the algorithm strives to make the image more uniform by removing the bright part in the middle of the image. [sent-164, score-0.326]

92 5 Discussion The idea of minimizing pixelwise entropies to remove nuisance variables from a set of images is not new. [sent-167, score-0.562]

93 [4, 5] presented an approach they call congealing in which the sum of pixelwise entropies is minimized by separate afﬁne transforms applied to each image. [sent-169, score-0.281]

94 Combining such approaches to do registration and bias removal simulataneously, or registration and lighting rectiﬁcation of faces, for example, is an obvious direction for future work. [sent-171, score-0.749]

95 For “easy” bias correction problems, such an approach may be overkill, but for difﬁcult problems in bias correction, where the bias ﬁeld is difﬁcult to separate from the underlying tissue, as discussed in [1], such an approach could produce critical extra leverage. [sent-177, score-1.75]

96 Simon Warﬁeld for graciously providing the infant brain images for this work. [sent-180, score-0.542]

97 The images were obtained under NIH grant P41 RR13218. [sent-181, score-0.26]

98 A uniﬁed variational approach to denoising and bias correction in MR. [sent-192, score-0.662]

99 Figure 4: NOTE: This image must be viewed in color (preferably on a bright display) for full effect. [sent-271, score-0.283]

100 The same images after bias removal with our algorithm. [sent-275, score-0.893]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bias', 0.544), ('tissue', 0.308), ('images', 0.26), ('image', 0.232), ('eld', 0.211), ('entropy', 0.199), ('elds', 0.175), ('hvasicek', 0.146), ('infant', 0.146), ('mr', 0.143), ('brain', 0.136), ('entropies', 0.135), ('brightness', 0.133), ('correction', 0.118), ('pixel', 0.099), ('maxp', 0.097), ('pixelwise', 0.097), ('vasicek', 0.097), ('wells', 0.097), ('tissues', 0.096), ('latent', 0.089), ('removal', 0.089), ('pixels', 0.078), ('residual', 0.067), ('arg', 0.065), ('viola', 0.065), ('matter', 0.061), ('medical', 0.06), ('registration', 0.058), ('coef', 0.057), ('across', 0.053), ('bright', 0.051), ('remove', 0.049), ('adults', 0.049), ('butter', 0.049), ('congealing', 0.049), ('imperfections', 0.049), ('corrupted', 0.046), ('middle', 0.043), ('white', 0.042), ('infants', 0.042), ('magnetic', 0.042), ('estimator', 0.042), ('miller', 0.041), ('gradient', 0.041), ('location', 0.039), ('eliminate', 0.039), ('landscape', 0.039), ('amherst', 0.039), ('mri', 0.039), ('scanner', 0.039), ('developed', 0.037), ('resonance', 0.036), ('li', 0.035), ('massachusetts', 0.034), ('patient', 0.034), ('idealized', 0.034), ('shall', 0.033), ('estimates', 0.032), ('smooth', 0.032), ('cients', 0.032), ('drawbacks', 0.032), ('bi', 0.032), ('synthetic', 0.031), ('reduce', 0.03), ('tends', 0.03), ('descent', 0.027), ('multiplying', 0.027), ('added', 0.027), ('simulated', 0.026), ('near', 0.025), ('bottom', 0.025), ('reduced', 0.024), ('real', 0.024), ('imaging', 0.024), ('type', 0.024), ('component', 0.024), ('structures', 0.023), ('minima', 0.023), ('occur', 0.023), ('occurs', 0.023), ('assumes', 0.023), ('fisher', 0.023), ('joint', 0.023), ('berkeley', 0.022), ('synthetically', 0.021), ('simulataneously', 0.021), ('grimson', 0.021), ('brains', 0.021), ('categorize', 0.021), ('cetin', 0.021), ('exaggerated', 0.021), ('imaged', 0.021), ('neil', 0.021), ('normality', 0.021), ('nuisance', 0.021), ('postulates', 0.021), ('proceeding', 0.021), ('reductions', 0.021), ('women', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

Author: Erik G. Learned-miller, Parvez Ahammad

2 0.1190272 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

Author: Elizaveta Levina, Peter J. Bickel

Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1

3 0.10953551 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

Author: Giorgio Gia\-cin\-to, Fabio Roli

Abstract: High retrieval precision in content-based image retrieval can be attained by adopting relevance feedback mechanisms. These mechanisms require that the user judges the quality of the results of the query by marking all the retrieved images as being either relevant or not. Then, the search engine exploits this information to adapt the search to better meet user’s needs. At present, the vast majority of proposed relevance feedback mechanisms are formulated in terms of search model that has to be optimized. Such an optimization involves the modification of some search parameters so that the nearest neighbor of the query vector contains the largest number of relevant images. In this paper, a different approach to relevance feedback is proposed. After the user provides the first feedback, following retrievals are not based on knn search, but on the computation of a relevance score for each image of the database. This score is computed as a function of two distances, namely the distance from the nearest non-relevant image and the distance from the nearest relevant one. Images are then ranked according to this score and the top k images are displayed. Reported results on three image data sets show that the proposed mechanism outperforms other state-of-the-art relevance feedback mechanisms. 1 In t rod u ct i on A large number of content-based image retrieval (CBIR) systems rely on the vector representation of images in a multidimensional feature space representing low-level image characteristics, e.g., color, texture, shape, etc. [1]. Content-based queries are often expressed by visual examples in order to retrieve from the database the images that are “similar” to the examples. This kind of retrieval is often referred to as K nearest-neighbor retrieval. It is easy to see that the effectiveness of content-based image retrieval systems (CBIR) strongly depends on the choice of the set of visual features, on the choice of the “metric” used to model the user’s perception of image similarity, and on the choice of the image used to query the database [1]. Typically, if we allow different users to mark the images retrieved with a given query as relevant or non-relevant, different subsets of images will be marked as relevant. Accordingly, the need for mechanisms to adapt the CBIR system response based on some feedback from the user is widely recognized. It is interesting to note that while relevance feedback mechanisms have been first introduced in the information retrieval field [2], they are receiving more attention in the CBIR field (Huang). The vast majority of relevance feedback techniques proposed in the literature is based on modifying the values of the search parameters as to better represent the concept the user bears in mind. To this end, search parameters are computed as a function of the relevance values assigned by the user to all the images retrieved so far. As an example, relevance feedback is often formulated in terms of the modification of the query vector, and/or in terms of adaptive similarity metrics. [3]-[7]. Recently, pattern classification paradigms such as SVMs have been proposed [8]. Feedback is thus used to model the concept of relevant images and adjust the search consequently. Concept modeling may be difficult on account of the distribution of relevant images in the selected feature space. “Narrow domain” image databases allows extracting good features, so that images bearing similar concepts belong to compact clusters. On the other hand, “broad domain” databases, such as image collection used by graphic professionals, or those made up of images from the Internet, are more difficult to subdivide in cluster because of the high variability of concepts [1]. In these cases, it is worth extracting only low level, non-specialized features, and image retrieval is better formulated in terms of a search problem rather then concept modeling. The present paper aims at offering an original contribution in this direction. Rather then modeling the concept of “relevance” the user bears in mind, feedback is used to assign each image of the database a relevance score. Such a score depends only from two dissimilarities (distances) computed against the images already marked by the user: the dissimilarity from the set of relevant images, and the dissimilarity from the set of non-relevant images. Despite its computational simplicity, this mechanism allows outperforming state-of-the-art relevance feedback mechanisms both on “narrow domain” databases, and on “broad domain” databases. This paper is organized as follows. Section 2 illustrates the idea behind the proposed mechanism and provides the basic assumptions. Section 3 details the proposed relevance feedback mechanism. Results on three image data sets are presented in Section 4, where performances of other relevance feedback mechanisms are compared. Conclusions are drawn in Section 5. 2 In st an ce- b ased rel evan ce est i m at i on The proposed mechanism has been inspired by classification techniques based on the “nearest case” [9]-[10]. Nearest-case theory provided the mechanism to compute the dissimilarity of each image from the sets of relevant and non–relevant images. The ratio between the nearest relevant image and the nearest non-relevant image has been used to compute the degree of relevance of each image of the database [11]. The present section illustrates the rationale behind the use of the nearest-case paradigm. Let us assume that each image of the database has been represented by a number of low-level features, and that a (dis)similarity measure has been defined so that the proximity between pairs of images represents some kind of “conceptual” similarity. In other words, the chosen feature space and similarity metric is meaningful at least for a restricted number of users. A search in image databases is usually performed by retrieving the k most similar images with respect to a given query. The dimension of k is usually small, to avoid displaying a large number of images at a time. Typical values for k are between 10 and 20. However, as the “relevant” images that the user wishes to retrieve may not fit perfectly with the similarity metric designed for the search engine, the user may be interested in exploring other regions of the feature space. To this end, the user marks the subset of “relevant” images out of the k retrieved. Usually, such relevance feedback is used to perform a new k-nn search by modifying some search parameters, i.e., the position of the query point, the similarity metric, and other tuning parameters [1]-[7]. Recent works proposed the use of support vector machine to learn the distribution of relevant images [8]. These techniques require some assumption about the general form of the distribution of relevant images in the feature space. As it is difficult to make any assumption about such a distribution for broad domain databases, we propose to exploit the information about the relevance of the images retrieved so far in a nearest-neighbor fashion. Nearest-neighbor techniques, as used in statistical pattern recognition, case-based reasoning, or instance-based learning, are effective in all applications where it is difficult to produce a high-level generalization of a “class” of objects [9]-[10],[12][13]. Relevance learning in content base image retrieval may well fit into this definition, as it is difficult to provide a general model that can be adapted to represent different concepts of similarity. In addition, the number of available cases may be too small to estimate the optimal set of parameters for such a general model. On the other hand, it can be more effective to use each “relevant” image as well as each “non-relevant” image, as “cases” or “instances” against which the images of the database should be compared. Consequently, we assume that an image is as much as relevant as much as its dissimilarity from the nearest relevant image is small. Analogously, an image is as much as non-relevant as much as its dissimilarity from the nearest non-relevant image is small. 3 Rel evan ce S core Com p u t ati on According to previous section, each image of the database can be thus characterized by a “degree of relevance” and a “degree of non-relevance” according to the dissimilarities from the nearest relevant image, and from the nearest non-relevant image, respectively. However, it should be noted that these degrees should be treated differently because only “relevant” images represent a “concept” in the user’s mind, while “non-relevant” images may represent a number of other concepts different from user’s interest. In other words, while it is meaningful to treat the degree of relevance as a degree of membership to the class of relevant images, the same does not apply to the degree of non-relevance. For this reason, we propose to use the “degree of non-relevance” to weight the “degree of relevance”. Let us denote with R the subset of indexes j ∈ {1,...,k} related to the set of relevant images retrieved so far and the original query (that is relevant by default), and with NR the subset of indexes j ∈ (1,...,k} related to the set of non-relevant images retrieved so far. For each image I of the database, according to the nearest neighbor rule, let us compute the dissimilarity from the nearest image in R and the dissimilarity from the nearest image in NR. Let us denote these dissimilarities as dR(I) and dNR(I), respectively. The value of dR(I) can be clearly used to measure the degree of relevance of image I, assuming that small values of dR(I) are related to very relevant images. On the other hand, the hypothesis that image I is relevant to the user’s query can be supported by a high value of dNR(I). Accordingly, we defined the relevance score ! dR ( I ) $ relevance ( I ) = # 1 + dN ( I ) &

4 0.1030404 192 nips-2004-The power of feature clustering: An application to object detection

Author: Shai Avidan, Moshe Butman

Abstract: We give a fast rejection scheme that is based on image segments and demonstrate it on the canonical example of face detection. However, instead of focusing on the detection step we focus on the rejection step and show that our method is simple and fast to be learned, thus making it an excellent pre-processing step to accelerate standard machine learning classiﬁers, such as neural-networks, Bayes classiﬁers or SVM. We decompose a collection of face images into regions of pixels with similar behavior over the image set. The relationships between the mean and variance of image segments are used to form a cascade of rejectors that can reject over 99.8% of image patches, thus only a small fraction of the image patches must be passed to a full-scale classiﬁer. Moreover, the training time for our method is much less than an hour, on a standard PC. The shape of the features (i.e. image segments) we use is data-driven, they are very cheap to compute and they form a very low dimensional feature space in which exhaustive search for the best features is tractable. 1

5 0.097549498 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1

6 0.095955528 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

7 0.092402175 164 nips-2004-Semi-supervised Learning by Entropy Minimization

8 0.088368088 44 nips-2004-Conditional Random Fields for Object Recognition

9 0.083234094 168 nips-2004-Semigroup Kernels on Finite Sets

10 0.076133832 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

11 0.075972155 99 nips-2004-Learning Hyper-Features for Visual Identification

12 0.075451277 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

13 0.068926983 40 nips-2004-Common-Frame Model for Object Recognition

14 0.067393862 163 nips-2004-Semi-parametric Exponential Family PCA

15 0.066626325 68 nips-2004-Face Detection --- Efficient and Rank Deficient

16 0.066450633 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

17 0.065903567 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

18 0.063953131 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

19 0.061145972 179 nips-2004-Surface Reconstruction using Learned Shape Models

20 0.059446886 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.187), (1, 0.03), (2, -0.061), (3, -0.146), (4, 0.083), (5, 0.04), (6, 0.021), (7, -0.046), (8, 0.006), (9, 0.037), (10, -0.007), (11, -0.015), (12, 0.101), (13, -0.122), (14, -0.021), (15, -0.032), (16, -0.01), (17, 0.063), (18, 0.042), (19, -0.131), (20, 0.131), (21, 0.016), (22, -0.036), (23, 0.02), (24, -0.02), (25, 0.03), (26, 0.062), (27, -0.009), (28, -0.22), (29, -0.069), (30, 0.111), (31, 0.076), (32, -0.185), (33, 0.038), (34, -0.11), (35, -0.057), (36, 0.023), (37, 0.059), (38, -0.036), (39, -0.065), (40, 0.013), (41, -0.043), (42, -0.004), (43, -0.106), (44, 0.074), (45, 0.118), (46, -0.092), (47, 0.001), (48, -0.075), (49, -0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9838571 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

Author: Erik G. Learned-miller, Parvez Ahammad

2 0.58662492 192 nips-2004-The power of feature clustering: An application to object detection

Author: Shai Avidan, Moshe Butman

3 0.58402067 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

Author: Matthias O. Franz, Bernhard Schölkopf

Abstract: The computation of classical higher-order statistics such as higher-order moments or spectra is difﬁcult for images due to the huge number of terms to be estimated and interpreted. We propose an alternative approach in which multiplicative pixel interactions are described by a series of Wiener functionals. Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of ﬁve play an important role in natural images. Most of the interesting structure in a natural image is characterized by its higher-order statistics. Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. This would require knowledge of a second point on the line, i.e., we have to consider some third-order statistics which describe the interactions between triplets of points. Analogously, the prediction of a corner neighbourhood needs at least fourth-order statistics, and so on. In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i.e. phase correlations between several Fourier components of the image. Classically, harmonic phase interactions are measured by higher-order spectra [4]. Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. For instance, a sixth-order spectrum of a 16×16 sized image contains roughly 1012 coefﬁcients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9]. Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly via polynomial kernels. This allows us to decompose an image into components that are characterized by interactions of a given order. In the next section, we introduce the Wiener expansion and discuss its capability of modeling higher-order pixel interactions. The implicit estimation method is described in Sect. 2, followed by some examples of use in Sect. 3. We conclude in Sect. 4 by brieﬂy discussing the results and possible improvements. 1 Modeling pixel interactions with Wiener functionals For our analysis, we adopt a prediction framework: Given a d × d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. We are particularly interested to which extent interactions of different orders contribute to the overall prediction. Our basic assumption is that the dependency of the central pixel value y on its neighbours xi , i = 1, . . . , m = d2 − 1 can be modeled as a series y = H0 [x] + H1 [x] + H2 [x] + · · · + Hn [x] + · · · (1) of discrete Volterra functionals H0 [x] = h0 = const. and Hn [x] = m i1 =1 ··· m in =1 (n) hi1 ...in xi1 . . . xin . (2) Here, we have stacked the grayvalues of the neighbourhood into the vector x = (x1 , . . . , xm ) ∈ Rm . The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefﬁcients (n) hi1 ...in . Volterra functionals provide a controlled way of introducing multiplicative interactions of image points since a functional of order n contains all products of the input of order n. In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels. Unfortunately, Volterra functionals are not orthogonal to each other, i.e., depending on the input distribution, a functional of order n generally leads to additional lower-order interactions. As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. If one wants to estimate the zeroeth-order component of an image (i.e., the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. For general Volterra series, this correction can be achieved by decomposing it into a new series y = G0 [x] + G1 [x] + · · · + Gn [x] + · · · of functionals Gn [x] that are uncorrelated, i.e., orthogonal with respect to the input. The resulting Wiener functionals1 Gn [x] are linear combinations of Volterra functionals up to order n. They are computed from the original Volterra series by a procedure akin to Gram-Schmidt orthogonalization [5]. It can be shown that any Wiener expansion of ﬁnite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5]. So far, we have not gained anything compared to classical estimation of higher-order moments or spectra: an nth-order Volterra functional contains the same number of terms as 1 Strictly speaking, the term Wiener functional is reserved for orthogonal Volterra functionals with respect to Gaussian input. Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions. the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefﬁcients as it consists also of lower-order Volterra functionals. In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efﬁcient computation of the Wiener functionals. 2 Estimating Wiener series by regression in RKHS Volterra series as linear functionals in RKHS. The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. We can interpret the evaluation of this functional for a given input x as a map φn deﬁned for n = 0, 1, 2, . . . as φ0 (x) = 1 and φn (x) = (xn , xn−1 x2 , . . . , x1 xn−1 , xn , . . . , xn ) 1 2 m 1 2 (3) n such that φn maps the input x ∈ Rm into a vector φn (x) ∈ Fn = Rm containing all mn ordered monomials of degree n. Using φn , we can write the nth-order Volterra functional in Eq. (2) as a scalar product in Fn , Hn [x] = ηn φn (x), (n) (4) (n) (n) with the coefﬁcients stacked into the vector ηn = (h1,1,..1 , h1,2,..1 , h1,3,..1 , . . . ) ∈ Fn . The same idea can be applied to the entire pth-order Volterra series. By stacking the maps φn into a single map φ(p) (x) = (φ0 (x), φ1 (x), . . . , φp (x)) , one obtains a mapping from p+1 2 p Rm into F(p) = R × Rm × Rm × . . . Rm = RM with dimensionality M = 1−m . The 1−m entire pth-order Volterra series can be written as a scalar product in F(p) p n=0 Hn [x] = (η (p) ) φ(p) (x) (5) with η (p) ∈ F(p) . Below, we will show how we can express η (p) as an expansion in terms of the training points. This will dramatically reduce the number of parameters we have to estimate. This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). As a consequence, the dot product in Fn can be computed by evaluating a positive deﬁnite kernel function kn (x1 , x2 ). For monomials, one can easily show that (e.g., [6]) φn (x1 ) φn (x2 ) = (x1 x2 )n =: kn (x1 , x2 ). (6) Since F(p) is generated as a direct sum of the single spaces Fn , the associated scalar product is simply the sum of the scalar products in the Fn : φ(p) (x1 ) φ(p) (x2 ) = p n=0 (x1 x2 )n = k (p) (x1 , x2 ). (7) Thus, we have shown that the discretized Volterra series can be expressed as a linear functional in a RKHS2 . Linear regression in RKHS. For our prediction problem (1), the RKHS property of the Volterra series leads to an efﬁcient solution which is in part due to the so called representer theorem (e.g., [6]). It states the following: suppose we are given N observations 2 A similar approach has been taken by [1] using the inhomogeneous polynomial kernel = (1 + x1 x2 )p . This kernel implies a map φinh into the same space of monomials, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem. (p) kinh (x1 , x2 ) (x1 , y1 ), . . . , (xN , yN ) of the function (1) and an arbitrary cost function c, Ω is a nondecreasing function on R>0 and . F is the norm of the RKHS associated with the kernel k. If we minimize an objective function c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) + Ω( f F ), (8) 3 over all functions in the RKHS, then an optimal solution can be expressed as N f (x) = j=1 aj k(x, xj ), aj ∈ R. (9) In other words, although we optimized over the entire RKHS including functions which are deﬁned for arbitrary input points, it turns out that we can always express the solution in terms of the observations xj only. Hence the optimization problem over the extremely large number of coefﬁcients η (p) in Eq. (5) is transformed into one over N variables aj . Let us consider the special case where the cost function is the mean squared error, N 1 c((x1 , y1 , f (x1 )), . . . , (xN , yN , f (xN ))) = N j=1 (f (xj ) − yj )2 , and the regularizer Ω is zero4 . The solution for a = (a1 , . . . , aN ) is readily computed by setting the derivative of (8) with respect to the vector a equal to zero; it takes the form a = K −1 y with the Gram matrix deﬁned as Kij = k(xi , xj ), hence5 y = f (x) = a z(x) = y K −1 z(x), (10) N where z(x) = (k(x, x1 ), k(x, x2 ), . . . k(x, xN )) ∈ R . Implicit Wiener series estimation. As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error. This can be put into the regression framework: since any ﬁnite Volterra series can be represented as a linear functional in the corresponding RKHS, we can ﬁnd the pth-order Volterra series that minimizes the squared error by linear regression. This, by deﬁnition, must be the pth-degree Wiener series since no other Volterra series has this property6 . From Eqn. (10), we obtain the following expressions for the implicit Wiener series p p 1 −1 G0 [x] = y 1, Hn [x] = y Kp z(p) (x) (11) Gn [x] = n=0 n=0 N (p) where the Gram matrix Kp and the coefﬁcient vector z (x) are computed using the kernel from Eq. (7) and 1 = (1, 1, . . . ) ∈ RN . Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. Thus, we can avoid the “curse of dimensionality”, i.e., there is no need to compute the possibly large number of coefﬁcients explicitly. The explicit Volterra and Wiener expansions can be recovered from Eq. (11) by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by −1 Hn [x] = y Kp zn (x) n n (12) n with zn (x) = ((x1 x) , (x2 x) , . . . , (xN x) ) . For p = 0 the only term is the constant zero-order Volterra functional H0 [x] = G0 [x]. The coefﬁcient vector ηn = (n) (n) (n) (h1,1,...1 , h1,2,...1 , h1,3,...1 , . . . ) of the explicit Volterra functional is obtained as −1 η n = Φ n Kp y 3 (13) for conditions on uniqueness of the solution, see [6] Note that this is different from the regularized approach used by [1]. If Ω is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. 5 If K is not invertible, K −1 denotes the pseudo-inverse of K. 6 assuming symmetrized Volterra kernels which can be obtained from any Volterra expanson. 4 using the design matrix Φn = (φn (x1 ) , φn (x1 ) , . . . , φn (x1 ) ) . The individual Wiener functionals can only be recovered by applying the regression procedure twice. If we are interested in the nth-degree Wiener functional, we have to compute the solution for the kernels k (n) (x1 , x2 ) and k (n−1) (x1 , x2 ). The Wiener functional for n > 0 is then obtained from the difference of the two results as Gn [x] = n i=0 Gi [x] − n−1 i=0 Gi [x] = y −1 −1 Kn z(n) (x) − Kn−1 z(n−1) (x) . (14) The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. (12) and (13) [3]. Orthogonality. The resulting Wiener functionals must fulﬁll the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. Formally, we will prove the following Theorem 1 The functionals obtained from Eq. (14) fulﬁll the orthogonality condition E [m(x)Gp [x]] = 0 (15) where E denotes the expectation over the input distribution and m(x) an arbitrary ithorder monomial with i < p. We will show that this a consequence of the least squares ﬁt of any linear expansion in a set M of basis functions of the form y = j=1 γj ϕj (x). In the case of the Wiener and Volterra expansions, the basis functions ϕj (x) are monomials of the components of x. M We denote the error of the expansion as e(x) = y − j=1 γj ϕj (xi ). The minimum of the expected quadratic loss L with respect to the expansion coefﬁcient γk is given by ∂ ∂L = E e(x) ∂γk ∂γk 2 = −2E [ϕk (x)e(x)] = 0. (16) This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. The approximation error is given by the ∞ sum of the higher-order Wiener functionals e(x) = n=p Gn [x], so Gp [x] is part of the error. As a consequence of the linearity of the expectation, Eq. (16) implies ∞ n=p ∞ E [ϕk (x)Gn [x]] = 0 and n=p+1 E [ϕk (x)Gn [x]] = 0 (17) for any φk of order less than p. The difference of both equations yields E [ϕk (x)Gp [x]] = 0, so that Gp [x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p. 3 Experiments Toy examples. In our ﬁrst experiment, we check whether our intuitions about higher-order statistics described in the introduction are captured by the proposed method. In particular, we expect that arbitrarily oriented lines can only be predicted using third-order statistics. As a consequence, we should need at least a second-order Wiener functional to predict lines correctly. Our ﬁrst test image (size 80 × 110, upper row in Fig. 1) contains only lines of varying orientations. Choosing a 5 × 5 neighbourhood, we predicted the central pixel using (11). original image 0th-order component/ reconstruction 1st-order reconstruction 1st-order component 2nd-order reconstruction 2nd-order component 3rd-order reconstruction mse = 583.7 mse = 0.006 mse = 0 mse = 1317 mse = 37.4 mse = 0.001 mse = 1845 mse = 334.9 3rd-order component mse = 19.0 Figure 1: Higher-order components of toy images. The image components of different orders are created by the corresponding Wiener functionals. They are added up to obtain the different orders of reconstruction. Note that the constant 0-order component and reconstruction are identical. The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. Although the linear ﬁrst-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). A correct prediction is only obtained by adding a second-order component to the model. The third-order component is only signiﬁcant at crossings, corners and line endings. Models of orders 0 . . . 3 were learned from the image by extracting the maximal training set of 76 × 106 patches of size 5 × 57 . The corresponding image components of order 0 to 3 were computed according to (14). Note the different components generated by the Wiener functionals can also be negative. In Fig. 1, they are scaled to the gray values [0..255]. The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. The additional small correction achieved by the third-order model is mainly due to discretization effects. Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. This is conﬁrmed by the second and third test image shown in the corresponding row in Fig. 1. Note that the third-order component is only signiﬁcant at crossings, corners and line endings. The fourth- and ﬁfth-order terms (not shown) have only negligible contributions. The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel. Application to natural images. Are there further predictable structures in natural images that are not due to lines, crossings or corners? This can be investigated by applying our method to a set of natural images (an example of size 80 × 110 is depicted in Fig. 2). Our 7 In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand. original image 0th-order component/ reconstruction 1st-order reconstruction mse = 1070 1st-order component 2nd-order reconstruction mse = 957.4 2nd-order component 3rd-order reconstruction mse = 414.6 3rd-order component 4th-order reconstruction mse = 98.5 4th-order component 5th-order reconstruction mse = 18.5 5th-order component 6th-order reconstruction mse = 4.98 6th-order component 7th-order reconstruction mse = 1.32 7th-order component 8th-order reconstruction mse = 0.41 8th-order component Figure 2: Higher-order components and reconstructions of a photograph. Interactions up to the ﬁfth order play an important role. Note that signiﬁcant components become sparser with increasing model order. results on a set of 10 natural images of size 50 × 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. 3). Interestingly, terms up to order 5 still play a signiﬁcant role, although the image regions with a signiﬁcant component become sparser with increasing model order (see Fig. 2). Note that the nonlinear terms reduce the reconstruction error to almost 0. This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models. 4 Conclusion The implicit estimation of Wiener functionals via polynomial kernels opens up new possibilities for the estimation of higher-order image statistics. Compared to the classical methods such as higher-order spectra, moments or cumulants, our approach avoids the combinatorial explosion caused by the exponential increase of the number of terms to be estimated and interpreted. When put into a predictive framework, multiplicative pixel interactions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners. There is no one-to-one mapping between the classical higher-order statistics and multiplicative pixel interactions. Any nonlinear Wiener functional, for instance, creates inﬁnitely many correlations or cumulants of higher order, and often also of lower order. On the other 700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. mse 500 400 300 200 100 0 0 1 2 3 4 5 6 7 model order hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. In contrast, our method is able to isolate image components that are created by a single order of interaction. Although of preliminary nature, our results on natural images suggest an important role of statistics up to the ﬁfth order. Most of the currently used low-level feature detectors such as edge or corner detectors maximally use third-order interactions. The investigation of fourth- or higher-order features is a ﬁeld that might lead to new insights into the nature and role of higher-order image structures. As often observed in the literature (e.g. [2][7]), our results seem to conﬁrm that a large proportion of the redundancy in natural images is contained in the higher-order pixel interactions. Before any further conclusions can be drawn, however, our study needs to be extended in several directions: 1. A representative image database has to be analyzed. The images must be carefully calibrated since nonlinear statistics can be highly calibrationsensitive. In addition, the contribution of image noise has to be investigated. 2. Currently, only images up to 9000 pixels can be analyzed due to the matrix inversion required by Eq. 11. To accomodate for larger images, our method has to be reformulated in an iterative algorithm. 3. So far, we only considered 5 × 5-patches. To systematically investigate patch size effects, the analysis has to be conducted in a multi-scale framework. References [1] T. J. Dodd and R. F. Harrison. A new solution to Volterra series estimation. In CD-Rom Proc. 2002 IFAC World Congress, 2002. [2] D. J. Field. What is the goal of sensory coding? Neural Computation, 6:559 – 601, 1994. [3] M. O. Franz and B. Sch¨lkopf. Implicit Wiener series. Technical Report 114, Max-Plancko Institut f¨r biologische Kybernetik, T¨bingen, June 2003. u u [4] C. L. Nikias and A. P. Petropulu. Higher-order spectra analysis. Prentice Hall, Englewood Cliffs, NJ, 1993. [5] M. Schetzen. The Volterra and Wiener theories of nonlinear systems. Krieger, Malabar, 1989. [6] B. Sch¨lkopf and A. J. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. o [7] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neurosc., 4(8):819 – 825, 2001. [8] M. G. A. Thomson. Higher-order structure in natural scenes. J. Opt.Soc. Am. A, 16(7):1549 – 1553, 1999. [9] M. G. A. Thomson. Beats, kurtosis and visual coding. Network: Compt. Neural Syst., 12:271 – 287, 2001.

4 0.55150437 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

Author: Oliver Williams, Andrew Blake, Roberto Cipolla

Abstract: There has been substantial progress in the past decade in the development of object classiﬁers for images, for example of faces, humans and vehicles. Here we address the problem of contaminations (e.g. occlusion, shadows) in test images which have not explicitly been encountered in training data. The Variational Ising Classiﬁer (VIC) algorithm models contamination as a mask (a ﬁeld of binary variables) with a strong spatial coherence prior. Variational inference is used to marginalize over contamination and obtain robust classiﬁcation. In this way the VIC approach can turn a kernel classiﬁer for clean data into one that can tolerate contamination, without any speciﬁc training on contaminated positives. 1

5 0.54142547 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of ﬁlter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which ﬁlter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efﬁcacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reﬂections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear ﬁlter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive ﬁeld development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive ﬁeld-like ﬁlters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and ﬁnance.16 Many approaches to the unsupervised speciﬁcation of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear ﬁlters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to ﬁnd combinations of these (perhaps non-linearly transformed) as a way of ﬁnding higher order ﬁlters (eg complex cells). One critical facet whose speciﬁcation from data is not obvious is the neighborhood arrangement, ie which linear ﬁlters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for ﬁnding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, ﬁlter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in ﬁgure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n ﬁlter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modiﬁed Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to ﬁlter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM speciﬁes a hierarchy of mixer variables. Wainwright2 considered a prespeciﬁed tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being ﬁlter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each ﬁlter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic ﬁlter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to ﬁll the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of ﬁlter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by deﬁnition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the ﬁlters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear ﬁlters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which ﬁlter responses share which mixer variables. We ﬁrst study this issue using synthetic data in which two groups of ﬁlter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (ﬁgure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few ﬁlter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single ﬁlter response, the Gaussian estimate peaks away from zero. For assignments including more ﬁlter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including ﬁlter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each ﬁlter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each ﬁlter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of ﬁlter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of ﬁlter associations corresponding to vα , vβ , and vγ . C Statistics of generated ﬁlter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of ﬁlter responses increases, the estimates improve, provided that they are taken from the right group of ﬁlter responses with the same mixer variable. Speciﬁcally, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single ﬁlter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many ﬁlter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefﬁcients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in ﬁgure 1 suggest that it should be possible to infer the assignments, ie work out which ﬁlter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each ﬁlter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between ﬁlter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that ﬁlter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of ﬁlter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a ﬁlter response li , we use Gibbs sampling for the E phase to ﬁnd possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple ﬁlter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the ﬁlter response samples just to those associated with each mixer variable. We tested the ability of this inference method to ﬁnd the associations in the probabilistic mixer variable synthetic example shown in ﬁgure 2, (A,B). The true generative model speciﬁes probabilistic overlap of 3 mixer variables. We generated 5000 samples for each ﬁlter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the ﬁlter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently ﬁnds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefﬁcient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear ﬁlters from a multi-scale oriented steerable pyramid,28 with 100 ﬁlters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in ﬁgure 2). Figure 3 depicts the association weights pij of the coefﬁcients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefﬁcients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefﬁcient to be generated from a given mixer variable. For the ﬁrst two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The ﬁrst neighborhood (in B) prefers vertical patterns across most of its “receptive ﬁeld”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefﬁcient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal ﬁlters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with ﬁlter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between ﬁlters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the ﬁrst 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefﬁcients of two spatially displaced vertical ﬁlters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet ﬁlters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA ﬁrst-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet ﬁlters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefﬁcients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In ﬁgure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial conﬁgurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial conﬁguration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal conﬁguration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefﬁcients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to ﬁlter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method ﬁrmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent ﬂows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefﬁcients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artiﬁcial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge ﬁlters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

6 0.50648004 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

7 0.46409711 125 nips-2004-Multiple Relational Embedding

8 0.44828999 158 nips-2004-Sampling Methods for Unsupervised Learning

9 0.43960631 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

10 0.43546662 99 nips-2004-Learning Hyper-Features for Visual Identification

11 0.42971817 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

12 0.42744297 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

13 0.41528663 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

14 0.41510934 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

15 0.39462346 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.39009488 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

17 0.38083932 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

18 0.37618995 44 nips-2004-Conditional Random Fields for Object Recognition

19 0.36432138 40 nips-2004-Common-Frame Model for Object Recognition

20 0.34717479 179 nips-2004-Surface Reconstruction using Learned Shape Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.12), (15, 0.103), (26, 0.071), (31, 0.035), (33, 0.202), (35, 0.03), (39, 0.028), (50, 0.027), (56, 0.285)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88730794 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: It has been demonstrated that basic aspects of human visual motion perception are qualitatively consistent with a Bayesian estimation framework, where the prior probability distribution on velocity favors slow speeds. Here, we present a reﬁned probabilistic model that can account for the typical trial-to-trial variabilities observed in psychophysical speed perception experiments. We also show that data from such experiments can be used to constrain both the likelihood and prior functions of the model. Speciﬁcally, we measured matching speeds and thresholds in a two-alternative forced choice speed discrimination task. Parametric ﬁts to the data reveal that the likelihood function is well approximated by a LogNormal distribution with a characteristic contrast-dependent variance, and that the prior distribution on velocity exhibits signiﬁcantly heavier tails than a Gaussian, and approximately follows a power-law function. Humans do not perceive visual motion veridically. Various psychophysical experiments have shown that the perceived speed of visual stimuli is affected by stimulus contrast, with low contrast stimuli being perceived to move slower than high contrast ones [1, 2]. Computational models have been suggested that can qualitatively explain these perceptual effects. Commonly, they assume the perception of visual motion to be optimal either within a deterministic framework with a regularization constraint that biases the solution toward zero motion [3, 4], or within a probabilistic framework of Bayesian estimation with a prior that favors slow velocities [5, 6]. The solutions resulting from these two frameworks are similar (and in some cases identical), but the probabilistic framework provides a more principled formulation of the problem in terms of meaningful probabilistic components. Speciﬁcally, Bayesian approaches rely on a likelihood function that expresses the relationship between the noisy measurements and the quantity to be estimated, and a prior distribution that expresses the probability of encountering any particular value of that quantity. A probabilistic model can also provide a richer description, by deﬁning a full probability density over the set of possible “percepts”, rather than just a single value. Numerous analyses of psychophysical experiments have made use of such distributions within the framework of signal detection theory in order to model perceptual behavior [7]. Previous work has shown that an ideal Bayesian observer model based on Gaussian forms µ posterior low contrast probability density probability density high contrast likelihood prior a posterior likelihood prior v ˆ v ˆ visual speed µ b visual speed Figure 1: Bayesian model of visual speed perception. a) For a high contrast stimulus, the likelihood has a narrow width (a high signal-to-noise ratio) and the prior induces only a small shift µ of the mean v of the posterior. b) For a low contrast stimuli, the measurement ˆ is noisy, leading to a wider likelihood. The shift µ is much larger and the perceived speed lower than under condition (a). for both likelihood and prior is sufﬁcient to capture the basic qualitative features of global translational motion perception [5, 6]. But the behavior of the resulting model deviates systematically from human perceptual data, most importantly with regard to trial-to-trial variability and the precise form of interaction between contrast and perceived speed. A recent article achieved better ﬁts for the model under the assumption that human contrast perception saturates [8]. In order to advance the theory of Bayesian perception and provide signiﬁcant constraints on models of neural implementation, it seems essential to constrain quantitatively both the likelihood function and the prior probability distribution. In previous work, the proposed likelihood functions were derived from the brightness constancy constraint [5, 6] or other generative principles [9]. Also, previous approaches deﬁned the prior distribution based on general assumptions and computational convenience, typically choosing a Gaussian with zero mean, although a Laplacian prior has also been suggested [4]. In this paper, we develop a more general form of Bayesian model for speed perception that can account for trial-to-trial variability. We use psychophysical speed discrimination data in order to constrain both the likelihood and the prior function. 1 1.1 Probabilistic Model of Visual Speed Perception Ideal Bayesian Observer Assume that an observer wants to obtain an estimate for a variable v based on a measurement m that she/he performs. A Bayesian observer “knows” that the measurement device is not ideal and therefore, the measurement m is affected by noise. Hence, this observer combines the information gained by the measurement m with a priori knowledge about v. Doing so (and assuming that the prior knowledge is valid), the observer will – on average – perform better in estimating v than just trusting the measurements m. According to Bayes’ rule 1 p(v|m) = p(m|v)p(v) (1) α the probability of perceiving v given m (posterior) is the product of the likelihood of v for a particular measurements m and the a priori knowledge about the estimated variable v (prior). α is a normalization constant independent of v that ensures that the posterior is a proper probability distribution. ^ ^ P(v2 > v1) 1 + Pcum=0.5 0 a b Pcum=0.875 vmatch vthres v2 Figure 2: 2AFC speed discrimination experiment. a) Two patches of drifting gratings were displayed simultaneously (motion without movement). The subject was asked to ﬁxate the center cross and decide after the presentation which of the two gratings was moving faster. b) A typical psychometric curve obtained under such paradigm. The dots represent the empirical probability that the subject perceived stimulus2 moving faster than stimulus1. The speed of stimulus1 was ﬁxed while v2 is varied. The point of subjective equality, vmatch , is the value of v2 for which Pcum = 0.5. The threshold velocity vthresh is the velocity for which Pcum = 0.875. It is important to note that the measurement m is an internal variable of the observer and is not necessarily represented in the same space as v. The likelihood embodies both the mapping from v to m and the noise in this mapping. So far, we assume that there is a monotonic function f (v) : v → vm that maps v into the same space as m (m-space). Doing so allows us to analytically treat m and vm in the same space. We will later propose a suitable form of the mapping function f (v). An ideal Bayesian observer selects the estimate that minimizes the expected loss, given the posterior and a loss function. We assume a least-squares loss function. Then, the optimal estimate v is the mean of the posterior in Equation (1). It is easy to see why this model ˆ of a Bayesian observer is consistent with the fact that perceived speed decreases with contrast. The width of the likelihood varies inversely with the accuracy of the measurements performed by the observer, which presumably decreases with decreasing contrast due to a decreasing signal-to-noise ratio. As illustrated in Figure 1, the shift in perceived speed towards slow velocities grows with the width of the likelihood, and thus a Bayesian model can qualitatively explain the psychophysical results [1]. 1.2 Two Alternative Forced Choice Experiment We would like to examine perceived speeds under a wide range of conditions in order to constrain a Bayesian model. Unfortunately, perceived speed is an internal variable, and it is not obvious how to design an experiment that would allow subjects to express it directly 1 . Perceived speed can only be accessed indirectly by asking the subject to compare the speed of two stimuli. For a given trial, an ideal Bayesian observer in such a two-alternative forced choice (2AFC) experimental paradigm simply decides on the basis of the two trial estimates v1 (stimulus1) and v2 (stimulus2) which stimulus moves faster. Each estimate v is based ˆ ˆ ˆ on a particular measurement m. For a given stimulus with speed v, an ideal Bayesian observer will produce a distribution of estimates p(ˆ|v) because m is noisy. Over trials, v the observers behavior can be described by classical signal detection theory based on the distributions of the estimates, hence e.g. the probability of perceiving stimulus2 moving 1 Although see [10] for an example of determining and even changing the prior of a Bayesian model for a sensorimotor task, where the estimates are more directly accessible. faster than stimulus1 is given as the cumulative probability Pcum (ˆ2 > v1 ) = v ˆ ∞ 0 p(ˆ2 |v2 ) v v2 ˆ 0 p(ˆ1 |v1 ) dˆ1 dˆ2 v v v (2) Pcum describes the full psychometric curve. Figure 2b illustrates the measured psychometric curve and its ﬁt from such an experimental situation. 2 Experimental Methods We measured matching speeds (Pcum = 0.5) and thresholds (Pcum = 0.875) in a 2AFC speed discrimination task. Subjects were presented simultaneously with two circular patches of horizontally drifting sine-wave gratings for the duration of one second (Figure 2a). Patches were 3deg in diameter, and were displayed at 6deg eccentricity to either side of a ﬁxation cross. The stimuli had an identical spatial frequency of 1.5 cycle/deg. One stimulus was considered to be the reference stimulus having one of two different contrast values (c1 =[0.075 0.5]) and one of ﬁve different speed values (u1 =[1 2 4 8 12] deg/sec) while the second stimulus (test) had one of ﬁve different contrast values (c2 =[0.05 0.1 0.2 0.4 0.8]) and a varying speed that was determined by an interleaved staircase procedure. For each condition there were 96 trials. Conditions were randomly interleaved, including a random choice of stimulus identity (test vs. reference) and motion direction (right vs. left). Subjects were asked to ﬁxate during stimulus presentation and select the faster moving stimulus. The threshold experiment differed only in that auditory feedback was given to indicate the correctness of their decision. This did not change the outcome of the experiment but increased signiﬁcantly the quality of the data and thus reduced the number of trials needed. 3 Analysis With the data from the speed discrimination experiments we could in principal apply a parametric ﬁt using Equation (2) to derive the prior and the likelihood, but the optimization is difﬁcult, and the ﬁt might not be well constrained given the amount of data we have obtained. The problem becomes much more tractable given the following weak assumptions: • We consider the prior to be relatively smooth. • We assume that the measurement m is corrupted by additive Gaussian noise with a variance whose dependence on stimulus speed and contrast is separable. • We assume that there is a mapping function f (v) : v → vm that maps v into the space of m (m-space). In that space, the likelihood is convolutional i.e. the noise in the measurement directly deﬁnes the width of the likelihood. These assumptions allow us to relate the psychophysical data to our probabilistic model in a simple way. The following analysis is in the m-space. The point of subjective equality (Pcum = 0.5) is deﬁned as where the expected values of the speed estimates are equal. We write E vm,1 ˆ vm,1 − E µ1 = E vm,2 ˆ = vm,2 − E µ2 (3) where E µ is the expected shift of the perceived speed compared to the veridical speed. For the discrimination threshold experiment, above assumptions imply that the variance var vm of the speed estimates vm is equal for both stimuli. Then, (2) predicts that the ˆ ˆ discrimination threshold is proportional to the standard deviation, thus vm,2 − vm,1 = γ var vm ˆ (4) likelihood a b prior vm Figure 3: Piece-wise approximation We perform a parametric ﬁt by assuming the prior to be piece-wise linear and the likelihood to be LogNormal (Gaussian in the m-space). where γ is a constant that depends on the threshold criterion Pcum and the exact shape of p(ˆm |vm ). v 3.1 Estimating the prior and likelihood In order to extract the prior and the likelihood of our model from the data, we have to ﬁnd a generic local form of the prior and the likelihood and relate them to the mean and the variance of the speed estimates. As illustrated in Figure 3, we assume that the likelihood is Gaussian with a standard deviation σ(c, vm ). Furthermore, the prior is assumed to be wellapproximated by a ﬁrst-order Taylor series expansion over the velocity ranges covered by the likelihood. We parameterize this linear expansion of the prior as p(vm ) = avm + b. We now can derive a posterior for this local approximation of likelihood and prior and then deﬁne the perceived speed shift µ(m). The posterior can be written as 2 vm 1 1 p(m|vm )p(vm ) = [exp(− )(avm + b)] α α 2σ(c, vm )2 where α is the normalization constant ∞ b p(m|vm )p(vm )dvm = π2σ(c, vm )2 α= 2 −∞ p(vm |m) = (5) (6) We can compute µ(m) as the ﬁrst order moment of the posterior for a given m. Exploiting the symmetries around the origin, we ﬁnd ∞ a(m) µ(m) = σ(c, vm )2 vp(vm |m)dvm ≡ (7) b(m) −∞ The expected value of µ(m) is equal to the value of µ at the expected value of the measurement m (which is the stimulus velocity vm ), thus a(vm ) σ(c, vm )2 E µ = µ(m)|m=vm = (8) b(vm ) Similarly, we derive var vm . Because the estimator is deterministic, the variance of the ˆ estimate only depends on the variance of the measurement m. For a given stimulus, the variance of the estimate can be well approximated by ∂ˆm (m) v var vm = var m ( ˆ |m=vm )2 (9) ∂m ∂µ(m) |m=vm )2 ≈ var m = var m (1 − ∂m Under the assumption of a locally smooth prior, the perceived velocity shift remains locally constant. The variance of the perceived speed vm becomes equal to the variance of the ˆ measurement m, which is the variance of the likelihood (in the m-space), thus var vm = σ(c, vm )2 ˆ (10) With (3) and (4), above derivations provide a simple dependency of the psychophysical data to the local parameters of the likelihood and the prior. 3.2 Choosing a Logarithmic speed representation We now want to choose the appropriate mapping function f (v) that maps v to the m-space. We deﬁne the m-space as the space in which the likelihood is Gaussian with a speedindependent width. We have shown that discrimination threshold is proportional to the width of the likelihood (4), (10). Also, we know from the psychophysics literature that visual speed discrimination approximately follows a Weber-Fechner law [11, 12], thus that the discrimination threshold increases roughly proportional with speed and so would the likelihood. A logarithmic speed representation would be compatible with the data and our choice of the likelihood. Hence, we transform the linear speed-domain v into a normalized logarithmic domain according to v + v0 vm = f (v) = ln( ) (11) v0 where v0 is a small normalization constant. The normalization is chosen to account for the expected deviation of equal variance behavior at the low end. Surprisingly, it has been found that neurons in the Medial Temporal area (Area MT) of macaque monkeys have speed-tuning curves that are very well approximated by Gaussians of constant width in above normalized logarithmic space [13]. These neurons are known to play a central role in the representation of motion. It seems natural to assume that they are strongly involved in tasks such as our performed psychophysical experiments. 4 Results Figure 4 shows the contrast dependent shift of speed perception and the speed discrimination threshold data for two subjects. Data points connected with a dashed line represent the relative matching speed (v2 /v1 ) for a particular contrast value c2 of the test stimulus as a function of the speed of the reference stimulus. Error bars are the empirical standard deviation of ﬁts to bootstrapped samples of the data. Clearly, low contrast stimuli are perceived to move slower. The effect, however, varies across the tested speed range and tends to become smaller for higher speeds. The relative discrimination thresholds for two different contrasts as a function of speed show that the Weber-Fechner law holds only approximately. The data are in good agreement with other data from the psychophysics literature [1, 11, 8]. For each subject, data from both experiments were used to compute a parametric leastsquares ﬁt according to (3), (4), (7), and (10). In order to test the assumption of a LogNormal likelihood we allowed the standard deviation to be dependent on contrast and speed, thus σ(c, vm ) = g(c)h(vm ). We split the speed range into six bins (subject2: ﬁve) and parameterized h(vm ) and the ratio a/b accordingly. Similarly, we parameterized g(c) for the seven contrast values. The resulting ﬁts are superimposed as bold lines in Figure 4. Figure 5 shows the ﬁtted parametric values for g(c) and h(v) (plotted in the linear domain), and the reconstructed prior distribution p(v) transformed back to the linear domain. The approximately constant values for h(v) provide evidence that a LogNormal distribution is an appropriate functional description of the likelihood. The resulting values for g(c) suggest for the likelihood width a roughly exponential decaying dependency on contrast with strong saturation for higher contrasts. discrimination threshold (relative) reference stimulus contrast c1: 0.075 0.5 subject 1 normalized matching speed 1.5 contrast c2 1 0.5 1 10 0.075 0.5 0.79 0.5 0.4 0.3 0.2 0.1 0 10 1 contrast: 1 10 discrimination threshold (relative) normalized matching speed subject 2 1.5 contrast c2 1 0.5 10 1 a 0.5 0.4 0.3 0.2 0.1 10 1 1 b speed of reference stimulus [deg/sec] 10 stimulus speed [deg/sec] Figure 4: Speed discrimination data for two subjects. a) The relative matching speed of a test stimulus with different contrast levels (c2 =[0.05 0.1 0.2 0.4 0.8]) to achieve subjective equality with a reference stimulus (two different contrast values c1 ). b) The relative discrimination threshold for two stimuli with equal contrast (c1,2 =[0.075 0.5]). reconstructed prior subject 1 p(v) [unnormalized] 1 Gaussian Power-Law g(c) 1 h(v) 2 0.9 1.5 0.8 0.1 n=-1.41 0.7 1 0.6 0.01 0.5 0.5 0.4 0.3 1 p(v) [unnormalized] subject 2 10 0.1 1 1 1 1 10 1 10 2 0.9 n=-1.35 0.1 1.5 0.8 0.7 1 0.6 0.01 0.5 0.5 0.4 1 speed [deg/sec] 10 0.3 0 0.1 1 contrast speed [deg/sec] Figure 5: Reconstructed prior distribution and parameters of the likelihood function. The reconstructed prior for both subjects show much heavier tails than a Gaussian (dashed ﬁt), approximately following a power-law function with exponent n ≈ −1.4 (bold line). 5 Conclusions We have proposed a probabilistic framework based on a Bayesian ideal observer and standard signal detection theory. We have derived a likelihood function and prior distribution for the estimator, with a fairly conservative set of assumptions, constrained by psychophysical measurements of speed discrimination and matching. The width of the resulting likelihood is nearly constant in the logarithmic speed domain, and decreases approximately exponentially with contrast. The prior expresses a preference for slower speeds, and approximately follows a power-law distribution, thus has much heavier tails than a Gaussian. It would be interesting to compare the here derived prior distributions with measured true distributions of local image velocities that impinge on the retina. Although a number of authors have measured the spatio-temporal structure of natural images [14, e.g. ], it is clearly difﬁcult to extract therefrom the true prior distribution because of the feedback loop formed through movements of the body, head and eyes. Acknowledgments The authors thank all subjects for their participation in the psychophysical experiments. References [1] P. Thompson. Perceived rate of movement depends on contrast. Vision Research, 22:377–380, 1982. [2] L.S. Stone and P. Thompson. Human speed perception is contrast dependent. Vision Research, 32(8):1535–1549, 1992. [3] A. Yuille and N. Grzywacz. A computational theory for the perception of coherent visual motion. Nature, 333(5):71–74, May 1988. [4] Alan Stocker. Constraint Optimization Networks for Visual Motion Perception - Analysis and Synthesis. PhD thesis, Dept. of Physics, Swiss Federal Institute of Technology, Z¨ rich, Switzeru land, March 2002. [5] Eero Simoncelli. Distributed analysis and representation of visual motion. PhD thesis, MIT, Dept. of Electrical Engineering, Cambridge, MA, 1993. [6] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [7] D.M. Green and J.A. Swets. Signal Detection Theory and Psychophysics. Wiley, New York, 1966. [8] F. H¨ rlimann, D. Kiper, and M. Carandini. Testing the Bayesian model of perceived speed. u Vision Research, 2002. [9] Y. Weiss and D.J. Fleet. Probabilistic Models of the Brain, chapter Velocity Likelihoods in Biological and Machine Vision, pages 77–96. Bradford, 2002. [10] K. Koerding and D. Wolpert. Bayesian integration in sensorimotor learning. 427(15):244–247, January 2004. Nature, [11] Leslie Welch. The perception of moving plaids reveals two motion-processing stages. Nature, 337:734–736, 1989. [12] S. McKee, G. Silvermann, and K. Nakayama. Precise velocity discrimintation despite random variations in temporal frequency and contrast. Vision Research, 26(4):609–619, 1986. [13] C.H. Anderson, H. Nover, and G.C. DeAngelis. Modeling the velocity tuning of macaque MT neurons. Journal of Vision/VSS abstract, 2003. [14] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6:345–358, 1995.

same-paper 2 0.84780651 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

Author: Erik G. Learned-miller, Parvez Ahammad

3 0.83517867 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1

4 0.74658877 145 nips-2004-Parametric Embedding for Class Visualization

Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classiﬁer’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1

5 0.68286103 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

Author: Liam Paninski

Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a ﬁnite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we ﬁnd that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.

6 0.68183577 125 nips-2004-Multiple Relational Embedding

7 0.68058735 102 nips-2004-Learning first-order Markov models for control

8 0.67730469 163 nips-2004-Semi-parametric Exponential Family PCA

9 0.67722476 131 nips-2004-Non-Local Manifold Tangent Learning

10 0.67539841 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

11 0.67515343 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

12 0.67470157 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

13 0.67400259 124 nips-2004-Multiple Alignment of Continuous Time Series

14 0.67378837 69 nips-2004-Fast Rates to Bayes for Kernel Machines

15 0.67374533 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

16 0.67369235 70 nips-2004-Following Curved Regularized Optimization Solution Paths

17 0.67369127 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

18 0.67286396 116 nips-2004-Message Errors in Belief Propagation

19 0.67271304 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

20 0.67268646 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees