nips nips2010 nips2010-256 knowledge-graph by maker-knowledge-mining

256 nips-2010-Structural epitome: a way to summarize one’s visual experience


Source: pdf

Author: Nebojsa Jojic, Alessandro Perina, Vittorio Murino

Abstract: In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g., Photosynth) or appearance-based clustering models (e.g., the epitome), is impractical due to either the large dataset size or the dramatic variations in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, defines a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. [sent-2, score-0.106]

2 Our first goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. [sent-3, score-0.109]

3 Direct application of existing algorithms, such as panoramic stitching (e. [sent-4, score-0.081]

4 As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efficient learning algorithm. [sent-9, score-0.07]

5 In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, defines a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. [sent-10, score-1.693]

6 The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. [sent-11, score-0.877]

7 However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. [sent-12, score-1.464]

8 As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. [sent-13, score-0.852]

9 1 Introduction We develop a novel generative model which combines the powerful invariance properties achieved through the use of hidden variables in epitome [2] and stel (structural element) models [6, 8]. [sent-14, score-1.355]

10 The latter set of models have a hidden stel index si for each image pixel i. [sent-15, score-0.861]

11 The number of discrete states si can take is small, typically 4-10, as the stel indices point to a small palette of distributions over local measurements, e. [sent-16, score-0.884]

12 color) for pixel i is assumed to have been generated from the appropriate palette entry. [sent-21, score-0.151]

13 This constrains the pixels with the same stel index s to have similar colors or whatever local measurements xi represent. [sent-22, score-0.724]

14 The indexing scheme is further assumed to change little accross different images of the same scene/object, while the palettes can vary significantly. [sent-23, score-0.129]

15 For example, two images of the same scene captured in different levels of overall illumination would still have very similar stel partitions, even though their palettes may be vastly different. [sent-24, score-0.853]

16 In this way, the image representation rises above a matrix of local measurements in favor of a matrix of stel indices which can survive remarkable non-structural image changes, as long as these can be explained away by a change in the (small) palette. [sent-25, score-0.804]

17 1B, images of pedestrians are captured by a model that has a prior distribution of stel assignments shown in the first row. [sent-27, score-0.732]

18 The prior on stel probabilities for each pixel adds up to one, and the 6 images showing these prior probabilities add up to a uniform image of ones. [sent-28, score-0.829]

19 In G) we show the original epitome model [2] trained on these four frames. [sent-33, score-0.69]

20 below with their posterior distributions over stel assignments, as well as the mean color of each stel. [sent-34, score-0.694]

21 This illustrates that the different parts of the pedestrian images are roughly matched. [sent-35, score-0.115]

22 Torso pixels, for instance, are consistently assigned to stel s = 3, despite the fact that different people wore shirts or coats of very different colors. [sent-36, score-0.653]

23 Such a consistent segmentation is possible because torso pixels tend to have similar colors within any given image and because the torso is roughly in the same position across images (though misalignment of up to half the size of the segments is largely tolerated). [sent-37, score-0.28]

24 While the figure shows the model with S=6 stels, larger number of stels were shown to lead to further segmentation of the head and even splitting of the left from right leg [6]. [sent-38, score-0.063]

25 [7, 13, 14, 8], as the described addition of hidden variables s achieves the remarkable level of intensity invariance first demonstrated through the use of similarity templates [12], but at a much lower computational cost. [sent-41, score-0.063]

26 In this paper, we embed the stel image representation within a large stel epitome: a stel prior matrix, like the one shown in the top row of Fig. [sent-42, score-1.975]

27 This requires the additional transformation variables T for each image whose role is to align it with the epitome. [sent-44, score-0.086]

28 In that case, a large collection of images must naturally undergo an unsupervised clustering in order for this real estate to be used as well as possible (or as well as the local minimum obtained by the learning algorithm allows). [sent-46, score-0.145]

29 As in the original epitome models, the transformation variables play both the alignment and cluster indexing roles. [sent-48, score-0.771]

30 Different 2 models over the typical scenes/objects have to compete over the positions in the epitome, with a panoramic version of each scene emerging in different parts of the epitome, finally providing a rich image indexing scheme. [sent-49, score-0.234]

31 Such a panoramic scene submodel within the stel epitome is illustrated in Fig. [sent-50, score-1.441]

32 A portion of the larger stel epitome is shown with 3 images that map into this region. [sent-52, score-1.406]

33 The three images shown, mapping to different parts of this region, have very different colors as they were taken at different times of day and across different days, and yet their alignment is not adversely affected, as it is evident in their posterior stel segmentation aligned to the epitome. [sent-55, score-0.879]

34 To further illustrate the panoramic alignment, we used the epitome mapping to show for the 4 different images in Fig. [sent-56, score-0.852]

35 1D how they overlap with stel s=4 of another office image (Fig. [sent-57, score-0.721]

36 1E), as well as how multiple images of this scene, including these 4, look when they are aligned and overlapped as intensity images in Fig. [sent-58, score-0.201]

37 1G the original epitome model [2] trained on images of this scene. [sent-61, score-0.771]

38 Without the invariances afforded by the stel representation, the standard color epitome has to split the images of the scene into two clusters, and so the laptop screen is doubled there. [sent-62, score-1.54]

39 Qualitatively quite different from both epitomes and previous stel models, the stel epitome is a model flexible enough to be applied to a very diverse set of images. [sent-63, score-2.057]

40 2 Stel epitome The graphical model describing the dependencies in stel epitomes is provided in Fig. [sent-68, score-1.422]

41 We first consider the generation of a single image or an image patch (depending on which visual scale we are epitomizing), and, for brevity, temporarily omit the subscript t indexing different images. [sent-71, score-0.204]

42 The epitome is a matrix of multinomial distributions over S indices s ∈ {1, 2, . [sent-72, score-0.718]

43 , S}, associated with each two-dimensional epitome location i: p(si = s) = ei (s). [sent-75, score-0.734]

44 (1) Thus each location in the epitome contains S probabilities (adding to one) for different indices. [sent-76, score-0.727]

45 Indices for the image are assumed to be generated from these distributions. [sent-77, score-0.07]

46 [1, 2], where the shifts are separated from other transformations such as scaling or rotation, T = ( , r), with being a 2-dimensional shift and r being the index into the set of other transformations, e. [sent-81, score-0.074]

47 Λ is the palette associated with the image, and Λs is its s − th entry. [sent-84, score-0.121]

48 Various palette models for probabilistic index / structure element map models have been reviewed in [8]. [sent-85, score-0.144]

49 For brevity, in this paper we focus on the simplest case where the image measurements are simply pixel colors, and the palette entries are simply Gaussians with parameters Λs = (µs , φs ). [sent-86, score-0.222]

50 To derive the inference and leaning algorithms for the mode, we start with a posterior distribution model Q and the appropriate free energy Q log Q . [sent-88, score-0.084]

51 To focus on these important issues, we further simplify the problem and omit both the non-shift part of the transformations (r) and palette priors p(Λ), and for consistency, we also omit these parts of the model in the experiments. [sent-90, score-0.17]

52 A large stel epitome is difficult to learn because decoupling of all hidden variables in the posterior leads to severe local minima, with all images either mapped to a single spot in the epitome, or mapped everywhere in the epitome so that the stel distribution is flat. [sent-92, score-2.809]

53 To resolve this, we either need a very high numerical precision (and considerable patience), or the severe variational approximations need to be avoided as much as possible. [sent-94, score-0.062]

54 Setting to zero the derivatives of this free energy with respect to the variational parameters – the probabilities q(si = s), q( ), and the palette ˆ means and variance estimates µs, , φs, – we obtain a set of updates for iterative inference. [sent-97, score-0.183]

55 1 E STEP The following steps are iterated for a single image x on an m × n grid and for a given epitome distributions e(s) on an M × N grid. [sent-99, score-0.773]

56 Index i corresponds to the epitome coordinates and masks m are used to describe which of all M × N coordinates correspond to image coordinates. [sent-100, score-0.813]

57 In the variational EM learning on a collection of images index by t, these steps are done for each image, yielding posterior distributions indexed by t and then the M step is performed as described below. [sent-101, score-0.152]

58 The consequence is that low alignment probabilities are rounded down to zero, as after exponentiation and normalization their values go below numerical precision. [sent-107, score-0.085]

59 To preserve the numerical precision needed for this, we set k thresholds τk , and compute log q ( )k , the ˜ distributions at the k different precision levels: log q ( )k = [log q( ) ≥ τk ] · τk + [log q( ) < τk ] · log q( ), ˜ where [] is the indicator function. [sent-109, score-0.135]

60 Masks mi,k provide total weight of ˜ the image mapping at the appropriate epitome location at different precision levels. [sent-113, score-0.851]

61 Posterior stel distribution q(s) update at multiple precision levels log q (si = s)k = ˜ const − q ( )k ˜ i|i− ∈C − q ( )k ˜ i|i− ∈C µ2 ˆs, ˆ 2φs, x2 i− + ˆ 2φs, q ( )k ˜ i|i− ∈C + mi,k · log e(si = s). [sent-114, score-0.726]

62 ˜ µs, xi− ˆ − ˆ φs, (9) To keep track of these different precision levels, we also define a mask M so that Mi = k indicates that the k-th level of detail should be used for epitome location i. [sent-115, score-0.767]

63 We now normalize log q (si = s)k to compute the distribution at k different precision levels, q (si = ˜ ˜ s)k , and compute q(s) integrating the results from different numerical precision levels as q(si = s) = k [Mi = k] · q (si = s)k . [sent-118, score-0.112]

64 2 M STEP The highest k for each epitome location Di = maxt {Mit }, is determined over all images xt in the dataset, so that we know the appropriate precision level at which to perform summation and normalization. [sent-120, score-0.857]

65 Then the epitome update consists of: e(si = s) = [Di = k] k 5 = k] · q t (si ) . [sent-121, score-0.69]

66 t t [M = k] t [M t Bike Kitchen Car Work office Dining room Outside home Home office Tennis field Laptop room Living room Figure 2: Some examples from the dataset (www. [sent-122, score-0.176]

67 3 Experiments Using a SenseCam wearable camera, we have obtained two weeks worth of images, taken at the rate of one frame every 20 seconds during all waking hours of a human subject. [sent-126, score-0.089]

68 The resulting image dataset captures the subject’s (summer) life rather completely in the following sense: Majority of images can be assigned to one of the emergent categories (Fig. [sent-127, score-0.198]

69 2) and the same categories represent the majority of images from any time period of a couple of days. [sent-128, score-0.098]

70 This dataset also proved to be fundamental for testing stel epitomes, as the illumination and viewing angle variations are significant across images and we found that the previous approaches to scene recognition provide only modest recognition rates. [sent-130, score-0.816]

71 For the purposes of evaluation, we have manually labeled a random collection of 320 images and compared our method with other approaches on supervised and unsupervised classification. [sent-131, score-0.105]

72 We divided this reduced dataset in 10 different recurrent scenes (32 images per class); some examples are depicted in Fig. [sent-132, score-0.138]

73 In all the experiments with the reduced dataset we used an epitome area 14 times larger than the image area and five stels (S=5). [sent-134, score-0.825]

74 In supervised learning the scene labels are available during the stel epitome learning. [sent-136, score-1.38]

75 We used this information to aid both the original epitome [9] and the stel epitome modifying the models by the addition of an observed scene class variable c in two ways: i) by linking c in the Bayesian network with e, and so learning p(e|c), and ii) by linking c with T inferring p(T |c). [sent-137, score-2.07]

76 In the latter strategy, where we model p(T |c), we learn a single epitome, but we assume that the epitome locations are linked with certain scenes, and this mapping is learned for each epitome pixel. [sent-138, score-1.4]

77 Then, the distribution p(c| ) over scene labels can be used for inference of the scene label for the test data. [sent-139, score-0.11]

78 For a previously unseen test image xt , recognition is achieved by computing the label posterior p(ct |xt ) using p(ct |xt ) = p(c| ) · p( |xt ). [sent-140, score-0.107]

79 For the above techniques that are based on topic models, representing images as spatially disorganized bags of features, the codebook of SIFT features was based 16x16 pixel patches computed over a grid spaced by 8 pixels. [sent-143, score-0.098]

80 Method Stel epitome Stel epitome Epitome [9] Epitome [9] p(T |c) p(e|c) p(T |c) p(e|c) Accuracy 70,06% 88,67% 74,36% 69,80% [10] Opt. [sent-151, score-1.38]

81 We also trained both the regular epitome and the stel epitome in an unsupervised way. [sent-163, score-2.039]

82 An illustration of the resulting stel epitome is provided in Fig. [sent-164, score-1.325]

83 Each of these panels is an image ei (s) for an appropriate s. [sent-170, score-0.103]

84 On the top of the stel epitome, four enlarged epitome regions are shown to highlight panoramic reconstructions of a few classes. [sent-171, score-1.386]

85 We also show the result of averaging all images according to their mapping to the stel epitome (Fig. [sent-172, score-1.426]

86 As opposed to the stel epitome, the learned color epitome [2] has to have multiple versions of the same scene in different illumination conditions. [sent-175, score-1.434]

87 Furthermore, many different scenes tend to overlap in the color epitome, especially indoor scenes which all look equally beige. [sent-176, score-0.145]

88 3B we show examples of some images of different scenes mapped onto the stel epitome, whose organization is illustrated by a rendering of all images averaged into the appropriate location (similarly to the original color epitomes). [sent-178, score-0.92]

89 Note that the model automatically clusters images using the structure, and not colors, even in face of variation of colors present in the exemplars of the ”Car”, or the ”Work office” classes (See also the supplemental video that illustrates the mapping dynamically). [sent-179, score-0.132]

90 The regular epitome cannot capture these invariances, and it clusters images based on overall intensity more readily than based on the structure of the scene. [sent-180, score-0.79]

91 Using the two types of unsupervised epitomes, and the known labels for the images in the training set, we assigned labels to the test set using the same classification rule explained in the previous paragraph. [sent-182, score-0.105]

92 This semi-supervised test reveals how consistent the clustering induced by epitomes is with the human labeling. [sent-183, score-0.097]

93 The stel epitome accuracy, 73,06%, outperforms the standard epitome model [9], 69,42%, with statistical significance. [sent-184, score-2.015]

94 We have also trained both types of epitomes over a real estate 35 times larger than the original image size using different random sets of 5000 images taken from the dataset. [sent-185, score-0.288]

95 The stel epitomes trained in an unsupervised way are qualitatively equivalent, in that they consistently capture around six of the most prominent scenes from Fig. [sent-186, score-0.811]

96 2, whereas the traditional epitomes tended to capture only three. [sent-187, score-0.097]

97 As a step in this direction, we provide a new dataset that contains a mix of indoor and outdoor scenes as a result of two weeks of continuous image acquisition, as well as a simple algorithm that deals with some of the invariances that have to be incorporated in a model of such data. [sent-193, score-0.23]

98 Caspi, “Capturing image structure with probabilistic index maps,” IEEE CVPR 2004, pp. [sent-224, score-0.093]

99 Frey, “Stel component analysis: modeling spatial correlation in image class structure,” IEEE CVPR 2009. [sent-234, score-0.07]

100 Tieu, “Transform invariant image decomposition with similarity templates,” NIPS 2003. [sent-258, score-0.07]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('epitome', 0.69), ('stel', 0.635), ('palette', 0.121), ('si', 0.1), ('epitomes', 0.097), ('images', 0.081), ('image', 0.07), ('panoramic', 0.061), ('scene', 0.055), ('jojic', 0.051), ('stels', 0.05), ('weeks', 0.049), ('scenes', 0.042), ('estate', 0.04), ('office', 0.04), ('alignment', 0.037), ('precision', 0.034), ('xr', 0.033), ('mi', 0.033), ('colors', 0.031), ('home', 0.03), ('epitomic', 0.03), ('pim', 0.03), ('illumination', 0.03), ('transformations', 0.029), ('indexing', 0.028), ('unsupervised', 0.024), ('color', 0.024), ('location', 0.024), ('laptop', 0.023), ('torso', 0.023), ('index', 0.023), ('shifts', 0.022), ('posterior', 0.022), ('visual', 0.021), ('pixels', 0.021), ('frey', 0.021), ('indoor', 0.021), ('masks', 0.021), ('ei', 0.02), ('parts', 0.02), ('mappings', 0.02), ('aligned', 0.02), ('exponentiation', 0.02), ('murino', 0.02), ('palettes', 0.02), ('perina', 0.02), ('photosynth', 0.02), ('sensecam', 0.02), ('stitching', 0.02), ('waking', 0.02), ('wearable', 0.02), ('mapping', 0.02), ('mapped', 0.02), ('mask', 0.019), ('invariances', 0.019), ('intensity', 0.019), ('free', 0.018), ('energy', 0.018), ('wore', 0.018), ('misalignment', 0.018), ('shadows', 0.018), ('browsing', 0.018), ('verona', 0.018), ('pixel', 0.017), ('categories', 0.017), ('room', 0.017), ('overlap', 0.016), ('cvpr', 0.016), ('coordinates', 0.016), ('levels', 0.016), ('captured', 0.016), ('transformation', 0.016), ('sift', 0.016), ('hidden', 0.016), ('camera', 0.015), ('kitchen', 0.015), ('const', 0.015), ('life', 0.015), ('patch', 0.015), ('dataset', 0.015), ('numerical', 0.015), ('indices', 0.015), ('xt', 0.015), ('measurements', 0.014), ('object', 0.014), ('car', 0.014), ('invariance', 0.014), ('templates', 0.014), ('pedestrian', 0.014), ('outdoor', 0.014), ('foreground', 0.014), ('probabilities', 0.013), ('log', 0.013), ('appropriate', 0.013), ('qualitatively', 0.013), ('segmentation', 0.013), ('screen', 0.013), ('distributions', 0.013), ('variational', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 256 nips-2010-Structural epitome: a way to summarize one’s visual experience

Author: Nebojsa Jojic, Alessandro Perina, Vittorio Murino

Abstract: In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g., Photosynth) or appearance-based clustering models (e.g., the epitome), is impractical due to either the large dataset size or the dramatic variations in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, defines a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. 1

2 0.1794489 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures

Author: Kamiya Motwani, Nagesh Adluru, Chris Hinrichs, Andrew Alexander, Vikas Singh

Abstract: We study the problem of segmenting specific white matter structures of interest from Diffusion Tensor (DT-MR) images of the human brain. This is an important requirement in many Neuroimaging studies: for instance, to evaluate whether a brain structure exhibits group level differences as a function of disease in a set of images. Typically, interactive expert guided segmentation has been the method of choice for such applications, but this is tedious for large datasets common today. To address this problem, we endow an image segmentation algorithm with “advice” encoding some global characteristics of the region(s) we want to extract. This is accomplished by constructing (using expert-segmented images) an epitome of a specific region – as a histogram over a bag of ‘words’ (e.g., suitable feature descriptors). Now, given such a representation, the problem reduces to segmenting a new brain image with additional constraints that enforce consistency between the segmented foreground and the pre-specified histogram over features. We present combinatorial approximation algorithms to incorporate such domain specific constraints for Markov Random Field (MRF) segmentation. Making use of recent results on image co-segmentation, we derive effective solution strategies for our problem. We provide an analysis of solution quality, and present promising experimental evidence showing that many structures of interest in Neuroscience can be extracted reliably from 3-D brain image volumes using our algorithm. 1

3 0.061578974 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models

Author: Jun Zhu, Li-jia Li, Li Fei-fei, Eric P. Xing

Abstract: Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.

4 0.052981608 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

Author: Li-jia Li, Hao Su, Li Fei-fei, Eric P. Xing

Abstract: Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale-invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.

5 0.052903742 103 nips-2010-Generating more realistic images using gated MRF's

Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton

Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9

6 0.051603891 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence

7 0.05081911 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata

8 0.04839481 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

9 0.046868384 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

10 0.046722639 149 nips-2010-Learning To Count Objects in Images

11 0.045575209 133 nips-2010-Kernel Descriptors for Visual Recognition

12 0.042990174 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

13 0.04261459 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

14 0.042059142 79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces

15 0.039037894 153 nips-2010-Learning invariant features using the Transformed Indian Buffet Process

16 0.037036095 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition

17 0.034680806 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

18 0.033245854 245 nips-2010-Space-Variant Single-Image Blind Deconvolution for Removing Camera Shake

19 0.032096244 234 nips-2010-Segmentation as Maximum-Weight Independent Set

20 0.028690238 101 nips-2010-Gaussian sampling by local perturbations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.087), (1, 0.046), (2, -0.079), (3, -0.074), (4, -0.022), (5, -0.031), (6, -0.003), (7, -0.01), (8, 0.008), (9, 0.026), (10, -0.006), (11, -0.007), (12, -0.034), (13, -0.005), (14, -0.037), (15, 0.008), (16, 0.044), (17, -0.018), (18, 0.081), (19, -0.029), (20, 0.03), (21, 0.031), (22, -0.023), (23, 0.02), (24, -0.037), (25, -0.047), (26, -0.017), (27, -0.007), (28, -0.065), (29, 0.005), (30, -0.034), (31, 0.071), (32, -0.021), (33, -0.006), (34, -0.035), (35, -0.079), (36, 0.013), (37, -0.042), (38, 0.053), (39, -0.057), (40, 0.056), (41, -0.055), (42, -0.021), (43, 0.077), (44, -0.057), (45, -0.024), (46, -0.039), (47, 0.106), (48, 0.108), (49, -0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89346826 256 nips-2010-Structural epitome: a way to summarize one’s visual experience

Author: Nebojsa Jojic, Alessandro Perina, Vittorio Murino

Abstract: In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g., Photosynth) or appearance-based clustering models (e.g., the epitome), is impractical due to either the large dataset size or the dramatic variations in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, defines a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. 1

2 0.71576113 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures

Author: Kamiya Motwani, Nagesh Adluru, Chris Hinrichs, Andrew Alexander, Vikas Singh

Abstract: We study the problem of segmenting specific white matter structures of interest from Diffusion Tensor (DT-MR) images of the human brain. This is an important requirement in many Neuroimaging studies: for instance, to evaluate whether a brain structure exhibits group level differences as a function of disease in a set of images. Typically, interactive expert guided segmentation has been the method of choice for such applications, but this is tedious for large datasets common today. To address this problem, we endow an image segmentation algorithm with “advice” encoding some global characteristics of the region(s) we want to extract. This is accomplished by constructing (using expert-segmented images) an epitome of a specific region – as a histogram over a bag of ‘words’ (e.g., suitable feature descriptors). Now, given such a representation, the problem reduces to segmenting a new brain image with additional constraints that enforce consistency between the segmented foreground and the pre-specified histogram over features. We present combinatorial approximation algorithms to incorporate such domain specific constraints for Markov Random Field (MRF) segmentation. Making use of recent results on image co-segmentation, we derive effective solution strategies for our problem. We provide an analysis of solution quality, and present promising experimental evidence showing that many structures of interest in Neuroscience can be extracted reliably from 3-D brain image volumes using our algorithm. 1

3 0.63655061 234 nips-2010-Segmentation as Maximum-Weight Independent Set

Author: William Brendel, Sinisa Todorovic

Abstract: Given an ensemble of distinct, low-level segmentations of an image, our goal is to identify visually “meaningful” segments in the ensemble. Knowledge about any specific objects and surfaces present in the image is not available. The selection of image regions occupied by objects is formalized as the maximum-weight independent set (MWIS) problem. MWIS is the heaviest subset of mutually non-adjacent nodes of an attributed graph. We construct such a graph from all segments in the ensemble. Then, MWIS selects maximally distinctive segments that together partition the image. A new MWIS algorithm is presented. The algorithm seeks a solution directly in the discrete domain, instead of relaxing MWIS to a continuous problem, as common in previous work. It iteratively finds a candidate discrete solution of the Taylor series expansion of the original MWIS objective function around the previous solution. The algorithm is shown to converge to an optimum. Our empirical evaluation on the benchmark Berkeley segmentation dataset shows that the new algorithm eliminates the need for hand-picking optimal input parameters of the state-of-the-art segmenters, and outperforms their best, manually optimized results.

4 0.60779101 245 nips-2010-Space-Variant Single-Image Blind Deconvolution for Removing Camera Shake

Author: Stefan Harmeling, Hirsch Michael, Bernhard Schölkopf

Abstract: Modelling camera shake as a space-invariant convolution simplifies the problem of removing camera shake, but often insufficiently models actual motion blur such as those due to camera rotation and movements outside the sensor plane or when objects in the scene have different distances to the camera. In an effort to address these limitations, (i) we introduce a taxonomy of camera shakes, (ii) we build on a recently introduced framework for space-variant filtering by Hirsch et al. and a fast algorithm for single image blind deconvolution for space-invariant filters by Cho and Lee to construct a method for blind deconvolution in the case of space-variant blur, and (iii), we present an experimental setup for evaluation that allows us to take images with real camera shake while at the same time recording the spacevariant point spread function corresponding to that blur. Finally, we demonstrate that our method is able to deblur images degraded by spatially-varying blur originating from real camera shake, even without using additionally motion sensor information. 1

5 0.60307282 149 nips-2010-Learning To Count Objects in Images

Author: Victor Lempitsky, Andrew Zisserman

Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. 1

6 0.54830664 103 nips-2010-Generating more realistic images using gated MRF's

7 0.50361454 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence

8 0.46327105 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata

9 0.42680576 267 nips-2010-The Multidimensional Wisdom of Crowds

10 0.41010481 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

11 0.40033156 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters

12 0.38343185 101 nips-2010-Gaussian sampling by local perturbations

13 0.38204494 79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces

14 0.38098818 224 nips-2010-Regularized estimation of image statistics by Score Matching

15 0.36099178 1 nips-2010-(RF)^2 -- Random Forest Random Field

16 0.36050034 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

17 0.35587388 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics

18 0.35564381 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

19 0.35127586 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach

20 0.33998007 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.024), (17, 0.018), (27, 0.078), (30, 0.032), (35, 0.025), (45, 0.176), (50, 0.051), (52, 0.027), (60, 0.021), (77, 0.062), (78, 0.016), (90, 0.042), (95, 0.298)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7412473 256 nips-2010-Structural epitome: a way to summarize one’s visual experience

Author: Nebojsa Jojic, Alessandro Perina, Vittorio Murino

Abstract: In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g., Photosynth) or appearance-based clustering models (e.g., the epitome), is impractical due to either the large dataset size or the dramatic variations in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, defines a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. 1

2 0.71221972 168 nips-2010-Monte-Carlo Planning in Large POMDPs

Author: David Silver, Joel Veness

Abstract: This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs. 1

3 0.67532659 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms

Author: Margareta Ackerman, Shai Ben-David, David Loker

Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1

4 0.58285499 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

Author: Surya Ganguli, Haim Sompolinsky

Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1

5 0.57885009 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models

Author: Ryan Kelly, Matthew Smith, Robert Kass, Tai S. Lee

Abstract: Activity of a neuron, even in the early sensory areas, is not simply a function of its local receptive field or tuning properties, but depends on global context of the stimulus, as well as the neural context. This suggests the activity of the surrounding neurons and global brain states can exert considerable influence on the activity of a neuron. In this paper we implemented an L1 regularized point process model to assess the contribution of multiple factors to the firing rate of many individual units recorded simultaneously from V1 with a 96-electrode “Utah” array. We found that the spikes of surrounding neurons indeed provide strong predictions of a neuron’s response, in addition to the neuron’s receptive field transfer function. We also found that the same spikes could be accounted for with the local field potentials, a surrogate measure of global network states. This work shows that accounting for network fluctuations can improve estimates of single trial firing rate and stimulus-response transfer functions. 1

6 0.57773161 17 nips-2010-A biologically plausible network for the computation of orientation dominance

7 0.5770148 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

8 0.57389218 98 nips-2010-Functional form of motion priors in human motion perception

9 0.57363033 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

10 0.57350141 96 nips-2010-Fractionally Predictive Spiking Neurons

11 0.573497 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

12 0.57269311 158 nips-2010-Learning via Gaussian Herding

13 0.57228976 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

14 0.57115984 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior

15 0.57080656 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

16 0.5699718 282 nips-2010-Variable margin losses for classifier design

17 0.56950223 117 nips-2010-Identifying graph-structured activation patterns in networks

18 0.56926173 268 nips-2010-The Neural Costs of Optimal Control

19 0.5686391 155 nips-2010-Learning the context of a category

20 0.56804579 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data