nips nips2009 nips2009-6 knowledge-graph by maker-knowledge-mining

6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification


Source: pdf

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nl Abstract Contrast statistics of the majority of natural images conform to a Weibull distribution. [sent-15, score-0.297]

2 This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. [sent-16, score-0.342]

3 Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. [sent-17, score-0.872]

4 In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. [sent-18, score-0.323]

5 From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. [sent-19, score-1.064]

6 We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. [sent-20, score-0.475]

7 Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. [sent-22, score-0.297]

8 Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. [sent-23, score-0.226]

9 Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification. [sent-24, score-0.815]

10 There is a strong correlation between adjacent image points in terms of local features such as luminance [1]. [sent-26, score-0.248]

11 These second-order correlations decrease with distance between image points, giving rise to the typical 1/12 power spectra of natural images. [sent-27, score-0.359]

12 On account of this power-law characteristic, natural images compromise a very small and distinguishable subset of the space of all possible images, with specific scene categories occupying different parts of this subspace. [sent-28, score-0.448]

13 For example, white noise images can be distinguished from natural images because of their deviation from the power law statistics, while street scenes and beach scenes can be separated from each other on the basis of differences in ensemble power spectra [2]. [sent-29, score-0.688]

14 Thus, the power spectra of natural images contain an indeterminate amount of the visual gist of these images. [sent-30, score-0.459]

15 The similarity structure among nearby image points, however, represents only part of the statistical structure in natural images. [sent-31, score-0.307]

16 There are also higher-order correlations, which introduce structure in the phase spectra of natural images. [sent-32, score-0.158]

17 This structure is assumed to carry perceptually important image features such as edges and has been measured in terms of kurtosis in the contrast distribution of natural images [3, 4, 5]. [sent-33, score-0.829]

18 Geusebroek and Smeulders [6] showed that the two-parameter Weibull distribution adequately captures the variance and kurtosis in the contrast distribution of the majority of natural images. [sent-34, score-0.376]

19 In fact, the two parameters of the Weibull contrast distribution tum out to organize the space of all possible natural scenes in a perceptually meaningful manner [7] and thus are likely to provide additional information about a scene's visual gist. [sent-35, score-0.486]

20 [7] have further shown that the two parameters of the Weibull contrast distribution match biologically realistic computations of Lateral Geniculate Nucleus (LGN) cells. [sent-37, score-0.297]

21 Specifically, they simulated X-cell responses by filtering images with a difference of Gaussians (DoG), rectifying the filtered images and transforming the pixel values of the resulting images with a contrast gain function adequate for P-cells. [sent-38, score-1.012]

22 To simulate Y-cell responses, the rectified images were passed through a Gaussian smoothing function and resulting pixel values were subsequently transformed with a contrast gain function adequate for M-cells. [sent-39, score-0.594]

23 The sum of the resulting X-cell responses turned out to correlate highly with one Wei bull parameter (r=0. [sent-40, score-0.518]

24 95), whereas the sum of the resulting Y-cell responses correlated highly with the other Weibull parameter (r=0. [sent-41, score-0.141]

25 Moreover, the two Wei bull parameters correlated highly with EEG activity (r 2 =0. [sent-43, score-0.496]

26 [7] show that our brain is capable of approximating the Wei bull contrast distribution of an image on the basis of filters that are biologically realistic in shape, sensitivity, and SIze. [sent-46, score-1.066]

27 Here we hypothesized that if Wei bull contrast distributions of natural images carry perceptually important information, a neural response model based on the Weibull contrast distribution will predict brain responses to brief flashes of natural images. [sent-47, score-1.641]

28 We tested this hypothesis with two experiments in which we rapidly presented a large set of natural or artificial images to multiple subjects while measuring EEG activity across the entire cortex. [sent-48, score-0.498]

29 In each experiment, we constructed a neural response model from the Weibull statistics of the presented images and corresponding EEG data, which we then applied to predict EEG responses to a new collection of natural or artificial images. [sent-49, score-0.644]

30 To validate the constructed neural response models, we used the approach of Kay et al. [sent-50, score-0.16]

31 [8]: predicted and measured EEG responses were compared to determine whether the observed image was correctly identified. [sent-51, score-0.41]

32 2 Methods We first describe how we filter images locally with a set of biologically-realistic filters. [sent-52, score-0.277]

33 Then we address a local contrast response selection mechanism with which we construct a contrast magnitude map for a given input image (a detailed description is in submission [12]). [sent-53, score-0.909]

34 Subsequently, Weibull contrast statistics are estimated from such maps and the relation between image statistics and neural activity modeled. [sent-54, score-0.485]

35 The section ends with an explanation of a performance measure for EEG-based image identification. [sent-55, score-0.175]

36 1 Local contrasts values in natural images As in [7], we use contrast filters that have spatial characteristics and contrast response properties closely mirroring well-known characteristic receptive-fields of LGN neurons [9]. [sent-57, score-1.216]

37 Specifically, we use a bank of second-order Gaussian derivative filters that span multiple octaves in spatial scale, that have peak sensitivity approximately inverse to filter size and that have contrast gain properties independent of size. [sent-58, score-0.875]

38 We represent contrast gain using an established non-linear response model that divides input contrast by the sum of the input and a semi-saturation constant [10]. [sent-59, score-0.68]

39 In this model a low value of the semi-saturation parameter indicates high non-linear contrast gain whereas higher values result in a linear mapping and thus will not lead to saturation. [sent-60, score-0.332]

40 Given an image, we process each image location with a bank of 5 contrast filters covering 5 octaves in spatial scale and , subsequently, subject the output of each scale-tuned filter to 5 different gain controls (5 semi-saturation values). [sent-61, score-1.097]

41 This results, for each image location, in 25 contrast response values. [sent-62, score-0.523]

42 We applied each of the 5 scale-specific filters, combined with each of the 5 contrast gain controls, to 800 natural images. [sent-63, score-0.421]

43 Figure 1 shows average responses over all image locations. [sent-64, score-0.293]

44 It decreases exponentially with scale owing to the peak sensitivity of the filters, which is inversely related to spatial scale. [sent-66, score-0.181]

45 That contrast also decreases with semi-saturation is explained by the fact that the amount of contrast suppression is proportional to the semi-saturation value. [sent-67, score-0.454]

46 From these summary statistics it follows that, although natural image contrast varies considerable within and across scale and contrast gain, the fast majority of natural image contrasts falls above a lower threshold. [sent-68, score-1.204]

47 It is reasonable to assume that the LGN considers contrast below this statistical threshold as noise and only processes contrasts above it, i. [sent-69, score-0.303]

48 01 0 Semisaturation Figure 1: Approximation of the typical range of contrasts generated by LGN neurons tuned to different spatial frequencies (5 octave scales) and with different contrast gain properties (5 semi-saturation constants). [sent-85, score-0.567]

49 Shown are the average of local contrast (dark gray), plus and minus two standard deviations, in the gray level (left), blue-yellow (middle) and red-green (right) color components of 800 natural images. [sent-86, score-0.413]

50 2 Natural image statistics-based selection of unique local image contrast values What spatial scale and contrast gain does the LGN use to process local image contrast? [sent-88, score-1.307]

51 It is unlikely that the LGN (linearly) integrates the output of a population of spatially overlapping filters to determine local image contrast as this would make it sensitive to receptive field clutter [II]. [sent-89, score-0.735]

52 Here we depart from the view that LGN aims to minimize receptive clutter by selecting a single output from a population of scale and gain specific contrast filters [12]. [sent-90, score-0.696]

53 Specifically, in order to determine contrast at an image location, we apply the smallest filter with boosted contrast output above what can be expected to be noise for that particular filter. [sent-91, score-0.726]

54 We define local contrast as the amount of contrast exceeding the noise threshold, which for a given scale and gain is set here to half standard deviation of contrasts in 800 natural images (see figure 1). [sent-92, score-1.047]

55 This contrast response selection mechanism produces a contrast magnitude map in ways similar to the scale selection model in [13]. [sent-93, score-0.769]

56 We apply the local contrast selection mechanism separately to the individual color components of an image. [sent-94, score-0.403]

57 From a single color image, the three color components are extracted using the Gaussian color model [14], resulting in a gray-scale, blue-yellow and red-green image representations. [sent-95, score-0.376]

58 Each of these representations is convolved with the 25 scale and gain specific contrast filters and subsequently subjected to our local contrast selection mechanism. [sent-96, score-0.966]

59 For each color component a dedicated scale and gain dependent noise threshold is used (see figure I). [sent-97, score-0.237]

60 As a result, for each color image we get three contrast magnitude maps, which we linearly sum to arrive at a single contrast magnitude map. [sent-98, score-0.764]

61 3 Weibull statistics of local image contrast The contrast magnitude map of an image is summarized in a histogram, representing the distribution of local contrast values of that image. [sent-100, score-1.15]

62 Note that the histogram does not preserve information about spatial structure in the contrast magnitude map: a scrambling the contrast magnitude map will not affect the histogram. [sent-101, score-0.661]

63 We subsequently fit a three-parameter Weibull distribution to the contrast histogram. [sent-102, score-0.307]

64 The three-parameter Wei bull distribution is given by f(x) = cexpc';tY (I) The parameters of this distribution are indicative for the spatial structure in a natural scene (see figure 2) and can be put in a biologically plausible framework [7]. [sent-103, score-0.719]

65 Hence, it varies roughly with the variation in local image contrasts. [sent-105, score-0.253]

66 8 Figure 2: Two arbitrary natural images from the Corel Photo Library with varying degrees of details and varying degrees of scene clutter. [sent-123, score-0.365]

67 The image gradient at each image location shows the contrast strength. [sent-127, score-0.59]

68 The scale and shape parameters of the Weibull distribution are estimated from the fit to the histogram by maximum likelihood estimation. [sent-129, score-0.196]

69 4 Model Estimation We use EEG response signals from C channels (electrodes) covering the entire cortex, to develop a Wei bull response model that predicts neuronal responses to natural images. [sent-131, score-0.938]

70 EEG signals are measured for S subjects watching N natural images. [sent-132, score-0.235]

71 We average these signals across subjects to obtain a more robust response signal per channel and per image. [sent-133, score-0.329]

72 This results in an N x C matrix F(t) of response signals fne(t). [sent-134, score-0.173]

73 We construct a linear Wei bull response model for each channel separately. [sent-135, score-0.602]

74 ,i3N;YI, ···,YNf and the EEG response feCt) = Line, . [sent-141, score-0.135]

75 The values i3n, Yn are the Wei bull parameters of image nand lne is the across subject average response to that image at channel c. [sent-145, score-1.004]

76 Wei bull response model estimation for channel c then reduces to solving feet) = XwCt) + E(t) (2) where wet) is 2 x I vector of regression functions and ECt) = [EICt), . [sent-146, score-0.602]

77 The estimated regression function provides the best estimate of feCt) in least squares sense: (4) We use we(t) to predict the EEG responses to a new set of M images represented by their Weibull distribution. [sent-153, score-0.349]

78 The EEG responses to these new images are predicted using the Wei bull response model: (5) where the M X 2 data matrix Y contains the two Weibull parameters for each of the new images and the M-vector of functions ge(t) denotes the predicted neural responses for channel c. [sent-154, score-1.294]

79 5 Image Identification How well does the Wei bull response model predict EEG responses to natural images? [sent-156, score-0.772]

80 Given a set of M new images and their Weibull parameters Y, the Weibull response model provides the EEG prediction ge(t). [sent-158, score-0.312]

81 The match between prediction ge(t) and true, measured EEG activity ge(t) = [gl, . [sent-159, score-0.137]

82 More specifically, an M X M similarity matrix S is constructed, where each element contains the Pearson's correlation coefficient R between measured gem(t) and predicted gcm(t) response. [sent-163, score-0.202]

83 The image whose predicted activity pattern is most correlated with the measured activity pattern is selected. [sent-165, score-0.436]

84 Hence, the square of the correlation coefficient r2 rather than r itself is used as a measure of similarity between true and predicted response. [sent-168, score-0.161]

85 1 Experiments and Results Stimulus and EEG Data In our experiments we used 800 color images with a resolution 345 x 217 pixels and a bit-depth of 24. [sent-170, score-0.271]

86 Of these, 400 were pictures of animals in their natural habitat and 400 pictures of natural landscapes, city scenes, indoor scenes and man-made objects. [sent-171, score-0.304]

87 These images were taken from a larger set of images used in Fabre-Thorpe [16]. [sent-172, score-0.354]

88 This subset of images was reasonably balanced in terms of Michelson contrast, spatial frequency and orientation properties. [sent-173, score-0.247]

89 The Weibull properties of these images nevertheless covered a wide range of real-world images. [sent-174, score-0.177]

90 The images were presented to 32 subjects on a 19" I1yama monitor with a resolution of 1024*768 pixels and a frame-rate of 100 Hz. [sent-176, score-0.271]

91 1 Hz (12 db/octave) and the pre-stimulus baseline activity was taken between -100 and 0 ms with regard to stimulus onset. [sent-183, score-0.159]

92 Trials were averaged over subject per individual stimulus resulting in 800 averages of 20 to 32 averages per individual image. [sent-184, score-0.182]

93 Two banks of Gaussian second-order derivative filters were used to determine image contrast for each image location. [sent-187, score-0.796]

94 The first set consisted of filters with octave spatial scales 1. [sent-188, score-0.307]

95 This set was used to determine the Wei bull scale parameter [3. [sent-191, score-0.476]

96 The other filter bank, with scales 3, 6, 12, 24, 48, was used for the estimation of Wei bull shape parameter y. [sent-192, score-0.537]

97 The spatial properties of the two sets were determined experimentally and roughly correspond to receptive field sizes of small X and large Y Ganglion cells in the early visual system of the human brain [18]. [sent-193, score-0.201]

98 6 to cover the spectrum from linear to non-linear contrast gain control in the LGN. [sent-196, score-0.332]

99 We repeated the same experiment 50 times, each time randomly selecting 700 images for model estimation and 100 images for image identification. [sent-198, score-0.529]

100 Performance was measured in terms of the percentage of correctly identified images for each of the 50 experiments. [sent-199, score-0.218]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('weibull', 0.428), ('bull', 0.4), ('eeg', 0.323), ('lgn', 0.24), ('wei', 0.216), ('contrast', 0.213), ('filters', 0.184), ('images', 0.177), ('image', 0.175), ('response', 0.135), ('gain', 0.119), ('responses', 0.118), ('filter', 0.1), ('scene', 0.099), ('contrasts', 0.09), ('natural', 0.089), ('scholte', 0.08), ('ge', 0.076), ('activity', 0.073), ('spatial', 0.07), ('artificial', 0.07), ('identification', 0.07), ('spectra', 0.069), ('subjects', 0.067), ('channel', 0.067), ('specifically', 0.067), ('color', 0.067), ('perceptually', 0.064), ('scenes', 0.062), ('biologically', 0.061), ('amsterdam', 0.06), ('specific', 0.06), ('netherlands', 0.057), ('subsequently', 0.054), ('biosemi', 0.053), ('fect', 0.053), ('flashes', 0.053), ('octave', 0.053), ('octaves', 0.053), ('smeulders', 0.053), ('bank', 0.052), ('predicted', 0.051), ('scale', 0.051), ('magnitude', 0.048), ('stimulus', 0.048), ('gj', 0.044), ('histogram', 0.044), ('local', 0.044), ('similarity', 0.043), ('kurtosis', 0.043), ('rapid', 0.041), ('measured', 0.041), ('fit', 0.04), ('ms', 0.038), ('receptive', 0.038), ('signals', 0.038), ('coefficient', 0.038), ('shape', 0.037), ('visual', 0.035), ('gist', 0.035), ('varies', 0.034), ('sensitivity', 0.034), ('brain', 0.033), ('pictures', 0.032), ('adequate', 0.031), ('clutter', 0.031), ('majority', 0.031), ('predict', 0.03), ('subject', 0.03), ('averages', 0.029), ('correlation', 0.029), ('mechanism', 0.028), ('selection', 0.028), ('amount', 0.028), ('carry', 0.027), ('pixels', 0.027), ('location', 0.027), ('power', 0.026), ('peak', 0.026), ('constructed', 0.025), ('determine', 0.025), ('map', 0.025), ('field', 0.025), ('estimated', 0.024), ('derivative', 0.024), ('correlated', 0.023), ('occupying', 0.023), ('define', 0.023), ('eagle', 0.023), ('geusebroek', 0.023), ('mirroring', 0.023), ('nucleus', 0.023), ('roughness', 0.023), ('tum', 0.023), ('individual', 0.023), ('covering', 0.023), ('match', 0.023), ('across', 0.022), ('considerable', 0.022), ('neurons', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.

2 0.17885807 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

Author: Wenming Zheng, Zhouchen Lin

Abstract: The method of common spatio-spectral patterns (CSSPs) is an extension of common spatial patterns (CSPs) by utilizing the technique of delay embedding to alleviate the adverse effects of noises and artifacts on the electroencephalogram (EEG) classification. Although the CSSPs method has shown to be more powerful than the CSPs method in the EEG classification, this method is only suitable for two-class EEG classification problems. In this paper, we generalize the two-class CSSPs method to multi-class cases. To this end, we first develop a novel theory of multi-class Bayes error estimation and then present the multi-class CSSPs (MCSSPs) method based on this Bayes error theoretical framework. By minimizing the estimated closed-form Bayes error, we obtain the optimal spatio-spectral filters of MCSSPs. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on the BCI competition 2005 data set. The experimental results show that our method significantly outperforms the previous multi-class CSPs (MCSPs) methods in the EEG classification.

3 0.14594319 211 nips-2009-Segmenting Scenes by Matching Image Composites

Author: Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman

Abstract: In this paper, we investigate how, given an image, similar images sharing the same global description can help with unsupervised scene segmentation. In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes. This allows for a better explanation of the input scenes. We perform MRF-based segmentation that optimizes over matches, while respecting boundary information. The recovered segments are then used to re-query a large database of images to retrieve better matches for the target regions. We show improved performance in detecting the principal occluding and contact boundaries for the scene over previous methods on data gathered from the LabelMe database.

4 0.1067018 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

Abstract: In the quest to make Brain Computer Interfacing (BCI) more usable, dry electrodes have emerged that get rid of the initial 30 minutes required for placing an electrode cap. Another time consuming step is the required individualized adaptation to the BCI user, which involves another 30 minutes calibration for assessing a subject’s brain signature. In this paper we aim to also remove this calibration proceedure from BCI setup time by means of machine learning. In particular, we harvest a large database of EEG BCI motor imagination recordings (83 subjects) for constructing a library of subject-specific spatio-temporal filters and derive a subject independent BCI classifier. Our offline results indicate that BCI-na¨ve ı users could start real-time BCI use with no prior calibration at only a very moderate performance loss.

5 0.088316537 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

Author: Daniel Zoran, Yair Weiss

Abstract: We propose a new model for natural image statistics. Instead of minimizing dependency between components of natural images, we maximize a simple form of dependency in the form of tree-dependencies. By learning filters and tree structures which are best suited for natural images we observe that the resulting filters are edge filters, similar to the famous ICA on natural images results. Calculating the likelihood of an image patch using our model requires estimating the squared output of pairs of filters connected in the tree. We observe that after learning, these pairs of filters are predominantly of similar orientations but different phases, so their joint energy resembles models of complex cells. 1 Introduction and related work Many models of natural image statistics have been proposed in recent years [1, 2, 3, 4]. A common goal of many of these models is finding a representation in which components or sub-components of the image are made as independent or as sparse as possible [5, 6, 2]. This has been found to be a difficult goal, as natural images have a highly intricate structure and removing dependencies between components is hard [7]. In this work we take a different approach, instead of minimizing dependence between components we try to maximize a simple form of dependence - tree dependence. It would be useful to place this model in context of previous works about natural image statistics. Many earlier models are described by the marginal statistics solely, obtaining a factorial form of the likelihood: p(x) = pi (xi ) (1) i The most notable model of this approach is Independent Component Analysis (ICA), where one seeks to find a linear transformation which maximizes independence between components (thus fitting well with the aforementioned factorization). This model has been applied to many scenarios, and proved to be one of the great successes of natural image statistics modeling with the emergence of edge-filters [5]. This approach has two problems. The first is that dependencies between components are still very strong, even with those learned transformation seeking to remove them. Second, it has been shown that ICA achieves, after the learned transformation, only marginal gains when measured quantitatively against simpler method like PCA [7] in terms of redundancy reduction. A different approach was taken recently in the form of radial Gaussianization [8], in which components which are distributed in a radially symmetric manner are made independent by transforming them non-linearly into a radial Gaussian, and thus, independent from one another. A more elaborate approach, related to ICA, is Independent Subspace Component Analysis or ISA. In this model, one looks for independent subspaces of the data, while allowing the sub-components 1 Figure 1: Our model with respect to marginal models such as ICA (a), and ISA like models (b). Our model, being a tree based model (c), allows components to belong to more than one subspace, and the subspaces are not required to be independent. of each subspace to be dependent: p(x) = pk (xi∈K ) (2) k This model has been applied to natural images as well and has been shown to produce the emergence of phase invariant edge detectors, akin to complex cells in V1 [2]. Independent models have several shortcoming, but by far the most notable one is the fact that the resulting components are, in fact, highly dependent. First, dependency between the responses of ICA filters has been reported many times [2, 7]. Also, dependencies between ISA components has also been observed [9]. Given these robust dependencies between filter outputs, it is somewhat peculiar that in order to get simple cell properties one needs to assume independence. In this work we ask whether it is possible to obtain V1 like filters in a model that assumes dependence. In our model we assume the filter distribution can be described by a tree graphical model [10] (see Figure 1). Degenerate cases of tree graphical models include ICA (in which no edges are present) and ISA (in which edges are only present within a subspace). But in its non-degenerate form, our model assumes any two filter outputs may be dependent. We allow components to belong to more than one subspace, and as a result, do not require independence between them. 2 Model and learning Our model is comprised of three main components. Given a set of patches, we look for the parameters which maximize the likelihood of a whitened natural image patch z: N p(yi |ypai ; β) p(z; W, β, T ) = p(y1 ) (3) i=1 Where y = Wz, T is the tree structure, pai denotes the parent of node i and β is a parameter of the density model (see below for the details). The three components we are trying to learn are: 1. The filter matrix W, where every row defines one of the filters. The response of these filters is assumed to be tree-dependent. We assume that W is orthogonal (and is a rotation of a whitening transform). 2. The tree structure T which specifies which components are dependent on each other. 3. The probability density function for connected nodes in the tree, which specify the exact form of dependency between nodes. All three together describe a complete model for whitened natural image patches, allowing likelihood estimation and exact inference [11]. We perform the learning in an iterative manner: we start by learning the tree structure and density model from the entire data set, then, keeping the structure and density constant, we learn the filters via gradient ascent in mini-batches. Going back to the tree structure we repeat the process many times iteratively. It is important to note that both the filter set and tree structure are learned from the data, and are continuously updated during learning. In the following sections we will provide details on the specifics of each part of the model. 2 β=0.0 β=0.5 β=1.0 β=0.0 β=0.5 β=1.0 2 1 1 1 1 1 1 0 −1 0 −1 −2 −1 −2 −3 0 x1 2 0 x1 2 0 x1 2 −2 −3 −2 0 x1 2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 −3 −2 0 x1 2 −2 0 x1 2 Figure 2: Shape of the conditional (Left three plots) and joint (Right three plots) density model in log scale for several values of β, from dependence to independence. 2.1 Learning tree structure In their seminal paper, Chow and Liu showed how to learn the optimal tree structure approximation for a multidimensional probability density function [12]. This algorithm is easy to apply to this scenario, and requires just a few simple steps. First, given the current estimate for the filter matrix W, we calculate the response of each of the filters with all the patches in the data set. Using these responses, we calculate the mutual information between each pair of filters (nodes) to obtain a fully connected weighted graph. The final step is to find a maximal spanning tree over this graph. The resulting unrooted tree is the optimal tree approximation of the joint distribution function over all nodes. We will note that the tree is unrooted, and the root can be chosen arbitrarily - this means that there is no node, or filter, that is more important than the others - the direction in the tree graph is arbitrary as long as it is chosen in a consistent way. 2.2 Joint probability density functions Gabor filter responses on natural images exhibit highly kurtotic marginal distributions, with heavy tails and sharp peaks [13, 3, 14]. Joint pair wise distributions also exhibit this same shape with varying degrees of dependency between the components [13, 2]. The density model we use allows us to capture both the highly kurtotic nature of the distributions, while still allowing varying degrees of dependence using a mixing variable. We use a mix of two forms of finite, zero mean Gaussian Scale Mixtures (GSM). In one, the components are assumed to be independent of each other and in the other, they are assumed to be spherically distributed. The mixing variable linearly interpolates between the two, allowing us to capture the whole range of dependencies: p(x1 , x2 ; β) = βpdep (x1 , x2 ) + (1 − β)pind (x1 , x2 ) (4) When β = 1 the two components are dependent (unless p is Gaussian), whereas when β = 0 the two components are independent. For the density functions themselves, we use a finite GSM. The dependent case is a scale mixture of bivariate Gaussians: 2 πk N (x1 , x2 ; σk I) pdep (x1 , x2 ) = (5) k While the independent case is a product of two independent univariate Gaussians: 2 πk N (x1 ; σk ) pind (x1 , x2 ) = k 2 πk N (x2 ; σk ) (6) k 2 Estimating parameters πk and σk for the GSM is done directly from the data using Expectation Maximization. These parameters are the same for all edges and are estimated only once on the first iteration. See Figure 2 for a visualization of the conditional distribution functions for varying values of β. We will note that the marginal distributions for the two types of joint distributions above are the same. The mixing parameter β is also estimated using EM, but this is done for each edge in the tree separately, thus allowing our model to theoretically capture the fully independent case (ICA) and other degenerate models such as ISA. 2.3 Learning tree dependent components Given the current tree structure and density model, we can now learn the matrix W via gradient ascent on the log likelihood of the model. All learning is performed on whitened, dimensionally 3 reduced patches. This means that W is a N × N rotation (orthonormal) matrix, where N is the number of dimensions after dimensionality reduction (see details below). Given an image patch z we multiply it by W to get the response vector y: y = Wz (7) Now we can calculate the log likelihood of the given patch using the tree model (which we assume is constant at the moment): N log p(yi |ypai ) log p(y) = log p(yroot ) + (8) i=1 Where pai denotes the parent of node i. Now, taking the derivative w.r.t the r-th row of W: ∂ log p(y) ∂ log p(y) T = z ∂Wr ∂yr (9) Where z is the whitened natural image patch. Finally, we can calculate the derivative of the log likelihood with respect to the r-th element in y: ∂ log p(ypar , yr ) ∂ log p(y) = + ∂yr ∂yr c∈C(r) ∂ log p(yr , yc ) ∂ log p(yr ) − ∂yr ∂yr (10) Where C(r) denote the children of node r. In summary, the gradient ascent rule for updating the rotation matrix W is given by: t+1 t Wr = Wr + η ∂ log p(y) T z ∂yr (11) Where η is the learning rate constant. After update, the rows of W are orthonormalized. This gradient ascent rule is applied for several hundreds of patches (see details below), after which the tree structure is learned again as described in Section 2.1, using the new filter matrix W, repeating this process for many iterations. 3 Results and analysis 3.1 Validation Before running the full algorithm on natural image data, we wanted to validate that it does produce sensible results with simple synthetic data. We generated noise from four different models, one is 1/f independent Gaussian noise with 8 Discrete Cosine Transform (DCT) filters, the second is a simple ICA model with 8 DCT filters, and highly kurtotic marginals. The third was a simple ISA model - 4 subspaces, each with two filters from the DCT filter set. Distribution within the subspace was a circular, highly kurtotic GSM, and the subspaces were sampled independently. Finally, we generated data from a simple synthetic tree of DCT filters, using the same joint distributions as for the ISA model. These four synthetic random data sets were given to the algorithm - results can be seen in Figure 3 for the ICA, ISA and tree samples. In all cases the model learned the filters and distribution correctly, reproducing both the filters (up to rotations within the subspace in ISA) and the dependency structure between the different filters. In the case of 1/f Gaussian noise, any whitening transformation is equally likely and any value of beta is equally likely. Thus in this case, the algorithm cannot find the tree or the filters. 3.2 Learning from natural image patches We then ran experiments with a set of natural images [9]1 . These images contain natural scenes such as mountains, fields and lakes. . The data set was 50,000 patches, each 16 × 16 pixels large. The patches’ DC was removed and they were then whitened using PCA. Dimension was reduced from 256 to 128 dimensions. The GSM for the density model had 16 components. Several initial 1 available at http://www.cis.hut.fi/projects/ica/imageica/ 4 Figure 3: Validation of the algorithm. Noise was generated from three models - top row is ICA, middle row is ISA and bottom row is a tree model. Samples were then given to the algorithm. On the right are the resulting learned tree models. Presented are the learned filters, tree model (with white edges meaning β = 0, black meaning β = 1 and grays intermediate values) and an example of a marginal histogram for one of the filters. It can be seen that in all cases all parts of the model were correctly learned. Filters in the ISA case were learned up to rotation within the subspace, and all filters were learned up to sign. β values for the ICA case were always below 0.1, as were the values of β between subspaces in ISA. conditions for the matrix W were tried out (random rotations, identity) but this had little effect on results. Mini-batches of 10 patches each were used for the gradient ascent - the gradient of 10 patches was summed, and then normalized to have unit norm. The learning rate constant η was set to 0.1. Tree structure learning and estimation of the mixing variable β were done every 500 mini-batches. All in all, 50 iterations were done over the data set. 3.3 Filters and tree structure Figures 4 and 5 show the learned filters (WQ where Q is the whitening matrix) and tree structure (T ) learned from natural images. Unlike the ISA toy data in figure 3, here a full tree was learned and β is approximately one for all edges. The GSM that was learned for the marginals was highly kurtotic. It can be seen that resulting filters are edge filters at varying scales, positions and orientations. This is similar to the result one gets when applying ICA to natural images [5, 15]. More interesting is Figure 4: Left: Filter set learned from 16 × 16 natural image patches. Filters are ordered by PCA eigenvalues, largest to smallest. Resulting filters are edge filters having different orientations, positions, frequencies and phases. Right: The “feature” set learned, that is, columns of the pseudo inverse of the filter set. 5 Figure 5: The learned tree graph structure and feature set. It can be seen that neighboring features on the graph have similar orientation, position and frequency. See Figure 4 for a better view of the feature details, and see text for full detail and analysis. Note that the figure is rotated CW. 6 Optimal Orientation Optimal Frequency 3.5 Optimal Phase 7 3 0.8 6 2.5 Optimal Position Y 0.9 6 5 5 0.7 0.6 3 Child 1.5 4 Child 2 Child Child 4 3 2 1 1 0.4 0.3 2 0.5 0.5 0.2 0 1 0.1 0 0 1 2 Parent 3 4 0 0 2 4 Parent 6 8 0 0 1 2 3 Parent 4 5 6 0 0.2 0.4 0.6 Parent 0.8 1 Figure 6: Correlation of optimal parameters in neighboring nodes in the tree graph. Orientation, frequency and position are highly correlated, while phase seems to be entirely uncorrelated. This property of correlation in frequency and orientation, while having no correlation in phase is related to the ubiquitous energy model of complex cells in V1. See text for further details. Figure 7: Left: Comparison of log likelihood values of our model with PCA, ICA and ISA. Our model gives the highest likelihood. Right: Samples taken at random from ICA, ISA and our model. Samples from our model appear to contain more long-range structure. the tree graph structure learned along with the filters which is shown in Figure 5. It can be seen that neighboring filters (nodes) in the tree tend to have similar position, frequency and orientation. Figure 6 shows the correlation of optimal frequency, orientation and position for neighboring filters in the tree - it is obvious that all three are highly correlated. Also apparent in this figure is the fact that the optimal phase for neighboring filters has no significant correlation. It has been suggested that filters which have the same orientation, frequency and position with different phase can be related to complex cells in V1 [2, 16]. 3.4 Comparison to other models Since our model is a generalization of both ICA and ISA we use it to learn both models. In order to learn ICA we used the exact same data set, but the tree had no edges and was not learned from the data (alternatively, we could have just set β = 0). For ISA we used a forest architecture of 2 node trees, setting β = 1 for all edges (which means a spherical symmetric distribution), no tree structure was learned. Both models produce edge filters similar to what we learn (and to those in [5, 15, 6]). The ISA model produces neighboring nodes with similar frequency and orientation, but different phase, as was reported in [2]. We also compare to a simple PCA whitening transform, using the same whitening transform and marginals as in the ICA case, but setting W = I. We compare the likelihood each model gives for a test set of natural image patches, different from the one that was used in training. There were 50,000 patches in the test set, and we calculate the mean log likelihood over the entire set. The table in Figure 7 shows the result - as can be seen, our model performs better in likelihood terms than both ICA and ISA. Using a tree model, as opposed to more complex graphical models, allows for easy sampling from the model. Figure 7 shows 20 random samples taken from our tree model along with samples from the ICA and ISA models. Note the elongated structures (e.g. in the bottom left sample) in the samples from the tree model, and compare to patches sampled from the ICA and ISA models. 7 40 40 30 30 20 20 10 1 10 0.8 0.6 0.4 0 0.2 0 0 1 2 3 Orientation 4 0 0 2 4 6 Frequency 8 0 2 4 Phase Figure 8: Left: Interpretation of the model. Given a patch, the response of all edge filters is computed (“simple cells”), then at each edge, the corresponding nodes are squared and summed to produce the response of the “complex cell” this edge represents. Both the response of complex cells and simple cells is summed to produce the likelihood of the patch. Right: Response of a “complex cell” in our model to changing phase, frequency and orientation. Response in the y-axis is the sum of squares of the two filters in this “complex cell”. Note that while the cell is selective to orientation and frequency, it is rather invariant to phase. 3.5 Tree models and complex cells One way to interpret the model is looking at the likelihood of a given patch under this model. For the case of β = 1 substituting Equation 4 into Equation 3 yields: 2 2 ρ( yi + ypai ) − ρ(|ypai |) log L(z) = (12) i 2 Where ρ(x) = log k πk N (x; σk ) . This form of likelihood has an interesting similarity to models of complex cells in V1 [2, 4]. In Figure 8 we draw a simple two-layer network that computes the likelihood. The first layer applies linear filters (“simple cells”) to the image patch, while the second layer sums the squared outputs of similarly oriented filters from the first layer, having different phases, which are connected in the tree (“complex cells”). Output is also dependent on the actual response of the “simple cell” layer. The likelihood here is maximized when both the response of the parent filter ypai and the child yi is zero, but, given that one filter has responded with a non-zero value, the likelihood is maximized when the other filter also fires (see the conditional density in Figure 2). Figure 8 also shows an example of the phase invariance which is present in the learned

6 0.087708622 137 nips-2009-Learning transport operators for image manifolds

7 0.085359275 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

8 0.084039129 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

9 0.08337678 246 nips-2009-Time-Varying Dynamic Bayesian Networks

10 0.081243753 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

11 0.081132412 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

12 0.080348477 201 nips-2009-Region-based Segmentation and Object Detection

13 0.079381786 96 nips-2009-Filtering Abstract Senses From Image Search Results

14 0.073813662 164 nips-2009-No evidence for active sparsification in the visual cortex

15 0.068961717 133 nips-2009-Learning models of object structure

16 0.068667106 175 nips-2009-Occlusive Components Analysis

17 0.067889601 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

18 0.067219816 88 nips-2009-Extending Phase Mechanism to Differential Motion Opponency for Motion Pop-out

19 0.067172088 43 nips-2009-Bayesian estimation of orientation preference maps

20 0.0655251 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.174), (1, -0.169), (2, -0.031), (3, 0.069), (4, -0.012), (5, 0.16), (6, 0.022), (7, -0.02), (8, 0.088), (9, -0.033), (10, -0.019), (11, 0.025), (12, 0.018), (13, 0.05), (14, 0.042), (15, -0.007), (16, -0.025), (17, 0.075), (18, -0.052), (19, 0.085), (20, 0.165), (21, 0.043), (22, -0.031), (23, -0.113), (24, 0.068), (25, 0.097), (26, -0.124), (27, 0.086), (28, -0.109), (29, -0.006), (30, 0.043), (31, 0.022), (32, 0.027), (33, -0.089), (34, 0.021), (35, 0.04), (36, -0.023), (37, 0.035), (38, -0.101), (39, 0.134), (40, 0.001), (41, 0.036), (42, -0.038), (43, -0.031), (44, -0.018), (45, -0.029), (46, -0.121), (47, -0.119), (48, -0.05), (49, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94136578 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.

2 0.59850001 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

Author: Wenming Zheng, Zhouchen Lin

Abstract: The method of common spatio-spectral patterns (CSSPs) is an extension of common spatial patterns (CSPs) by utilizing the technique of delay embedding to alleviate the adverse effects of noises and artifacts on the electroencephalogram (EEG) classification. Although the CSSPs method has shown to be more powerful than the CSPs method in the EEG classification, this method is only suitable for two-class EEG classification problems. In this paper, we generalize the two-class CSSPs method to multi-class cases. To this end, we first develop a novel theory of multi-class Bayes error estimation and then present the multi-class CSSPs (MCSSPs) method based on this Bayes error theoretical framework. By minimizing the estimated closed-form Bayes error, we obtain the optimal spatio-spectral filters of MCSSPs. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on the BCI competition 2005 data set. The experimental results show that our method significantly outperforms the previous multi-class CSPs (MCSPs) methods in the EEG classification.

3 0.5911001 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

Abstract: In the quest to make Brain Computer Interfacing (BCI) more usable, dry electrodes have emerged that get rid of the initial 30 minutes required for placing an electrode cap. Another time consuming step is the required individualized adaptation to the BCI user, which involves another 30 minutes calibration for assessing a subject’s brain signature. In this paper we aim to also remove this calibration proceedure from BCI setup time by means of machine learning. In particular, we harvest a large database of EEG BCI motor imagination recordings (83 subjects) for constructing a library of subject-specific spatio-temporal filters and derive a subject independent BCI classifier. Our offline results indicate that BCI-na¨ve ı users could start real-time BCI use with no prior calibration at only a very moderate performance loss.

4 0.54782426 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

Author: Ruben Coen-cagli, Peter Dayan, Odelia Schwartz

Abstract: A central hypothesis about early visual processing is that it represents inputs in a coordinate system matched to the statistics of natural scenes. Simple versions of this lead to Gabor–like receptive fields and divisive gain modulation from local surrounds; these have led to influential neural and psychological models of visual processing. However, these accounts are based on an incomplete view of the visual context surrounding each point. Here, we consider an approximate model of linear and non–linear correlations between the responses of spatially distributed Gaborlike receptive fields, which, when trained on an ensemble of natural scenes, unifies a range of spatial context effects. The full model accounts for neural surround data in primary visual cortex (V1), provides a statistical foundation for perceptual phenomena associated with Li’s (2002) hypothesis that V1 builds a saliency map, and fits data on the tilt illusion. 1

5 0.52832025 211 nips-2009-Segmenting Scenes by Matching Image Composites

Author: Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman

Abstract: In this paper, we investigate how, given an image, similar images sharing the same global description can help with unsupervised scene segmentation. In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes. This allows for a better explanation of the input scenes. We perform MRF-based segmentation that optimizes over matches, while respecting boundary information. The recovered segments are then used to re-query a large database of images to retrieve better matches for the target regions. We show improved performance in detecting the principal occluding and contact boundaries for the scene over previous methods on data gathered from the LabelMe database.

6 0.48500994 93 nips-2009-Fast Image Deconvolution using Hyper-Laplacian Priors

7 0.4727037 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis

8 0.4720304 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

9 0.47126967 175 nips-2009-Occlusive Components Analysis

10 0.46593499 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

11 0.4476352 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

12 0.44656622 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

13 0.44516316 137 nips-2009-Learning transport operators for image manifolds

14 0.43250817 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

15 0.423902 164 nips-2009-No evidence for active sparsification in the visual cortex

16 0.41694543 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

17 0.41043317 138 nips-2009-Learning with Compressible Priors

18 0.4030804 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion

19 0.40128389 96 nips-2009-Filtering Abstract Senses From Image Search Results

20 0.39196706 246 nips-2009-Time-Varying Dynamic Bayesian Networks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.017), (25, 0.067), (31, 0.011), (35, 0.025), (36, 0.038), (39, 0.04), (58, 0.06), (71, 0.027), (81, 0.014), (86, 0.602)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98096204 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.

2 0.98056769 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

Author: Peter Sollich, Matthew Urry, Camille Coti

Abstract: We investigate how well Gaussian process regression can learn functions defined on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we find that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function defined on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive definite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then defined as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, defined as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one finds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be verified by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the figure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to find the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reflecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn fixes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we find the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior defined by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simplification but is appropriate for the present first exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is fixed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and finally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as specified by p, a and σ 2 . We fix d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difficult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by first expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a fixed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are defined appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the definition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects fluctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justified in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative fluctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small filled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reflecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefficient c with p, which implies that to reach a specified (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at fixed , while in the tail ∝ p, it continues to change. For finite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simplifies to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this significantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one finds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a finite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the figure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artificial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning first to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this fixes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 first terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to fine detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9

3 0.95720619 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

Author: Xiao-ming Wu, Anthony M. So, Zhenguo Li, Shuo-yen R. Li

Abstract: Kernel learning is a powerful framework for nonlinear data modeling. Using the kernel trick, a number of problems have been formulated as semidefinite programs (SDPs). These include Maximum Variance Unfolding (MVU) (Weinberger et al., 2004) in nonlinear dimensionality reduction, and Pairwise Constraint Propagation (PCP) (Li et al., 2008) in constrained clustering. Although in theory SDPs can be efficiently solved, the high computational complexity incurred in numerically processing the huge linear matrix inequality constraints has rendered the SDP approach unscalable. In this paper, we show that a large class of kernel learning problems can be reformulated as semidefinite-quadratic-linear programs (SQLPs), which only contain a simple positive semidefinite constraint, a second-order cone constraint and a number of linear constraints. These constraints are much easier to process numerically, and the gain in speedup over previous approaches is at least of the order m2.5 , where m is the matrix dimension. Experimental results are also presented to show the superb computational efficiency of our approach.

4 0.95513767 190 nips-2009-Polynomial Semantic Indexing

Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri

Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1

5 0.93581551 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks

Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng

Abstract: In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1

6 0.92654961 176 nips-2009-On Invariance in Hierarchical Models

7 0.80239022 151 nips-2009-Measuring Invariances in Deep Networks

8 0.7502476 119 nips-2009-Kernel Methods for Deep Learning

9 0.71838164 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

10 0.69802243 137 nips-2009-Learning transport operators for image manifolds

11 0.69330394 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

12 0.68825114 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

13 0.68576229 196 nips-2009-Quantification and the language of thought

14 0.68363595 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

15 0.67679 95 nips-2009-Fast subtree kernels on graphs

16 0.66466576 104 nips-2009-Group Sparse Coding

17 0.66087151 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

18 0.65542942 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

19 0.6526525 2 nips-2009-3D Object Recognition with Deep Belief Nets

20 0.64176631 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank