nips nips2009 nips2009-253 knowledge-graph by maker-knowledge-mining

253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks

Source: pdf

Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng

Abstract: In recent years, deep learning approaches have gained signiﬁcant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classiﬁcation tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classiﬁcation tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks Honglak Lee Yan Largman Peter Pham Computer Science Department Stanford University Stanford, CA 94305 Andrew Y. [sent-1, score-0.592]

2 Ng Abstract In recent years, deep learning approaches have gained signiﬁcant interest as a way of building hierarchical representations from unlabeled data. [sent-2, score-0.306]

3 However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. [sent-3, score-0.249]

4 In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classiﬁcation tasks. [sent-4, score-0.726]

5 In the case of speech data, we show that the learned features correspond to phones/phonemes. [sent-5, score-0.191]

6 In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classiﬁcation tasks. [sent-6, score-0.476]

7 We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. [sent-7, score-0.394]

8 Previous work [1, 2] revealed that learning a sparse representation of auditory signals leads to ﬁlters that closely correspond to those of neurons in early audio processing in mammals. [sent-9, score-0.214]

9 [3] proposed an efﬁcient sparse coding algorithm for auditory signals and demonstrated its usefulness in audio classiﬁcation tasks. [sent-12, score-0.214]

10 These “deep learning” algorithms try to learn simple features in the lower layers and more complex features in the higher layers. [sent-16, score-0.197]

11 The deep belief network [4] is a generative probabilistic model composed of one visible (observed) layer and many hidden layers. [sent-18, score-0.453]

12 Each hidden layer unit learns a statistical relationship between the units in the lower layer; the higher layer representations tend to become more complex. [sent-19, score-0.336]

13 The deep belief network can be efﬁciently trained using greedy layerwise training, in which the hidden layers are trained one at a time in a bottom-up fashion [4]. [sent-20, score-0.431]

14 Recently, convolutional deep belief networks [9] have been developed to scale up the algorithm to high-dimensional data. [sent-21, score-0.42]

15 Similar to deep belief networks, convolutional deep belief networks can be trained in a greedy, bottom-up fashion. [sent-22, score-0.722]

16 In this paper, we will apply convolutional deep belief networks to unlabeled auditory data (such as speech and music) and evaluate the learned feature representations on several audio classiﬁcation tasks. [sent-25, score-0.875]

17 In addition, our feature representations outperform other baseline features (spectrogram and MFCC) 1 for multiple audio classiﬁcation tasks. [sent-27, score-0.338]

18 For the phone classiﬁcation task, MFCC features can be augmented with our features to improve accuracy. [sent-29, score-0.319]

19 We also show for certain tasks that the second-layer features produce higher accuracy than the ﬁrst-layer features, which justiﬁes the use of deep learning approaches for audio classiﬁcation. [sent-30, score-0.49]

20 Finally, we show that our features give better performance in comparison to other baseline features for music classiﬁcation tasks. [sent-31, score-0.27]

21 In our experiments, the learned features often performed much better than other baseline features when there was only a small number of labeled training examples. [sent-32, score-0.295]

22 To the best of our knowledge, we are the ﬁrst to apply deep learning algorithms to a range of audio classiﬁcation tasks. [sent-33, score-0.341]

23 We hope that this paper will inspire more research on deep learning approaches applied to audio recognition tasks. [sent-34, score-0.394]

24 1 Algorithms Convolutional deep belief networks We ﬁrst brieﬂy review convolutional restricted Boltzmann machines (CRBMs) [9, 10, 11] as building blocks for convolutional deep belief networks (CDBNs). [sent-36, score-0.84]

25 The CRBM is an extension of the “regular” RBM [4] to a convolutional setting, in which the weights between the hidden units and the visible units are shared among all locations in the hidden layer. [sent-39, score-0.386]

26 The CRBM consists of two layers: an input (visible) layer V and a hidden layer H. [sent-40, score-0.232]

27 The hidden units are binary-valued, and the visible units are binary-valued or real-valued. [sent-41, score-0.21]

28 The hidden layer consists of K “groups” of nH -dimensional arrays (where nH nV − nW + 1) with units in group k sharing the weights W k . [sent-44, score-0.197]

29 (1) i=1 Similarly, the energy function of CRBM with real-valued visible units can be deﬁned as: E(v, h) = 1 2 nV K nH nW 2 vi − i nH K k hk Wr vj+r−1 − j k=1 j=1 r=1 k=1 nV hk − c j bk j=1 vi . [sent-47, score-0.271]

30 [9] further developed a convolutional RBM with “probabilistic max-pooling,” where the maxima over small neighborhoods of hidden units are computed in a probabilistically sound way. [sent-51, score-0.264]

31 ) In this paper, we use CRBMs with probabilistic max-pooling as building blocks for convolutional deep belief networks. [sent-53, score-0.39]

32 Once the parameters for all the layers are trained, we stack the CRBMs to form a convolutional deep belief network. [sent-60, score-0.413]

33 2 Application to audio data For the application of CDBNs to audio data, we ﬁrst convert time-domain signals into spectrograms. [sent-63, score-0.306]

34 Similarly, the ﬁrst-layer bases are comprised of nc channels of one-dimensional ﬁlters of length nW . [sent-69, score-0.281]

35 1 Training on unlabeled TIMIT data We trained the ﬁrst and second-layer CDBN representations using a large, unlabeled speech dataset. [sent-71, score-0.315]

36 First, we extracted the spectrogram from each utterance of the TIMIT training data [13]. [sent-72, score-0.252]

37 The spectrogram had a 20 ms window size with 10 ms overlaps. [sent-73, score-0.197]

38 We then trained 300 ﬁrst-layer bases with a ﬁlter length (nW ) of 6 and a max-pooling ratio (local neighborhood size) of 3. [sent-75, score-0.287]

39 We further trained 300 second-layer bases using the max-pooled ﬁrst-layer activations as input, again with a ﬁlter length of 6 and a max-pooling ratio of 3. [sent-76, score-0.308]

40 We visualize the ﬁrstlayer bases by multiplying the inverse of the PCA whitening on each ﬁrst-layer basis (Figure 1). [sent-79, score-0.266]

41 Figure 1: Visualization of randomly selected ﬁrst-layer CDBN bases trained on the TIMIT data. [sent-83, score-0.325]

42 1 Phonemes and the CDBN features In Figure 2, we show how our bases relate to phonemes by comparing visualizations of each phoneme with the bases that are most activated by that phoneme. [sent-89, score-0.736]

43 For each phoneme, we show ﬁve spectrograms of sound clips of that phoneme (top ﬁve columns in each phoneme group), and the ﬁve ﬁrst-layer bases with the highest average activations on the given phoneme (bottom ﬁve columns in each phoneme group). [sent-90, score-0.942]

44 Many of the ﬁrst-layer bases closely match the shapes of phonemes. [sent-91, score-0.237]

45 There are prominent horizontal bands in the lower frequencies of the ﬁrstlayer bases that respond most to vowels (for example, “ah” and “oy”). [sent-92, score-0.26]

46 For each phoneme: (top) the spectrograms of the ﬁve randomly selected phones; (bottom) ﬁve ﬁrst-layer bases with the highest average activations on the given phoneme. [sent-94, score-0.41]

47 Closer inspection of the bases provides slight evidence that the ﬁrst-layer bases also capture more ﬁne-grained details. [sent-97, score-0.474]

48 For example, the ﬁrst and third “oy” bases reﬂect the upward-slanting pattern in the phoneme spectrograms. [sent-98, score-0.373]

49 The top “el” bases mirror the intensity patterns of the corresponding phoneme spectrograms: a high intensity region appears in the lowest frequencies, and another region of lesser intensity appears a bit higher up. [sent-99, score-0.46]

50 2 Speaker gender information and the CDBN features In Figure 3, we show an analysis of two-layer CDBN feature representations with respect to the gender classiﬁcation task (Section 4. [sent-102, score-0.362]

51 Note that the network was trained on unlabeled data; therefore, no information about speaker gender was given during training. [sent-104, score-0.431]

52 Example phones (female) First layer bases ("female") Second layer bases ("female") Example phones (male) First layer bases ("male") Second layer bases ("male") Figure 3: (Left) ﬁve spectrogram samples of “ae” phoneme from female (top)/male (bottom) speakers. [sent-105, score-1.867]

53 (Middle) Visualization of the ﬁve ﬁrst-layer bases that most differentially activate for female/male speakers. [sent-106, score-0.286]

54 (Right) Visualization of the ﬁve second-layer bases that most differentially activate for female/male speakers. [sent-107, score-0.286]

55 For comparison with the CDBN features, randomly selected spectrograms of female (top left ﬁve columns) and male (bottom left ﬁve columns) pronunciations of the “ae” phoneme from the TIMIT dataset are shown. [sent-108, score-0.464]

56 Spectrograms for the female pronunciations are qualitatively distinguishable by a ﬁner horizontal banding pattern in low frequencies, whereas male pronunciations have more blurred 4 patterns. [sent-109, score-0.238]

57 Only the bases that are most biased to activate on either male or female speech are shown. [sent-111, score-0.496]

58 The bases that are most active on female speech encode the horizontal band pattern that is prominent in the spectrograms of female pronunciations. [sent-112, score-0.603]

59 On the other hand, the male-biased bases have more blurred patterns, which again visually matches the corresponding spectrograms. [sent-113, score-0.237]

60 4 Application to speech recognition tasks In this section, we demonstrate that the CDBN feature representations learned from the unlabeled speech corpus can be useful for multiple speech recognition tasks, such as speaker identiﬁcation, gender classiﬁcation, and phone classiﬁcation. [sent-114, score-0.884]

61 1 Speaker identiﬁcation We evaluated the usefulness of the learned CDBN representations for the speaker identiﬁcation task. [sent-120, score-0.276]

62 The subset of the TIMIT corpus that we used for speaker identiﬁcation has 168 speakers and 10 utterances (sentences) per speaker, resulting in a total of 1680 utterances. [sent-121, score-0.309]

63 For each number of utterances per speaker, we randomly selected training utterances and testing utterances and measured the classiﬁcation accuracy; we report the results averaged over 10 random trials. [sent-123, score-0.468]

64 5 To construct training and test data for the classiﬁcation task, we extracted a spectrogram from each utterance in the TIMIT corpus. [sent-124, score-0.27]

65 We computed the ﬁrst and second-layer CDBN features using the spectrogram as input. [sent-126, score-0.234]

66 We drew unlabeled data from the larger of the two for unsupervised feature learning, and we drew labeled data from the other data set to create our training and test set for the classiﬁcation tasks. [sent-139, score-0.22]

67 5 Details: There were some exceptions to this; for the case of eight training utterances, we followed Reynolds (1995) [16]; more speciﬁcally, we used eight training utterances (2 sa sentences, 3 si sentences and ﬁrst 3 sx sentences); the two testing utterances were the remaining 2 sx sentences. [sent-143, score-0.413]

68 5 Table 1: Test classiﬁcation accuracy for speaker identiﬁcation using summary statistics #training utterances per speaker 1 2 3 5 8 RAW 46. [sent-149, score-0.526]

69 0% Table 2: Test classiﬁcation accuracy for speaker identiﬁcation using all frames #training utterances per speaker 1 2 3 5 8 MFCC ([16]’s method) 40. [sent-174, score-0.546]

70 8 As shown in Table 2, the CDBN features consistently outperformed MFCC features when the number of training examples was small. [sent-193, score-0.217]

71 9 The resulting combined classiﬁer performed the best, achieving 100% accuracy for the case of 8 training utterances per speaker. [sent-195, score-0.209]

72 2 Speaker gender classiﬁcation We also evaluated the same CDBN features which were learned for the speaker identiﬁcation task on the gender classiﬁcation task. [sent-197, score-0.535]

73 This suggests that the second-layer representation learned more invariant features that are relevant for speaker gender classiﬁcation, justifying the use of “deep” architectures. [sent-202, score-0.407]

74 3 Phone classiﬁcation Finally, we evaluated our learned representation on phone classiﬁcation tasks. [sent-204, score-0.199]

75 For this experiment, we treated each phone segment as an individual example and computed the spectrogram (RAW) and MFCC features for each phone segment. [sent-205, score-0.542]

76 Following the standard protocol [15], we report the 39 way phone classiﬁcation accuracy on the test data (TIMIT core test set) for various numbers of training sentences. [sent-207, score-0.261]

77 The summary 7 Details: In [16], MFCC features (with multiple frames) were computed for each utterance; then a Gaussian mixture model was trained for each speaker (treating each individual MFCC frame as a input example to the GMM. [sent-209, score-0.335]

78 Overall, the highest scoring speaker was selected for the prediction. [sent-214, score-0.223]

79 6 Table 3: Test accuracy for gender classiﬁcation problem #training utterances per gender 1 2 3 5 7 10 RAW 68. [sent-216, score-0.38]

80 6% Table 4: Test accuracy for phone classiﬁcation problem #training utterances 100 200 500 1000 2000 3696 RAW 36. [sent-246, score-0.311]

81 In this experiment, the ﬁrst-layer CDBN features performed better than spectrogram features, but they did not outperform the MFCC features. [sent-277, score-0.252]

82 [17, 18, 19, 20] This suggests that the ﬁrst-layer CDBN features learned somewhat informative features for phone classiﬁcation tasks in an unsupervised way. [sent-281, score-0.377]

83 In contrast to the gender classiﬁcation task, the secondlayer CDBN features did not offer much improvement over the ﬁrst-layer CDBN features. [sent-282, score-0.194]

84 5 Application to music classiﬁcation tasks In this section, we assess the applicability of CDBN features to various music classiﬁcation tasks. [sent-284, score-0.266]

85 7% Music genre classiﬁcation For the task of music genre classiﬁcation, we trained the ﬁrst and second-layer CDBN representations on an unlabeled collection of music data. [sent-306, score-0.426]

86 10 First, we computed the spectrogram (20 ms window size with 10 ms overlaps) representation for individual songs. [sent-307, score-0.215]

87 We trained 300 ﬁrst-layer bases with a ﬁlter length of 10 and a max-pooling ratio of 3. [sent-309, score-0.287]

88 In addition, we trained 300 second-layer bases with a ﬁlter length of 10 and a max-pooling ratio of 3. [sent-310, score-0.287]

89 The training and test songs for the classiﬁcation tasks were randomly sampled from 5 genres (classical, electric, jazz, pop, and rock) and did not overlap with the unlabeled data. [sent-312, score-0.211]

90 1, we trained the ﬁrst and second-layer CDBN representations from an unlabeled collection of classical music data. [sent-325, score-0.245]

91 The results show that both the ﬁrst and second-layer CDBN features performed better than the baseline features, and that either using the second-layer features only or combining the ﬁrst and the second-layer features yielded the best results. [sent-334, score-0.28]

92 Figure 4: Visualization of randomly selected ﬁrst-layer CDBN bases trained on classical music data. [sent-338, score-0.402]

93 In this paper, we applied convolutional deep belief networks to audio data and evaluated on various audio classiﬁcation tasks. [sent-362, score-0.747]

94 By leveraging a large amount of unlabeled data, our learned features often equaled or surpassed MFCC features, which are hand-tailored to audio data. [sent-363, score-0.349]

95 Also, our results show that a single CDBN feature representation can achieve high performance on multiple audio recognition tasks. [sent-365, score-0.194]

96 We hope that our approach will inspire more research on automatically learning deep feature hierarchies for audio data. [sent-366, score-0.391]

97 11 In our experiments, we found that artist identiﬁcation task was more difﬁcult than the speaker identiﬁcation task because the local sound patterns can be highly variable even for the same artist. [sent-370, score-0.266]

98 Sparse deep belief network model for visual area V2. [sent-428, score-0.27]

99 Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. [sent-436, score-0.282]

100 Regularization, adaptation, and nonindependent features improve hidden conditional random ﬁelds for phone classiﬁcation. [sent-493, score-0.27]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cdbn', 0.691), ('mfcc', 0.243), ('bases', 0.237), ('deep', 0.188), ('speaker', 0.18), ('audio', 0.153), ('spectrogram', 0.147), ('phone', 0.145), ('convolutional', 0.138), ('phoneme', 0.136), ('timit', 0.135), ('utterances', 0.129), ('gender', 0.107), ('female', 0.102), ('layer', 0.097), ('spectrograms', 0.091), ('features', 0.087), ('classi', 0.083), ('music', 0.077), ('unlabeled', 0.076), ('phones', 0.073), ('speech', 0.071), ('nv', 0.07), ('nh', 0.068), ('crbm', 0.065), ('belief', 0.064), ('utterance', 0.062), ('units', 0.062), ('auditory', 0.061), ('male', 0.058), ('cation', 0.054), ('nw', 0.053), ('genre', 0.052), ('trained', 0.05), ('visible', 0.048), ('raw', 0.044), ('hk', 0.043), ('training', 0.043), ('representations', 0.042), ('artist', 0.042), ('phonemes', 0.039), ('crbms', 0.039), ('pronunciations', 0.039), ('hidden', 0.038), ('visualization', 0.037), ('accuracy', 0.037), ('grosse', 0.037), ('identi', 0.035), ('bk', 0.035), ('oy', 0.034), ('learned', 0.033), ('sentences', 0.031), ('songs', 0.031), ('inspire', 0.031), ('ve', 0.031), ('networks', 0.03), ('whitening', 0.029), ('activate', 0.028), ('sound', 0.026), ('labeled', 0.026), ('channels', 0.026), ('llikelihood', 0.026), ('lsparsity', 0.026), ('rstlayer', 0.026), ('sung', 0.026), ('lee', 0.026), ('tasks', 0.025), ('ms', 0.025), ('frequencies', 0.023), ('highest', 0.023), ('layers', 0.023), ('convolution', 0.023), ('intensity', 0.023), ('ah', 0.023), ('cdbns', 0.023), ('recognition', 0.022), ('lter', 0.022), ('activations', 0.021), ('evaluated', 0.021), ('phonetic', 0.021), ('differentially', 0.021), ('selected', 0.02), ('vi', 0.02), ('frames', 0.02), ('rbms', 0.019), ('drew', 0.019), ('reynolds', 0.019), ('sx', 0.019), ('baseline', 0.019), ('feature', 0.019), ('ae', 0.018), ('wr', 0.018), ('nc', 0.018), ('network', 0.018), ('test', 0.018), ('pca', 0.018), ('outperform', 0.018), ('individual', 0.018), ('randomly', 0.018), ('patterns', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks

Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng

2 0.19288452 151 nips-2009-Measuring Invariances in Deep Networks

Author: Ian Goodfellow, Honglak Lee, Quoc V. Le, Andrew Saxe, Andrew Y. Ng

Abstract: For many pattern recognition tasks, the ideal input feature would be invariant to multiple confounding properties (such as illumination and viewing angle, in computer vision applications). Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, it is difﬁcult to evaluate the learned features by any means other than using them in a classiﬁer. In this paper, we propose a number of empirical tests that directly measure the degree to which these learned features are invariant to different input transformations. We ﬁnd that stacked autoencoders learn modestly increasingly invariant features with depth when trained on natural images. We ﬁnd that convolutional deep belief networks learn substantially more invariant features in each layer. These results further justify the use of “deep” vs. “shallower” representations, but suggest that mechanisms beyond merely stacking one autoencoder on top of another may be important for achieving invariance. Our evaluation metrics can also be used to evaluate future work in deep learning, and thus help the development of future algorithms. 1

3 0.17611705 2 nips-2009-3D Object Recognition with Deep Belief Nets

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We introduce a new type of top-level model for Deep Belief Nets and evaluate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalized-uniform version), which contains stereo-pair images of objects under diﬀerent lighting conditions and viewpoints. Our model achieves 6.5% error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modiﬁed version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2% error. 1

4 0.14319289 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions

Author: Zahi Karam, Douglas Sturim, William M. Campbell

Abstract: Speaker comparison, the process of ﬁnding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker veriﬁcation, clustering, and identiﬁcation. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and ﬁnding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require signiﬁcantly less computation than current techniques.

5 0.13494855 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

Author: Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj

Abstract: In this paper we present an algorithm for separating mixed sounds from a monophonic recording. Our approach makes use of training data which allows us to learn representations of the types of sounds that compose the mixture. In contrast to popular methods that attempt to extract compact generalizable models for each sound from training data, we employ the training data itself as a representation of the sources in the mixture. We show that mixtures of known sounds can be described as sparse combinations of the training data itself, and in doing so produce signiﬁcantly better separation results as compared to similar systems based on compact statistical models. Keywords: Example-Based Representation, Signal Separation, Sparse Models. 1

6 0.12813474 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

7 0.12253842 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

8 0.11471449 119 nips-2009-Kernel Methods for Deep Learning

9 0.088125803 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

10 0.076518461 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

11 0.075129725 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

12 0.048237029 260 nips-2009-Zero-shot Learning with Semantic Output Codes

13 0.048121985 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

14 0.047573216 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

15 0.045916475 97 nips-2009-Free energy score space

16 0.045804884 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

17 0.045714371 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

18 0.042602845 204 nips-2009-Replicated Softmax: an Undirected Topic Model

19 0.042552549 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

20 0.03889491 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, -0.072), (2, -0.052), (3, 0.05), (4, -0.047), (5, 0.028), (6, -0.007), (7, 0.108), (8, -0.077), (9, 0.082), (10, 0.029), (11, 0.073), (12, -0.338), (13, 0.065), (14, 0.102), (15, 0.006), (16, 0.157), (17, -0.029), (18, -0.056), (19, -0.04), (20, -0.073), (21, -0.024), (22, -0.026), (23, 0.061), (24, -0.054), (25, -0.074), (26, 0.129), (27, 0.016), (28, -0.036), (29, 0.002), (30, -0.014), (31, 0.017), (32, 0.05), (33, -0.04), (34, 0.04), (35, -0.047), (36, 0.027), (37, 0.146), (38, 0.073), (39, 0.021), (40, 0.001), (41, -0.142), (42, 0.08), (43, -0.13), (44, 0.014), (45, -0.163), (46, -0.005), (47, -0.063), (48, 0.022), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9270578 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks

Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng

2 0.67190349 2 nips-2009-3D Object Recognition with Deep Belief Nets

Author: Vinod Nair, Geoffrey E. Hinton

3 0.61445647 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions

Author: Zahi Karam, Douglas Sturim, William M. Campbell

4 0.53901631 151 nips-2009-Measuring Invariances in Deep Networks

Author: Ian Goodfellow, Honglak Lee, Quoc V. Le, Andrew Saxe, Andrew Y. Ng

5 0.51279676 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

Author: Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj

6 0.4558219 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

7 0.44181368 119 nips-2009-Kernel Methods for Deep Learning

8 0.42780647 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

9 0.37770444 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

10 0.33921933 97 nips-2009-Free energy score space

11 0.32679221 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

12 0.30158955 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

13 0.29818431 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

14 0.29793099 56 nips-2009-Conditional Neural Fields

15 0.29706705 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

16 0.28384438 39 nips-2009-Bayesian Belief Polarization

17 0.2788513 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

18 0.27798808 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

19 0.27693719 47 nips-2009-Boosting with Spatial Regularization

20 0.26255858 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(12, 0.022), (21, 0.015), (24, 0.029), (25, 0.047), (35, 0.028), (36, 0.076), (39, 0.032), (42, 0.015), (58, 0.035), (71, 0.031), (81, 0.012), (86, 0.531), (91, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98188001 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

Author: Peter Sollich, Matthew Urry, Camille Coti

Abstract: We investigate how well Gaussian process regression can learn functions deﬁned on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we ﬁnd that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function deﬁned on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive deﬁnite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then deﬁned as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, deﬁned as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one ﬁnds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be veriﬁed by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the ﬁgure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to ﬁnd the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reﬂecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn ﬁxes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we ﬁnd the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior deﬁned by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simpliﬁcation but is appropriate for the present ﬁrst exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is ﬁxed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and ﬁnally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as speciﬁed by p, a and σ 2 . We ﬁx d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difﬁcult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by ﬁrst expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a ﬁxed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are deﬁned appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the deﬁnition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects ﬂuctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justiﬁed in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative ﬂuctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small ﬁlled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reﬂecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefﬁcient c with p, which implies that to reach a speciﬁed (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at ﬁxed , while in the tail ∝ p, it continues to change. For ﬁnite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simpliﬁes to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this signiﬁcantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one ﬁnds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a ﬁnite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the ﬁgure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artiﬁcial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning ﬁrst to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this ﬁxes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 ﬁrst terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to ﬁne detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9

2 0.97723639 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.

3 0.96690029 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

Author: Xiao-ming Wu, Anthony M. So, Zhenguo Li, Shuo-yen R. Li

Abstract: Kernel learning is a powerful framework for nonlinear data modeling. Using the kernel trick, a number of problems have been formulated as semideﬁnite programs (SDPs). These include Maximum Variance Unfolding (MVU) (Weinberger et al., 2004) in nonlinear dimensionality reduction, and Pairwise Constraint Propagation (PCP) (Li et al., 2008) in constrained clustering. Although in theory SDPs can be efﬁciently solved, the high computational complexity incurred in numerically processing the huge linear matrix inequality constraints has rendered the SDP approach unscalable. In this paper, we show that a large class of kernel learning problems can be reformulated as semideﬁnite-quadratic-linear programs (SQLPs), which only contain a simple positive semideﬁnite constraint, a second-order cone constraint and a number of linear constraints. These constraints are much easier to process numerically, and the gain in speedup over previous approaches is at least of the order m2.5 , where m is the matrix dimension. Experimental results are also presented to show the superb computational efﬁciency of our approach.

4 0.96153468 190 nips-2009-Polynomial Semantic Indexing

Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri

Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1

same-paper 5 0.95496863 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks

Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng

6 0.94142729 176 nips-2009-On Invariance in Hierarchical Models

7 0.82323658 151 nips-2009-Measuring Invariances in Deep Networks

8 0.77821445 119 nips-2009-Kernel Methods for Deep Learning

9 0.74255657 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

10 0.71890473 137 nips-2009-Learning transport operators for image manifolds

11 0.71323419 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

12 0.71076536 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

13 0.70816863 196 nips-2009-Quantification and the language of thought

14 0.70777714 95 nips-2009-Fast subtree kernels on graphs

15 0.70003581 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

16 0.69799942 2 nips-2009-3D Object Recognition with Deep Belief Nets

17 0.68679309 104 nips-2009-Group Sparse Coding

18 0.68535066 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

19 0.67558748 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

20 0.66808069 87 nips-2009-Exponential Family Graph Matching and Ranking