nips nips2011 nips2011-93 knowledge-graph by maker-knowledge-mining

93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

Source: pdf

Author: Ke Chen, Ahmad Salman

Abstract: Speech conveys different yet mixed information ranging from linguistic to speaker-speciﬁc components, and each of them should be exclusively used in a speciﬁc task. However, it is extremely difﬁcult to extract a speciﬁc information component given the fact that nearly all existing acoustic representations carry all types of speech information. Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. In this paper, we present a deep neural architecture to extract speaker-speciﬁc information from MFCCs. As a result, a multi-objective loss function is proposed for learning speaker-speciﬁc characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-speciﬁc representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and relate our approach to previous work. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, it is extremely difﬁcult to extract a speciﬁc information component given the fact that nearly all existing acoustic representations carry all types of speech information. [sent-5, score-0.217]

2 Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. [sent-6, score-0.774]

3 In this paper, we present a deep neural architecture to extract speaker-speciﬁc information from MFCCs. [sent-7, score-0.164]

4 With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-speciﬁc representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. [sent-9, score-0.892]

5 1 Introduction It is well known that speech conveys various yet mixed information where there are linguistic information, a major component, and non-verbal information such as speaker-speciﬁc and emotional components [1]. [sent-11, score-0.237]

6 For human communication, all the information components in speech turn out to be very useful and exclusively used for different tasks. [sent-12, score-0.19]

7 For example, one often recognizes a speaker regardless of what is spoken for speaker recognition, while it is effortless for him/her to understand what is exactly spoken by different speakers for speech recognition. [sent-13, score-1.366]

8 In general, however, there is no effective way to automatically extract an information component of interest from speech signals so that the same representation has to be used in different speech information tasks. [sent-14, score-0.463]

9 The interference of different yet entangled speech information components in most existing acoustic representations hinders a speech or speaker recognition system from achieving better performance [1]. [sent-15, score-0.931]

10 On the other hand, the Siamese architecture originally proposed in [10] uses supervised yet contrastive 1 CS ˆ x 1t x 1t D CS X1 , CS X 2 ; 4 CS x ˆ x 2t 2t Figure 1: Regularized Siamese deep network (RSDN) architecture. [sent-23, score-0.171]

11 Inspired by the aforementioned work, we present a regularized Siamese deep network (RSDN) to extract speaker-speciﬁc information from a spectral representation, Mel Frequency Cepstral Coefﬁcients (MFCCs), commonly used in both speech and speaker recognition. [sent-26, score-0.705]

12 , greedy layer-wise unsupervised learning for initializing its component deep neural networks followed by global supervised learning based on the proposed loss function. [sent-30, score-0.114]

13 With LDC benchmark corpora [14] and a Chinese corpus [15], we demonstrate that a generic speaker-speciﬁc representation learned by our RSDN is insensitive to text and languages spoken and, moreover, applicable to speech corpora unseen during learning. [sent-31, score-0.586]

14 Experimental results in speaker recognition suggest that a representation learned by the RSDN outperforms MFCCs and that by the CDBN [9] that learns a generic speech representation without speaker-speciﬁc information extraction. [sent-32, score-0.736]

15 1 Architecture As illustrated in Figure 1, our RSDN architecture consists of two subnets, and each subnet is a fully connected multi-layered perceptron of 2K+1 layers, i. [sent-43, score-0.181]

16 , an input layer, 2K-1 hidden layers and a visible layer at the top. [sent-45, score-0.222]

17 If we stipulate that layer 0 is input layer, there are the same number of neurons in layers k and 2K-k for k = 0, 1, · · · , K. [sent-46, score-0.279]

18 In particular, the Kth hidden layer is used as code layer, and neurons in this layer are further divided into two subsets. [sent-47, score-0.374]

19 As depicted in Figure 1, those neurons in the box named CS and colored in red constitute one subset for encoding speakerspeciﬁc information and all remaining neurons in the code layer form the other subset expected to 2 accommodate non-speaker related information. [sent-48, score-0.33]

20 The input to each subnet is an MFCC representation of a frame after a short-term analysis that a speech segment is divided into a number of frames and the MFCC representation is achieved for each frame. [sent-49, score-0.481]

21 As depicted in Figure 1, xit is the MFCC feature vector of frame t in Xi , input to subnet i (i=1,2), where Xi = {xit }TB collectively denotes t=1 MFCC feature vectors for a speech segment of TB frames. [sent-50, score-0.802]

22 During learning, two identical subsets are coupled at their coding layers via neurons in CS with an incompatibility measure deﬁned on two speech segments of equal length, X1 and X2 , input to two subnets, which will be presented in 2. [sent-51, score-0.446]

23 After learning, we achieve two identical subnets and hence can use either of them to produce a new representation for a speech frame. [sent-53, score-0.35]

24 For input x to a subnet, only the bottom K layers of the subnet are used and the output of neurons in CS at the code layer or layer K, denoted by CS(x), is its new representation, as illustrated by the dash box in Figure 1. [sent-54, score-0.608]

25 2 Loss Function Let CS(xit ) be the output of all neurons in CS of subnet i (i=1,2) for input xit ∈ Xi and CS(Xi ) = {CS(xit )}TB , which pools output of neurons in CS for TB frames in Xi , as illustrated in Figure 1. [sent-56, score-0.787]

26 Intuitively, two speech segments belonging to different speakers lead to different statistics and hence their incompatibility score measured by (1) should be large after learning. [sent-61, score-0.449]

27 For a corpus of multiple speakers, we can construct a training set so that an example be in the form: (X1 , X2 ; I) where I is the label deﬁned as I = 1 if two speech segments, X1 and X2 , are spoken by the same speaker or I = 0 otherwise. [sent-63, score-0.762]

28 Using such training examples, we apply the energy-based model principle [16] to deﬁne a loss function as L(X1 , X2 ; Θ) = α[LR (X1 ; Θ) + LR (X2 ; Θ)] + (1 − α)LD (X1 , X2 ; Θ), (2) where T D Dm 1 B − S ˆ 2 ||xit − xit ||2 (i = 1, 2), LD (X1 , X2 ; Θ) = ID + (1 − I)(e− λm + e λS ). [sent-64, score-0.467]

29 By nature, both speaker-speciﬁc and non-speaker related information components are entangled over speech [1],[5]. [sent-71, score-0.213]

30 By minimizing reconstruction errors in two subnets, the code layer leads to a speaker-speciﬁc representation with the output of neurons in CS while the remaining neurons are used to regularize various interference by capturing some invariant properties underlying them for good generalization. [sent-76, score-0.459]

31 , pre-training for initializing subnets and discriminative learning for learning a speakerspeciﬁc representation. [sent-81, score-0.14]

32 Let hkj (xit ) denote the output of the |h | k jth neuron in layer k for k=0,1,· · · ,K,· · · ,2K. [sent-83, score-0.245]

33 hk (xit ) = hkj (xit ) j=1 is a collective notation of the output of all neurons in layer k of subnet i (i=1,2) where |hk | is the number of neurons in layer k. [sent-84, score-0.73]

34 By this notation, k=0 refers to the input layer with h0 (xit ) = xit , and k=2K refers ˆ to the top layer producing the reconstruction xit . [sent-85, score-1.112]

35 , layer K, CS(xit ) = |CS| (i) (i) hKj (xit ) j=1 is a simpliﬁed notation for output of neurons in CS. [sent-88, score-0.221]

36 Let Wk and bk denote the connection weight matrix between layers k-1 and k and the bias vector of layer k in subnet i (i=1,2), respectively, for k=1,· · · ,2K. [sent-89, score-0.36]

37 Then output of layer k is hk (xit ) = σ[uk (xit )] for k=1,· · · ,2K-1, |z| (i) (i) where uk (xit ) = Wk hk−1 (xit ) + bk and σ(z) = (1 + e−zj )−1 j=1 . [sent-90, score-0.315]

38 A denoising autoencoder is a three-layered perceptron ˜ where the input, x, is a distorted version of the target output, x. [sent-96, score-0.118]

39 Since MFCCs fed to the ﬁrst hidden layer and its intermediate representation input to all other hidden layers are of continuous value, we always ˜ distort input, x, by adding Gaussian noise to form a distorted version, x. [sent-98, score-0.368]

40 Finally, the second subnet is created by simply duplicating the pre-trained one. [sent-104, score-0.127]

41 Given our loss function is deﬁned on statistics of TB frames in a speech segment, we cannot update parameters until we have TB output of neurons in CS at the code layer. [sent-108, score-0.375]

42 Fortunately, the SBP algorithm perfectly meets our requirement; In the SBP algorithm, we always set the batch size to the number of frames in a speech segment. [sent-109, score-0.212]

43 For layer k = 2K, ∂LR = 2(ˆ it − xit ), i = 1, 2. [sent-112, score-0.556]

44 x (3) ∂u2K (xit ) For all hidden layers, k=2K-1,· · · ,1, applying the chain rule and (3) leads to ∂LR = ∂uk (xit ) ∂LR hkj (xit )[1−hkj (xit )] ∂hkj (xit ) |hk | , j=1 ∂LR (i) = Wk+1 ∂hk (xit ) T ∂LR . [sent-113, score-0.127]

45 (4) ∂uk+1 (xit ) As the contrastive loss, LD (X1 , X2 ; Θ), deﬁned on neurons in CS at code layers of two subnets, its gradients are determined only by parameters related to K hidden layers in two subnets, as depicted by dash boxes in Figure 1. [sent-114, score-0.375]

46 For layer k=K and subnet i=1, 2, after a derivation (see the appendix for details), we obtain Dm ∂LD |hK | |CS| = [I − λ−1 (1 − I)e− λm ]ψj (xit ) j=1 , 0 j=|CS|+1 + m ∂uK (xit ) [I − λ−1 (1 − I)e S 4 D − λS S ]ξj (xit ) |CS| , j=1 0 |hK | j=|CS|+1 . [sent-115, score-0.247]

47 5−i)(Σ(1) −Σ(2) )[CS(xit )−µ(i) ] B and CS(xit ) j is output of the jth neuron in CS for input xit . [sent-118, score-0.457]

48 For layers k=K-1, · · · ,1, we have ∂LD = ∂uk (xit ) ∂LD hkj (xit )[1−hkj (xit )] ∂hkj (xit ) |hk | , j=1 ∂LD (i) = Wk+1 ∂hk (xit ) T ∂LR . [sent-119, score-0.183]

49 For layers k=K+1, · · · , 2K, their parameters are updated by (i) (i) Wk ← Wk − α TB TB 2 t=1 r=1 ∂LR α (i) (i) [hk−1 (xrt )]T , bk ← bk − ∂uk (xrt ) TB TB 2 t=1 r=1 ∂LR . [sent-122, score-0.147]

50 (i) (i) bk ← bk − 4 α Experiment In this section, we describe our experimental methodology and report experiments results in visualization of vowel distributions, speaker comparison and speaker segmentation. [sent-127, score-0.942]

51 We employ two LDC benchmark corpora [14], KING and TIMIT, and a Chinese speech corpus [15], CHN, in our experiments. [sent-128, score-0.331]

52 KING, including wide-band and narrow-band sets, consists of 51 speakers whose utterances were recorded in 10 sessions. [sent-129, score-0.302]

53 There are 630 speakers in TIMIT and 59 speakers in CHN of three sessions, respectively. [sent-131, score-0.324]

54 All corpora were collected especially for evaluating a speaker recognition system. [sent-132, score-0.51]

55 For the RSDN learning, we use utterances of all 49 speakers recorded in sessions 1 and 2 in KING. [sent-137, score-0.376]

56 Furthermore, we distort all the utterances by the additive white noise channel with SNR of 10dB and the Rayleigh fading channel with 5 Hz Doppler shift [19] to simulate channel effects. [sent-138, score-0.16]

57 Thus our training set consists of clean utterances and their corrupted versions. [sent-139, score-0.14]

58 We randomly divide all utterances into speech segments of a length TB (1 sec ≤ TB ≤ 2 sec) and then exhaustively combine them to form training examples as described in Sect. [sent-140, score-0.434]

59 With a validation set of all the utterances recorded in session 3 in KING, we select a structure of K=4 (100, 100, 100 and 200 neurons in layers 1-4 and |CS|=100 in the code layer or layer 4) from candidate models of 2 <5 and 501000 neurons in a hidden layer. [sent-143, score-0.698]

60 For any speaker recognition tasks, speaker modeling (SM) is inevitable. [sent-152, score-0.839]

61 In our experiments, we use the 1st- and 2nd-order statistics of a speech segment based on a representation, SM = {µ, Σ}, for SM. [sent-153, score-0.22]

62 Furthermore, we employ a speaker distance metric: d(SM1 , SM2 ) = tr[(Σ−1 + Σ−1 )(µ1 − 1 2 µ2 )(µ1 − µ2 )T ], where SMi = {µi , Σi } (i = 1, 2) are two speaker models (SMs). [sent-154, score-0.81]

63 TIMIT [14] provides phonetic transcription of all 10 utterances containing all 20 vowels in English for every speaker. [sent-163, score-0.267]

64 As all the vowels may appear in 10 different utterances, up to 200 vowel segments in length of 0. [sent-164, score-0.229]

65 5 sec are available for a speaker, which enables us to investigate vowel distributions in a representation space for different speakers. [sent-166, score-0.138]

66 Here, we merely visualize mean feature vectors of up to 200 segments for a speaker in terms of a speciﬁc representation with the t-SNE method [21], which is likely to reﬂect intrinsic manifolds, by projecting them onto a two-dimensional plane. [sent-167, score-0.542]

67 In the code layer of our RSDN, output of neurons 1-100 forms a speaker-speciﬁc representation, CS, and that of remaining 100 neurons becomes a non-speaker related representation, dubbed CS. [sent-168, score-0.332]

68 For a noticeable effect, we randomly choose only ﬁve speakers (four females and one male) and visualize their vowel distributions in Figure 2 in terms of CS, CS and MFCC representations, respectively, where a maker/color corresponds to a speaker. [sent-169, score-0.202]

69 It is evident from Figure 2(a) that, by using the CS representation, most vowels spoken by a speaker are tightly grouped together while vowels spoken by different speakers are well separated. [sent-170, score-1.025]

70 For the CS representation, close inspection on Figure 2(b) reveals that the same vowels spoken by different speakers are, to a great extent, co-located. [sent-171, score-0.391]

71 Moreover, most of phonetically correlated vowels, as circled and labeled, are closely located in dense regions independent of speakers and genders. [sent-172, score-0.182]

72 In particular, most of vowels spoken by the male, marked by and colored by green, are grouped tightly but isolated from those by all females. [sent-174, score-0.229]

73 Thus, visualization in Figure 2 demonstrates how our RSDN learning works and could lend an evidence to justiﬁcation on why MFCCs can be used in both speech and speaker recognition [1]. [sent-175, score-0.648]

74 During data collection, there was a “great divide” between sessions 1-5 and 6-10; both recording device and environments changed, which alters spectral features of 26 speakers and leads to 10dB SNR reduction on average. [sent-179, score-0.236]

75 As suggested in [18], we conduct two experiments: within-divide where SMs built on utterances in session 1 are compared to SMs on those in sessions 2-5 and cross-divide where SMs built on utterances in session 1 are compared with those in sessions 6-10. [sent-180, score-0.478]

76 As short utterances poses a greater challenge for speaker recognition [4],[18],[20], utterances are partitioned into short segments of a certain length and SMs built on segments of the same length are always used for SC. [sent-181, score-0.838]

77 8 1 (c) Figure 3: Performance of speaker comparison (DET) in the within-divide (upper row) and the crossdivide (lower row) experiments for different segment lengths. [sent-247, score-0.435]

78 Table 1: Performance (mean±std)% of speaker segmentation on TIMIT and CHN audio streams. [sent-251, score-0.469]

79 3 Speaker Segmentation Speaker segmentation (SS) is a task of detecting speaker change points in an audio stream to split it into acoustically homogeneous segments so that every segment contains only one speaker [23]. [sent-259, score-0.987]

80 Following the same protocol used in previous work [23], we utilize utterances in TIMIT and CHN corpora to simulate audio conversations. [sent-260, score-0.28]

81 As a result, we randomly select 250 speakers from TIMIT to create 25 audio streams where the duration of speakers ranges from 1. [sent-261, score-0.413]

82 0 sec and 50 speakers from CHN to create 15 audio streams where the duration of speakers is from 3. [sent-263, score-0.455]

83 Note that the BIC method is inapplicable to our representation since it uses only covariance information but the high dimensionality of our representation and the use of a small sliding window in the BIC result in unstable performance, as pointed out early in this section. [sent-271, score-0.112]

84 Table 1 tabulates SS performance where, as boldfaced, results by our representation are superior to those by MFCCs regardless of SS methods and corpora for creating audio streams used in our simulations. [sent-276, score-0.221]

85 In summary, visualization of vowels and results in SC and SS suggest that our RSDN successfully extracts speaker-speciﬁc information; its resultant representation can be generalized to unseen corpora during learning and is insensitive to text and languages spoken and environmental changes. [sent-277, score-0.428]

86 7 5 Discussion As pointed out earlier, speech carries different yet mixed information and speaker-speciﬁc information is minor in comparison to predominant linguistic information. [sent-278, score-0.307]

87 In particular, the use of data regularization in discriminative learning and distorted data in two learning phases plays a critical role in capturing intrinsic speaker-speciﬁc characteristics and variations caused by miscellaneous mismatches. [sent-280, score-0.125]

88 Our results not reported here show that such an architecture learns a representation often overﬁtting to the training corpus due to interference of predominant non-speaker related information, which is not a problem in predominant information extraction. [sent-289, score-0.386]

89 The DA in [12] uses the RBM [13] as a building block to construct a deep belief subnet in their Siamese DA and the NCA [25] as their contrastive loss function to minimize the intra-class variability. [sent-295, score-0.275]

90 On the other hand, intrinsic topological structures of a handwritten digit convey predominant information given the fact that without using the NCA loss a deep belief autoencoder already yields a good representation [7],[12],[13],[26]. [sent-298, score-0.3]

91 In our work, however, speaker-speciﬁc information is non-predominant in speech and hence a large amount of labeled data reﬂecting miscellaneous variabilities are required during discriminative learning despite the pre-training. [sent-300, score-0.279]

92 Finally, our code layer yields an overcomplete representation to facilitate nonpredominant information extraction. [sent-301, score-0.207]

93 In contrast, a parsimonious representation seems more suitable for extracting predominant information since dimensionality reduction is likely to discover “principal” components that often associate with predominant information, as are evident in [11],[12]. [sent-302, score-0.215]

94 To conclude, we propose a deep neural architecture for speaker-speciﬁc information extraction and demonstrate that its resultant speaker-speciﬁc representation outperforms the state-of-the-art techniques. [sent-303, score-0.215]

95 It should also be stated that our work presented here is limited to speech corpora available at present. [sent-304, score-0.266]

96 Our work demonstrates that speech ICA is feasible via learning. [sent-307, score-0.19]

97 Moreover, deep learning could be a promising methodology for speech ICA. [sent-308, score-0.273]

98 Wang for offering their SIAT Chinese speech corpus [15] to us; both of which were used in our experiments. [sent-311, score-0.255]

99 (2009) Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. [sent-362, score-0.147]

100 (1995) Speaker Identiﬁcation and veriﬁcation using Gaussian mixture speaker models. [sent-411, score-0.405]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xit', 0.436), ('speaker', 0.405), ('rsdn', 0.276), ('cs', 0.262), ('tb', 0.214), ('speech', 0.19), ('speakers', 0.162), ('mfcc', 0.148), ('utterances', 0.14), ('subnet', 0.127), ('vowels', 0.127), ('sm', 0.126), ('mfccs', 0.121), ('lr', 0.121), ('layer', 0.12), ('cdbn', 0.112), ('siamese', 0.111), ('ld', 0.108), ('hkj', 0.104), ('subnets', 0.104), ('spoken', 0.102), ('xrt', 0.092), ('deep', 0.083), ('neurons', 0.08), ('layers', 0.079), ('hk', 0.078), ('corpora', 0.076), ('sessions', 0.074), ('interference', 0.071), ('predominant', 0.07), ('chn', 0.069), ('corpus', 0.065), ('audio', 0.064), ('segments', 0.062), ('uk', 0.062), ('wk', 0.061), ('alarm', 0.061), ('miss', 0.061), ('timit', 0.061), ('sms', 0.058), ('representation', 0.056), ('gmm', 0.055), ('architecture', 0.054), ('distorted', 0.047), ('linguistic', 0.047), ('ldc', 0.046), ('nking', 0.046), ('da', 0.044), ('sec', 0.042), ('autoencoder', 0.041), ('vowel', 0.04), ('nca', 0.04), ('bic', 0.037), ('ss', 0.037), ('discriminative', 0.036), ('incompatibility', 0.035), ('mdr', 0.035), ('speakerspeci', 0.035), ('contrastive', 0.034), ('bk', 0.034), ('chinese', 0.033), ('king', 0.033), ('code', 0.031), ('loss', 0.031), ('variabilities', 0.03), ('dash', 0.03), ('sbp', 0.03), ('denoising', 0.03), ('hinton', 0.03), ('false', 0.03), ('segment', 0.03), ('recognition', 0.029), ('sc', 0.029), ('das', 0.028), ('ica', 0.028), ('extract', 0.027), ('campbell', 0.026), ('streams', 0.025), ('session', 0.025), ('visualization', 0.024), ('dm', 0.024), ('entangled', 0.023), ('hinders', 0.023), ('miscellaneous', 0.023), ('lecun', 0.023), ('hidden', 0.023), ('resultant', 0.022), ('frames', 0.022), ('issues', 0.022), ('insensitive', 0.021), ('output', 0.021), ('stream', 0.021), ('distort', 0.02), ('hadsell', 0.02), ('mismatches', 0.02), ('phonetically', 0.02), ('depicted', 0.019), ('intrinsic', 0.019), ('extracting', 0.019), ('biases', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

Author: Ke Chen, Ahmad Salman

2 0.11021894 261 nips-2011-Sparse Filtering

Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng

Abstract: Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classiﬁcation. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse ﬁltering, a simple new algorithm which is efﬁcient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse ﬁltering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. Sparse ﬁltering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse ﬁltering on natural images, object classiﬁcation (STL-10), and phone classiﬁcation (TIMIT), and show that our method works well on a range of different modalities. 1

3 0.087318815 244 nips-2011-Selecting Receptive Fields in Deep Networks

Author: Adam Coates, Andrew Y. Ng

Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive ﬁelds” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Speciﬁcally, we choose local receptive ﬁelds that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive ﬁelds (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive ﬁelds by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1

4 0.08688727 250 nips-2011-Shallow vs. Deep Sum-Product Networks

Author: Olivier Delalleau, Yoshua Bengio

Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efﬁciently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artiﬁcial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A ﬁxedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efﬁcient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efﬁciently represent highly-varying functions [4, 3]. In this paper, we measure “efﬁciency” in terms of the number of computational units in the network. An efﬁcient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a ﬁxed amount of training samples and computational power, better generalization is expected. The ﬁrst successful algorithms for training deep architectures appeared in 2006, with efﬁcient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classiﬁcation [19, 10], natural language processing [9, 36, 21, 32], collaborative ﬁltering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artiﬁcial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we deﬁne “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we deﬁne the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to ﬁve. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efﬁciently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efﬁciently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difﬁculty efﬁciently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efﬁciently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efﬁcient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efﬁcient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We ﬁrst formally deﬁne sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efﬁcient than the deep representation. 2 Sum-product networks Deﬁnition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our deﬁnition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Deﬁnition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally deﬁne what we mean by deep vs. shallow networks in the rest of the paper. Deﬁnition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Deﬁnition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Deﬁnition The ﬁrst family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now deﬁne f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We ﬁrst prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which speciﬁc factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the ﬁnal output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the ﬁrst factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the ﬁrst factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Deﬁnition 1), then the ﬁnal output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Deﬁnition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this speciﬁc structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a speciﬁc architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to deﬁne functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efﬁciency of deep vs. shallow sum-product networks in the case of a ﬁxed dataset. 4.1 Deﬁnition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, deﬁne the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the deﬁnition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus ﬁnd a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a speciﬁc combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two speciﬁc families of functions: ﬁnding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions deﬁned only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefﬁcients (since the number of input combinations becomes ﬁnite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on ﬁnite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random ﬁeld. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efﬁcient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative ﬁltering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9

5 0.082090877 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation

Author: David Sontag, Dan Roy

Abstract: We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of ﬁnding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the e↵ective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, ﬁnding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of ﬁnding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we brieﬂy discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question. 1

6 0.065067485 186 nips-2011-Noise Thresholds for Spectral Clustering

7 0.064493157 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements

8 0.064020887 249 nips-2011-Sequence learning with hidden units in spiking neural networks

9 0.061933447 217 nips-2011-Practical Variational Inference for Neural Networks

10 0.056843158 156 nips-2011-Learning to Learn with Compound HD Models

11 0.055552103 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

12 0.055451363 302 nips-2011-Variational Learning for Recurrent Spiking Networks

13 0.054758284 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

14 0.05277418 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

15 0.050960261 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

16 0.049345143 287 nips-2011-The Manifold Tangent Classifier

17 0.045552742 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

18 0.043111507 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations

19 0.042850781 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

20 0.042681515 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.114), (1, 0.071), (2, 0.027), (3, 0.011), (4, 0.008), (5, -0.026), (6, 0.051), (7, 0.077), (8, -0.016), (9, -0.096), (10, -0.003), (11, 0.009), (12, 0.041), (13, -0.05), (14, -0.039), (15, -0.067), (16, -0.027), (17, -0.01), (18, -0.021), (19, 0.051), (20, 0.019), (21, 0.015), (22, -0.045), (23, -0.029), (24, -0.001), (25, 0.01), (26, -0.018), (27, -0.036), (28, 0.048), (29, -0.047), (30, 0.02), (31, -0.04), (32, 0.04), (33, -0.014), (34, 0.091), (35, 0.01), (36, 0.001), (37, 0.044), (38, 0.026), (39, -0.07), (40, 0.062), (41, 0.051), (42, 0.016), (43, -0.026), (44, 0.073), (45, 0.032), (46, 0.075), (47, 0.109), (48, 0.031), (49, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93061346 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

Author: Ke Chen, Ahmad Salman

2 0.66170716 250 nips-2011-Shallow vs. Deep Sum-Product Networks

Author: Olivier Delalleau, Yoshua Bengio

3 0.61271304 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng

Abstract: Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word- and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a ﬁxed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classiﬁer. Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus. 1

4 0.54315144 244 nips-2011-Selecting Receptive Fields in Deep Networks

Author: Adam Coates, Andrew Y. Ng

5 0.54290807 156 nips-2011-Learning to Learn with Compound HD Models

Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Speciﬁcally we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efﬁcient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1

6 0.47392997 217 nips-2011-Practical Variational Inference for Neural Networks

7 0.46924484 261 nips-2011-Sparse Filtering

8 0.4655776 287 nips-2011-The Manifold Tangent Classifier

9 0.46346873 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability

10 0.40609035 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

11 0.38896793 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

12 0.38445073 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

13 0.38132402 75 nips-2011-Dynamical segmentation of single trials from population neural data

14 0.380023 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

15 0.35869175 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

16 0.35693309 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

17 0.35657123 249 nips-2011-Sequence learning with hidden units in spiking neural networks

18 0.35372767 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

19 0.34510785 60 nips-2011-Confidence Sets for Network Structure

20 0.3425833 260 nips-2011-Sparse Features for PCA-Like Linear Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (4, 0.035), (20, 0.044), (26, 0.011), (31, 0.071), (33, 0.016), (43, 0.031), (45, 0.063), (57, 0.021), (65, 0.469), (74, 0.031), (83, 0.052), (84, 0.012), (99, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86654747 298 nips-2011-Unsupervised learning models of primary cortical receptive fields and receptive field plasticity

Author: Maneesh Bhand, Ritvik Mudur, Bipin Suresh, Andrew Saxe, Andrew Y. Ng

Abstract: The efﬁcient coding hypothesis holds that neural receptive ﬁelds are adapted to the statistics of the environment, but is agnostic to the timescale of this adaptation, which occurs on both evolutionary and developmental timescales. In this work we focus on that component of adaptation which occurs during an organism’s lifetime, and show that a number of unsupervised feature learning algorithms can account for features of normal receptive ﬁeld properties across multiple primary sensory cortices. Furthermore, we show that the same algorithms account for altered receptive ﬁeld properties in response to experimentally altered environmental statistics. Based on these modeling results we propose these models as phenomenological models of receptive ﬁeld plasticity during an organism’s lifetime. Finally, due to the success of the same models in multiple sensory areas, we suggest that these algorithms may provide a constructive realization of the theory, ﬁrst proposed by Mountcastle [1], that a qualitatively similar learning algorithm acts throughout primary sensory cortices. 1

same-paper 2 0.86055362 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

Author: Ke Chen, Ahmad Salman

3 0.602552 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

Author: Vikas Sindhwani, Aurelie C. Lozano

Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1

4 0.59854937 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

Author: Quoc V. Le, Alexandre Karpenko, Jiquan Ngiam, Andrew Y. Ng

Abstract: Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difﬁcult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-the-shelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets. 1

5 0.56868804 261 nips-2011-Sparse Filtering

Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng

6 0.56238383 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

7 0.50290275 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

8 0.46863768 244 nips-2011-Selecting Receptive Fields in Deep Networks

9 0.39656892 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

10 0.37708923 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

11 0.3748371 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

12 0.36905965 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

13 0.3633357 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

14 0.33582875 35 nips-2011-An ideal observer model for identifying the reference frame of objects

15 0.33514148 287 nips-2011-The Manifold Tangent Classifier

16 0.33304557 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

17 0.32736072 219 nips-2011-Predicting response time and error rates in visual search

18 0.32402241 276 nips-2011-Structured sparse coding via lateral inhibition

19 0.3207438 156 nips-2011-Learning to Learn with Compound HD Models

20 0.3201029 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability