nips nips2010 nips2010-140 knowledge-graph by maker-knowledge-mining

140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels


Source: pdf

Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun

Abstract: Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Layer-wise analysis of deep networks with Gaussian kernels Gr´ goire Montavon e Machine Learning Group TU Berlin Mikio L. [sent-1, score-0.723]

2 de Abstract Deep networks can potentially express a learning problem more efficiently than local learning machines. [sent-8, score-0.155]

3 While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. [sent-9, score-0.879]

4 We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. [sent-10, score-1.286]

5 We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. [sent-11, score-1.648]

6 1 Introduction Local learning machines such as nearest neighbors classifiers, radial basis function (RBF) kernel machines or linear classifiers predict the class of new data points from their neighbors in the input space. [sent-12, score-0.169]

7 A limitation of local learning machines is that they cannot generalize beyond the notion of continuity in the input space. [sent-13, score-0.154]

8 This limitation motivates the creation of learning machines that can map the input space into a higher-level representation where regularities of higher order than simple continuity in the input space can be expressed. [sent-17, score-0.173]

9 Engineered feature extractors, nonlocal kernel machines (Zien et al. [sent-18, score-0.149]

10 Deep networks implement them by distorting the input space so that initially distant points in the input space appear closer. [sent-24, score-0.171]

11 Also, their multilayered nature acts as a regularizer, allowing them to share at a given layer features computed at the previous layer (Bengio, 2009). [sent-25, score-0.428]

12 Understanding how the representation is built in a deep network and how to train it efficiently received a lot of attention (Goodfellow et al. [sent-26, score-0.864]

13 However, it is still unclear how their nice representation emerges from their complex structure, in particular, how the representation evolves from layer to layer. [sent-30, score-0.321]

14 The main contribution of this paper is to introduce an analysis based on RBF kernels and on the kernel principal component analysis (kPCA, Sch¨ lkopf et al. [sent-31, score-0.133]

15 , 1998) that can capture and quantify the o layer-wise evolution of the representation in a deep network. [sent-32, score-0.692]

16 In practice, for each layer 1 ≤ l ≤ L of the deep network, we take a small labeled dataset D, compute its image D(l) at the layer l of the deep network and measure what dimensionality the local model built on top of D(l) must have in order to solve the learning problem with a certain accuracy. [sent-33, score-1.903]

17 This analysis relates the prediction error e(d) to the dimensionality d of a local model built at each layer of the deep network. [sent-37, score-1.005]

18 As the data is propagated through the deep network, lower errors are obtained with lower-dimensional local models. [sent-38, score-0.65]

19 The plots on the right illustrate this dynamic where the thick gray arrows indicate the forward path of the deep network and where do is a fixed number of dimensions. [sent-39, score-0.73]

20 We apply this novel analysis to a multilayer perceptron (MLP), a pretrained multilayer perceptron (PMLP) and a convolutional neural network (CNN). [sent-40, score-0.606]

21 We observe in each case that the error and the dimensionality of the local model decrease as we propagate the dataset through the deep network. [sent-41, score-0.808]

22 This reveals that the deep network improves the representation of the learning problem layer after layer. [sent-42, score-0.965]

23 In addition, we observe that the CNN and the PMLP tend to postpone the discrimination to the last layers, leading to more transferable features and better-generalizing representations than for the simple MLP. [sent-44, score-0.305]

24 This result suggests that the structure of a deep network, by enforcing a separation of concerns between lowlevel generic features and high-level task-specific features, has an important role to play in order to build good representations. [sent-45, score-0.869]

25 A simple way to do it is to measure how many degrees of freedom (or dimensionality d) a local model must have in order to solve the learning problem with a certain error e. [sent-47, score-0.145]

26 This analysis relates the dimensionality d of the local model to its prediction error e(d). [sent-48, score-0.173]

27 In practice, there are many ways to define the dimensionality of a model, for example, (1) the number of samples given to the learning machine, (2) the number of required hidden nodes of a neural network (Murata et al. [sent-49, score-0.28]

28 , 1994), (3) the number of support vectors of a SVM or (4) the number of leading kPCA components of the input distribution p(x) used in the model. [sent-50, score-0.111]

29 Second, the leading kPCA components obtained with a finite and typically small number of samples n are similar to those that would be obtained in the asymptotic case where p(x, y) is fully observed (n → ∞). [sent-52, score-0.153]

30 The analysis presented here is strongly inspired from the relevant dimensionality estimation (RDE) method of Braun et al. [sent-56, score-0.109]

31 Note that with four leading kPCA components out of the 12 kPCA components, all the samples are already classified perfectly. [sent-62, score-0.113]

32 , un are obtained by performing an eigendecomposition of K where eigenvectors u1 , . [sent-80, score-0.107]

33 We fit a linear model β ⋆ that maps the projection on the d leading components of the training data to the log-likelihood of the classes 2 ˆˆ β ⋆ = argminβ || exp(U U ⊤ β) − Y ||F where β is a matrix of same size as Y and where the exponential function is applied element-wise. [sent-102, score-0.153]

34 The test error is defined as: e(d) = Pr(argmax y = argmax y) ˆ The training and test error can be used as an approximation bound for the asymptotic case n → ∞ where the data would be projected on the real eigenvectors of the input distribution. [sent-104, score-0.222]

35 In the next sections, the training and test error are depicted respectively as dotted and solid lines in Figure 3 and as the bottom and the top of error bars in Figure 4. [sent-105, score-0.115]

36 For each dimension, the kernel scale parameter σ that minimizes e(d) is retained, leading to a different kernel for each dimensionality. [sent-106, score-0.143]

37 The rationale for taking a different kernel for each model is that the optimal scale parameter typically shrinks as more leading components of the input distribution are observed. [sent-107, score-0.161]

38 These three deep networks are chosen in order to evaluate how the two types of regularizers implemented respectively by the CNN and the PMLP impact on the evolution of the representation layer after layer. [sent-109, score-0.99]

39 The multilayer perceptron (MLP) is a deep network obtained by alternating linear transformations and element-wise nonlinearities. [sent-111, score-0.889]

40 Each layer maps an input vector of size m into an output vector of size n and consists of (1) a linear transformation linearm→n (x) = w · x + b where w is a weight matrix of size n × m learned from the data and (2) a non-linearity applied element-wise to the output of the linear transformation. [sent-112, score-0.33]

41 , 2006) that we abbreviate PMLP in this paper is a variant of the MLP where weights are initialized with a deep belief network (DBN, Hinton et al. [sent-114, score-0.787]

42 , 1998) is a deep network obtained by alternating convolution filters y = convolvea×b (x) transforming a set of m input features maps m→n m {x1 , . [sent-118, score-0.889]

43 , xm } into a set of n output features maps {yi = j=1 wij ⋆ xj + bi , i = 1 . [sent-121, score-0.103]

44 1 Training the deep networks Each deep network is trained on the MNIST handwriting digit recognition dataset (LeCun et al. [sent-129, score-1.64]

45 The MNIST dataset consists of predicting the digit 0 – 9 from scanned handwritten digits of 28 × 28 pixels. [sent-131, score-0.126]

46 We partition randomly the MNIST training set in three subsets of 45000, 5000 and 10000 samples that are respectively used for training the deep network, selecting the parameters of the deep network and performing the RBF analysis. [sent-132, score-1.437]

47 No training: the weights of the deep network are left at their initial value. [sent-134, score-0.73]

48 If the deep network hasn’t received unsupervised pretraining, the weights are set randomly according to a normal distribution N (0, γ −1 ) where γ denotes for a given layer the number of input nodes that are connected to a single output node. [sent-135, score-1.023]

49 The goal of training a deep network on an alternate task is to learn features on a problem where the number of labeled samples is abundant and then reuse these features to learn the target task that has typically few labels. [sent-139, score-1.112]

50 In the alternate task described earlier, negative examples form a cloud around the manifold of positive examples and learning this manifold potentially allows the deep network to learn features that can be transfered to the digit recognition task. [sent-140, score-0.955]

51 Training on the target task: the deep network is trained on the digit recognition task using the 45000 labeled training samples. [sent-142, score-0.992]

52 2 Applying the RBF analysis to deep networks In this section, we explain how the RBF analysis described in Section 2 is applied to analyze layerwise the deep networks presented in Section 3. [sent-145, score-1.394]

53 Let f = fL ◦· · ·◦f1 be the trained deep network of depth L. [sent-146, score-0.79]

54 Let D be the analysis dataset containing the 10000 samples of the MNIST dataset on which the deep network hasn’t been trained. [sent-147, score-0.843]

55 For each layer, we build a new dataset D(l) corresponding to the mapping of the original dataset D to the l first layers of the deep network. [sent-148, score-0.843]

56 We use n = 2500 samples for computing the eigenvectors and the remaining 7500 samples to estimate the prediction error of the model. [sent-154, score-0.145]

57 This analysis yields for each dataset D(l) the error as a function of the dimensionality of the model e(d). [sent-155, score-0.129]

58 The interest of using a local model to solve the learning problem is that the local models are blind with respect to possibly better representations that could be obtained in previous or subsequent layers. [sent-158, score-0.179]

59 This local scoping property allows for fine isolation of the representations in the deep network. [sent-159, score-0.761]

60 The need for local scoping also arises when “debugging” deep architectures. [sent-160, score-0.69]

61 Sometimes, deep architectures perform reasonably well even when the first layers do something wrong. [sent-161, score-0.738]

62 The size n of the dataset is selected so that it is large enough to approximate well the asymptotic case (n → ∞) but also be small enough so that computing the eigendecomposition of the kernel matrix of size n × n is fast. [sent-163, score-0.157]

63 4 Results Layer-wise evolution of the error e(d) is plotted in Figure 3 in the supervised training case. [sent-174, score-0.134]

64 The layer-wise evolution of the error when d is fixed to 16 dimensions is plotted in Figure 4. [sent-175, score-0.125]

65 Both figures capture the simultaneous reduction of error and dimensionality performed by the deep network when trained on the target task. [sent-176, score-0.943]

66 In particular, they illustrate that in the last layers, a few number of dimensions is sufficient to build a good model of the target task. [sent-177, score-0.144]

67 5 Figure 3: Layer-wise evolution of the error e(d) when the deep network has been trained on the target task. [sent-178, score-0.949]

68 The solid line and the dotted line represent respectively the test error and the training error. [sent-179, score-0.076]

69 From these results, we first demonstrate some properties of deep networks trained on an “asymptotically” large number of samples. [sent-181, score-0.757]

70 Then, we demonstrate the important role of structure in deep networks. [sent-182, score-0.625]

71 This asymptotic property of deep networks should not be thought of as a statistical superiority of deep networks over local models. [sent-185, score-1.488]

72 Indeed, it is still possible that a higher-dimensional local model applied directly on the raw data performs as well as a local model applied at the output of the deep network. [sent-186, score-0.733]

73 Instead, this asymptotic property has the following consequence: Despite the internal complexity of deep networks a local interpretation of the representation is possible at each stage of the processing. [sent-187, score-0.829]

74 2 Role of the structure of deep networks We can observe in Figure 4 (left) that even when the convolutional neural network (CNN) and the pretrained MLP (PMLP) have not received supervised training, the first layers slightly improve the representation with respect to the target task. [sent-190, score-1.256]

75 On the other hand, the representation built by a simple MLP with random weights degrades layer after layer. [sent-191, score-0.274]

76 This observation closely relates to results obtained in (Ranzato et al. [sent-193, score-0.085]

77 , 2009) where it is observed that training the deep network while keeping random weights in the first layers still allows for good predictions by the subsequent layers. [sent-195, score-0.909]

78 In the case of the PMLP, the successive layers progressively disentangle the factors of variation (Hinton and Salakhutdinov, 2006; Bengio, 2009) and simplify the learning problem. [sent-196, score-0.167]

79 We can observe in Figure 4 (middle) that the phenomenon is even clearer when the CNN and the PMLP are trained on an alternate task: they are able to create generic features in the first layers that transfer well to the target task. [sent-197, score-0.477]

80 The top and the bottom of the error bars represent respectively the test error and the training error of the local model. [sent-199, score-0.208]

81 MLP, alternate task MLP, target task PMLP, alternate task PMLP, target task CNN, alternate task CNN, target task Figure 5: Leading components of the weights (receptive fields) obtained in the first layer of each architecture. [sent-200, score-0.95]

82 The filters learned by the CNN and the pretrained MLP are richer than the filters learned by the MLP. [sent-201, score-0.113]

83 The first component of the MLP trained on the alternate task dominates all other components and prevents good transfer on the target task. [sent-202, score-0.326]

84 On the other hand, the standard MLP trained on the alternate task leads to a degradation of representations. [sent-204, score-0.22]

85 This degradation is even higher than in the case of random weights, despite all the prior knowledge on pixel neighborhood contained implicitly in the alternate task. [sent-205, score-0.115]

86 The fact that receptive fields are different for each task indicates that the MLP tries to discriminate already in the first layers. [sent-207, score-0.08]

87 The absence of a built-in separation of concerns between low-level and high-level feature extractors seems to be a reason for the inability to learn transferable features. [sent-208, score-0.187]

88 It indicates that end-to-end transfer learning on unstructured learning machines is in general not appropriate and supports the recent success of transfer learning on restricted portions of the deep network (Collobert and Weston, 2008; Weston et al. [sent-209, score-0.905]

89 , 2008) or on structured deep networks (Mobahi et al. [sent-210, score-0.754]

90 When the deep networks are trained on the target task, the CNN and the PMLP solve the problem differently as the MLP. [sent-212, score-0.819]

91 In Figure 4 (right), we can observe that the CNN and the PMLP tend to postpone the discrimination to the last layers while the MLP starts to discriminate already in the first layers. [sent-213, score-0.264]

92 This result suggests that again, the structure contained in the CNN and the PMLP enforces a separation of concerns between the first layers encoding low-level generic features and the last layers encoding high-level task-specific features. [sent-214, score-0.489]

93 This separation of concerns might explain the better generalization of the CNN and PMLP observed respectively in (LeCun et al. [sent-215, score-0.179]

94 (2009) showing that the pretraining of the PMLP must be unsupervised and not supervised in order to build well-generalizing representations. [sent-219, score-0.102]

95 5 Conclusion We present a layer-wise analysis of deep networks based on RBF kernels. [sent-220, score-0.697]

96 This analysis estimates for each layer of the deep network the number of dimensions that is necessary in order to model well a learning problem based on the representation obtained at the output of this layer. [sent-221, score-1.022]

97 7 We observe that a properly trained deep network creates representations layer after layer in which a more accurate and lower-dimensional local model of the learning problem can be built. [sent-222, score-1.338]

98 This observation emphasizes the limitations of black box transfer learning and, more generally, of black box training of deep architectures. [sent-224, score-0.671]

99 A unified architecture for natural language processing: Deep neural networks with multitask learning. [sent-247, score-0.137]

100 Network information criterion - determining the number of hidden units for an artificial neural network model. [sent-301, score-0.134]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('deep', 0.596), ('pmlp', 0.319), ('mlp', 0.305), ('cnn', 0.286), ('layer', 0.197), ('layers', 0.142), ('rbf', 0.139), ('network', 0.134), ('kpca', 0.129), ('pretrained', 0.113), ('networks', 0.101), ('multilayer', 0.09), ('alternate', 0.088), ('concerns', 0.079), ('tanh', 0.076), ('braun', 0.075), ('representations', 0.071), ('mikio', 0.07), ('perceptron', 0.069), ('hinton', 0.069), ('target', 0.062), ('trained', 0.06), ('evolution', 0.058), ('lecun', 0.058), ('digit', 0.058), ('et', 0.057), ('bengio', 0.054), ('local', 0.054), ('dimensionality', 0.052), ('convolution', 0.05), ('kernel', 0.05), ('un', 0.046), ('task', 0.045), ('separation', 0.043), ('progressive', 0.043), ('leading', 0.043), ('machines', 0.042), ('larochelle', 0.041), ('pretraining', 0.041), ('yoshua', 0.041), ('convolutional', 0.041), ('maps', 0.04), ('asymptotic', 0.04), ('scoping', 0.04), ('discrimination', 0.039), ('mnist', 0.039), ('error', 0.039), ('built', 0.039), ('transfer', 0.038), ('representation', 0.038), ('dataset', 0.038), ('samples', 0.037), ('lters', 0.037), ('training', 0.037), ('tu', 0.037), ('architecture', 0.036), ('collobert', 0.036), ('input', 0.035), ('receptive', 0.035), ('hubel', 0.035), ('transferable', 0.035), ('postponing', 0.035), ('lowlevel', 0.035), ('mobahi', 0.035), ('rumelhart', 0.035), ('features', 0.034), ('weston', 0.034), ('components', 0.033), ('unsupervised', 0.032), ('hasn', 0.032), ('goodfellow', 0.032), ('murata', 0.032), ('ranzato', 0.032), ('eigenvectors', 0.032), ('extractors', 0.03), ('erhan', 0.03), ('ronan', 0.03), ('digits', 0.03), ('output', 0.029), ('role', 0.029), ('build', 0.029), ('observe', 0.029), ('postpone', 0.029), ('eigendecomposition', 0.029), ('dimensions', 0.028), ('berlin', 0.028), ('relates', 0.028), ('jarrett', 0.027), ('degradation', 0.027), ('builds', 0.027), ('kernels', 0.026), ('zien', 0.026), ('fl', 0.025), ('emerges', 0.025), ('progressively', 0.025), ('last', 0.025), ('subsampling', 0.024), ('generic', 0.024), ('limitation', 0.023), ('unclear', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels

Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun

Abstract: Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1

2 0.18919028 99 nips-2010-Gated Softmax Classification

Author: Roland Memisevic, Christopher Zach, Marc Pollefeys, Geoffrey E. Hinton

Abstract: We describe a ”log-bilinear” model that computes class probabilities by combining an input vector multiplicatively with a vector of binary latent variables. Even though the latent variables can take on exponentially many possible combinations of values, we can efficiently compute the exact probability of each class by marginalizing over the latent variables. This makes it possible to get the exact gradient of the log likelihood. The bilinear score-functions are defined using a three-dimensional weight tensor, and we show that factorizing this tensor allows the model to encode invariances inherent in a task by learning a dictionary of invariant basis functions. Experiments on a set of benchmark problems show that this fully probabilistic model can achieve classification performance that is competitive with (kernel) SVMs, backpropagation, and deep belief nets. 1

3 0.17771298 271 nips-2010-Tiled convolutional neural networks

Author: Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, Andrew Y. Ng

Abstract: Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets. 1

4 0.15931979 59 nips-2010-Deep Coding Network

Author: Yuanqing Lin, Zhang Tong, Shenghuo Zhu, Kai Yu

Abstract: This paper proposes a principled extension of the traditional single-layer flat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding. The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. Empirically, it is shown that the deep coding approach yields improved performance in benchmark datasets.

5 0.15588924 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton

Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1

6 0.135299 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

7 0.13025762 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition

8 0.12044776 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

9 0.11856277 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering

10 0.11203608 133 nips-2010-Kernel Descriptors for Visual Recognition

11 0.10386864 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

12 0.081371255 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model

13 0.072444655 103 nips-2010-Generating more realistic images using gated MRF's

14 0.066151083 280 nips-2010-Unsupervised Kernel Dimension Reduction

15 0.065139592 224 nips-2010-Regularized estimation of image statistics by Score Matching

16 0.061125007 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs

17 0.058632057 94 nips-2010-Feature Set Embedding for Incomplete Data

18 0.055599749 101 nips-2010-Gaussian sampling by local perturbations

19 0.05526362 235 nips-2010-Self-Paced Learning for Latent Variable Models

20 0.052422289 138 nips-2010-Large Margin Multi-Task Metric Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, 0.078), (2, -0.12), (3, -0.081), (4, 0.082), (5, -0.019), (6, 0.055), (7, 0.061), (8, -0.104), (9, 0.033), (10, 0.017), (11, -0.097), (12, 0.074), (13, -0.174), (14, -0.13), (15, -0.069), (16, -0.014), (17, -0.076), (18, -0.164), (19, -0.151), (20, 0.019), (21, 0.102), (22, -0.093), (23, 0.046), (24, 0.066), (25, 0.212), (26, 0.007), (27, 0.014), (28, 0.107), (29, 0.003), (30, -0.08), (31, -0.11), (32, 0.094), (33, 0.063), (34, 0.042), (35, -0.024), (36, -0.002), (37, 0.166), (38, -0.049), (39, -0.028), (40, -0.018), (41, 0.017), (42, -0.061), (43, 0.013), (44, 0.044), (45, -0.1), (46, 0.025), (47, -0.061), (48, -0.055), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95801425 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels

Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun

Abstract: Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1

2 0.85692084 271 nips-2010-Tiled convolutional neural networks

Author: Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, Andrew Y. Ng

Abstract: Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets. 1

3 0.74644673 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton

Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1

4 0.71096241 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model

Author: Peggy Series, David P. Reichert, Amos J. Storkey

Abstract: The Charles Bonnet Syndrome (CBS) is characterized by complex vivid visual hallucinations in people with, primarily, eye diseases and no other neurological pathology. We present a Deep Boltzmann Machine model of CBS, exploring two core hypotheses: First, that the visual cortex learns a generative or predictive model of sensory input, thus explaining its capability to generate internal imagery. And second, that homeostatic mechanisms stabilize neuronal activity levels, leading to hallucinations being formed when input is lacking. We reproduce a variety of qualitative findings in CBS. We also introduce a modification to the DBM that allows us to model a possible role of acetylcholine in CBS as mediating the balance of feed-forward and feed-back processing. Our model might provide new insights into CBS and also demonstrates that generative frameworks are promising as hypothetical models of cortical learning and perception. 1

5 0.62368131 99 nips-2010-Gated Softmax Classification

Author: Roland Memisevic, Christopher Zach, Marc Pollefeys, Geoffrey E. Hinton

Abstract: We describe a ”log-bilinear” model that computes class probabilities by combining an input vector multiplicatively with a vector of binary latent variables. Even though the latent variables can take on exponentially many possible combinations of values, we can efficiently compute the exact probability of each class by marginalizing over the latent variables. This makes it possible to get the exact gradient of the log likelihood. The bilinear score-functions are defined using a three-dimensional weight tensor, and we show that factorizing this tensor allows the model to encode invariances inherent in a task by learning a dictionary of invariant basis functions. Experiments on a set of benchmark problems show that this fully probabilistic model can achieve classification performance that is competitive with (kernel) SVMs, backpropagation, and deep belief nets. 1

6 0.58971107 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

7 0.58770227 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs

8 0.56545913 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition

9 0.50463736 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine

10 0.48838693 59 nips-2010-Deep Coding Network

11 0.45728493 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

12 0.39542994 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

13 0.37233093 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering

14 0.35758623 188 nips-2010-On Herding and the Perceptron Cycling Theorem

15 0.35079652 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

16 0.34737912 103 nips-2010-Generating more realistic images using gated MRF's

17 0.32710207 224 nips-2010-Regularized estimation of image statistics by Score Matching

18 0.32640496 133 nips-2010-Kernel Descriptors for Visual Recognition

19 0.32371867 94 nips-2010-Feature Set Embedding for Incomplete Data

20 0.31714863 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.035), (17, 0.039), (27, 0.053), (30, 0.034), (35, 0.023), (45, 0.167), (50, 0.023), (52, 0.428), (60, 0.034), (70, 0.015), (77, 0.041), (90, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90009421 231 nips-2010-Robust PCA via Outlier Pursuit

Author: Huan Xu, Constantine Caramanis, Sujay Sanghavi

Abstract: Singular Value Decomposition (and Principal Component Analysis) is one of the most widely used techniques for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily corrupted components. Yet, in applications of SVD or PCA such as robust collaborative filtering or bioinformatics, malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire points that are completely corrupted. We present an efficient convex optimization-based algorithm we call Outlier Pursuit, that under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace, and identifies the corrupted points. Such identification of corrupted points that do not conform to the low-dimensional approximation, is of paramount interest in bioinformatics and financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm minimization, however, our results, setup, and approach, necessarily differ considerably from the existing line of work in matrix completion and matrix decomposition, since we develop an approach to recover the correct column space of the uncorrupted matrix, rather than the exact matrix itself.

2 0.88908863 227 nips-2010-Rescaling, thinning or complementing? On goodness-of-fit procedures for point process models and Generalized Linear Models

Author: Felipe Gerhard, Wulfram Gerstner

Abstract: Generalized Linear Models (GLMs) are an increasingly popular framework for modeling neural spike trains. They have been linked to the theory of stochastic point processes and researchers have used this relation to assess goodness-of-fit using methods from point-process theory, e.g. the time-rescaling theorem. However, high neural firing rates or coarse discretization lead to a breakdown of the assumptions necessary for this connection. Here, we show how goodness-of-fit tests from point-process theory can still be applied to GLMs by constructing equivalent surrogate point processes out of time-series observations. Furthermore, two additional tests based on thinning and complementing point processes are introduced. They augment the instruments available for checking model adequacy of point processes as well as discretized models. 1

3 0.88664776 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

Author: Andreas Christmann, Ingo Steinwart

Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1

4 0.8550368 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform

Author: Siwei Lyu

Abstract: Divisive normalization (DN) has been advocated as an effective nonlinear efficient coding transform for natural sensory signals with applications in biology and engineering. In this work, we aim to establish a connection between the DN transform and the statistical properties of natural sensory signals. Our analysis is based on the use of multivariate t model to capture some important statistical properties of natural sensory signals. The multivariate t model justifies DN as an approximation to the transform that completely eliminates its statistical dependency. Furthermore, using the multivariate t model and measuring statistical dependency with multi-information, we can precisely quantify the statistical dependency that is reduced by the DN transform. We compare this with the actual performance of the DN transform in reducing statistical dependencies of natural sensory signals. Our theoretical analysis and quantitative evaluations confirm DN as an effective efficient coding transform for natural sensory signals. On the other hand, we also observe a previously unreported phenomenon that DN may increase statistical dependencies when the size of pooling is small. 1

same-paper 5 0.82627416 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels

Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun

Abstract: Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1

6 0.59500623 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes

7 0.57961029 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data

8 0.56442237 96 nips-2010-Fractionally Predictive Spiking Neurons

9 0.561818 64 nips-2010-Distributionally Robust Markov Decision Processes

10 0.55732584 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents

11 0.55671978 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

12 0.55509233 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

13 0.55346823 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

14 0.55139178 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

15 0.54453397 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

16 0.54450464 115 nips-2010-Identifying Dendritic Processing

17 0.53692394 117 nips-2010-Identifying graph-structured activation patterns in networks

18 0.53546399 265 nips-2010-The LASSO risk: asymptotic results and real world examples

19 0.53511208 271 nips-2010-Tiled convolutional neural networks

20 0.53506732 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification