nips nips2013 nips2013-30 knowledge-graph by maker-knowledge-mining

30 nips-2013-Adaptive dropout for training deep neural networks


Source: pdf

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Adaptive dropout for training deep neural networks Lei Jimmy Ba Brendan Frey Department of Electrical and Computer Engineering University of Toronto jimmy, frey@psi. [sent-1, score-0.759]

2 ca Abstract Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e. [sent-3, score-0.623]

3 We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. [sent-5, score-0.645]

4 This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. [sent-6, score-1.15]

5 Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. [sent-7, score-1.381]

6 When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. [sent-8, score-0.073]

7 8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. [sent-11, score-0.064]

8 1 Introduction For decades, deep networks with broad hidden layers and full connectivity could not be trained to produce useful results, because of overfitting, slow convergence and other issues. [sent-12, score-0.487]

9 One approach that has proven to be successful for unsupervised learning of both probabilistic generative models and auto-encoders is to train a deep network layer by layer in a greedy fashion [7]. [sent-13, score-0.387]

10 Each layer of connections is learnt using contrastive divergence in a restricted Boltzmann machine (RBM) [6] or backpropagation through a one-layer auto-encoder [1], and then the hidden activities are used to train the next layer. [sent-14, score-0.321]

11 When the parameters of a deep network are initialized in this way, further fine tuning can be used to improve the model, e. [sent-15, score-0.307]

12 [4] have achieved improved classification rates by using different unsupervised learning algorithms. [sent-21, score-0.076]

13 Recently, a technique called dropout was shown to significantly improve the performance of deep neural networks on various tasks [8], including vision problems [10]. [sent-22, score-0.754]

14 Dropout randomly sets hidden unit activities to zero with a probability of 0. [sent-23, score-0.232]

15 Each training example can thus be viewed as providing gradients for a different, randomly sampled architecture, so that the final neural network efficiently represents a huge ensemble of neural networks, with good generalization capability. [sent-25, score-0.199]

16 Experimental results on several tasks show that dropout frequently and significantly improves the classification performance of deep architectures. [sent-26, score-0.646]

17 Injecting noise for the purpose of regularization has been studied previously, but in the context of adding noise to the inputs [3],[21] and to network components [16]. [sent-27, score-0.113]

18 Unfortunately, when dropout is used to discriminatively train a deep fully connected neural network on input with high variation, e. [sent-28, score-0.777]

19 1 In this paper, we describe a generalization of dropout, where the dropout probability for each hidden variable is computed using a binary belief network that shares parameters with the deep network. [sent-32, score-0.91]

20 Our method works well both for unsupervised and supervised learning of deep networks. [sent-33, score-0.225]

21 We present results on the MNIST and NORB datasets showing that our ‘standout’ technique can learn better feature detectors for handwritten digit and object recognition tasks. [sent-34, score-0.102]

22 Interestingly, we also find that our method enables the successful training of deep auto-encoders from scratch, i. [sent-35, score-0.197]

23 2 The model The original dropout technique [8] uses a constant probability for omitting a unit, so a natural question we considered is whether it may help to let this probability be different for different hidden units. [sent-38, score-0.662]

24 In particular, there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [sent-39, score-0.35]

25 Viewed another way, suppose after dropout is applied, it is found that several hidden units are highly correlated in the pre-dropout activities. [sent-41, score-0.787]

26 They could be combined into a single hidden unit with a lower dropout probability, freeing up hidden units for other purposes. [sent-42, score-0.969]

27 We denote the activity of unit j in a deep neural network by aj and assume that its inputs are {ai : i < j}. [sent-43, score-0.434]

28 In dropout, aj is randomly set to zero with probability 0. [sent-44, score-0.072]

29 Let mj be a binary variable that is used to mask, the activity aj , so that its value is aj = mj g wj,i ai , (1) i:i < j}) = f πj,i ai , (2) i:i < j. [sent-46, score-0.305]

30 Another interesting setting is obtained by clamping πj,i = 0 for 1 ≤ i < j, but learning the input-independent dropout parameter πj,0 for each unit aj . [sent-47, score-0.594]

31 As in standard dropout, to process an input at test time, the stochastic feedforward process is replaced by taking the expectation of equation 1: E[aj ] = f πj,i ai g wj,i ai . [sent-48, score-0.06]

32 (3) i:i < j}) = f α i:i < j}); aj = mj g i:i < 1 ∂θ is the momentum coefficient and η is the learning rate. [sent-49, score-0.172]

33 Nesterov momentum takes into account the velocity in parameter space when computing updates. [sent-50, score-0.064]

34 We schedule the momentum coefficient γ to further speed up the learning process. [sent-52, score-0.064]

35 4 Computation time We used the publicly available gnumpy library [20] to implement our models. [sent-60, score-0.036]

36 The models mentioned in this work are trained on a single Nvidia GTX 580 GPU. [sent-61, score-0.071]

37 As in psuedocode(1), the first algorithm is relatively slow, since the number of computations is O(n2 ) where n is the number of hidden units. [sent-62, score-0.152]

38 In particular, for a 784-1000-784 autoencoder model with mini-batches of size 100 and 50,000 training cases on a GTX 580 GPU, learning takes 1. [sent-64, score-0.072]

39 5 Unsupervised feature learning Having good features is crucial for obtaining competitive performance in classification and other high level tasks. [sent-70, score-0.062]

40 Learning algorithms that can take advantage of unlabeled data are appealing due to increasing amount of unlabeled data. [sent-71, score-0.034]

41 Furthermore, on more challenging datasets, such as NORB, a fully connected discriminative neural network trained from scratch tends to perform poorly, even with the help of dropout. [sent-72, score-0.28]

42 (We trained a two hidden layer neural network on NORB to obtain 13% error rate and saw no improvement by using dropout). [sent-73, score-0.412]

43 Such disappointing performance motived us to investigate unsupervised feature learning and pre-training strategies with our new method. [sent-74, score-0.09]

44 The features extracted using our method not only outperform other common feature learning methods, but our method is also quite computationally efficient compared to techniques like sparse coding. [sent-76, score-0.114]

45 We first extract the features using one of the unsupervised learning algorithms in figure (4). [sent-78, score-0.078]

46 The usefulness of the extracted features are then evaluated by training a linear classifier to predict the object class from the extracted features. [sent-79, score-0.137]

47 This process is similar to that employed in other feature learning research [14]. [sent-80, score-0.037]

48 We trained a number of architectures on MNIST, including standard auto-encoders, dropout autoencoders and standout auto-encoders. [sent-81, score-1.259]

49 raw pixel 784 784-1000 RBM weight decay DAE 784-1000-784 dropout 784-1000-784 AE 50% hidden dropout standout 784-1000-784 AE standout act. [sent-83, score-2.457]

50 8976 2048-4000 RBM weight decay DAE 2048-4000-2048 dropout 2048-4000-2048 AE 50% hidden dropout dropout 2048-4000-2048 AE * 22% hidden dropout standout 2048-4000-2048 AE standout act. [sent-86, score-3.54]

51 3% (b) NORB Figure 4: Performance of unsupervised feature learning methods. [sent-101, score-0.09]

52 The dropout probability in the DAE * was optimized using [18] each hidden activity and use that as the feature when training a classifier. [sent-102, score-0.717]

53 We also examined RBM’s, where we the soft probability for each hidden unit as a feature. [sent-103, score-0.182]

54 Results for the different architectures and learning methods are compared in table 4(a). [sent-106, score-0.033]

55 The autoencoder trained using our proposed technique with α = 1 and β = 0 performed the best on MNIST. [sent-107, score-0.138]

56 Our standout method consistently performs better than other methods, as shown in table 4(b). [sent-111, score-0.657]

57 6 Discussion The proposed standout method was able to outperform other feature learning methods in both datasets with a noticeable margin. [sent-113, score-0.694]

58 The stochasticity introduced by the standout network successfully removes hidden units that are unnecessary for good performance and that hinder performance. [sent-114, score-1.087]

59 By inspecting the weights from auto-encoders regularized by dropout and standout, we find that the standout auto-encoder weights are sharper than those learnt using dropout, which may be consistent with the improved performance on classification tasks. [sent-115, score-1.233]

60 Classification error rate as a function of number of hidden units 20 DAE Dropout AE Deterministic standout AE Standout AE 18 test error rate (%) The effect of the number of hidden units was studied using networks with sizes 500, 1000, 1500, and up to 4500. [sent-116, score-1.37]

61 Figure 5 shows that all algorithms generally perform better by increasing the number of hidden units. [sent-117, score-0.152]

62 One notable trend for dropout regularization is that it achieves significantly better performance with large numbers of hidden units since all units have equal chance to be omitted. [sent-118, score-0.973]

63 In comparison, standout can achieve similar performance with only half as many hidden units, because highly useful hidden units will be kept more often while only the less effective units will be dropped. [sent-119, score-1.283]

64 16 14 12 10 8 6 500 1000 1500 2000 2500 3000 numer of hidden units 3500 4000 4500 Figure 5: Classification error rate as a function of number of hidden units on NORB. [sent-120, score-0.664]

65 One question is whether it is the stochasticity of the standout network that helps, or just a different nonlinearity obtained by the expected activity in equation 3. [sent-121, score-0.803]

66 To address this, we trained a deterministic auto-encoder with hidden activation functions given by equation 3. [sent-122, score-0.223]

67 The result of this ‘deterministic standout method’ is shown in figure 5 and it performs quite poorly. [sent-123, score-0.657]

68 It is believed that sparse features can help improve the performance of linear classifiers. [sent-124, score-0.046]

69 We found that auto-encoders trained using ReLU units and standout produce sparse features. [sent-125, score-0.91]

70 We wondered whether training a sparse auto-encoder with a sparsity level matching the one obtained by our method would yield similar performance. [sent-126, score-0.046]

71 We applied an L1 penalty on the hidden units and trained an auto-encoder to match the sparsity obtained by our method (figure4). [sent-127, score-0.384]

72 The final features extracted using the sparse auto-encoder achieved 10. [sent-128, score-0.1]

73 Further gains can be achieved by tuning hyper-parameters, but the hyper-parameters for our method are easier to tune and, as shown above, have little effect on the final performance. [sent-130, score-0.068]

74 Moreover, the sparse features learnt using standout are also computationally efficient compared 7 error rate RBM + FT 1. [sent-131, score-0.782]

75 80% (a) MNIST fine-tuned DBN [15] DBM [15] third order RBM [12] dropout shallow AE + FT dropout deep AE + FT standout shallow AE + FT standout deep AE + FT (b) NORB fine-tuned error rate 8. [sent-137, score-2.801]

76 Surprisingly, a shallow network with standout regularization (table 4(b)) outperforms some of the much larger and deeper networks shown. [sent-148, score-0.921]

77 Some of those deeper models have three or four times more parameters than the shallow network we trained here. [sent-149, score-0.265]

78 This particular result show that a simpler model trained using our regularization technique can achieve higher performance compared to other, more complicated methods. [sent-150, score-0.116]

79 7 Discriminative learning In deep learning, a common practice is to use the encoder weights learnt by an unsupervised learning method to initialize the early layers of a multilayer discriminative model. [sent-152, score-0.406]

80 The backpropagation algorithm is then used to learn the weights for the last hidden layer and also fine tune the weights in the layers before. [sent-153, score-0.321]

81 This procedure is often referred to as discriminative fine tuning. [sent-154, score-0.054]

82 We initialized neural networks using the models described above. [sent-155, score-0.107]

83 The regularization method that we used for unsupervised learning (RBM, dropout, standout) is also used for corresponding discriminative fine tuning. [sent-156, score-0.132]

84 For example, if a neural network is initialized using an auto-encoder trained with standout, the neural network will also be fine tuned using standout for all its hidden units, with the same standout function and hyper-parameters as the auto-encoder. [sent-157, score-1.818]

85 During discriminative fine tuning, we hold the weights fixed for all layers except the last one for the first 10 epochs, and then the weights are updated jointly after that. [sent-158, score-0.145]

86 As found by previous authors, we find that classification performance is usually improved by the use of discriminative fine tuning. [sent-159, score-0.054]

87 Impressively, we found that a two-hidden-layer neural network with 1000 ReLU units in its first and second hidden layers trained with standout is able to achieve 80 errors on MNIST data after fine tuning (error rate of 0. [sent-160, score-1.268]

88 2% error rate by fine tuning the simple shallow auto-encoder from table(4(b)). [sent-164, score-0.136]

89 Furthermore, a twohidden-layer neural network with 4000 ReLU units in both hidden layers that is pre-trained using standout achieved 5. [sent-165, score-1.171]

90 0005 is applied to this network during fine-tuning to further prevent overfitting. [sent-168, score-0.088]

91 It even outperforms carefully designed convolutional neural networks found in [9]. [sent-171, score-0.115]

92 Figure 6 reports the classification accuracy obtained by different models, including state-of-the-art deep networks. [sent-172, score-0.172]

93 6 Conclusions Our results demonstrate that the proposed use of standout networks can significantly improve performance of feature-learning methods. [sent-173, score-0.702]

94 Further, our results provide additional support for the ‘regularization by noise’ hypothesis that has been used to regularize other deep architectures, including RBMs and denoising auto-encoders, and in dropout. [sent-174, score-0.231]

95 An obvious missing piece in this research is a good theoretical understanding of why the standout network provides better regularization compared to the fixed dropout probability of 0. [sent-175, score-1.244]

96 The importance of encoding versus training with sparse coding and vector quantization. [sent-197, score-0.046]

97 Improving neural networks by preventing co-adaptation of feature detectors. [sent-225, score-0.125]

98 Learning methods for generic object recognition with invariance to pose and lighting. [sent-239, score-0.063]

99 On the importance of initialization and momentum in deep learning. [sent-283, score-0.236]

100 Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. [sent-294, score-0.332]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('standout', 0.657), ('dropout', 0.474), ('ae', 0.212), ('norb', 0.181), ('relu', 0.181), ('deep', 0.172), ('units', 0.161), ('hidden', 0.152), ('ft', 0.097), ('network', 0.088), ('shallow', 0.087), ('rbm', 0.083), ('dae', 0.075), ('aj', 0.072), ('trained', 0.071), ('momentum', 0.064), ('mnist', 0.06), ('learnt', 0.058), ('discriminative', 0.054), ('unsupervised', 0.053), ('activities', 0.05), ('classi', 0.048), ('autoencoder', 0.047), ('layers', 0.047), ('networks', 0.045), ('neural', 0.043), ('hugo', 0.043), ('jimmy', 0.041), ('yann', 0.04), ('feature', 0.037), ('layer', 0.037), ('gnumpy', 0.036), ('gtx', 0.036), ('denoising', 0.036), ('mj', 0.036), ('boltzmann', 0.033), ('architectures', 0.033), ('jasper', 0.031), ('extracted', 0.031), ('snoek', 0.03), ('ai', 0.03), ('unit', 0.03), ('activity', 0.029), ('stochasticity', 0.029), ('frey', 0.029), ('nair', 0.029), ('tuning', 0.028), ('sutskever', 0.027), ('convolutional', 0.027), ('decay', 0.026), ('training', 0.025), ('regularization', 0.025), ('hinton', 0.025), ('features', 0.025), ('object', 0.025), ('gpu', 0.024), ('backpropagation', 0.024), ('coates', 0.024), ('belief', 0.024), ('autoencoders', 0.024), ('scratch', 0.024), ('epoch', 0.023), ('achieved', 0.023), ('ryan', 0.023), ('regularize', 0.023), ('weights', 0.022), ('larochelle', 0.022), ('sparse', 0.021), ('rate', 0.021), ('recognition', 0.02), ('technique', 0.02), ('krizhevsky', 0.02), ('initialized', 0.019), ('toronto', 0.019), ('vincent', 0.019), ('deeper', 0.019), ('clamping', 0.018), ('impressively', 0.018), ('marcaurelio', 0.018), ('ramped', 0.018), ('recapitulate', 0.018), ('invariance', 0.018), ('ers', 0.018), ('raw', 0.017), ('tune', 0.017), ('cation', 0.017), ('lajoie', 0.017), ('boards', 0.017), ('jarrett', 0.017), ('koray', 0.017), ('martens', 0.017), ('numer', 0.017), ('nvidia', 0.017), ('tijmen', 0.017), ('tikhonov', 0.017), ('unlabeled', 0.017), ('injecting', 0.016), ('jie', 0.016), ('omitting', 0.016), ('selectively', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

2 0.37050515 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1

3 0.36893341 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

4 0.21971199 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1

5 0.18377578 251 nips-2013-Predicting Parameters in Deep Learning

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1

6 0.15702772 331 nips-2013-Top-Down Regularization of Deep Belief Networks

7 0.15353245 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

8 0.10206158 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

9 0.097391389 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

10 0.088583879 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

11 0.08690197 160 nips-2013-Learning Stochastic Feedforward Neural Networks

12 0.082025126 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

13 0.079148628 75 nips-2013-Convex Two-Layer Modeling

14 0.078918457 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

15 0.070950605 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

16 0.070344247 5 nips-2013-A Deep Architecture for Matching Short Texts

17 0.06998761 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

18 0.066636145 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

19 0.061919823 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

20 0.061841443 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.129), (1, 0.071), (2, -0.128), (3, -0.139), (4, 0.087), (5, -0.164), (6, -0.12), (7, 0.074), (8, 0.021), (9, -0.232), (10, 0.371), (11, 0.002), (12, -0.076), (13, 0.066), (14, 0.098), (15, 0.019), (16, -0.078), (17, 0.08), (18, 0.103), (19, -0.023), (20, 0.027), (21, 0.14), (22, -0.267), (23, -0.08), (24, -0.085), (25, -0.075), (26, 0.159), (27, -0.022), (28, -0.028), (29, 0.009), (30, 0.029), (31, -0.055), (32, 0.041), (33, 0.011), (34, -0.056), (35, -0.044), (36, 0.007), (37, 0.019), (38, 0.012), (39, 0.001), (40, 0.029), (41, 0.019), (42, 0.022), (43, -0.016), (44, 0.031), (45, 0.013), (46, 0.027), (47, 0.011), (48, 0.02), (49, -0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94268996 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1

same-paper 2 0.93968564 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

3 0.8457678 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

4 0.63537145 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1

5 0.50976765 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1

6 0.47336918 251 nips-2013-Predicting Parameters in Deep Learning

7 0.41335499 331 nips-2013-Top-Down Regularization of Deep Belief Networks

8 0.39502269 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

9 0.37453038 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

10 0.35598448 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

11 0.35032138 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

12 0.34284839 160 nips-2013-Learning Stochastic Feedforward Neural Networks

13 0.3424511 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

14 0.30291584 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

15 0.28794581 85 nips-2013-Deep content-based music recommendation

16 0.28548923 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

17 0.25816405 5 nips-2013-A Deep Architecture for Matching Short Texts

18 0.2533606 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

19 0.24921027 65 nips-2013-Compressive Feature Learning

20 0.24350421 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.02), (19, 0.011), (31, 0.01), (33, 0.171), (34, 0.094), (41, 0.027), (49, 0.063), (56, 0.105), (70, 0.043), (85, 0.023), (89, 0.029), (92, 0.16), (93, 0.14)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87397027 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Author: Marco Cuturi

Abstract: Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms’ dimension exceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximumentropy perspective. We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn’s matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classification problem.

same-paper 2 0.87313807 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

3 0.85590839 280 nips-2013-Robust Data-Driven Dynamic Programming

Author: Grani Adiwena Hanasusanto, Daniel Kuhn

Abstract: In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. The resulting data-driven DP scheme is asymptotically consistent and admits an efficient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated cost-to-go functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in out-of-sample tests. To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. We show that the arising minmax problems in the DP recursions reduce to tractable conic programs. We also demonstrate that the proposed robust DP algorithm dominates various non-robust schemes in out-of-sample tests across several application domains. 1

4 0.84118611 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

Author: Marcelo Fiori, Pablo Sprechmann, Joshua Vogelstein, Pablo Muse, Guillermo Sapiro

Abstract: Graph matching is a challenging problem with very important applications in a wide range of fields, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsityrelated techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efficiently solved using augmented Lagrangian techniques. The method can deal with weighted or unweighted graphs, as well as multimodal data, where different graphs represent different types of data. The proposed approach is also naturally integrated with collaborative graph inference techniques, solving general network inference problems where the observed variables, possibly coming from different modalities, are not in correspondence. The algorithm is tested and compared with state-of-the-art graph matching techniques in both synthetic and real graphs. We also present results on multimodal graphs and applications to collaborative inference of brain connectivity from alignment-free functional magnetic resonance imaging (fMRI) data. The code is publicly available. 1

5 0.83517075 215 nips-2013-On Decomposing the Proximal Map

Author: Yao-Liang Yu

Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1

6 0.82915246 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

7 0.82457203 99 nips-2013-Dropout Training as Adaptive Regularization

8 0.82175779 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

9 0.82163393 65 nips-2013-Compressive Feature Learning

10 0.81865376 251 nips-2013-Predicting Parameters in Deep Learning

11 0.81804198 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

12 0.81484532 64 nips-2013-Compete to Compute

13 0.81477302 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

14 0.81460297 211 nips-2013-Non-Linear Domain Adaptation with Boosting

15 0.80894834 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

16 0.80820584 5 nips-2013-A Deep Architecture for Matching Short Texts

17 0.80757064 331 nips-2013-Top-Down Regularization of Deep Belief Networks

18 0.80225122 183 nips-2013-Mapping paradigm ontologies to and from the brain

19 0.7993114 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

20 0.79866982 301 nips-2013-Sparse Additive Text Models with Low Rank Background