nips nips2013 nips2013-339 knowledge-graph by maker-knowledge-mining

339 nips-2013-Understanding Dropout

Source: pdf

Author: Pierre Baldi, Peter J. Sadowski

Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. [sent-4, score-0.296]

2 We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. [sent-5, score-1.529]

3 For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. [sent-6, score-0.983]

4 We provide estimates and bounds for these approximations and corroborate the results with simulations. [sent-7, score-0.045]

5 Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. [sent-8, score-0.661]

6 1 Introduction Dropout is an algorithm for training neural networks that was described at NIPS 2012 [7]. [sent-9, score-0.188]

7 In its most simple form, during training, at each example presentation, feature detectors are deleted with probability q = 1 − p = 0. [sent-10, score-0.107]

8 The main motivation behind the algorithm is to prevent the co-adaptation of feature detectors, or overﬁtting, by forcing neurons to be robust and rely on population behavior, rather than on the activity of other speciﬁc units. [sent-14, score-0.113]

9 In [7], dropout is reported to achieve state-of-the-art performance on several benchmark datasets. [sent-15, score-0.661]

10 In spite of the impressive results that have been reported, little is known about dropout from a theoretical standpoint, in particular about its averaging, regularization, and convergence properties. [sent-17, score-0.729]

11 5, whether different values of q can be used including different values for different layers or different units, and whether dropout can be applied to the connections rather than the units. [sent-19, score-0.696]

12 2 Dropout in Linear Networks It is instructive to ﬁrst look at some of the properties of dropout in linear networks, since these can be studied exactly in the most general setting of a multilayer feedforward network described by an underlying acyclic graph. [sent-21, score-0.85]

13 The activity in unit i of layer h can be expressed as: h Si (I) = hl l wij Sj l 0 (4) j 0 with E(Sj ) = Ij in the input layer. [sent-22, score-0.402]

14 In short, the ensemble average can easily be computed by hl hl feedforward propagation in the original network, simply replacing the weights wij by wij pl . [sent-23, score-0.788]

15 1 Dropout in Neural Networks Dropout in Shallow Neural Networks n Consider ﬁrst a single logistic unit with n inputs O = σ(S) = 1/(1 + ce−λS ) and S = 1 wj Ij . [sent-25, score-0.172]

16 To achieve the greatest level of generality, we assume that the unit produces different outputs O1 , . [sent-26, score-0.11]

17 In the most relevant case, these outputs and these sums are associated with the m = 2n possible subnetworks of the unit. [sent-36, score-0.162]

18 , Pm could be generated, for instance, by using Bernoulli gating variables, although this is not necessary for this derivation. [sent-40, score-0.098]

19 It is useful to deﬁne the following four quantities: the mean E = Pi Oi ; the mean of the complements P E = Pi (1 − Oi ) = 1 − E; the weighted geometric mean (W GM ) G = i Oi i ; and the weighted geometric mean of the complements G = i (1 − Oi )Pi . [sent-41, score-0.382]

20 We also deﬁne the normalized weighted geometric mean N W GM = G/(G + G ). [sent-42, score-0.153]

21 We can now prove the key averaging theorem for logistic functions: N W GM (O1 , . [sent-43, score-0.159]

22 A similar result is true also for normalized exponential transfer functions. [sent-50, score-0.034]

23 Finally, one can also show that the only class of functions f that satisfy N W GM (f ) = f (E) are the constant functions and the logistic functions [1]. [sent-51, score-0.078]

24 Figure 2 Figure 3: In every hidden layer of a dropout trained network, the distribution of neuron activations O∗ is sparse and not symmetric. [sent-57, score-0.751]

25 Improving neural networks by preventing co-adaptation of feature detectors. [sent-109, score-0.18]

26 On the Ky Fan inequality and related inequalities i. [sent-116, score-0.065]

27 On the Ky Fan inequality and related inequalities ii. [sent-121, score-0.065]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dropout', 0.661), ('gm', 0.28), ('oi', 0.245), ('om', 0.182), ('pi', 0.158), ('hl', 0.155), ('irvine', 0.144), ('networks', 0.114), ('neuman', 0.112), ('wij', 0.101), ('si', 0.1), ('gating', 0.098), ('ce', 0.095), ('feedforward', 0.092), ('baldi', 0.091), ('subnetworks', 0.091), ('oj', 0.091), ('pm', 0.085), ('ky', 0.081), ('averaging', 0.081), ('ensemble', 0.081), ('logistic', 0.078), ('desjardins', 0.078), ('geometric', 0.078), ('complements', 0.072), ('bastien', 0.066), ('lamblin', 0.066), ('inequalities', 0.065), ('multilayer', 0.063), ('units', 0.062), ('detectors', 0.062), ('fan', 0.06), ('layer', 0.06), ('ij', 0.058), ('toronto', 0.052), ('deep', 0.052), ('economical', 0.049), ('oral', 0.049), ('lnai', 0.049), ('wj', 0.048), ('unit', 0.046), ('sj', 0.046), ('berlin', 0.046), ('saad', 0.045), ('corroborate', 0.045), ('deleted', 0.045), ('conjectured', 0.043), ('weighted', 0.041), ('neurons', 0.041), ('breuleux', 0.041), ('theano', 0.041), ('sums', 0.04), ('activity', 0.04), ('california', 0.039), ('bernoulli', 0.039), ('scipy', 0.039), ('sigmoidal', 0.039), ('ca', 0.039), ('training', 0.038), ('propagation', 0.037), ('standpoint', 0.037), ('spite', 0.037), ('pascanu', 0.037), ('neural', 0.036), ('pierre', 0.036), ('goodfellow', 0.036), ('turian', 0.036), ('connections', 0.035), ('python', 0.035), ('luxburg', 0.035), ('cal', 0.035), ('weights', 0.035), ('normalized', 0.034), ('shallow', 0.034), ('math', 0.034), ('nement', 0.034), ('instructive', 0.034), ('arithmetic', 0.034), ('bergstra', 0.034), ('robbins', 0.033), ('gpu', 0.033), ('regularizing', 0.033), ('bulletin', 0.033), ('tx', 0.033), ('greatest', 0.033), ('forcing', 0.032), ('sm', 0.032), ('australian', 0.032), ('improving', 0.032), ('verlag', 0.031), ('editor', 0.031), ('pj', 0.031), ('outputs', 0.031), ('formalism', 0.031), ('impressive', 0.031), ('pl', 0.031), ('activations', 0.03), ('preventing', 0.03), ('stochastically', 0.029), ('srivastava', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

2 0.5153383 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overﬁtting by artiﬁcially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is ﬁrst-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and ﬁnd that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classiﬁcation tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

3 0.37050515 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classiﬁcation error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

4 0.16676927 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artiﬁcial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1

5 0.10687508 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classiﬁcation tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classiﬁcation, classiﬁcation with missing inputs, and mean ﬁeld prediction tasks.1 1

6 0.10611805 251 nips-2013-Predicting Parameters in Deep Learning

7 0.10300399 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles

8 0.081487879 232 nips-2013-Online PCA for Contaminated Data

9 0.073837399 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

10 0.06401404 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

11 0.063206889 331 nips-2013-Top-Down Regularization of Deep Belief Networks

12 0.062231749 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

13 0.057520017 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

14 0.050595809 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

15 0.046491124 5 nips-2013-A Deep Architecture for Matching Short Texts

16 0.046032496 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

17 0.045760073 75 nips-2013-Convex Two-Layer Modeling

18 0.045009766 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

19 0.044881005 160 nips-2013-Learning Stochastic Feedforward Neural Networks

20 0.039259773 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, 0.054), (2, -0.094), (3, -0.101), (4, 0.037), (5, -0.092), (6, -0.116), (7, 0.043), (8, 0.004), (9, -0.184), (10, 0.376), (11, 0.009), (12, -0.027), (13, 0.071), (14, 0.118), (15, -0.03), (16, -0.095), (17, 0.126), (18, 0.144), (19, -0.003), (20, -0.016), (21, 0.17), (22, -0.386), (23, -0.062), (24, -0.139), (25, -0.092), (26, 0.276), (27, 0.003), (28, -0.087), (29, 0.002), (30, -0.021), (31, -0.043), (32, 0.068), (33, 0.083), (34, -0.091), (35, -0.073), (36, -0.036), (37, -0.001), (38, -0.02), (39, -0.01), (40, 0.009), (41, 0.018), (42, 0.017), (43, -0.039), (44, 0.046), (45, 0.018), (46, 0.048), (47, -0.006), (48, -0.007), (49, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97550112 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

2 0.88270575 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

3 0.8163926 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

4 0.49202082 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

5 0.33662412 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

6 0.26119909 251 nips-2013-Predicting Parameters in Deep Learning

7 0.2479697 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

8 0.2283341 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes

9 0.22421557 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

10 0.22369952 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

11 0.21486729 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles

12 0.2084679 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems

13 0.19725263 160 nips-2013-Learning Stochastic Feedforward Neural Networks

14 0.1968784 85 nips-2013-Deep content-based music recommendation

15 0.19286482 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

16 0.19254453 65 nips-2013-Compressive Feature Learning

17 0.19191718 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars

18 0.18477254 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

19 0.18134935 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

20 0.17991769 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.025), (33, 0.09), (34, 0.085), (41, 0.024), (49, 0.025), (56, 0.043), (70, 0.024), (89, 0.023), (93, 0.564)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91331857 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

2 0.84384245 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-speciﬁc decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for speciﬁc a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a signiﬁcant improvement over the state of the art. 1

3 0.84062028 65 nips-2013-Compressive Feature Learning

Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie

Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method ﬁnds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efﬁcient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1

4 0.81103688 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

Author: Andriy Mnih, Koray Kavukcuoglu

Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and ﬁnd that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1

5 0.80442476 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon

Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efﬁciency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1

6 0.63883412 215 nips-2013-On Decomposing the Proximal Map

7 0.60930395 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

8 0.55498165 99 nips-2013-Dropout Training as Adaptive Regularization

9 0.54118747 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

10 0.53351277 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality

11 0.53338408 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

12 0.51353687 30 nips-2013-Adaptive dropout for training deep neural networks

13 0.5132302 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

14 0.49148723 251 nips-2013-Predicting Parameters in Deep Learning

15 0.48412067 5 nips-2013-A Deep Architecture for Matching Short Texts

16 0.47576112 69 nips-2013-Context-sensitive active sensing in humans

17 0.4660041 64 nips-2013-Compete to Compute

18 0.46230453 183 nips-2013-Mapping paradigm ontologies to and from the brain

19 0.46075138 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion

20 0.45546949 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization