nips nips2013 nips2013-99 knowledge-graph by maker-knowledge-mining

99 nips-2013-Dropout Training as Adaptive Regularization


Source: pdf

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. [sent-4, score-0.575]

2 For generalized linear models, dropout performs a form of adaptive regularization. [sent-5, score-0.76]

3 Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. [sent-6, score-0.931]

4 By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. [sent-8, score-0.938]

5 We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. [sent-9, score-0.756]

6 1 Although dropout has proved to be a very successful technique, the reasons for its success are not yet well understood at a theoretical level. [sent-12, score-0.718]

7 For example, Bishop [9] showed that the effect of training with features that have been corrupted with additive Gaussian noise is equivalent to a form of L2 -type regularization in the low noise limit. [sent-15, score-0.314]

8 In this paper, we take a step towards understanding how dropout training works by analyzing it as a regularizer. [sent-16, score-0.782]

9 We focus on generalized linear models (GLMs), a class of models for which feature dropout reduces to a form of adaptive model regularization. [sent-17, score-0.818]

10 Using this framework, we show that dropout training is first-order equivalent to L2 -regularization afˆ ˆ ter transforming the input by diag(I) 1/2 , where I is an estimate of the Fisher information matrix. [sent-18, score-0.801]

11 In the case of logistic regression, dropout can be interpreted as a form of adaptive L2 -regularization that favors rare but useful features. [sent-20, score-0.98]

12 introduced dropout training in the context of neural networks specifically, and also advocated omitting random hidden layers during training. [sent-31, score-0.804]

13 In this paper, we follow [2, 3] and study feature dropout as a generic training method that can be applied to any learning algorithm. [sent-32, score-0.84]

14 1 1 and dropout training have an intimate connection: Just as SGD progresses by repeatedly solving linearized L2 -regularized problems, a close relative of AdaGrad advances by solving linearized dropout-regularized problems. [sent-33, score-0.888]

15 Our formulation of dropout training as adaptive regularization also leads to a simple semi-supervised learning scheme, where we use unlabeled data to learn a better dropout regularizer. [sent-34, score-1.772]

16 We apply this idea to several document classification problems, and find that it consistently improves the performance of dropout training. [sent-36, score-0.736]

17 On the benchmark IMDB reviews dataset introduced by [12], dropout logistic regression with a regularizer tuned on unlabeled data outperforms previous state-of-the-art. [sent-37, score-1.165]

18 2 Artificial Feature Noising as Regularization We begin by discussing the general connections between feature noising and regularization in generalized linear models (GLMs). [sent-39, score-0.563]

19 We will apply the machinery developed here to dropout training in Section 4. [sent-40, score-0.782]

20 A GLM defines a conditional distribution over a response y 2 Y given an input feature vector x 2 Rd : def p (y | x) = h(y) exp{y x · A(x · )}, def `x,y ( ) = log p (y | x). [sent-41, score-0.186]

21 Given n training examples (xi , yi ), the standard maximum likelihood estimate ˆ 2 Rd minimizes the empirical loss over the training examples: ˆ def arg min = 2Rd n X (2) `xi , yi ( ). [sent-49, score-0.312]

22 i=1 With artificial feature noising, we replace the observed feature vectors xi with noisy versions xi = ˜ ⌫(xi , ⇠i ), where ⌫ is our noising function and ⇠i is an independent random variable. [sent-50, score-0.675]

23 In other words, dropout noise corresponds to setting xij ˜ to 0 with probability and to xij /(1 ) else. [sent-55, score-0.849]

24 2 Integrating over the feature noise gives us a noised maximum likelihood parameter estimate: ˆ = arg min 2Rd n X i=1 def E⇠ [`xi , yi ( )] , where E⇠ [Z] = E [Z | {xi , yi }] ˜ (3) is the expectation taken with respect to the artificial feature noise ⇠ = (⇠1 , . [sent-56, score-0.417]

25 For GLMs, the noised empirical loss takes on a simpler form: n X E⇠ [`xi , yi ( )] = ˜ i=1 2 Artificial noise of the form xi to dropout noise as defined by [1]. [sent-61, score-0.986]

26 The key observation here is that the effect of artificial feature noising reduces to a penalty R( ) that does not depend on the labels {yi }. [sent-68, score-0.619]

27 Because of this, artificial feature noising penalizes the complexity of a classifier in a way that does not depend on the accuracy of a classifier. [sent-69, score-0.55]

28 Thus, for GLMs, artificial feature noising is a regularization scheme on the model itself that can be compared with other forms of regularization such as ridge (L2 ) or lasso (L1 ) penalization. [sent-70, score-0.647]

29 In Section 6, we exploit the label-independence of the noising penalty and use unlabeled data to tune our estimate of R( ). [sent-71, score-0.739]

30 It is easy to verify that these expressions are in general not equivalent, but ˆ they are equivalent when the effect of feature noising reduces to a label-independent penalty on the likelihood. [sent-75, score-0.663]

31 1 A Quadratic Approximation to the Noising Penalty Although the noising penalty R yields an explicit regularizer that does not depend on the labels {yi }, the form of R can be difficult to interpret. [sent-78, score-0.635]

32 Applying this quadratic approximation to (5) x x yields the following quadratic noising regularizer, which will play a pivotal role in the rest of the paper: n X def 1 Rq ( ) = A00 (xi · ) Var⇠ [˜i · ] . [sent-82, score-0.665]

33 x (6) 2 i=1 This regularizer penalizes two types of variance over the training examples: (i) A00 (xi · ), which corresponds to the variance of the response yi in the GLM, and (ii) Var⇠ [˜i · ], the variance of the x estimated GLM parameter due to noising. [sent-83, score-0.193]

34 3 Accuracy of approximation Figure 1a compares the noising penalties R and Rq for logistic redef gression in the case that x · is Gaussian;4 we vary the mean parameter p = (1 + e x· ) 1 and the ˜ q noise level . [sent-84, score-0.637]

35 atheism vs 3 Although Rq is not convex, we were still able (using an L-BFGS algorithm) to train logistic regression with Rq as a surrogate for the dropout regularizer without running into any major issues with local optima. [sent-91, score-1.011]

36 4 This assumption holds a priori for additive Gaussian noise, and can be reasonable for dropout by the central limit theorem. [sent-92, score-0.765]

37 5 0 50 Sigma 100 150 Training Iteration (a) Comparison of noising penalties R and Rq for logistic regression with Gaussian perturbations, i. [sent-109, score-0.649]

38 The solid line indicates the true penalty and the dashed one is our quadratic approximation thereof; p = (1 + e x· ) 1 is the mean parameter for the logistic model. [sent-112, score-0.293]

39 (b) Comparing the evolution of the exact dropout penalty R and our quadratic approximation Rq for logistic regression on the AthR classification task in [15] with 22K features and n = 1000 examples. [sent-113, score-1.13]

40 In practice, we have found that fitting logistic regression with the quadratic surrogate Rq gives similar results to actual dropout-regularized logistic regression. [sent-120, score-0.368]

41 3 Regularization based on Additive Noise Having established the general quadratic noising regularizer Rq , we now turn to studying the effects of Rq for various likelihoods (linear and logistic regression) and noising models (additive and dropout). [sent-122, score-1.147]

42 In this section, we warm up with additive noise; in Section 4 we turn to our main target of interest, namely dropout noise. [sent-123, score-0.765]

43 Applying these facts to (6) yields a simplified form for the quadratic noising penalty: 1 2 Rq ( ) = nk k2 . [sent-126, score-0.518]

44 (7) 2 2 Thus, we recover the well-known result that linear regression with additive feature noising is equivalent to ridge regression [2, 9]. [sent-127, score-0.755]

45 For logistic regression, A00 (xi · ) = pi (1 pi ) where pi = (1 + exp( xi · )) 1 is the predicted probability of yi = 1. [sent-130, score-0.518]

46 The quadratic noising penalty is then 1 R ( )= 2 q 2 k k2 2 n X pi (1 pi ). [sent-131, score-0.842]

47 (8) i=1 In other words, the noising penalty now simultaneously encourages parsimonious modeling as before (by encouraging k k2 to be small) as well as confident predictions (by encouraging the pi ’s to 2 move away from 1 ). [sent-132, score-0.691]

48 The pi = (1 + e xi · ) 1 are mean ij parameters for the logistic model. [sent-137, score-0.29]

49 We can check that: d X 1 2 Var⇠ [˜i · ] = x x2 j , (9) ij 21 j=1 and so the quadratic dropout penalty is Rq ( ) = 1 21 n X i=1 A00 (xi · ) d X x2 ij 2 j. [sent-139, score-0.945]

50 Thus, nX V ( dropout can be seen as an attempt to apply an L2 penalty after normalizing the feature vector by diag(I) 1/2 . [sent-143, score-0.905]

51 Linear Regression For linear regression, V is the identity matrix, so the dropout objective is equivalent to a form of ridge regression where each column of the design matrix is normalized before applying the L2 penalty. [sent-149, score-0.859]

52 Logistic Regression The form of dropout penalties becomes much more intriguing once we move beyond the realm of linear regression. [sent-151, score-0.739]

53 Here, we can write the quadratic dropout penalty from (10) as n d XX 1 2 Rq ( ) = pi (1 pi ) x2 j . [sent-153, score-1.107]

54 (12) ij 21 i=1 j=1 Thus, just like additive noising, dropout generally gives an advantage to confident predictions and small . [sent-154, score-0.814]

55 However, unlike all the other methods considered so far, dropout may allow for some large 2 pi (1 pi ) and some large j , provided that the corresponding cross-term x2 is small. [sent-155, score-0.934]

56 ij Our analysis shows that dropout regularization should be better than L2 -regularization for learning weights for features that are rare (i. [sent-156, score-0.943]

57 , often 0) but highly discriminative, because dropout effectively does not penalize j over observations for which xij = 0. [sent-158, score-0.762]

58 Thus, in order for a feature to earn a large 2 pi ) each time that it j , it suffices for it to contribute to a confident prediction with small pi (1 is active. [sent-159, score-0.274]

59 6 To be precise, dropout does not reward all rare but discriminative features. [sent-161, score-0.862]

60 Rather, dropout rewards those features that are rare and positively co-adapted with other features in a way that enables the model to make confident predictions whenever the feature of interest is active. [sent-162, score-0.99]

61 5 Table 3: Accuracy of L2 and dropout regularized logistic regression on a simulated example. [sent-163, score-0.893]

62 55 classification where rare but discriminative features are prevalent [3]. [sent-175, score-0.19]

63 We summarize the relationship between L2 -penalization, additive noising and dropout in Table 2. [sent-177, score-1.218]

64 Additive noising introduces a product-form penalty depending on both and A00 . [sent-178, score-0.561]

65 However, the full potential of artificial feature noising only emerges with dropout, which allows the penalty terms due to and A00 to interact in a non-trivial way through the design matrix X (except for linear regression, in which all the noising schemes we consider collapse to ridge regression). [sent-179, score-1.121]

66 1 A Simulation Example The above discussion suggests that dropout logistic regression should perform well with rare but useful features. [sent-181, score-0.993]

67 To make sure that our experiment was picking up the effect of dropout training specifically and not just normalization of X, we ensured that the columns of X were normalized in expectation. [sent-184, score-0.782]

68 The dropout penalty for logistic regression can be written as a matrix product 0 10 1 ··· ··· 1 2 Rq ( ) = (· · · pi (1 pi ) · · ·) @· · · x2 · · ·A @ j A . [sent-185, score-1.217]

69 ij 21 ··· ··· (13) We designed the simulation study in such a way that, at the optimal , the dropout penalty should have structure Small (confident prediction) Big (weak prediction) ! [sent-186, score-0.873]

70 C A (14) A dropout penalty with such a structure should be small. [sent-188, score-0.826]

71 Although there are some uncertain pre2 dictions with large pi (1 pi ) and some big weights j , these terms cannot interact because the 2 corresponding terms xij are all 0 (these are examples without any of the rare discriminative features and thus have no signal). [sent-189, score-0.475]

72 Our simulation results, given in Table 3, confirm that dropout training outperforms L2 -regularization here as expected. [sent-191, score-0.802]

73 In SGD, the weight vector ˆ is updated with ˆt+1 = ˆt ⌘t gt , where gt = r`xt , yt ( ˆt ) is the gradient of the loss due to the t-th training example. [sent-195, score-0.207]

74 Left: 10000 labeled training examples, and up to 40000 unlabeled examples. [sent-211, score-0.278]

75 Right: 3000-15000 labeled training examples, and 25000 unlabeled examples. [sent-212, score-0.278]

76 At least superficially, AdaGrad and dropout seem to have similar goals: For logistic regression, they can both be understood as adaptive alternatives to methods based on L2 -regularization that favor learning rare, useful features. [sent-220, score-0.862]

77 (17) This implies that dropout descent is first-order equivalent to an adaptive SGD procedure with At = diag(Ht ). [sent-223, score-0.796]

78 Thus, by using dropout instead of L2 -regularization to solve linearized problems in online learning, we end up with an AdaGrad-like algorithm. [sent-226, score-0.774]

79 Of course, the connection between AdaGrad and dropout is not perfect. [sent-227, score-0.751]

80 But, at a high level, AdaGrad and dropout appear to both be aiming for the same goal: scaling the features by the Fisher information to make the level-curves of the objective more circular. [sent-229, score-0.764]

81 In the case of logistic regression, AROW also favors learning rare features, but unlike dropout and AdaGrad does not privilege confident predictions. [sent-231, score-0.938]

82 Table 4: Performance of semi-supervised dropout training for document classification. [sent-233, score-0.8]

83 (a) Test accuracy with and without unlabeled data on different datasets. [sent-234, score-0.201]

84 : logistic regression with L2 regularization; Dropout: dropout trained with quadratic surrogate; +Unlabeled: using unlabeled data. [sent-238, score-1.136]

85 Labeled: using just labeled data from each paper/method, +Unlabeled: use additional unlabeled data. [sent-240, score-0.214]

86 Drop: dropout with Rq , MNB: multionomial naive Bayes with semisupervised frequency estimate from [19],8 -Uni: unigram features, -Bi: bigram features. [sent-241, score-0.739]

87 As a result, we can use additional unlabeled training examples to estimate it more accurately. [sent-275, score-0.267]

88 Suppose we have an unlabeled dataset {zi } of size m, and let ↵ 2 (0, 1] be a discount factor for the unlabeled data. [sent-276, score-0.356]

89 Then we can define a semi-supervised penalty estimate ⌘ n ⇣ def R⇤ ( ) = R( ) + ↵ RUnlabeled ( ) , (19) n + ↵m P where R( ) is the original penalty estimate and RUnlabeled ( ) = i E⇠ [A(zi · )] A(zi · ) is computed using (5) over the unlabeled examples zi . [sent-277, score-0.502]

90 Our semi-supervised approach is based on a different intuition: we’d like to set weights to make confident predictions on unlabeled data as well as the labeled data, an intuition shared by entropy regularization [24] and transductive SVMs [25]. [sent-284, score-0.306]

91 Overall, we see that using unlabeled data to learn a better regularizer R⇤ ( ) consistently improves the performance of dropout training. [sent-287, score-0.97]

92 The dataset contains 50,000 unlabeled examples in addition to the labeled train and test sets of size 25,000 each. [sent-289, score-0.257]

93 Whereas the train and test examples are either positive or negative, the unlabeled examples contain neutral reviews as well. [sent-290, score-0.266]

94 We train a dropout-regularized logistic regression classifier on unigram/bigram features, and use the unlabeled data to tune our regularizer. [sent-291, score-0.371]

95 Our method benefits from unlabeled data even in the presence of a large amount of labeled data, and achieves state-of-the-art accuracy on this dataset. [sent-292, score-0.237]

96 7 Conclusion We analyzed dropout training as a form of adaptive regularization. [sent-293, score-0.824]

97 This framework enabled us to uncover close connections between dropout training, adaptively balanced L2 -regularization, and AdaGrad; and led to a simple yet effective method for semi-supervised training. [sent-294, score-0.718]

98 There seem to be multiple opportunities for digging deeper into the connection between dropout training and adaptive regularization. [sent-295, score-0.857]

99 In particular, it would be interesting to see whether the dropout regularizer takes on a tractable and/or interpretable form in neural networks, and whether similar semi-supervised schemes could be used to improve on the results presented in [1]. [sent-296, score-0.792]

100 Text classification from labeled and unlabeled documents using EM. [sent-363, score-0.214]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dropout', 0.718), ('noising', 0.453), ('unlabeled', 0.178), ('rq', 0.177), ('adagrad', 0.132), ('pi', 0.108), ('penalty', 0.108), ('logistic', 0.102), ('rare', 0.1), ('glms', 0.081), ('imdb', 0.077), ('regularizer', 0.074), ('noised', 0.073), ('regression', 0.073), ('sgd', 0.067), ('dent', 0.066), ('quadratic', 0.065), ('training', 0.064), ('def', 0.064), ('gt', 0.063), ('feature', 0.058), ('sida', 0.058), ('xi', 0.053), ('regularization', 0.052), ('additive', 0.047), ('features', 0.046), ('diag', 0.045), ('fisher', 0.044), ('xij', 0.044), ('discriminative', 0.044), ('noise', 0.043), ('adaptive', 0.042), ('yi', 0.039), ('var', 0.038), ('glm', 0.037), ('labeled', 0.036), ('classi', 0.036), ('linearized', 0.035), ('connection', 0.033), ('cially', 0.032), ('ridge', 0.032), ('ht', 0.031), ('surfaces', 0.029), ('argminy', 0.029), ('blankout', 0.029), ('mnb', 0.029), ('percy', 0.029), ('runlabeled', 0.029), ('salah', 0.029), ('linguistics', 0.029), ('duchi', 0.028), ('arti', 0.027), ('ij', 0.027), ('christopher', 0.026), ('arow', 0.026), ('surrogate', 0.026), ('expressions', 0.025), ('examples', 0.025), ('spherical', 0.025), ('yoshua', 0.024), ('accuracy', 0.023), ('omitting', 0.022), ('nuisance', 0.022), ('wager', 0.022), ('predictions', 0.022), ('unigram', 0.021), ('rifai', 0.021), ('normalizing', 0.021), ('online', 0.021), ('penalties', 0.021), ('progresses', 0.02), ('xavier', 0.02), ('simulation', 0.02), ('reviews', 0.02), ('wang', 0.019), ('zi', 0.019), ('stefan', 0.019), ('yann', 0.019), ('equivalent', 0.019), ('tangent', 0.018), ('cial', 0.018), ('document', 0.018), ('train', 0.018), ('transductive', 0.018), ('favors', 0.018), ('approximation', 0.018), ('hinton', 0.018), ('sentiment', 0.017), ('liang', 0.017), ('design', 0.017), ('descent', 0.017), ('con', 0.017), ('loss', 0.017), ('penalizes', 0.016), ('chris', 0.016), ('repeatedly', 0.016), ('stanford', 0.016), ('generative', 0.016), ('table', 0.016), ('tting', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

2 0.5153383 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1

3 0.36893341 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

4 0.10881405 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1

5 0.073037207 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1

6 0.073012121 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

7 0.069585063 149 nips-2013-Latent Structured Active Learning

8 0.065019079 318 nips-2013-Structured Learning via Logistic Regression

9 0.064018436 251 nips-2013-Predicting Parameters in Deep Learning

10 0.06322702 171 nips-2013-Learning with Noisy Labels

11 0.061412785 65 nips-2013-Compressive Feature Learning

12 0.060515296 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

13 0.058571614 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

14 0.057170007 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

15 0.056259986 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

16 0.053739183 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

17 0.053179968 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

18 0.049659867 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

19 0.048337065 309 nips-2013-Statistical Active Learning Algorithms

20 0.047763862 91 nips-2013-Dirty Statistical Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, 0.045), (2, -0.044), (3, -0.089), (4, 0.073), (5, -0.075), (6, -0.119), (7, 0.066), (8, -0.005), (9, -0.146), (10, 0.318), (11, -0.001), (12, -0.075), (13, 0.034), (14, 0.105), (15, -0.082), (16, -0.079), (17, 0.137), (18, 0.152), (19, 0.007), (20, -0.052), (21, 0.197), (22, -0.394), (23, -0.067), (24, -0.128), (25, -0.093), (26, 0.3), (27, -0.032), (28, -0.099), (29, -0.02), (30, -0.007), (31, -0.024), (32, 0.069), (33, 0.051), (34, -0.069), (35, -0.074), (36, -0.028), (37, -0.037), (38, 0.018), (39, 0.002), (40, 0.033), (41, 0.013), (42, 0.006), (43, -0.03), (44, -0.002), (45, 0.034), (46, 0.036), (47, -0.004), (48, 0.002), (49, -0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94363338 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1

same-paper 2 0.93967748 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

3 0.78761172 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

4 0.41579044 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1

5 0.32444772 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1

6 0.27348474 76 nips-2013-Correlated random features for fast semi-supervised learning

7 0.27243236 65 nips-2013-Compressive Feature Learning

8 0.2394163 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

9 0.23560417 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

10 0.23023695 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

11 0.22154753 251 nips-2013-Predicting Parameters in Deep Learning

12 0.21990885 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

13 0.21789338 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

14 0.21620716 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

15 0.21378085 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

16 0.21157758 225 nips-2013-One-shot learning and big data with n=2

17 0.20755017 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

18 0.20629254 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

19 0.20423341 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

20 0.20248243 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.019), (16, 0.028), (33, 0.123), (34, 0.095), (41, 0.031), (49, 0.031), (50, 0.13), (56, 0.099), (70, 0.019), (85, 0.063), (89, 0.045), (93, 0.181), (95, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87633163 65 nips-2013-Compressive Feature Learning

Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie

Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1

2 0.8714506 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon

Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1

3 0.86561543 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1

4 0.86542332 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1

same-paper 5 0.85926759 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

6 0.85602105 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

7 0.84915024 215 nips-2013-On Decomposing the Proximal Map

8 0.80170673 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

9 0.79633135 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

10 0.79496944 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

11 0.78218853 69 nips-2013-Context-sensitive active sensing in humans

12 0.78098166 5 nips-2013-A Deep Architecture for Matching Short Texts

13 0.78075838 251 nips-2013-Predicting Parameters in Deep Learning

14 0.78032458 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

15 0.77746069 30 nips-2013-Adaptive dropout for training deep neural networks

16 0.77134907 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

17 0.77040213 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

18 0.76856405 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

19 0.76765156 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

20 0.7610622 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents