jmlr jmlr2010 jmlr2010-117 knowledge-graph by maker-knowledge-mining

117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

Source: pdf

Author: Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio

Abstract: Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difﬁcult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the inﬂuence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments conﬁrm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training. Keywords: deep architectures, unsupervised pre-training, deep belief networks, stacked denoising auto-encoders, non-convex optimization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. [sent-14, score-0.744]

2 Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difﬁcult learning problem. [sent-15, score-0.478]

3 Answering this questions is important if learning in deep architectures is to be further improved. [sent-17, score-0.496]

4 The results suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training. [sent-21, score-0.753]

5 Keywords: deep architectures, unsupervised pre-training, deep belief networks, stacked denoising auto-encoders, non-convex optimization 1. [sent-22, score-1.254]

6 They include learning methods for a wide array of deep architectures (Bengio, 2009 provides a survey), including neural networks with many hidden layers (Bengio et al. [sent-24, score-0.982]

7 These recent demonstrations of the potential of deep learning algorithms were achieved despite the serious challenge of training models with many layers of adaptive parameters. [sent-48, score-0.816]

8 The breakthrough to effective training strategies for deep architectures came in 2006 with the algorithms for training deep belief networks (DBN) (Hinton et al. [sent-51, score-1.169]

9 Each layer is pretrained with an unsupervised learning algorithm, learning a nonlinear transformation of its input (the output of the previous layer) that captures the main variations in its input. [sent-55, score-0.609]

10 This unsupervised pre-training sets the stage for a ﬁnal training phase where the deep architecture is ﬁne-tuned with respect to a supervised training criterion with gradient-based optimization. [sent-56, score-1.041]

11 The objective of this paper is to explore, through extensive experimentation, how unsupervised pre-training works to render learning deep architectures more effective and why they appear to work so much better than traditional neural network training methods. [sent-58, score-0.933]

12 One possibility is that unsupervised pretraining acts as a kind of network pre-conditioner, putting the parameter values in the appropriate range for further supervised training. [sent-60, score-0.479]

13 Here, we argue that our experiments support a view of unsupervised pre-training as an unusual form of regularization: minimizing variance and introducing bias towards conﬁgurations of the parameter space that are useful for unsupervised learning. [sent-63, score-0.658]

14 We suggest that, in the highly non-convex situation of training a deep architecture, deﬁning a particular initialization point implicitly imposes constraints on the parameters in that it speciﬁes which minima (out of a very large number of possible minima) of the cost function are allowed. [sent-71, score-0.593]

15 Another important and distinct property of the unsupervised pre-training strategy is that in the standard situation of training using stochastic gradient descent, the beneﬁcial generalization effects due to pre-training do not appear to diminish as the number of labeled examples grows very large. [sent-74, score-0.537]

16 In particular, unsupervised pre-training sets the parameter in a region from which better basins of attraction can be reached, in terms of generalization. [sent-77, score-0.458]

17 Hence, although unsupervised pre-training is a regularizer, it can have a positive effect on the training objective when the number of training examples is large. [sent-78, score-0.585]

18 As previously stated, this paper is concerned with an experimental assessment of the various competing hypotheses regarding the role of unsupervised pre-training in the recent success of deep learning methods. [sent-79, score-0.724]

19 In the ﬁrst set of experiments (in Section 6), we establish the effect of unsupervised pre-training on improving the generalization error of trained deep architectures. [sent-81, score-0.812]

20 In the second set of experiments (in Section 7), we directly compare the two alternative hypotheses (pre-training as a pre-conditioner; and pre-training as an optimization scheme) against the hypothesis that unsupervised pre-training is a regularization strategy. [sent-83, score-0.46]

21 In the ﬁnal set of experiments, (in Section 8), we explore the role of unsupervised pre-training in the online learning setting, where the number of available training examples grows very large. [sent-84, score-0.464]

22 In these experiments, we test key aspects of our hypothesis relating to the topology of the cost function and the role of unsupervised pre-training in manipulating the region of parameter space from which supervised training is initiated. [sent-85, score-0.611]

23 Before delving into the experiments, we begin with a more in-depth view of the challenges in training deep architectures and how we believe unsupervised pre-training works towards overcoming these challenges. [sent-86, score-0.933]

24 The Challenges of Deep Learning In this section, we present a perspective on why standard training of deep models through gradient backpropagation appears to be so difﬁcult. [sent-88, score-0.519]

25 We believe the central challenge in training deep architectures is dealing with the strong dependencies that exist during training between the parameters across layers. [sent-90, score-0.712]

26 adapt the lower layers in order to provide adequate input to the ﬁnal (end of training) setting of the upper layers 2. [sent-92, score-0.676]

27 Furthermore, because with enough capacity the top two layers can easily overﬁt the training set, training error does not necessarily reveal the difﬁculty in optimizing the lower layers. [sent-98, score-0.582]

28 A separate but related issue appears if we focus our consideration of traditional training methods for deep architectures on stochastic gradient descent. [sent-100, score-0.678]

29 1 An important consequence of this phenomenon is that even in the presence of a very large (effectively inﬁnite) amounts of supervised data, stochastic gradient descent is subject to a degree of overﬁtting to the training data presented early in the training process. [sent-107, score-0.48]

30 In that sense, unsupervised pre-training interacts intimately with the optimization process, and when the number of training examples becomes large, its positive effect is seen not only on generalization error but also on training error. [sent-108, score-0.611]

31 Unsupervised Pre-training Acts as a Regularizer As stated in the introduction, we believe that greedy layer-wise unsupervised pre-training overcomes the challenges of deep learning by introducing a useful prior to the supervised ﬁne-tuning training procedure. [sent-113, score-0.893]

32 We believe the credit for its success can be attributed to the unsupervised training criteria optimized during unsupervised pre-training. [sent-122, score-0.766]

33 During each phase of the greedy unsupervised training strategy, layers are trained to represent the dominant factors of variation extant in the data. [sent-123, score-0.822]

34 In the context of deep learning, the greedy unsupervised strategy may also have a special function. [sent-128, score-0.699]

35 In these models the data is ﬁrst transformed in a new representation using unsupervised learning, and a supervised classiﬁer is stacked on top, learning to map the data in this new representation into class predictions. [sent-165, score-0.464]

36 In the context of deep architectures, a very interesting application of these ideas involves adding an unsupervised embedding criterion at each layer (or only one intermediate layer) to a traditional supervised criterion (Weston et al. [sent-171, score-1.065]

37 In the context of scarcity of labelled data (and abundance of unlabelled data), deep architectures have shown promise as well. [sent-174, score-0.496]

38 The section includes a description of the deep architectures used, the data sets and the details necessary to reproduce our results. [sent-192, score-0.496]

39 , 2008) in the literature for training deep architectures have something in common: they rely on an unsupervised learning algorithm that provides a training signal at the level of a single layer. [sent-201, score-1.041]

40 In a ﬁrst phase, unsupervised pre-training, all layers are initialized using this layer-wise unsupervised learning signal. [sent-203, score-0.996]

41 , 2007), the unsupervised pre-training is done in a greedy layer-wise fashion: at stage k, the k-th layer is trained (with respect to an unsupervised criterion) using as input the output of the previous layer, and while the previous layers are kept ﬁxed. [sent-208, score-1.323]

42 We shall consider two deep architectures as representatives of two families of models encountered in the deep learning literature. [sent-209, score-0.866]

43 Consider the ﬁrst layer of the DBN trained as an RBM P1 with hidden layer h1 and visible layer v1 . [sent-229, score-0.985]

44 The number of layers can be increased greedily, with the newly added top layer trained as an RBM to model the samples produced by chaining the posteriors P(hk |hk−1 ) of the lower layers (starting from h0 from the training data set). [sent-232, score-1.111]

45 The i-th unit of the k-th layer of the neural ˆ ˆ network outputs hki = sigmoid(cki + ∑ j Wki j hk−1, j ), using the parameters ck and Wk of the k-th layer of the DBN. [sent-234, score-0.56]

46 It has been shown on an array of data sets to perform signiﬁcantly better than ordinary auto-encoders and similarly or better 633 E RHAN , B ENGIO , C OURVILLE , M ANZAGOL , V INCENT AND B ENGIO than RBMs when stacked into a deep supervised architecture (Vincent et al. [sent-252, score-0.545]

47 With either DBN or SDAE, an output logistic regression layer is added after unsupervised training. [sent-272, score-0.609]

48 This layer uses softmax (multinomial logistic regression) units to estimate P(class|x) = softmaxclass (a), where ai is a linear combination of outputs from the top hidden layer. [sent-273, score-0.465]

49 Each hidden layer contains the same number of hidden units, which is a hyperparameter. [sent-297, score-0.476]

50 The experiments involve the training of deep architectures with a variable number of layers with and without unsupervised pre-training. [sent-307, score-1.271]

51 Note also that when we write of number of layers it is to be understood as the number of hidden layers in the network. [sent-316, score-0.774]

52 Figure 1: Effect of depth on performance for a model trained (left) without unsupervised pretraining and (right) with unsupervised pre-training, for 1 to 5 hidden layers (networks with 5 layers failed to converge to a solution, without the use of unsupervised pretraining). [sent-354, score-1.91]

53 It is also interesting to note the low variance and small spread of errors obtained with 400 seeds with unsupervised pre-training: it suggests that unsupervised pre-training is robust with respect to the random initialization seed (the one used to initialize parameters before pre-training). [sent-370, score-0.82]

54 While the ﬁrst layer ﬁlters do seem to correspond to localized features, 2nd and 3rd layers are not as interpretable anymore. [sent-385, score-0.654]

55 Second, different layers change differently: the ﬁrst layer changes least, while supervised training has more effect when performed on the 3rd layer. [sent-391, score-0.852]

56 First layer weights seem to encode basic stroke-like detectors, second layer weights seem to detect digit parts, while top layer weights detect entire digits. [sent-394, score-1.038]

57 3 Visualization of Model Trajectories During Learning Visualizing the learned features allows for a qualitative comparison of the training strategies for deep architectures. [sent-404, score-0.478]

58 This is consistent with the formalization of pre-training from Section 3, in which we described a theoretical justiﬁcation for viewing unsupervised pre-training as a regularizer; there, the probabilities of pre-traininig parameters landing in a basin of attraction is small. [sent-453, score-0.486]

59 Compare also with Figure 8, where 1-layer networks with unsupervised pre-training obtain higher training errors. [sent-482, score-0.487]

60 From this perspective, the advantage of unsupervised pre-training could be that it puts us in a region of parameter space where basins of attraction run deeper than when picking starting parameters at random. [sent-497, score-0.487]

61 Now it might also be the case that unsupervised pre-training puts us in a region of parameter space in which training error is not necessarily better than when starting at random (or possibly worse), but which systematically yields better generalization (test error). [sent-499, score-0.493]

62 Typically gradient descent training of the deep model is initialized with randomly assigned weights, small enough to be in the linear region of the parameter space (close to zero for most neural network and DBN models). [sent-506, score-0.59]

63 As can be seen in Figure 8, while for 1 hidden layer, unsupervised pre-training reaches lower training cost than no pre-training, hinting towards a better optimization, this is not necessarily the case for the deeper networks. [sent-542, score-0.564]

64 This brings us to the following result: unsupervised pre-training appears to have a similar effect to that of a good regularizer or a good “prior” on the parameters, even though no explicit regularization term is apparent in the cost being optimized. [sent-545, score-0.518]

65 It is conceivable that sampling from such a distribution in order to initialize a deep architecture might actually hurt the performance of a deep architecture (compared to random initialization from a uniform distribution). [sent-548, score-0.896]

66 Like regularizers in general, unsupervised pre-training (in this case, with denoising auto-encoders) might thus be seen as decreasing the variance and introducing a bias (towards parameter conﬁgurations suitable for performing denoising). [sent-558, score-0.456]

67 In this experiment we explore the relationship between the number of units per layer and the effectiveness of unsupervised pre-training. [sent-564, score-0.696]

68 The hypothesis that unsupervised pre-training acts as a regularizer would suggest that we should see a trend of increasing effectiveness of unsupervised pre-training as the number of units per layer are increased. [sent-565, score-1.147]

69 What we observe is a more systematic effect: while unsupervised pre-training helps for larger layers and deeper networks, it also appears to hurt for too small networks. [sent-571, score-0.696]

70 Figure 9 also shows that DBNs behave qualitatively like SDAEs, in the sense that unsupervised pre-training architectures with smaller layers hurts performance. [sent-572, score-0.829]

71 The effect can be explained in terms of the role of unsupervised pre-training as promoting input transformations (in the hidden layers) that are useful at capturing the main variations in the input distribution P(X). [sent-583, score-0.467]

72 When the hidden layers are small it is less likely that the transformations for predicting Y are included in the lot learned by unsupervised pre-training. [sent-585, score-0.765]

73 (2007) constrained the top layer of a deep network to have 20 units and measured the training error of networks with and without pre-training. [sent-590, score-0.895]

74 5 Experiment 5: Comparing pre-training to L1 and L2 regularization An alternative hypothesis would be that classical ways of regularizing could perhaps achieve the same effect as unsupervised pre-training. [sent-608, score-0.475]

75 6 Summary of Findings: Experiments 1-5 So far, the results obtained from the previous experiments point towards a pretty clear explanation of the effect of unsupervised pre-training: namely, that its effect is a regularization effect. [sent-618, score-0.457]

76 This is because the effectiveness of a canonical regularizer decreases as the data set grows, whereas the effectiveness of unsupervised pre-training as a regularizer is maintained as the data set grows. [sent-629, score-0.457]

77 This conﬁrms the hypothesis that even in an online setting, optimization of deep networks is harder than shallow ones. [sent-641, score-0.505]

78 However it is entirely consistent with our interpretation, stated in our hypothesis, of the role of unsupervised pre-training in the online setting with stochastic gradient descent training on a non-convex objective function. [sent-666, score-0.579]

79 3 Experiment 8: Pre-training only k layers From Figure 11 we can see that unsupervised pre-training makes quite a difference for 3 layers, on InfiniteMNIST. [sent-699, score-0.667]

80 The setup is as follows: for both MNIST and InfiniteMNIST we pre-train only the bottom k layers and randomly initialize the top n − k layers in the usual way. [sent-701, score-0.676]

81 We pre-train the ﬁrst layer, the ﬁrst two layers and all three layers using RBMs and randomly initialize the other layers; we also compare with the network whose layers are all randomly initialized. [sent-709, score-1.014]

82 We pre-train the ﬁrst layer, the ﬁrst two layers or all three layers using denoising auto-encoders and leave the rest of the network randomly initialized. [sent-711, score-0.775]

83 Discussion and Conclusions We have shown that unsupervised pre-training adds robustness to a deep architecture. [sent-723, score-0.699]

84 One of our ﬁndings is that deep networks with unsupervised pre-training seem to exhibit some properties of a regularizer: with small enough layers, pre-trained deep architectures are systematically worse than randomly initialized deep architectures. [sent-729, score-1.651]

85 Moreover, when the layers are big enough, the pre-trained models obtain worse training errors, but better generalization performance. [sent-730, score-0.472]

86 Additionally, we have re-done an experiment which purportedly showed that unsupervised pre-training can be explained with an optimization hypothesis and observed a regularization effect instead. [sent-731, score-0.475]

87 Finally, the other important set of results show that unsupervised pre-training acts like a variance reduction technique, yet a network with pre-training has a lower training error on a very large data set, which supports an optimization interpretation of the effect of pre-training. [sent-735, score-0.477]

88 One of these consequences is that early examples have a big inﬂuence on the outcome of training and this is one of the reasons why in a large-scale setting the inﬂuence of unsupervised pre-training is still present. [sent-740, score-0.5]

89 Throughout this paper, we have delved on the idea that the basin of attraction induced by the early examples (in conjunction with unsupervised pre-training) is, for all practical purposes, a basin from which supervised training does not escape. [sent-741, score-0.834]

90 Basically, unsupervised pre-training favors hidden units that compute features of the input X that correspond to major factors of variation in the true P(X). [sent-744, score-0.514]

91 We hypothesize that the presence of the bottleneck is a crucial element that distinguishes the deep auto-encoders from the deep classiﬁers studied here. [sent-754, score-0.74]

92 Our results suggest that optimization in deep networks is a complicated problem that is inﬂuenced in great part by the early examples during training. [sent-759, score-0.483]

93 If this disentangling hypothesis is correct, it would help to explain how unsupervised pre-training can address the chicken-and-egg issue explained in Section 2: the lower layers of a supervised deep architecture need the upper layers to deﬁne what they should extract, and vice-versa. [sent-771, score-1.559]

94 Instead, the lower layers can extract robust and disentangled representations of the factors of variation and the upper layers select and combine the appropriate factors (sometimes not all at the top hidden layer). [sent-772, score-0.774]

95 Instead, in the case of a single hidden layer, less information about Y would have been dropped (if at all), making the job of the supervised output layer easier. [sent-784, score-0.464]

96 , 2009) showing that for several data sets supervised ﬁne-tuning signiﬁcantly improves classiﬁcation error, when the output layer only takes input from the top hidden layer. [sent-786, score-0.464]

97 This hypothesis is also consistent with the observation made here (Figure 1) that unsupervised pre-training actually does not help (and can hurt) for too deep networks. [sent-787, score-0.757]

98 Our conviction is that devising improved strategies for learning in deep architectures requires a more profound understanding of the difﬁculties that we face with them. [sent-796, score-0.496]

99 An empirical evaluation of deep architectures on problems with many factors of variation. [sent-923, score-0.496]

100 Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. [sent-956, score-0.786]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('deep', 0.37), ('layers', 0.338), ('unsupervised', 0.329), ('layer', 0.28), ('engio', 0.229), ('eep', 0.149), ('ourville', 0.149), ('rhan', 0.149), ('elp', 0.141), ('anzagol', 0.127), ('incent', 0.127), ('architectures', 0.126), ('nsupervised', 0.12), ('oes', 0.12), ('training', 0.108), ('infinitemnist', 0.108), ('hy', 0.1), ('denoising', 0.099), ('hidden', 0.098), ('bengio', 0.097), ('basin', 0.091), ('rbm', 0.089), ('units', 0.087), ('supervised', 0.086), ('yoshua', 0.082), ('ranzato', 0.076), ('initialization', 0.076), ('mnist', 0.073), ('trajectories', 0.071), ('larochelle', 0.071), ('rbms', 0.071), ('vincent', 0.069), ('dbn', 0.067), ('nll', 0.066), ('attraction', 0.066), ('regularizer', 0.064), ('erhan', 0.064), ('pretraining', 0.064), ('sdae', 0.064), ('early', 0.063), ('hypothesis', 0.058), ('yann', 0.058), ('seeds', 0.057), ('visualizations', 0.057), ('hinton', 0.057), ('networks', 0.05), ('dumitru', 0.05), ('stacked', 0.049), ('geoffrey', 0.049), ('earning', 0.048), ('regularization', 0.048), ('trained', 0.047), ('rk', 0.045), ('million', 0.044), ('lters', 0.043), ('weights', 0.042), ('descent', 0.041), ('gradient', 0.041), ('hi', 0.041), ('effect', 0.04), ('salakhutdinov', 0.04), ('architecture', 0.04), ('minima', 0.039), ('depth', 0.038), ('weston', 0.038), ('belief', 0.037), ('apparent', 0.037), ('seem', 0.036), ('andrew', 0.036), ('qualitatively', 0.036), ('aaron', 0.035), ('courville', 0.035), ('umontreal', 0.035), ('generative', 0.033), ('basins', 0.033), ('stochastic', 0.033), ('lasserre', 0.032), ('uenced', 0.031), ('dbns', 0.031), ('region', 0.03), ('pascal', 0.03), ('seed', 0.029), ('visualizing', 0.029), ('hugo', 0.029), ('nips', 0.029), ('deeper', 0.029), ('regularizers', 0.028), ('capacity', 0.028), ('pre', 0.027), ('biases', 0.027), ('regions', 0.027), ('online', 0.027), ('editors', 0.026), ('generalization', 0.026), ('contrastive', 0.026), ('jason', 0.026), ('hk', 0.026), ('trajectory', 0.026), ('lecun', 0.026), ('hypotheses', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

Author: Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio

2 0.40339333 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

Author: Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol

Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classiﬁcation problems to yield signiﬁcantly lower classiﬁcation error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classiﬁers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations. Keywords: deep learning, unsupervised feature learning, deep belief networks, autoencoders, denoising

3 0.09773618 93 jmlr-2010-PyBrain

Author: Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas Rückstieß, Jürgen Schmidhuber

Abstract: PyBrain is a versatile machine learning library for Python. Its goal is to provide ﬂexible, easyto-use yet still powerful algorithms for machine learning tasks, including a variety of predeﬁned environments and benchmarks to test and compare algorithms. Implemented algorithms include Long Short-Term Memory (LSTM), policy gradient methods, (multidimensional) recurrent neural networks and deep belief networks. Keywords: Python, neural networks, reinforcement learning, optimization

4 0.08162722 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

Author: Gideon S. Mann, Andrew McCallum

Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE ﬁts model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random ﬁelds. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random ﬁelds

5 0.071331382 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley

Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overﬁtting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-ﬁtting, regularization, minimum description length

6 0.067881547 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

7 0.057745602 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

8 0.052725602 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

9 0.05026044 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

10 0.049823202 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

11 0.048656166 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

12 0.048479725 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

13 0.047392018 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

14 0.047183972 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

15 0.045874886 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

16 0.043948416 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

17 0.043408524 53 jmlr-2010-Inducing Tree-Substitution Grammars

18 0.042794883 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

19 0.042458594 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

20 0.041144632 22 jmlr-2010-Classification Using Geometric Level Sets

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.213), (1, 0.037), (2, -0.16), (3, 0.091), (4, 0.065), (5, 0.188), (6, 0.203), (7, -0.257), (8, 0.542), (9, 0.11), (10, -0.078), (11, -0.153), (12, -0.077), (13, -0.064), (14, -0.078), (15, -0.045), (16, -0.074), (17, -0.063), (18, 0.053), (19, 0.118), (20, -0.049), (21, -0.048), (22, 0.028), (23, 0.03), (24, -0.029), (25, 0.029), (26, 0.109), (27, -0.035), (28, 0.023), (29, -0.052), (30, -0.065), (31, -0.093), (32, 0.032), (33, -0.003), (34, 0.048), (35, -0.04), (36, 0.041), (37, -0.04), (38, -0.039), (39, -0.028), (40, 0.004), (41, 0.035), (42, -0.024), (43, 0.02), (44, -0.015), (45, 0.009), (46, -0.018), (47, 0.0), (48, 0.02), (49, -0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94273901 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

Author: Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio

2 0.92453963 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

Author: Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol

3 0.47455034 93 jmlr-2010-PyBrain

Author: Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas Rückstieß, Jürgen Schmidhuber

4 0.31089795 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

Author: Valero Laparra, Juan Gutiérrez, Gustavo Camps-Valls, Jesús Malo

Abstract: A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. The performance of these methods improves when relations among the local frequency coefﬁcients are explicitly included. However, in these techniques, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources. In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefﬁcients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations in the estimated signal. Since relations among the coefﬁcients are speciﬁc to the signal, the regularization property of SVR is exploited to remove the noise, which does not share this feature. The speciﬁc signal relations are encoded in an anisotropic kernel obtained from mutual information measures computed on a representative image database. In the proposed scheme, training considers minimizing the Kullback-Leibler divergence (KLD) between the estimated and actual probability functions (or histograms) of signal and noise in order to enforce similarity up to the higher (computationally estimable) order. Due to its non-parametric nature, the method can eventually cope with different noise sources without the need of an explicit re-formulation, as it is strictly necessary under parametric Bayesian formalisms. Results under several noise levels and noise sources show that: (1) the proposed method outperforms conventional wavelet methods that assume coefﬁcient independence, (2) it is similar to state-of-the-art methods that do explicitly include these relations when the noise source is Gaussian, and (3) it gives better numerical and visual performance when more complex, realistic noise sources are considered. Therefore, the proposed machine learning approach can be seen as a mor

5 0.25352922 37 jmlr-2010-Evolving Static Representations for Task Transfer

Author: Phillip Verbancsics, Kenneth O. Stanley

Abstract: An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to ﬁt the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird’s eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map. For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV. Yet the problem is that a raw two-dimensional map is high-dimensional and unstructured. This paper shows how this problem is addressed naturally by an idea from evolutionary computation called indirect encoding, which compresses the representation by exploiting its geometry. The result is that the BEV learns a Keepaway policy that transfers without further learning or manipulation. It also facilitates transferring knowledge learned in a different domain, Knight Joust, into Keepaway. Finally, the indirect encoding of the BEV means that its geometry can be changed without altering the solution. Thus static representations facilitate several kinds of transfer.

6 0.25242853 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

7 0.23247859 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

8 0.23072642 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.22608078 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

10 0.21610495 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

11 0.21411365 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

12 0.21287203 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

13 0.19770019 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

14 0.1872647 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

15 0.18498507 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

16 0.18099734 53 jmlr-2010-Inducing Tree-Substitution Grammars

17 0.17897548 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

18 0.17445908 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

19 0.17408147 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

20 0.172681 22 jmlr-2010-Classification Using Geometric Level Sets

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.352), (3, 0.015), (4, 0.013), (8, 0.019), (15, 0.016), (21, 0.029), (24, 0.014), (32, 0.061), (33, 0.012), (36, 0.041), (37, 0.051), (39, 0.035), (75, 0.147), (81, 0.021), (85, 0.069), (96, 0.023), (97, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73144084 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

Author: Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio

2 0.56769228 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

Author: Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol

3 0.45943698 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

4 0.45871276 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

5 0.45837989 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

6 0.45803633 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

7 0.45653379 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

8 0.45420122 63 jmlr-2010-Learning Instance-Specific Predictive Models

9 0.45388645 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

10 0.45228815 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

11 0.45212838 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

12 0.45109668 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

13 0.44948635 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

14 0.44932848 109 jmlr-2010-Stochastic Composite Likelihood

15 0.44929248 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

16 0.44764656 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

17 0.44756925 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

18 0.44704974 102 jmlr-2010-Semi-Supervised Novelty Detection

19 0.44556174 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

20 0.44487786 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices