jmlr jmlr2012 jmlr2012-78 knowledge-graph by maker-knowledge-mining

78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Source: pdf

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which ﬁnd latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can ﬁnd representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it ﬁnds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically signiﬁcant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which ﬁnd latent representations that are constrained but nevertheless informative for reconstruction. [sent-10, score-0.934]

2 However, pure unsupervised learning with autoencoders can ﬁnd representations that may or may not be useful for the ultimate discriminative task. [sent-11, score-0.298]

3 It is a continuing challenge to guide the training of an autoencoder so that it ﬁnds features which will be useful for predicting labels. [sent-12, score-0.627]

4 Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. [sent-13, score-0.302]

5 Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. [sent-14, score-0.869]

6 We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. [sent-16, score-0.356]

7 Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning 1. [sent-18, score-0.369]

8 In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. [sent-26, score-0.319]

9 Neural networks have proven to be an effective way to perform such processing, and autoencoder neural networks, speciﬁcally, have been used to ﬁnd representatons for a variety of downstream machine learning tasks, for example, image classiﬁcation (Vincent et al. [sent-30, score-0.675]

10 The critical insight of the autoencoder neural network is the idea of using a constrained (typically either sparse or low-dimensional) representation within a feedforward neural network. [sent-34, score-0.749]

11 (2007) introduced weak supervision into the autoencoder training objective by adding label-speciﬁc output units in addition to the reconstruction. [sent-41, score-0.719]

12 The difﬁculty of this approach is that it complicates the task of learning the autoencoder representation. [sent-43, score-0.627]

13 Here we propose a different take on the issue of introducing supervised guidance into autoencoder representations. [sent-47, score-0.869]

14 We consider Gaussian process priors on the discriminative function that maps the latent codes into labels. [sent-48, score-0.297]

15 We are then able to combine the efﬁcient parametric feed-forward aspects of the autoencoder with a ﬂexible Bayesian nonparametric model for the labels. [sent-51, score-0.762]

16 This also leads to an interesting interpretation of the back-constrained GPLVM itself as a limiting case of an autoencoder in which the decoder has been marginalized out. [sent-52, score-0.743]

17 We also examine a data set that highlights the value of our approach, in which we cannot only use guidance from desired labels, but also introduce guidance away from irrelevant representations. [sent-54, score-0.44]

18 Unsupervised Learning of Latent Representations The nonparametrically-guided autoencoder presented in this paper is motivated largely by the relationship between two different approaches to latent variable modeling. [sent-56, score-0.817]

19 In this section, we review these two approaches, the GPLVM and autoencoder neural network, and examine precisely how they are related. [sent-57, score-0.653]

20 , 1987) is a neural network architecture that is designed to create a latent representation that is informative of the input data. [sent-60, score-0.286]

21 Through training the model to reproduce the input data at its output, a latent embedding must arise within the hidden layer of the model. [sent-61, score-0.324]

22 However, these are difﬁcult to learn because a trivial minimum of the autoencoder 2569 S NOEK , A DAMS AND L AROCHELLE reconstruction objective is reached when the autoencoder learns the identity transformation. [sent-77, score-1.254]

23 The denoising autoencoder forces the model to learn more interesting structure from the data by providing as input a corrupted training example, while evaluating reconstruction on the noiseless original. [sent-78, score-0.699]

24 Such a manifold is difﬁcult to deﬁne a priori, however, and thus the problem is often framed as learning the latent embedding under an assumed smooth functional mapping between the visible and latent spaces. [sent-83, score-0.46]

25 Using a Gaussian process prior, the GPLVM marginalizes over the inﬁnite possible mappings from the latent to visible spaces and optimizes the latent embedding over a distribution of mappings. [sent-86, score-0.438]

26 Not only does this introduce arbitrary gaps in the latent manifold, but it also complicates the encoding of novel data points into the latent space as there is no direct mapping. [sent-136, score-0.408]

27 The latent representations of out-of-sample data must thus be optimized, conditioned on the latent embedding of the training examples. [sent-137, score-0.46]

28 The NeuroScale algorithm is a radial basis function network that creates a one-way mapping from data to a latent space using a heuristic loss that attempts to preserve pairwise distances between data cases. [sent-143, score-0.277]

29 An interesting and overlooked consequence of this relationship is that it establishes a connection between autoencoders and the back-constrained Gaussian process latent variable model. [sent-148, score-0.34]

30 A GPLVM with the covariance function of Williams (1998), although it does not impose a density over the data, is similar to a density network (MacKay, 1994) with an inﬁnite number of hidden units in the single hidden layer. [sent-149, score-0.403]

31 We can transform this density network into a semiparametric autoencoder by applying a neural network as the backconstraint network of the GPLVM. [sent-150, score-0.773]

32 The encoder of the resulting model is a parametric neural network and the decoder a Gaussian process. [sent-151, score-0.372]

33 After training, the decoder network of an autoencoder is generally superﬂuous. [sent-157, score-0.783]

34 Thus, for very high dimensional data, a standard autoencoder may be more desirable. [sent-162, score-0.627]

35 Supervised Guidance of Latent Representations Unsupervised learning has proven to be effective for learning latent representations that excel in discriminative tasks. [sent-164, score-0.319]

36 (2007) demonstrated, for example, that while a purely supervised signal can lead to overﬁtting, mild supervised guidance can be beneﬁcial when initializing a discriminative deep neural network. [sent-167, score-0.411]

37 (2007) proposed a hybrid approach under which the unsupervised model’s latent representation also be trained to predict the label information, by adding a parametric mapping c(x ; Λ) : X → Z from the latent space X to the labels Z and backpropagating error gradients from the output. [sent-169, score-0.66]

38 This “partial supervision” thus encourages the model to encode statistics within the latent representation that are useful for a speciﬁc (but learned) parameterization of such a linear mapping. [sent-172, score-0.28]

39 The assumption of a speciﬁc parametric form for the mapping c(x ; Λ) restricts the supervised guidance to classiﬁers within that family of mappings. [sent-175, score-0.372]

40 At every iteration t of descent (with current state φt , ψt , Λt ), the gradient from supervised guidance encourages the latent representation (currently parametrized by φt , ψt ) to become more predictive of the labels under the 2573 S NOEK , A DAMS AND L AROCHELLE current label map c(x ; Λt ). [sent-178, score-0.564]

41 That is, rather than learning a latent representation that is tied to a speciﬁc parameterized mapping to the labels, we would instead prefer to ﬁnd a latent representation that is consistent with an entire class of mappings. [sent-183, score-0.487]

42 The result is a hybrid of the autoencoder and back-constrained GPLVM, where the encoder is shared across models. [sent-189, score-0.734]

43 For notation, we will refer to this approach to guided latent representation as a nonparametrically guided autoencoder, or NPGA. [sent-190, score-0.295]

44 As is common for autoencoders and to reduce the number of free parameters in the model, the encoder and decoder weights are tied. [sent-196, score-0.348]

45 2574 N ONPARAMETRIC G UIDANCE OF AUTOENCODERS denoising autoencoder variant of Vincent et al. [sent-199, score-0.674]

46 That is, we update the denoising autoencoder noise every three iterations of conjugate gradient descent optimization. [sent-201, score-0.674]

47 Note also that when the salient variations of the data are not relevant to a given discriminative task, the initial RBM training will not encourage the encoding of the discriminative information in the latent representation. [sent-210, score-0.415]

48 The NPGA circumvents these issues by applying a GP to small mini-batches during the learning of the latent representation and uses the GP to learn a representation that is better even for a linear discriminative model. [sent-211, score-0.332]

49 Salakhutdinov and Hinton (2007) combined autoencoder training with neighborhood component analysis (Goldberger et al. [sent-213, score-0.627]

50 , 2004), which encouraged the model to encode similar latent representations for inputs belonging to the same class. [sent-214, score-0.296]

51 Thus, the GPLVM enforces that examples close in label space will be closer in the latent representation than examples that are distant in label space. [sent-224, score-0.3]

52 Encoding such periodic signals in a parametric neural network and blending this with unsupervised learning can be challenging (Zemel et al. [sent-229, score-0.279]

53 In all experiments, the discriminative value of the learned representation is evaluated by training a linear (logistic) classiﬁer, a standard practice for evaluating latent representations. [sent-244, score-0.324]

54 We use these data primarily to explore two questions: • To what extent does the nonparametric guidance of an unsupervised parametric autoencoder improve the learned feature representation with respect to the classiﬁcation objective? [sent-249, score-1.078]

55 • What additional beneﬁt is gained through using nonparametric guidance over simply incorporating a parametric mapping to the labels? [sent-250, score-0.402]

56 In order to address these concerns, we linearly blend our nonparametric guidance cost LGP (φ, Γ) with the one Bengio et al. [sent-251, score-0.272]

57 Thus, α allows us to adjust the relative contribution of the unsupervised guidance while β weighs the relative contributions of the parametric and nonparametric supervised guidance. [sent-253, score-0.421]

58 05 was added to the inputs of the denoising autoencoder cost. [sent-274, score-0.699]

59 However, in Figure 1a we can see that some parametric guidance can be beneﬁcial, presumably because it is from the same discriminative family as the ﬁnal classiﬁer. [sent-284, score-0.385]

60 We observe also that using a GP with a linear covariance function within the NPGA outperforms the parametric guidance (see Fig. [sent-287, score-0.372]

61 Full images: A one-layer autoencoder with 2400 NReLU units was trained on the raw data (which was reduced from 32×32×3 = 3072 to 400 dimensions using PCA). [sent-336, score-0.719]

62 28 × 28 patches: An autoencoder with 1500 logistic hidden units was trained on 28×28×3 patches subsampled from the full images, then reduced to 400 dimensions using PCA. [sent-339, score-0.898]

63 1600 NReLU units were used in the autoencoder but the GP was applied to only 400 of them. [sent-350, score-0.719]

64 When PCA preprocessing was used for autoencoder training, the inputs were corrupted with zero-mean Gaussian noise with standard deviation 0. [sent-353, score-0.677]

65 2579 S NOEK , A DAMS AND L AROCHELLE After training, a logistic regression classiﬁer was applied to the features resulting from the hidden layer of each autoencoder to evaluate their quality with respect to the classiﬁcation objective. [sent-360, score-0.766]

66 The use of different architectures, methodologies and hidden unit activations demonstrates that the nonparametric guidance can be beneﬁcial for a wide variety of formulations. [sent-362, score-0.394]

67 Certainly, the squared pixel difference objective of the autoencoder will be affected more by signiﬁcant lighting changes than object categories. [sent-379, score-0.691]

68 In our empirical analysis we examine the following: As the autoencoder attempts to coalesce the various sources of structure into its hidden layer, can the NPGA guide the learning in such a way as to separate the class-invariant transformations of the data from the class-relevant information? [sent-382, score-0.728]

69 In order to separate the latent embedding of the salient information related to each label, the GPs were applied to disjoint subsets of the hidden units of the autoencoder. [sent-384, score-0.449]

70 Thus a GP mapping from a four dimensional latent space, H = 4, to class labels was applied to 1200 hidden units. [sent-386, score-0.374]

71 Finally, because the azimuth is a periodic signal, a periodic kernel was used for the azimuth GP. [sent-390, score-0.276]

72 2580 N ONPARAMETRIC G UIDANCE OF AUTOENCODERS Class Elevation Lighting Figure 3: Visualisations of the NORB training (top) and test (bottom) data latent space representations in the NPGA, corresponding to class (ﬁrst column), elevation (second column), and lighting (third column). [sent-392, score-0.334]

73 To validate this conﬁguration, we empirically compared it to a standard autoencoder (i. [sent-395, score-0.627]

74 , α = 0), an autoencoder with parametric logistic regression guidance and an NPGA with a single GP applied to all hidden units mapping to the class labels. [sent-397, score-1.208]

75 For denoising autoencoder training, the raw pixels were corrupted by setting 20% of pixels to zero in the inputs. [sent-402, score-0.699]

76 This implies that the half of the latent representation that encodes the information to which the model should be invariant can be discarded with virtually no discriminative penalty. [sent-439, score-0.302]

77 Given the signiﬁcant difference in accuracy between this formulation and the other models, it appears to be very important to separate the encoding of different sources of variation within the autoencoder hidden layer. [sent-440, score-0.756]

78 An autoencoder with parametric guidance to all four labels, mimicking the conﬁguration of the NPGA, achieved the poorest performance of the models tested, with 86% accuracy. [sent-446, score-0.93]

79 Rehabilitation patients beneﬁt from performing repetitive rehabilitation exercises as frequently as possible but are limited due to a shortage of rehabilitation therapists. [sent-454, score-0.293]

80 (2012) developed a system to automate the role of a therapist guiding rehabilitation patients through repetitive upper limb rehabilitation exercises. [sent-459, score-0.298]

81 In our analysis of this problem we use a NPGA to encode a latent embedding of postures that facilitates better discrimination between different posture types. [sent-482, score-0.307]

82 We interpolate between a standard autoencoder (α = 0), a classiﬁcation neural net (α = 1, β = 1), and a nonparametrically guided autoencoder by linear blending of their objectives according to Equation 5. [sent-485, score-1.332]

83 (2012), to search over α ∈ [0, 1], β ∈ [0, 1], 10 − 1000 hidden units in the autoencoder and the GP latent dimensionality H ∈ {1. [sent-497, score-1.01]

84 Thus, in Figure 5 we explore how the relationship between validation error and the amount of nonparametric guidance α, and parametric guidance β is expected to change as the number of autoencoder hidden units is varied. [sent-509, score-1.415]

85 1, it seems clear that the best region in hyperparameter space is a combination of all three objectives, the parametric 2584 N ONPARAMETRIC G UIDANCE OF AUTOENCODERS H=2, 10 hidden units H=2, 1000 hidden units H=2, 500 hidden units 1 1 1 0. [sent-512, score-0.662]

86 8 1 10 (c) Figure 5: The posterior mean learned by Bayesian optimization over the validation set classiﬁcation error (in percent) for α and β with H ﬁxed at 2 and three different settings of autoencoder hidden units: (a) 10, (b) 500, and (c) 1000. [sent-536, score-0.77]

87 This shows how the relationship between validation error and the amount of nonparametric guidance, α, and parametric guidance, β, is expected to change as the number of autoencoder hidden units is increased. [sent-537, score-0.975]

88 Also, as we increase the number of hidden units in the autoencoder, the amount of guidance required appears to decrease. [sent-541, score-0.413]

89 As the capacity of the autoencoder is increased, it is likely that the autoencoder encodes increasingly subtle statistical structure in the data. [sent-542, score-1.254]

90 When there are fewer hidden units, this structure is not encoded unless the autoencoder objective is augmented to reﬂect a preference for it. [sent-543, score-0.728]

91 With the best performing NPGA reported above, a nearest neighbors classiﬁer applied to the hidden units of the autoencoder achieved an accuracy of 85. [sent-546, score-0.82]

92 This likely reﬂects the fact that the autoencoder must still encode information that is useful for reconstruction but not discrimination. [sent-549, score-0.661]

93 Conclusion In this paper we present an interesting theoretical link between the autoencoder neural network and the back-constrained Gaussian process latent variable model. [sent-554, score-0.908]

94 A particular formulation of the back-constrained GPLVM can be interpreted as an autoencoder in which the decoder has an in2585 S NOEK , A DAMS AND L AROCHELLE ﬁnite number of hidden units. [sent-555, score-0.844]

95 This formulation exhibits some attractive properties as it allows one to learn the encoder half of the autoencoder while marginalizing over decoders. [sent-556, score-0.734]

96 We examine the use of this model to guide the latent representation of an autoencoder to encode auxiliary label information without instantiating a parametric mapping to the labels. [sent-557, score-1.051]

97 The resulting nonparametric guidance encourages the autoencoder to encode a latent representation that captures salient structure within the input data that is harmonious with the labels. [sent-558, score-1.212]

98 Conceptually, this approach enforces simply that a smooth mapping exists from the latent representation to the labels rather than choosing or learning a speciﬁc parameterization. [sent-559, score-0.303]

99 The approach is empirically validated on four data sets, demonstrating that the nonparametrically guided autoencoder encourages latent representations that are better with respect to a discriminative task. [sent-560, score-1.024]

100 We demonstrate on the NORB data that this model can also be used to discourage latent representations that capture statistical structure that is known to be irrelevant through guiding the autoencoder to separate multiple sources of variation. [sent-567, score-0.864]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('autoencoder', 0.627), ('npga', 0.343), ('gplvm', 0.3), ('guidance', 0.22), ('latent', 0.19), ('gp', 0.151), ('rehabilitation', 0.136), ('autoencoders', 0.125), ('decoder', 0.116), ('encoder', 0.107), ('hidden', 0.101), ('dams', 0.092), ('noek', 0.092), ('units', 0.092), ('periodic', 0.086), ('onparametric', 0.084), ('uidance', 0.084), ('parametric', 0.083), ('discriminative', 0.082), ('arochelle', 0.079), ('gps', 0.071), ('covariance', 0.069), ('norb', 0.067), ('taati', 0.067), ('lighting', 0.064), ('larochelle', 0.053), ('nonparametric', 0.052), ('hugo', 0.051), ('cifar', 0.05), ('coates', 0.05), ('posture', 0.05), ('denoising', 0.047), ('mapping', 0.047), ('representations', 0.047), ('salakhutdinov', 0.044), ('unsupervised', 0.044), ('azimuth', 0.042), ('huq', 0.042), ('jasper', 0.042), ('label', 0.04), ('gaussian', 0.04), ('network', 0.04), ('patches', 0.04), ('deep', 0.039), ('logistic', 0.038), ('hyperparameters', 0.037), ('labels', 0.036), ('robotic', 0.036), ('snoek', 0.036), ('encode', 0.034), ('elevation', 0.033), ('lauto', 0.033), ('lgp', 0.033), ('nair', 0.033), ('neuroscale', 0.033), ('nrelu', 0.033), ('rajibul', 0.033), ('sgplvm', 0.033), ('yk', 0.033), ('exponentiated', 0.033), ('salient', 0.033), ('embedding', 0.033), ('alpha', 0.032), ('lawrence', 0.03), ('bayesian', 0.03), ('aaron', 0.03), ('representation', 0.03), ('oil', 0.029), ('nonparametrically', 0.029), ('encoding', 0.028), ('bengio', 0.026), ('neural', 0.026), ('geoffrey', 0.026), ('limb', 0.026), ('urtasun', 0.026), ('recti', 0.026), ('ruslan', 0.026), ('convolutional', 0.026), ('encourages', 0.026), ('inputs', 0.025), ('corrupted', 0.025), ('process', 0.025), ('regressor', 0.025), ('guided', 0.023), ('supervised', 0.022), ('adams', 0.022), ('image', 0.022), ('learned', 0.022), ('stroke', 0.021), ('activations', 0.021), ('exercises', 0.021), ('kan', 0.021), ('qui', 0.021), ('skeletal', 0.021), ('ryan', 0.021), ('alex', 0.021), ('classi', 0.021), ('validation', 0.02), ('kernel', 0.02), ('neil', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

2 0.080089681 59 jmlr-2012-Linear Regression With Random Projections

Author: Odalric-Ambrym Maillard, Rémi Munos

Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of ﬁnite dimension P) of a given large (possibly inﬁnite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is deﬁned as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefﬁcients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a speciﬁc regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction

3 0.079384588 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

Abstract: Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artiﬁcial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classiﬁers. We study the Classiﬁcation RBM (ClassRBM), a variant on the RBM adapted to the classiﬁcation setting. We study different strategies for training the ClassRBM and show that competitive classiﬁcation performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classiﬁcation problems, namely semi-supervised and multitask learning. Keywords: restricted Boltzmann machine, classiﬁcation, discriminative learning, generative learning

4 0.075541802 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper. Keywords: Gaussian process regression, domain decomposition method, partial independent conditional, bagging for Gaussian process, local probabilistic regression

5 0.061898638 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Author: James Bergstra, Yoshua Bengio

Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent Ă˘€œHigh ThroughputĂ˘€? methods achieve surprising successĂ˘€”they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword

6 0.051795695 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

7 0.048061702 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

8 0.045756567 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

9 0.041865967 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

10 0.038239803 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

11 0.037901945 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

12 0.028488681 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

13 0.02804153 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

14 0.027455948 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

15 0.027073052 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

16 0.026972992 106 jmlr-2012-Sign Language Recognition using Sub-Units

17 0.02658492 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

18 0.024767034 82 jmlr-2012-On the Necessity of Irrelevant Variables

19 0.024263076 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

20 0.024254628 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.123), (1, 0.046), (2, 0.151), (3, -0.04), (4, 0.077), (5, -0.008), (6, 0.059), (7, 0.041), (8, -0.175), (9, 0.006), (10, 0.012), (11, 0.033), (12, -0.115), (13, 0.136), (14, -0.066), (15, 0.062), (16, -0.052), (17, 0.118), (18, 0.023), (19, 0.057), (20, 0.027), (21, -0.097), (22, 0.186), (23, -0.002), (24, 0.09), (25, 0.074), (26, 0.179), (27, -0.046), (28, -0.161), (29, -0.105), (30, 0.043), (31, -0.087), (32, -0.08), (33, 0.071), (34, 0.032), (35, 0.084), (36, -0.004), (37, -0.073), (38, 0.143), (39, -0.118), (40, 0.013), (41, -0.176), (42, 0.105), (43, 0.117), (44, 0.099), (45, -0.081), (46, 0.017), (47, 0.099), (48, -0.109), (49, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93451816 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

2 0.58296043 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

3 0.5405829 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

4 0.39927804 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Author: James Bergstra, Yoshua Bengio

5 0.3716473 59 jmlr-2012-Linear Regression With Random Projections

Author: Odalric-Ambrym Maillard, Rémi Munos

6 0.36947805 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

7 0.30978253 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

8 0.28984734 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

9 0.23566681 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

10 0.21767254 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

11 0.21723466 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

12 0.20308758 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces

13 0.19213045 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

14 0.16924892 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

15 0.16707709 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

16 0.16622548 101 jmlr-2012-SVDFeature: A Toolkit for Feature-based Collaborative Filtering

17 0.16488241 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

18 0.1616271 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

19 0.16103594 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory

20 0.15836088 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.015), (14, 0.034), (21, 0.02), (26, 0.047), (29, 0.028), (35, 0.042), (49, 0.03), (56, 0.014), (57, 0.013), (69, 0.023), (75, 0.05), (77, 0.012), (79, 0.013), (81, 0.445), (92, 0.049), (96, 0.079)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79686421 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

2 0.78522688 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

Author: Gil Tahan, Lior Rokach, Yuval Shahar

Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable ﬁles without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware ﬁles. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire ﬁle, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufﬁcient to employ one simple detection rule for classifying executable ﬁles. Keywords: computer security, malware detection, common segment analysis, supervised learning

3 0.67145675 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efﬁciently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-deﬁnite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classiﬁcation task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning

4 0.35716951 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a uniﬁed constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classiﬁcation or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efﬁcient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efﬁcient than existing supervised topic models, especially for classiﬁcation. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines

5 0.34807757 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio

6 0.32626927 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

7 0.30761379 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

8 0.29566556 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

9 0.29190555 106 jmlr-2012-Sign Language Recognition using Sub-Units

10 0.28940073 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

11 0.2878685 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

12 0.28319514 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

13 0.28093201 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

14 0.27654833 100 jmlr-2012-Robust Kernel Density Estimation

15 0.27357763 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

16 0.27319825 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

17 0.27213565 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

18 0.26677671 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

19 0.26250201 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

20 0.25900695 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis