jmlr jmlr2012 jmlr2012-95 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
Reference: text
sentIndex sentText sentNum sentScore
1 This paper shows empirically and theoretically that randomly chosen trials are more efÄ? [sent-6, score-0.491]
2 Ĺš cient for hyper-parameter optimization than trials on a grid. [sent-7, score-0.545]
3 Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ? [sent-8, score-0.905]
4 Granting random search the same computational budget, random search Ä? [sent-14, score-0.532]
5 Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ? [sent-18, score-0.968]
6 This phenomenon makes grid search a poor choice for conÄ? [sent-21, score-0.551]
7 The critical step in hyper-parameter optimization is to choose the set of trials {ĂŽĹĽ(1) . [sent-67, score-0.545]
8 The most widely used strategy is a combination of grid search and manual search (e. [sent-71, score-0.905]
9 ), then grid search requires that we choose a set of values for each variable (L(1) . [sent-81, score-0.551]
10 In grid search the set of trials is formed by assembling every possible combination of values, so the number of trials in a grid search is S = � [sent-85, score-2.084]
11 This k=1 product over K sets makes grid search suffer from the curse of dimensionality because the number of joint values grows exponentially with the number of hyper-parameters (Bellman, 1961). [sent-87, score-0.588]
12 On the other hand, grid search alone does very poorly in practice (as discussed here). [sent-98, score-0.551]
13 Ĺš cient (roughly equivalent to or better than combinining manual search and grid search, in our experiments) and keeping the advantages of implementation simplicity and reproducibility of pure grid search. [sent-100, score-1.004]
14 Random search is actually more practical than grid search because it can be applied even when using a cluster of computers that can fail, and allows the experimenter to change the “resolution� [sent-101, score-0.777]
15 Ĺš‚y: adding new trials to the set or ignoring failed trials are both feasible because the trials are i. [sent-103, score-1.473]
16 Of course, random search can probably be improved by automating what manual search does, i. [sent-107, score-0.62]
17 There are several reasons why manual search and grid search prevail as the state of the art despite decades of research into global optimization (e. [sent-110, score-0.959]
18 We show that random search has all the practical advantages of grid search (conceptual simplicity, ease of implementation, trivial parallelism) and trades a small reduction in efÄ? [sent-126, score-0.817]
19 Ĺš cient than grid search in high-dimensional spaces because functions Ψ of interest have a low effective dimensionality; essentially, Ψ of interest are more sensitive to changes in some dimensions than others (CaÄ? [sent-130, score-0.653]
20 With grid search, nine trials only test g(x) in three distinct places. [sent-142, score-0.85]
21 With random search, all nine trials explore distinct values of g. [sent-143, score-0.531]
22 This failure of grid search is the rule rather than the exception in high dimensional hyper-parameter optimization. [sent-144, score-0.551]
23 Ĺš cient for each individual data set because of the curse of dimensionality: the number of wasted grid search trials is exponential in the number of search dimensions that turn out to be irrelevant for a particular data set. [sent-148, score-1.311]
24 grid search as a method for optimizing neural network hyper-parameters. [sent-155, score-0.659]
25 We take the grid search experiments of Larochelle et al. [sent-156, score-0.598]
26 We conclude in Section 7 that random search is generally superior to grid search for optimizing hyperparameters. [sent-165, score-0.852]
27 However, when different trials have nearly optimal validation means, then it is not clear which test score to report, and a slightly different choice of ĂŽĹĽ could have yielded a different test error. [sent-192, score-0.652]
28 Ĺš ciency Curve Figure 2 illustrates the results of a random experiment: an experiment of 256 trials training neural networks to classify the rectangles data set. [sent-214, score-1.023]
29 Since the trials of a random experiment are independently identically distributed (i. [sent-215, score-0.654]
30 trials can also be interpreted as N independent experiments of s trials, as long as sN â‰Â¤ S. [sent-221, score-0.538]
31 The sharp upward slope occurs because when we take the maximum over larger subsets of the S trials, trials with poor performance are rarely the best within their subset. [sent-225, score-0.519]
32 The gentle downward slope occurs because as we take the maximum over larger subsets of trials (in Equation 6), we are less sure about which trial is actually the best. [sent-230, score-0.612]
33 Large experiments average together good validation trials with unusually high test scores with other good validation trials with unusually low test scores to arrive at a more accurate estimate of generalization. [sent-231, score-1.291]
34 d, so an experiment of many trials (here, 256 trials optimizing a neural network to classify the rectangles basic data set, Section 2. [sent-244, score-1.454]
35 For example, at horizontal axis position 8, we consider our 256 trials to be 32 experiments of 8 trials each. [sent-246, score-1.029]
36 When there are not enough experiments to support a box plot, as occurs here for experiments of 32 trials or more, the best generalization score of each experiment is shown by a scatter plot. [sent-255, score-0.736]
37 In the bar plot for trials of size 1, we would see the top performer scoring 80%. [sent-260, score-0.491]
38 287 B ERGSTRA AND B ENGIO Figure 3: From top to bottom, samples from the mnist rotated, mnist background random, mnist background images, mnist rotated background images data sets. [sent-267, score-1.84]
39 The mnist background images data set is a variation on mnist basic in which the white foreground digit has been composited on top of a 28x28 natural image patch. [sent-283, score-1.06]
40 The mnist background random data set is a similar variation on mnist basic in which the white foreground digit has been composited on top of random uniform (0,1) pixel values. [sent-287, score-0.997]
41 The mnist rotated data set is a variation on mnist basic in which the images have been rotated by an amount chosen randomly between 0 and 2ÄŽ€ radians. [sent-289, score-1.151]
42 The mnist rotated background images data set is a variation on mnist rotated in which the images have been rotated by an amount chosen randomly between 0 and 2ÄŽ€ radians, and then subsequently composited onto natural image patch backgrounds. [sent-303, score-1.569]
43 The rectangles images data set (Figure 4, middle) is a variation on rectangles in which the foreground rectangles were Ä? [sent-313, score-0.895]
44 (2007), the hyper-parameters of the neural network were optimized by search over a grid of trials. [sent-335, score-0.624]
45 4 We formed experiments for each data set by drawing S = 256 trials from this distribution. [sent-378, score-0.538]
46 Random sampling of trials is surprisingly effective in these settings. [sent-380, score-0.55]
47 Figure 5 shows that even among the fraction of jobs (71/256) that used no preprocessing, the random search with 8 trials is better than the grid search employed in Larochelle et al. [sent-381, score-1.308]
48 Typically, the extent of a grid search is determined by a computational budget. [sent-383, score-0.551]
49 Figure 6 shows what is possible if we use random search in a larger space that requires more trials to explore. [sent-384, score-0.757]
50 In the larger space, 32 trials were necessary to consistently outperform grid search rather than 8, indicating that there are many harmful ways to preprocess the data. [sent-386, score-1.042]
51 However, when we allowed larger experiments of 64 trials or more, random search found superior results to those found more quickly within the more restricted search. [sent-387, score-0.804]
52 The fact that experiments of just 4 or 8 trials often have the same maximum as much larger experiments indicates that the region of ĂŽ› that gives rise to the best performance is approximately a quarter or an eighth respectively of the entire conÄ? [sent-392, score-0.585]
53 In contrast the mnist rotated background images and convex curves show that even with 16 or 32 random trials, there is considerable variation in the generalization of the reportedly best model. [sent-397, score-0.812]
54 In this section we show that indeed Ψ has a low effective dimension, which explains why randomly sampled trials found better values. [sent-403, score-0.55]
55 3 1 experiment size (# trials) 2 4 8 16 32 1 experiment size (# trials) mnist rotated 2 4 8 16 32 experiment size (# trials) mnist rotated background images 0. [sent-436, score-1.556]
56 (2007), looking only at trials with no preprocessing (7 hyper-parameters to optimize). [sent-479, score-0.528]
57 The dashed blue line represents grid search accuracy for neural network models based on a selection by grids averaging 100 trials (Larochelle et al. [sent-481, score-1.226]
58 Random searches of 8 trials match or outperform grid searches of (on average) 100 trials. [sent-483, score-0.816]
59 3 1 experiment size (# trials) 2 4 8 16 32 64 1 experiment size (# trials) mnist rotated 2 4 8 16 32 64 experiment size (# trials) mnist rotated background images 0. [sent-518, score-1.556]
60 Dashed blue line represents grid search accuracy using (on average) 100 trials (Larochelle et al. [sent-559, score-1.072]
61 Often the extent of a search is determined by a computational budget, and with random search 64 trials are enough to Ä? [sent-561, score-0.983]
62 Exploring just four PCA variance levels by grid search would have required 5 times as many (average 500) trials per data set. [sent-563, score-1.042]
63 Figure 7 reveals two important properties of Ψ for neural networks that suggest why grid search performs so poorly relative to random experiments: 1. [sent-569, score-0.651]
64 a small fraction of hyper-parameters matter for any one data set, but 293 B ERGSTRA AND B ENGIO mnist basic mnist background images mnist background random relevance (1 / length scale) relevance (1 / length scale) relevance (1 / length scale) mnist rotated mnist rotated back. [sent-570, score-2.582]
65 images convex relevance (1 / length scale) relevance (1 / length scale) relevance (1 / length scale) rectangles rectangles images relevance (1 / length scale) relevance (1 / length scale) h. [sent-571, score-1.127]
66 While random search optimized these Ψ functions with 8 to 16 trials, a grid with, say, four values in each of these axes would already require 256 trials, and yet provide no guarantee that Ψ for a new data set would be well optimized. [sent-629, score-0.591]
67 The data sets for which the effective dimensionality was lowest (1 or 2) were mnist basic, mnist background images, mnist background random, and rectangles images. [sent-631, score-1.496]
68 Grid Search and Sets with Low Effective Dimensionality It is an interesting mathematical challenge to choose a set of trials for sampling functions of unknown, but low effective dimensionality. [sent-642, score-0.55]
69 We would like it to be true that no matter which dimensions turn out to be important, our trials sample the important dimensions evenly. [sent-643, score-0.604]
70 There are many possible grid experiments of any size in multiple dimensions (at least for non-prime experiment sizes). [sent-674, score-0.538]
71 Ĺš ve grids with resolutions (1, 1, 1, 1, 16), (1, 1, 1, 2, 8), (1, 1, 2, 2, 4), (1, 1, 1, 4, 4), (1, 2, 2, 2, 2); for experiments of some prime size P in 3 dimensions we tried one grid with resolution (1, 1, P). [sent-677, score-0.496]
72 0 minus the probability of missing the target with every single one of T trials in the experiment. [sent-683, score-0.518]
73 Ĺš ciently many trials for the structure in the Sobol 5. [sent-700, score-0.491]
74 Ĺš ciently many trials for random search to succeed with high probability. [sent-724, score-0.757]
75 A thought experiment gives some intuition for why grid search fails in the case of rectangles. [sent-725, score-0.674]
76 More generally, if the target manifold were not systematically aligned with subsets of trial points, then grid search would be as efÄ? [sent-730, score-0.644]
77 Sequential Manual Optimization To see how random search compares with a careful combination of grid search and hand-tuning in the context of a model with many hyper-parameters, we performed experiments with the Deep Belief Network (DBN) model (Hinton et al. [sent-734, score-0.864]
78 A grid search is not practical for the 32-dimensional search problem of DBN model selection, because even just 2 possible values for each of 32 hyper-parameters would yield more trials than we could conduct (232 > 109 trials and each can take hours). [sent-774, score-1.759]
79 (2007) was a combination of manual search, multi-resolution grid search and coordinate descent. [sent-777, score-0.679]
80 Ĺš rst trials would simply give us a trend on the validation set error for these parameters (is a change in the hyper-parameter making things worse of better) and we would then consider that information in selecting appropriate additional trials. [sent-801, score-0.556]
81 There was large variation in the number of trials used in Larochelle et al. [sent-806, score-0.52]
82 Results obtained by grid-assisted manual search using an average of 41 trials are marked in Ä? [sent-813, score-0.845]
83 Random experiments of 128 random trials found an inferior best model for three data sets, a competitive model in four, and superior model in one (convex). [sent-815, score-0.578]
84 In considering the number of trials per data set, it is important to bear in mind that the experiments on different data sets were not performed independently. [sent-818, score-0.538]
85 Although grid search was part of the optimization loop, the manual intervention turns the overall optimization process into something with more resemblance to an adaptive sequential algorithm. [sent-821, score-0.824]
86 Ĺš ciency curves we see that even experiments with larger numbers of trials (64 and larger) feature signiÄ? [sent-832, score-0.633]
87 With so many sophisticated algorithms to draw on, it may seem strange that grid search is still widely used, and, with straight faces, we now suggest using random search instead. [sent-860, score-0.817]
88 Manual optimization followed by grid search is easy to implement: grid search requires very little code infrastructure beyond access to a cluster of computers. [sent-862, score-1.156]
89 They require client-server architectures in which a master process keeps track of the trials that have completed, the trials that are in progress, the trials that were started but failed to complete. [sent-867, score-1.473]
90 It is also common to perform multistage, multi-resolution grid experiments that are more or less automated, because a grid experiment with a Ä? [sent-875, score-0.82]
91 Grid search experiments allocate too many trials to the exploration of dimensions that do not matter and suffer from poor coverage in dimensions that are important. [sent-881, score-0.905]
92 Compared with the grid search experiments of Larochelle et al. [sent-882, score-0.598]
93 â€Ë˜ The experiment can be stopped any time and the trials form a complete experiment. [sent-885, score-0.614]
94 â€Ë˜ If extra computers become available, new trials can be added to an experiment without having to adjust the grid and commit to a much larger experiment. [sent-886, score-0.939]
95 To investigate the effect of one hyper-parameter of interest X, we recommend random search (instead of grid search) for optimizing over other hyper-parameters. [sent-890, score-0.626]
96 Random experiments with large numbers of trials also bring attention to the question of how to measure test error of an experiment when many trials have some claim to being best. [sent-892, score-1.186]
97 However, the trials 302 R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION of a low-discrepancy experiment are not i. [sent-901, score-0.614]
98 Random search was not generally as good as the sequential combination of manual and grid search from an expert (Larochelle et al. [sent-910, score-0.942]
99 Ĺš ciency of the grid search employed at each step of the procedure. [sent-913, score-0.619]
100 We hope that future work in that direction will consider random search of the form studied here as a baseline for performance, rather than grid search. [sent-915, score-0.591]
wordName wordTfidf (topN-words)
[('trials', 0.491), ('mnist', 0.343), ('grid', 0.325), ('rectangles', 0.241), ('search', 0.226), ('larochelle', 0.191), ('rotated', 0.163), ('ergstra', 0.14), ('yper', 0.14), ('manual', 0.128), ('dbn', 0.126), ('experiment', 0.123), ('engio', 0.12), ('sobol', 0.117), ('images', 0.11), ('andom', 0.1), ('earch', 0.093), ('grids', 0.081), ('ptimization', 0.079), ('deep', 0.073), ('ciency', 0.068), ('units', 0.067), ('trial', 0.066), ('background', 0.065), ('validation', 0.065), ('effective', 0.059), ('gx', 0.058), ('netuning', 0.058), ('optimization', 0.054), ('hidden', 0.052), ('bergstra', 0.05), ('valid', 0.048), ('experiments', 0.047), ('relevance', 0.047), ('composited', 0.047), ('guration', 0.047), ('layer', 0.047), ('rectangle', 0.047), ('network', 0.047), ('pretraining', 0.045), ('ws', 0.044), ('dimensions', 0.043), ('pixels', 0.042), ('random', 0.04), ('latin', 0.04), ('lecun', 0.039), ('preprocessing', 0.037), ('dimensionality', 0.037), ('sequential', 0.037), ('bengio', 0.037), ('hutter', 0.036), ('unimportant', 0.036), ('tall', 0.036), ('convex', 0.035), ('gpr', 0.035), ('halton', 0.035), ('kirkpatrick', 0.035), ('optimizing', 0.035), ('networks', 0.034), ('test', 0.034), ('foreground', 0.033), ('contrastive', 0.033), ('image', 0.033), ('scores', 0.032), ('length', 0.031), ('belief', 0.031), ('accuracy', 0.03), ('chose', 0.03), ('draws', 0.029), ('digit', 0.029), ('variation', 0.029), ('score', 0.028), ('white', 0.028), ('slope', 0.028), ('coverage', 0.028), ('downward', 0.027), ('nelder', 0.027), ('curves', 0.027), ('target', 0.027), ('matter', 0.027), ('geometrically', 0.026), ('hypercube', 0.026), ('neural', 0.026), ('surface', 0.025), ('doi', 0.025), ('erhan', 0.025), ('train', 0.025), ('drawn', 0.024), ('uniformly', 0.024), ('response', 0.024), ('antonov', 0.023), ('boxed', 0.023), ('bratley', 0.023), ('czogiel', 0.023), ('galassi', 0.023), ('gured', 0.023), ('holes', 0.023), ('isch', 0.023), ('kleinman', 0.023), ('meanx', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
2 0.11824343 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
3 0.11573008 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio
Abstract: Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artificial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classification performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classification problems, namely semi-supervised and multitask learning. Keywords: restricted Boltzmann machine, classification, discriminative learning, generative learning
4 0.076567374 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
5 0.061898638 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle
Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can find representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it finds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically significant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning
6 0.05657132 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
7 0.048537534 106 jmlr-2012-Sign Language Recognition using Sub-Units
8 0.044900674 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
9 0.044364382 103 jmlr-2012-Sampling Methods for the Nyström Method
10 0.042969823 59 jmlr-2012-Linear Regression With Random Projections
11 0.041965745 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
12 0.039145079 13 jmlr-2012-Active Learning via Perfect Selective Classification
13 0.039024204 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
14 0.036225371 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
15 0.035507098 20 jmlr-2012-Analysis of a Random Forests Model
16 0.034983981 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
17 0.034119129 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
18 0.031887501 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
19 0.031802546 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
20 0.03087404 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
topicId topicWeight
[(0, -0.17), (1, 0.018), (2, 0.163), (3, -0.036), (4, 0.045), (5, 0.006), (6, 0.105), (7, 0.032), (8, -0.076), (9, 0.06), (10, -0.001), (11, -0.01), (12, -0.003), (13, 0.129), (14, -0.048), (15, 0.164), (16, 0.097), (17, 0.185), (18, 0.167), (19, -0.018), (20, -0.046), (21, -0.131), (22, 0.173), (23, 0.16), (24, 0.001), (25, -0.098), (26, 0.137), (27, 0.035), (28, -0.136), (29, -0.267), (30, -0.036), (31, -0.025), (32, 0.078), (33, -0.043), (34, -0.107), (35, -0.095), (36, -0.152), (37, -0.052), (38, 0.065), (39, -0.007), (40, -0.064), (41, 0.202), (42, 0.049), (43, -0.145), (44, 0.032), (45, -0.059), (46, 0.107), (47, 0.002), (48, 0.032), (49, 0.131)]
simIndex simValue paperId paperTitle
same-paper 1 0.97538966 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
2 0.57822359 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio
Abstract: Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artificial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classification performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classification problems, namely semi-supervised and multitask learning. Keywords: restricted Boltzmann machine, classification, discriminative learning, generative learning
3 0.52181339 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
4 0.45053256 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
5 0.4406088 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle
Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can find representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it finds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically significant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning
6 0.2550036 103 jmlr-2012-Sampling Methods for the Nyström Method
8 0.23808169 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
10 0.22185522 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
11 0.21749541 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy
12 0.20503679 106 jmlr-2012-Sign Language Recognition using Sub-Units
13 0.20421641 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
14 0.19396654 20 jmlr-2012-Analysis of a Random Forests Model
15 0.19097506 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
16 0.18995766 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space
17 0.17767279 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
18 0.17654727 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
19 0.17083959 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
20 0.17071258 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
topicId topicWeight
[(7, 0.013), (14, 0.017), (21, 0.042), (26, 0.054), (27, 0.013), (29, 0.062), (35, 0.455), (49, 0.012), (56, 0.024), (57, 0.013), (69, 0.018), (75, 0.042), (77, 0.014), (81, 0.014), (92, 0.058), (96, 0.08)]
simIndex simValue paperId paperTitle
same-paper 1 0.88838333 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
2 0.88638991 12 jmlr-2012-Active Clustering of Biological Sequences
Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia
Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA
3 0.41839814 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
Author: Marius Kloft, Pavel Laskov
Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection
4 0.39911127 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
Author: Daniel L. Ly, Hod Lipson
Abstract: A hybrid dynamical system is a mathematical model suitable for describing an extensive spectrum of multi-modal, time-series behaviors, ranging from bouncing balls to air traffic controllers. This paper describes multi-modal symbolic regression (MMSR): a learning algorithm to construct non-linear symbolic representations of discrete dynamical systems with continuous mappings from unlabeled, time-series data. MMSR consists of two subalgorithms—clustered symbolic regression, a method to simultaneously identify distinct behaviors while formulating their mathematical expressions, and transition modeling, an algorithm to infer symbolic inequalities that describe binary classification boundaries. These subalgorithms are combined to infer hybrid dynamical systems as a collection of apt, mathematical expressions. MMSR is evaluated on a collection of four synthetic data sets and outperforms other multi-modal machine learning approaches in both accuracy and interpretability, even in the presence of noise. Furthermore, the versatility of MMSR is demonstrated by identifying and inferring classical expressions of transistor modes from recorded measurements. Keywords: hybrid dynamical systems, evolutionary computation, symbolic piecewise functions, symbolic binary classification
5 0.39033347 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio
Abstract: Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artificial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classification performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classification problems, namely semi-supervised and multitask learning. Keywords: restricted Boltzmann machine, classification, discriminative learning, generative learning
6 0.38870031 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
7 0.38466948 109 jmlr-2012-Stability of Density-Based Clustering
8 0.38298401 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
9 0.3829267 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space
10 0.3778711 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
11 0.37159103 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
12 0.37054744 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
13 0.36705658 103 jmlr-2012-Sampling Methods for the Nyström Method
14 0.36600822 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
15 0.36344087 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
16 0.36078548 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
17 0.35557947 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
18 0.35331562 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
19 0.35002115 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
20 0.34880245 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression