nips nips2011 nips2011-30 knowledge-graph by maker-knowledge-mining

30 nips-2011-Algorithms for Hyper-Parameter Optimization


Source: pdf

Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl

Abstract: Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. [sent-10, score-0.257]

2 Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. [sent-11, score-0.175]

3 We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). [sent-12, score-0.214]

4 We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. [sent-13, score-0.329]

5 Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. [sent-14, score-0.197]

6 The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. [sent-15, score-0.141]

7 The difficulty of tuning these models makes published results difficult to reproduce and extend, and makes even the original investigation of such methods more of an art than a science. [sent-18, score-0.068]

8 Recent results such as [5], [6], and [7] demonstrate that the challenge of hyper-parameter optimization in large and multilayer models is a direct impediment to scientific progress. [sent-19, score-0.082]

9 These works have advanced state of the art performance on image classification problems by more concerted hyper-parameter optimization in simple algorithms, rather than by innovative modeling or machine learning strategies. [sent-20, score-0.116]

10 Instead, hyper-parameter optimization should be regarded as a formal outer loop in the learning process. [sent-22, score-0.082]

11 A learning algorithm, as a functional from data to classifier (taking classification problems as an example), includes a budgeting choice of how many CPU cycles are to be spent on hyper-parameter exploration, and how many CPU cycles are to be spent evaluating each hyperparameter choice (i. [sent-23, score-0.089]

12 The results of [5] and [7] suggest that with current generation hardware such as large computer clusters and GPUs, the optimal alloca1 tion of CPU cycles includes more hyper-parameter exploration than has been typical in the machine learning literature. [sent-26, score-0.069]

13 Hyper-parameter optimization is the problem of optimizing a loss function over a graph-structured configuration space. [sent-27, score-0.12]

14 the number of hidden units in the 2nd layer of a DBN) are only well-defined when node variables (e. [sent-31, score-0.072]

15 a discrete choice of how many layers to use) take particular values. [sent-33, score-0.079]

16 Not only must a hyper-parameter optimization algorithm optimize over variables which are discrete, ordinal, and continuous, but it must simultaneously choose which variables to optimize. [sent-34, score-0.128]

17 Random search is the algorithm of drawing hyper-parameter assignments from that process and evaluating them. [sent-36, score-0.17]

18 This paper makes two contributions: 1) Random search is competitive with the manual optimization of DBNs in [1], and 2) Automatic sequential optimization outperforms both manual and random search. [sent-38, score-0.872]

19 Section 2 covers sequential model-based optimization, and the expected improvement criterion. [sent-39, score-0.127]

20 Section 3 introduces a Gaussian Process based hyper-parameter optimization algorithm. [sent-40, score-0.082]

21 Section 6 shows the efficiency of sequential optimization on the two hardest datasets according to random search. [sent-43, score-0.233]

22 In an application where the true fitness function f : X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. [sent-46, score-0.082]

23 Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. [sent-47, score-0.082]

24 The point x∗ that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. [sent-48, score-0.094]

25 We chose to use the EI criterion in our work because it is intuitive, and has been shown to work well in a variety of settings. [sent-55, score-0.06]

26 We leave the systematic exploration of improvement criteria for future work. [sent-56, score-0.071]

27 3 The Gaussian Process Approach (GP) Gaussian Processes have long been recognized as a good method for modeling loss functions in model-based optimization literature [13]. [sent-60, score-0.082]

28 The above mentioned closedness property, along with the fact that GPs provide an assessment of prediction uncertainty incorporating the effect of data scarcity, make the GP an elegant candidate for both finding candidate x∗ (Figure 1, step 3) and fitting a model Mt (Figure 1, step 6). [sent-67, score-0.06]

29 EI functions are usually optimized with an exhaustive grid search over the input space, or a Latin Hypercube search in higher dimensions. [sent-73, score-0.34]

30 The authors of [16] used the preceding remarks on the landscape of EI to design an evolutionary algorithm with mixture search, specifically aimed at optimizing EI, that is shown to outperform exhaustive search for a given budget in EI evaluations. [sent-75, score-0.278]

31 CMA-ES is a state-of-the-art gradient-free evolutionary algorithm for optimization on continuous domains, which has been shown to outperform the Gaussian search EDA. [sent-78, score-0.296]

32 The use of tesselations suggested by [16] is prohibitive here, as our task often means working in more than 10 dimensions, thus we start each local search at the center of mass of a simplex with vertices randomly picked among the training points. [sent-81, score-0.17]

33 3 4 Tree-structured Parzen Estimator Approach (TPE) Anticipating that our hyper-parameter optimization tasks will mean high dimensions and small fitness evaluation budgets, we now turn to another modeling strategy and EI optimization scheme for the SMBO algorithm. [sent-87, score-0.191]

34 In the experimental section, we will see that the configuation space is described using uniform, log-uniform, quantized log-uniform, and categorical variables. [sent-93, score-0.067]

35 Whereas the GP-based approach favoured quite an aggressive y ∗ (typically less than the best observed loss), the TPE algorithm depends on a y ∗ that is larger than the best observed f (x) so that some points can be used to form (x). [sent-100, score-0.073]

36 By maintaining sorted lists of observed variables in H, the runtime of each iteration of the TPE algorithm can scale linearly in |H| and linearly in the number of variables (dimensions) being optimized. [sent-102, score-0.073]

37 1 Optimizing EI in the TPE algorithm The parametrization of p(x, y) as p(y)p(x|y) in the TPE algorithm was chosen to facilitate the optimization of EI. [sent-104, score-0.082]

38 y∗ y∗ (y ∗ − y)p(y|x)dy = EIy∗ (x) = −∞ ∗ −∞ By construction, γ = p(y < y ) and p(x) = y∗ ∗ (y − y)p(x|y)p(y)dy (y ∗ − y) = R y∗ p(x|y)p(y) dy p(x) p(x|y)p(y)dy = γ (x) + (1 − γ)g(x). [sent-105, score-0.082]

39 This last expression shows that to maximize improvement we would like points x with high probability under (x) and low probability under g(x). [sent-107, score-0.06]

40 5, 1) W init U (−a, a) or N (0, a2 ) random seed 5 choices a algo A or B (see text) classifier learn rate log U (0. [sent-128, score-0.095]

41 layers 1 to 3 CD anneal start log U (10, 104 ) batch size 20 or 100 CD sample data yes or no 5 Random Search for Hyper-Parameter Optimization in DBNs One simple, but recent step toward formalizing hyper-parameter optimization is the use of random search [5]. [sent-131, score-0.379]

42 [19] showed that random search was much more efficient than grid search for optimizing the parameters of one-layer neural network classifiers. [sent-132, score-0.41]

43 In this section, we evaluate random search for DBN optimization, compared with the sequential grid-assisted manual search carried out in [1]. [sent-133, score-0.672]

44 We chose the prior listed in Table 1 to define the search space over DBN configurations. [sent-134, score-0.245]

45 These changes expand the hyper-parameter search problem, while maintaining the original hyper-parameter search space as a subset of the expanded search space. [sent-137, score-0.533]

46 The results of this preliminary random search are in Figure 2. [sent-138, score-0.202]

47 Perhaps surprisingly, the result of manual search can be reliably matched with 32 random trials for several datasets. [sent-139, score-0.583]

48 The efficiency of random search in this setting is explored further in [21]. [sent-140, score-0.202]

49 Where random search results match human performance, it is not clear from Figure 2 whether the reason is that it searched the original space as efficiently, or that it searched a larger space where good performance is easier to find. [sent-141, score-0.302]

50 The results in Figure 2 indicate that hyper-parameter optimization is harder for some datasets. [sent-144, score-0.082]

51 For example, in the case of the “MNIST rotated background images” dataset (MRBI), random sampling appears to converge to a maximum relatively quickly (best models among experiments of 32 trials show little variance in performance), but this plateau is lower than what was found by manual search. [sent-145, score-0.491]

52 In another dataset (convex), the random sampling procedure exceeds the performance of manual search, but is slow to converge to any sort of plateau. [sent-146, score-0.238]

53 This slow convergence indicates that better performance is probably available, but we need to search the configuration space more efficiently to find it. [sent-148, score-0.193]

54 The remainder of this paper explores sequential optimization strategies for hyper-parameter optimization for these two datasets: convex and MRBI. [sent-149, score-0.291]

55 1 by comparing with random sampling on the Boston Housing dataset, a regression task with 506 points made of 13 scaled input variables and a scalar 5 mnist basic 1. [sent-151, score-0.147]

56 Random search is used to explore up to 32 hyper-parameters (see Table 1). [sent-199, score-0.17]

57 Results found using a grid-search-assisted manual search over a similar domain with an average 41 trials are given in green (1-layer DBN) and red (3-layer DBN). [sent-200, score-0.551]

58 ) shows the distribution of test set performance when the best model among N random trials is selected. [sent-204, score-0.23]

59 The datasets “convex” and “mnist rotated background images” are used for more thorough hyper-parameter optimization. [sent-205, score-0.103]

60 The first 30 iterations were made using random sampling, while from the 30th on, we differentiated the random samples from the GP approach trained on the updated history. [sent-209, score-0.064]

61 Although the number of points is particularly small compared to the dimensionality, the surrogate modelling approach finds noticeably better points than random, which supports the application of SMBO approaches to more ambitious tasks and datasets. [sent-211, score-0.14]

62 Applying the GP to the problem of optimizing DBN performance, we allowed 3 random restarts to the CMA+ES algorithm per proposal x∗ , and up to 500 iterations of conjugate gradient method in fitting the length scales of the GP. [sent-212, score-0.143]

63 The CMA-ES part of GPs dealt with boundaries using a penalty method, the binomial sampling part dealt with it by nature. [sent-214, score-0.075]

64 15 and picked the best among 100 candidates drawn from (x) on each iteration as the proposal x∗ . [sent-218, score-0.09]

65 TPE was allowed to grow past the initial bounds used with for random sampling in the course of optimization, whereas the GP and random search were restricted to stay within the initial bounds throughout the course of optimization. [sent-220, score-0.271]

66 The TPE algorithm was also initialized with the same 30 randomly sampled points as were used to seed the GP. [sent-221, score-0.064]

67 For the GP approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation equal to the mean of the y’s within the training set D was assigned temporarily, until the evaluation completed and reported the actual loss f (x∗ ). [sent-224, score-0.084]

68 This makes search less efficient, though faster in terms of wall time. [sent-227, score-0.194]

69 44% 50 Table 2: The test set classification error of the best model found by each search algorithm on each problem. [sent-244, score-0.193]

70 Each search algorithm was allowed up to 200 trials. [sent-245, score-0.207]

71 The manual searches used 82 trials for convex and 27 trials MRBI. [sent-246, score-0.589]

72 With the parallel evaluation of up to five proposals from the GP and TPE algorithms, each experiment took about 24 hours of wall time using five GPUs. [sent-253, score-0.101]

73 7 Discussion The trajectories (H) constructed by each algorithm up to 200 steps are illustrated in Figure 4, and compared with random search and the manual search carried out in [1]. [sent-254, score-0.578]

74 On the convex dataset (2-way classification), both algorithms converged to a validation score of 13% error. [sent-256, score-0.086]

75 TPE’s best was significantly better than both manual search (19%) and random search with 200 trials (17%). [sent-260, score-0.776]

76 On the MRBI dataset (10-way classification), random search was the worst performer (50% error), the GP approach and manual search approximately tied (47% error), while the TPE algorithm found a new best result (44% error). [sent-261, score-0.601]

77 The GP and TPE algorithms were slightly less efficient than manual search: GP and EI identified performance on par with manual search within 80 trials, the manual search of [1] used 82 trials for convex and 27 trials for MRBI. [sent-263, score-1.365]

78 Perhaps, conversely, the exploration induced by the TPE’s lack of accuracy turned out to be a good heuristic for search. [sent-266, score-0.066]

79 Critically though, all four SMBO runs matched or exceeded both random search and a careful human-guided search, which are currently the state of the art methods for hyper-parameter optimization. [sent-269, score-0.236]

80 Sequential optimization algorithms work by leveraging structure in observed (x, y) pairs. [sent-271, score-0.106]

81 8 Conclusion This paper has introduced two sequential hyper-parameter optimization algorithms, and shown them to meet or exceed human performance and the performance of a brute-force random search in two difficult hyper-parameter optimization tasks involving DBNs. [sent-274, score-0.46]

82 equal layer sizes at all layers) on the search space, and fall back on a more natural hyperparameter space of 32 variables (including both discrete and continuous variables) in which many 7 Dataset: convex Dataset: mnist rotated background images 0. [sent-277, score-0.515]

83 The solid coloured lines are the validation set accuracy of the best trial found before each point in time. [sent-296, score-0.104]

84 Both the TPE and GP algorithms make significant advances from their random initial conditions, and substantially outperform the manual and random search methods. [sent-297, score-0.464]

85 A 95% confidence interval about the best validation means on the convex task extends 0. [sent-298, score-0.085]

86 The solid black line is the test set accuracy obtained by domain experts using a combination of grid search and manual search [1]. [sent-301, score-0.574]

87 5% quantile of validation performance found among trials sampled from our prior distribution (see Table 1), estimated from 457 and 361 random trials on the two datasets respectively. [sent-303, score-0.491]

88 In this 32-dimensional search problem, the TPE algorithm presented here has uncovered new best results on both of these datasets that are significantly better than what DBNs were previously believed to achieve. [sent-307, score-0.242]

89 Moreover, the GP and TPE algorithms are practical: the optimization for each dataset was done in just 24 hours using five GPU processors. [sent-308, score-0.106]

90 Although our results are only for DBNs, our methods are quite general, and extend naturally to any hyper-parameter optimization problem in which the hyper-parameters are drawn from a measurable set. [sent-309, score-0.082]

91 We hope that our work may spur researchers in the machine learning community to treat the hyperparameter optimization strategy as an interesting and important component of all learning algorithms. [sent-310, score-0.109]

92 ” is not a fully specified, empirically answerable question – different approaches to hyper-parameter optimization will give different answers. [sent-312, score-0.082]

93 Algorithmic approaches to hyper-parameter optimization make machine learning results easier to disseminate, reproduce, and transfer to other domains. [sent-313, score-0.082]

94 Finally, powerful hyper-parameter optimization algorithms broaden the horizon of models that can realistically be studied; researchers need not restrict themselves to systems of a few variables that can readily be tuned by hand. [sent-315, score-0.129]

95 An empirical evaluation of deep architectures on problems with many factors of variation. [sent-326, score-0.082]

96 Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. [sent-342, score-0.117]

97 A taxonomy of global optimization methods based on response surfaces. [sent-382, score-0.108]

98 An informational approach to the global optimization of expensive-to-evaluate functions. [sent-388, score-0.108]

99 Gaussian process optimization in the bandit setting: No regret and experimental design. [sent-395, score-0.082]

100 Surrogating the surrogate: accelerating Gaussian Process optimization with e mixtures. [sent-425, score-0.082]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tpe', 0.626), ('gp', 0.348), ('dbn', 0.224), ('smbo', 0.215), ('manual', 0.206), ('trials', 0.175), ('search', 0.17), ('dbns', 0.133), ('ei', 0.106), ('mrbi', 0.098), ('parzen', 0.095), ('sequential', 0.094), ('guration', 0.093), ('bergstra', 0.089), ('cd', 0.088), ('optimization', 0.082), ('dy', 0.082), ('gps', 0.082), ('gpu', 0.079), ('tness', 0.069), ('zca', 0.069), ('mnist', 0.065), ('bardenet', 0.059), ('eiy', 0.059), ('surrogate', 0.058), ('layers', 0.056), ('deep', 0.055), ('rotated', 0.052), ('layer', 0.049), ('evolutionary', 0.044), ('categorical', 0.044), ('con', 0.043), ('anneal', 0.039), ('cma', 0.039), ('exploration', 0.038), ('cpu', 0.038), ('optimizing', 0.038), ('seed', 0.037), ('allowed', 0.037), ('proposal', 0.036), ('experimenter', 0.034), ('mlp', 0.034), ('theano', 0.034), ('art', 0.034), ('reproduce', 0.034), ('convex', 0.033), ('improvement', 0.033), ('criterion', 0.033), ('universit', 0.032), ('random', 0.032), ('mt', 0.032), ('recherche', 0.032), ('candidates', 0.031), ('cycles', 0.031), ('denoising', 0.031), ('perhaps', 0.03), ('candidate', 0.03), ('gaussian', 0.03), ('quantile', 0.03), ('housing', 0.03), ('lozano', 0.03), ('validation', 0.029), ('noticeably', 0.028), ('informatique', 0.028), ('accuracy', 0.028), ('chose', 0.027), ('networks', 0.027), ('hyperparameter', 0.027), ('densities', 0.027), ('coates', 0.027), ('searched', 0.027), ('classi', 0.027), ('evaluation', 0.027), ('points', 0.027), ('runtime', 0.027), ('background', 0.026), ('global', 0.026), ('landscape', 0.026), ('boston', 0.026), ('algo', 0.026), ('autoencoders', 0.026), ('rectangles', 0.026), ('experiment', 0.025), ('took', 0.025), ('binomial', 0.025), ('dealt', 0.025), ('datasets', 0.025), ('prior', 0.025), ('algorithms', 0.024), ('wall', 0.024), ('believed', 0.024), ('images', 0.024), ('trial', 0.024), ('ciency', 0.024), ('stacked', 0.023), ('discrete', 0.023), ('belief', 0.023), ('space', 0.023), ('best', 0.023), ('variables', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 30 nips-2011-Algorithms for Hyper-Parameter Optimization

Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl

Abstract: Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1

2 0.20701273 100 nips-2011-Gaussian Process Training with Input Noise

Author: Andrew Mchutchon, Carl E. Rasmussen

Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1

3 0.18009998 26 nips-2011-Additive Gaussian Processes

Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen

Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1

4 0.10626417 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

5 0.083362319 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efficient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1

6 0.078619555 157 nips-2011-Learning to Search Efficiently in High Dimensions

7 0.078298844 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

8 0.075605266 24 nips-2011-Active learning of neural response functions with Gaussian processes

9 0.071542129 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

10 0.06984967 250 nips-2011-Shallow vs. Deep Sum-Product Networks

11 0.069505267 244 nips-2011-Selecting Receptive Fields in Deep Networks

12 0.068432949 301 nips-2011-Variational Gaussian Process Dynamical Systems

13 0.053126901 274 nips-2011-Structure Learning for Optimization

14 0.0527073 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

15 0.052654319 34 nips-2011-An Unsupervised Decontamination Procedure For Improving The Reliability Of Human Judgments

16 0.052134078 287 nips-2011-The Manifold Tangent Classifier

17 0.052100301 156 nips-2011-Learning to Learn with Compound HD Models

18 0.050081693 261 nips-2011-Sparse Filtering

19 0.046494678 258 nips-2011-Sparse Bayesian Multi-Task Learning

20 0.046353724 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.17), (1, 0.007), (2, 0.026), (3, 0.018), (4, -0.028), (5, -0.059), (6, 0.058), (7, 0.006), (8, 0.036), (9, 0.022), (10, -0.15), (11, -0.047), (12, 0.053), (13, 0.012), (14, -0.038), (15, 0.165), (16, 0.052), (17, 0.128), (18, -0.16), (19, -0.157), (20, 0.008), (21, 0.1), (22, -0.114), (23, -0.169), (24, 0.02), (25, 0.081), (26, 0.011), (27, -0.026), (28, 0.003), (29, -0.147), (30, -0.017), (31, -0.038), (32, 0.109), (33, 0.089), (34, 0.086), (35, -0.008), (36, 0.028), (37, -0.07), (38, -0.066), (39, 0.015), (40, -0.01), (41, 0.025), (42, 0.112), (43, -0.017), (44, 0.002), (45, -0.031), (46, -0.044), (47, 0.086), (48, -0.077), (49, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91643524 30 nips-2011-Algorithms for Hyper-Parameter Optimization

Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl

Abstract: Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1

2 0.82619077 100 nips-2011-Gaussian Process Training with Input Noise

Author: Andrew Mchutchon, Carl E. Rasmussen

Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1

3 0.80963367 26 nips-2011-Additive Gaussian Processes

Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen

Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1

4 0.55658871 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

5 0.53005308 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efficient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1

6 0.45231926 24 nips-2011-Active learning of neural response functions with Gaussian processes

7 0.44733062 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

8 0.40214679 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

9 0.37438396 250 nips-2011-Shallow vs. Deep Sum-Product Networks

10 0.3621273 34 nips-2011-An Unsupervised Decontamination Procedure For Improving The Reliability Of Human Judgments

11 0.3607901 139 nips-2011-Kernel Bayes' Rule

12 0.35382065 157 nips-2011-Learning to Search Efficiently in High Dimensions

13 0.34541681 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

14 0.34467724 111 nips-2011-Hashing Algorithms for Large-Scale Learning

15 0.34317723 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

16 0.34193489 225 nips-2011-Probabilistic amplitude and frequency demodulation

17 0.34028727 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning

18 0.33959827 4 nips-2011-A Convergence Analysis of Log-Linear Training

19 0.33941257 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

20 0.33792424 217 nips-2011-Practical Variational Inference for Neural Networks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (4, 0.042), (20, 0.034), (26, 0.018), (31, 0.097), (33, 0.021), (40, 0.236), (43, 0.082), (45, 0.111), (57, 0.076), (65, 0.031), (74, 0.046), (83, 0.037), (99, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86578971 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent

Author: Shinichi Nakajima, Masashi Sugiyama, S. D. Babacan

Abstract: Variational Bayesian matrix factorization (VBMF) efficiently approximates the posterior distribution of factorized matrices by assuming matrix-wise independence of the two factors. A recent study on fully-observed VBMF showed that, under a stronger assumption that the two factorized matrices are column-wise independent, the global optimal solution can be analytically computed. However, it was not clear how restrictive the column-wise independence assumption is. In this paper, we prove that the global solution under matrix-wise independence is actually column-wise independent, implying that the column-wise independence assumption is harmless. A practical consequence of our theoretical finding is that the global solution under matrix-wise independence (which is a standard setup) can be obtained analytically in a computationally very efficient way without any iterative algorithms. We experimentally illustrate advantages of using our analytic solution in probabilistic principal component analysis. 1

same-paper 2 0.77431911 30 nips-2011-Algorithms for Hyper-Parameter Optimization

Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl

Abstract: Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1

3 0.74393064 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

Author: Shie Mannor, Ohad Shamir

Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1

4 0.72149611 180 nips-2011-Multiple Instance Filtering

Author: Kamil A. Wnuk, Stefano Soatto

Abstract: We propose a robust filtering approach based on semi-supervised and multiple instance learning (MIL). We assume that the posterior density would be unimodal if not for the effect of outliers that we do not wish to explicitly model. Therefore, we seek for a point estimate at the outset, rather than a generic approximation of the entire posterior. Our approach can be thought of as a combination of standard finite-dimensional filtering (Extended Kalman Filter, or Unscented Filter) with multiple instance learning, whereby the initial condition comes with a putative set of inlier measurements. We show how both the state (regression) and the inlier set (classification) can be estimated iteratively and causally by processing only the current measurement. We illustrate our approach on visual tracking problems whereby the object of interest (target) moves and evolves as a result of occlusions and deformations, and partial knowledge of the target is given in the form of a bounding box (training set). 1

5 0.6661976 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

Author: Nobuyuki Morioka, Shin'ichi Satoh

Abstract: Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, 1 regularized sparse coding is combined with the spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents “Generalized Lasso based Approximation of Sparse coding” (GLAS). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with the generalized lasso. We also propose an efficient post-refinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains a comparable performance to 1 regularized sparse coding, yet achieves a significant speed up demonstrating its effectiveness for large-scale visual recognition problems. 1

6 0.64335078 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

7 0.63753253 258 nips-2011-Sparse Bayesian Multi-Task Learning

8 0.63689035 219 nips-2011-Predicting response time and error rates in visual search

9 0.63421339 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

10 0.63316125 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

11 0.63084286 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

12 0.63016754 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

13 0.62951225 276 nips-2011-Structured sparse coding via lateral inhibition

14 0.62904406 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

15 0.62893599 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

16 0.62888259 24 nips-2011-Active learning of neural response functions with Gaussian processes

17 0.62879938 156 nips-2011-Learning to Learn with Compound HD Models

18 0.62769729 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

19 0.62624371 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

20 0.62511808 301 nips-2011-Variational Gaussian Process Dynamical Systems