nips nips2013 nips2013-201 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. [sent-8, score-0.262]
2 In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. [sent-9, score-0.341]
3 We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. [sent-12, score-0.173]
4 Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. [sent-13, score-0.243]
5 We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. [sent-14, score-0.381]
6 Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. [sent-15, score-0.242]
7 The difference between poor settings and good settings of hyperparameters can be the difference between a useless model and state-of-the-art performance. [sent-18, score-0.268]
8 As the space of hyperparameters grows, the task of tuning them can become daunting, as well-established techniques such as grid search either become too slow, or too coarse, leading to poor results in both performance and training time. [sent-21, score-0.431]
9 Recent work in machine learning has revisited the idea of Bayesian optimization [1, 2, 3, 4, 5, 6, 7], a framework for global optimization that provides an appealing approach to the difficult explorationexploitation tradeoff. [sent-22, score-0.25]
10 The optimization must be carried out from scratch each time a model is applied to new data. [sent-25, score-0.212]
11 This could manifest itself in many ways, including establishing the values for a grid search, or simply taking certain hyperparameters as fixed with some commonly accepted value. [sent-29, score-0.194]
12 Furthermore, for large datasets one could imagine exploring a wide range of hyperparameters on a small subset of data, and then using this knowledge to quickly find an effective setting on the full dataset with just a few function evaluations. [sent-33, score-0.198]
13 The predictive mean and covariance under a GP can be respectively expressed as: µ(x ; {xn , yn }, ✓) = K(X, x)> K(X, X) 0 0 ⌃(x, x ; {xn , yn }, ✓) = K(x, x ) 1 (y (1) m(X)), > K(X, x) K(X, X) 1 0 K(X, x ). [sent-47, score-0.29]
14 3 Bayesian Optimization for a Single Task Bayesian optimization is a general framework for the global optimization of noisy, expensive, blackbox functions [15]. [sent-67, score-0.25]
15 The strategy is based on the notion that one can use a relatively cheap probabilistic model to query as a surrogate for the financially, computationally or physically expensive function that is subject to the optimization. [sent-68, score-0.207]
16 That is, given observation pairs of the form {xn , yn }N , where xn 2 X and yn 2 R, we assume that the function f (x) is drawn from a n=1 Gaussian process prior where yn ⇠ N (f (xn ), ⌫) and ⌫ is the function observation noise variance. [sent-71, score-0.535]
17 A standard approach is to select the next point to query by finding the maximum of an acquisition function a(x ; {xn , yn }, ✓) over a bounded domain in X . [sent-72, score-0.353]
18 We will use the expected improvement criterion (EI) [15, 17], p aEI (x ; {xn , yn }, ✓) = ⌃(x, x ; {xn , yn }, ✓) ( (x) ( (x)) + N ( (x) ; 0, 1)) , (4) ybest µ(x ; {xn , yn }, ✓) (x) = p . [sent-75, score-0.435]
19 An alternative to heuristic acquisition functions such as EI is to consider a distribution over the minimum of the function and iteratively evaluating points that will most decrease the entropy of this distribution. [sent-78, score-0.421]
20 This entropy search strategy [18] has the appealing interpretation of decreasing the uncertainty over the location of the minimum at each optimization step. [sent-79, score-0.52]
21 Here, we formulate the entropy search problem as that of selecting the next point from a pre-specified candidate set. [sent-80, score-0.236]
22 The entropy search procedure relies on an estimate of the reduction in uncertainty over this ˜ distribution if the value y at x is revealed. [sent-82, score-0.235]
23 1 Multi-Task Bayesian Optimization Transferring Bayesian Optimization to a New Task Under the framework of multi-task GPs, performing optimization on a related task is fairly straightforward. [sent-87, score-0.265]
24 We simply restrict our future observations to the task of interest and proceed as normal. [sent-88, score-0.192]
25 Once we have enough observations on the task of interest to properly estimate Kt , then the other tasks will act as additional observations without requiring any additional function evaluations. [sent-89, score-0.317]
26 We wish to optimize the average performance over all k folds, but it may not be necessary to actually evaluate all of them in order to identify the quality of the hyperparameters under consideration. [sent-104, score-0.188]
27 The predictive mean and variance of the average objective are given by: µ(x) = ¯ k 1X µ(x, t ; {xn , yn }, ✓), k t=1 ¯ (x)2 = k k 1 XX ⌃(x, x, t, t0 ; {xn , yn }, ✓). [sent-105, score-0.368]
28 k 2 t=1 0 (8) t =1 If we are willing to spend one function evaluation on each task for every point x that we query, then the optimization of this objective can proceed using standard approaches. [sent-106, score-0.309]
29 As an extreme case, if we have two perfectly correlated tasks then spending two function evaluations per query provides no additional information, at twice the cost of a single-task optimization. [sent-108, score-0.527]
30 The more interesting case then is to try to jointly choose both x as well as the task t and spend only one function evaluation per query. [sent-109, score-0.196]
31 Conditioned on x, we then choose the task that yields the highest single-task expected improvement. [sent-113, score-0.157]
32 The problem of minimizing the average error over multiple tasks has been considered in [20], where they applied Bayesian optimization in order to tune a single model on multiple datasets. [sent-114, score-0.281]
33 3 A Principled Multi-Task Acquisition Function Rather than transferring knowledge from an already completed search on a related task to bootstrap a new one, a more desirable strategy would have the optimization routine dynamically query the related, possibly significantly cheaper task. [sent-118, score-0.795]
34 Intuitively, if two tasks are closely related, then evaluating a cheaper one can reveal information and reduce uncertainty about the location of the minimum on the more expensive task. [sent-119, score-0.506]
35 A clever strategy may, for example, perform low cost exploration of a promising location on the cheaper task before risking an evaluation of the expensive task. [sent-120, score-0.477]
36 In this section we develop an acquisition function for such a dynamic multi-task strategy which specifically takes noisy estimates of cost into account based on the entropy search strategy. [sent-121, score-0.409]
37 Although the EI criterion is intuitive and effective in the single task case, it does not directly generalize to the multi-task case. [sent-122, score-0.157]
38 However, entropy search does translate naturally to the multi-task t problem. [sent-123, score-0.197]
39 In the bottom of each figure are lines indicating the expected information gain with regard to the primary objective function. [sent-127, score-0.195]
40 The green dashed line shows the information gain about the primary objective that results from evaluating the auxiliary objective function. [sent-128, score-0.444]
41 Evaluating the primary objective gains information, but evaluating the auxiliary does not. [sent-130, score-0.35]
42 In Figure 2b we see that with two strongly correlated functions, not only do observations on either task reduce uncertainty about the other, but observations from the auxiliary task acquire information about the primary task. [sent-131, score-0.669]
43 Finally, in 2c we assume that the primary objective is three times more expensive than the auxiliary task and thus evaluating the related task gives more information gain per unit cost. [sent-132, score-0.826]
44 to pick the candidate xt that maximally reduces the entropy of Pmin for the primary task, which we take to be t = 1. [sent-133, score-0.295]
45 However, we can evaluate Py for min y t>1 and if the auxiliary task is related to the primary task, Py will change from the base distribumin tion and H(Pmin ) H(Py ) will be positive. [sent-135, score-0.414]
46 Through reducing uncertainty about f, evaluating an min observation on a related auxiliary task can reduce the entropy of Pmin on the primary task of interest. [sent-136, score-0.824]
47 However, observe that evaluating a point on a related task can never reveal more information than evaluating the same point on the task of interest. [sent-137, score-0.51]
48 Nevertheless, when cost is taken into account, the auxiliary task may convey more information per unit cost. [sent-139, score-0.375]
49 Thus we translate the objective from Equation (7) to instead reflect the information gain per unit cost of evaluating a candidate point, ◆ Z Z ✓ H[Pmin ] H[Py ] min aIG (xt ) = p(y | f) p(f | xt ) dy df, (9) ct (x) where ct (x), ct : X ! [sent-140, score-0.547]
50 R+ , is the real valued cost of evaluating task t at x. [sent-141, score-0.289]
51 Although, we don’t know this cost function in advance, we can estimate it similarly to the task functions, f (xt ), using the same multi-task GP machinery to model log ct (x). [sent-142, score-0.243]
52 Figure 2 provides a visualization of this acquisition function, using a two task example. [sent-143, score-0.283]
53 It shows how selecting a point on a related auxiliary task can reduce uncertainty about the location of the minimum on the primary task of interest (blue solid line). [sent-144, score-0.651]
54 Following [18], we pick these candidates by taking the top C points according to the EI criterion on the primary task of interest. [sent-146, score-0.258]
55 1 Empirical Analyses Addressing the Cold Start Problem Here we compare Bayesian optimization with no initial information to the case where we can leverage results from an already completed optimization on a related task. [sent-148, score-0.267]
56 In each classification experiment the target of Bayesian optimization is the error on a held out validation set. [sent-149, score-0.218]
57 As a related task we consider a shifted Branin-Hoo where the function is translated by 10% along either axis. [sent-152, score-0.211]
58 We used Bayesian optimization to find the minimum of the original function and then added the shifted function as an additional task. [sent-153, score-0.208]
59 Logistic regression We optimize four hyperparameters of logistic regression (LR) on the MNIST dataset using 10000 validation examples. [sent-154, score-0.259]
60 We assume that we have already completed 50 iterations 5 15 10 5 0 0 5 10 15 20 Function evaluations 25 30 0. [sent-155, score-0.257]
61 1 0 10 20 30 Function evaluations 40 10 20 30 Function Evaluations 40 0. [sent-162, score-0.206]
62 1 10 20 30 Function Evaluations 20 30 Function evaluations 40 50 (c) CNN on STL-10 0. [sent-171, score-0.206]
63 07 Average Cumulative Error Min Function Value 25 Min Function Value (Error Rate) Branin−Hoo from scratch Branin−Hoo from shifted Baseline 30 40 50 STL−10 from scratch STL−10 from CIFAR−10 0. [sent-178, score-0.262]
64 25 0 10 20 30 Function evaluations 40 50 (d) LR on MNIST (e) SVHN ACE (f) STL-10 ACE Figure 3: (a)-(d)Validation error per function evaluation. [sent-185, score-0.294]
65 Convolutional neural networks on pixels We applied convolutional neural networks1 (CNNs) to the Street View House Numbers (SVHN) [21] dataset and bootstrapped from a previous run of Bayesian optimization using the same model trained on CIFAR-10 [22, 6]. [sent-189, score-0.239]
66 During the optimization, we used the first fold for training, and the remaining 4000 points from the other folds for validation. [sent-202, score-0.252]
67 We then trained separate networks on each fold using the best hyperparameter settings found by Bayesian optimization. [sent-203, score-0.405]
68 The single-task method wastes many more evaluations exploring poor hyperparameter settings. [sent-214, score-0.36]
69 In the multi-task case, this exploration has already been performed and more evaluations are spent on exploitation. [sent-215, score-0.206]
70 As a baseline (the dashed black line), we took the best model from the first task and applied it directly to the task of interest. [sent-216, score-0.351]
71 In general, we have found that the best settings for one task are usually not optimal for the other. [sent-219, score-0.214]
72 95 50 100 150 Function evaluations (a) 200 Min Function Value (RMSE) Min Function Value (RMSE) Avg Crossval Error Estimated Avg Crossval Error Full Crossvalidation 1. [sent-227, score-0.206]
73 92 20 40 60 80 Function evaluations 100 Figure 4: (a) PMF crossvalidation error per function evaluation on Movielens-100k. [sent-233, score-0.294]
74 (b) Lowest error observed for each fold per function evaluation for a single run. [sent-234, score-0.282]
75 (b) partitioned among folds that the errors for each fold will be highly correlated. [sent-235, score-0.252]
76 With a good GP model, we can very likely obtain a high quality estimate by evaluating just one fold per setting. [sent-237, score-0.331]
77 We demonstrate this procedure on the task of training probabilistic matrix factorization (PMF) models for recommender systems [28]. [sent-240, score-0.157]
78 In Figure 4(a) we show the best error obtained after a given number of function evaluations as measured by the number of folds queried, averaged over 50 optimization runs. [sent-243, score-0.421]
79 In Figure 4(b), we show the best observed error after a given number of function evaluations on a randomly selected run. [sent-247, score-0.255]
80 For a particular fold, the error cannot improve unless that fold is directly queried. [sent-248, score-0.243]
81 The algorithm makes nontrivial decisions in terms of which fold to query, steadily reducing the average error. [sent-249, score-0.228]
82 3 Using Small Datasets to Quickly Optimize for Large Datasets As a final empirical analysis, we evaluate the dynamic multi-task entropy search strategy developed in Section 3. [sent-251, score-0.249]
83 We treat the cost, ct (x), of a function evaluation as being the real running time of training and evaluating the machine learning algorithm with hyperparameter settings x on task t. [sent-253, score-0.518]
84 In both tasks we compare using our multi-task entropy search strategy (MTBO) to optimizing the task of interest independently (STBO). [sent-255, score-0.546]
85 1 (Figure 3(d)) using the same experimental protocol, but rather than assuming that there is a completed optimization of the USPS data, the Bayesian optimization routine can instead dynamically query USPS as needed. [sent-257, score-0.426]
86 Figures 5(b) and 5(c) show that MTBO reaches better values significantly faster by spending more function evaluations on the related, but relatively cheaper task. [sent-260, score-0.359]
87 Finally we evaluate the very expensive problem of optimizing the hyperparameters of online Latent Dirichlet Allocation [30] on a large corpus of 200,000 documents. [sent-261, score-0.277]
88 [6] demonstrated that on this problem, Bayesian optimization could find better hyperparameters in significantly less time than the grid search conducted by the authors. [sent-263, score-0.382]
89 We repeat this experiment here using the exact same grid as [6] and [30] but provide an auxiliary task involving a subset of 50,000 documents and 25 topics on the same grid. [sent-264, score-0.304]
90 We performed our multi-task Bayesian optimization restricted to the same grid and compare to the results of the standard Bayesian optimization of [6] (the GP EI MCMC algorithm). [sent-268, score-0.256]
91 In Figure 5d, we see that our MTBO strategy finds the minimum in approximately 6 days of computation while the STBO strategy takes 10 days. [sent-269, score-0.218]
92 Our algorithm saves almost 4 days of computation by being able to dynamically explore the cheaper alternative task. [sent-270, score-0.261]
93 We see in 5(f) that particularly early in the optimization, the algorithm explores the cheaper task to gather information about the expensive one. [sent-271, score-0.346]
94 5 Conclusion As datasets grow larger, and models become more expensive, it has become necessary to develop new search strategies in order to find optimal hyperparameter settings as quickly as possible. [sent-280, score-0.291]
95 There is a plethora of information that can be carried over from related tasks, and taking advantage of this can result in substantial cost-savings by allowing the search to focus on regions of the hyperparameter space that are already known to be promising. [sent-283, score-0.234]
96 Our fast cross-validation procedure obviates the need to evaluate each fold per hyperparameter query and therefore eliminates redundant and costly function evaluations. [sent-289, score-0.469]
97 The next application we considered employed a cost-sensitive version of the entropy search acquisition function in order to utilize a cheap auxiliary task in the minimization of an expensive primary task. [sent-290, score-0.761]
98 Our algorithm dynamically chooses which task to evaluate, and we showed that it can substantially reduce the amount of time required to find good hyperparameter settings. [sent-291, score-0.388]
99 This provides another avenue for utilizing one task to bootstrap another. [sent-295, score-0.229]
100 Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. [sent-326, score-0.262]
wordName wordTfidf (topN-words)
[('svhn', 0.334), ('mtbo', 0.257), ('gp', 0.24), ('stbo', 0.232), ('evaluations', 0.206), ('fold', 0.194), ('task', 0.157), ('hyperparameter', 0.154), ('hyperparameters', 0.154), ('yn', 0.145), ('pmin', 0.138), ('acquisition', 0.126), ('bayesian', 0.12), ('entropy', 0.117), ('cheaper', 0.116), ('optimization', 0.108), ('auxiliary', 0.107), ('scratch', 0.104), ('primary', 0.101), ('xn', 0.1), ('evaluating', 0.098), ('ei', 0.093), ('stl', 0.091), ('tasks', 0.09), ('py', 0.087), ('ace', 0.084), ('query', 0.082), ('usps', 0.081), ('search', 0.08), ('dynamically', 0.077), ('lr', 0.076), ('snoek', 0.075), ('cnn', 0.075), ('expensive', 0.073), ('cifar', 0.072), ('bootstrap', 0.072), ('days', 0.068), ('brochu', 0.063), ('kt', 0.062), ('validation', 0.061), ('jasper', 0.059), ('folds', 0.058), ('gps', 0.058), ('settings', 0.057), ('pmf', 0.056), ('shifted', 0.054), ('mnist', 0.054), ('ct', 0.052), ('strategy', 0.052), ('aig', 0.051), ('bardenet', 0.051), ('crossval', 0.051), ('deepnet', 0.051), ('hoo', 0.051), ('kmulti', 0.051), ('completed', 0.051), ('optimizing', 0.05), ('gain', 0.05), ('error', 0.049), ('min', 0.049), ('perplexity', 0.048), ('gaussian', 0.047), ('minimum', 0.046), ('df', 0.046), ('gens', 0.045), ('branin', 0.045), ('geostatistics', 0.045), ('convolutional', 0.045), ('location', 0.045), ('dataset', 0.044), ('objective', 0.044), ('processes', 0.044), ('ryan', 0.043), ('animation', 0.042), ('bootstrapped', 0.042), ('cnns', 0.042), ('grid', 0.04), ('transfer', 0.04), ('candidate', 0.039), ('cold', 0.039), ('per', 0.039), ('correlated', 0.039), ('slice', 0.038), ('xt', 0.038), ('uncertainty', 0.038), ('taken', 0.038), ('spending', 0.037), ('nando', 0.037), ('baseline', 0.037), ('lda', 0.037), ('avg', 0.036), ('nitish', 0.036), ('observations', 0.035), ('mins', 0.034), ('cost', 0.034), ('functions', 0.034), ('appealing', 0.034), ('average', 0.034), ('package', 0.033), ('pierre', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
2 0.19699278 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
3 0.18799256 54 nips-2013-Bayesian optimization explains human active search
Author: Ali Borji, Laurent Itti
Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1
4 0.14298202 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen
Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1
5 0.12611765 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
6 0.1056959 137 nips-2013-High-Dimensional Gaussian Process Bandits
7 0.10490905 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
8 0.095366836 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
9 0.090463474 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors
10 0.089298695 211 nips-2013-Non-Linear Domain Adaptation with Boosting
11 0.083563723 75 nips-2013-Convex Two-Layer Modeling
12 0.078377746 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
13 0.075604856 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
14 0.072002761 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
15 0.071588442 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields
16 0.071208611 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
17 0.070585348 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
18 0.07045076 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
19 0.069803648 251 nips-2013-Predicting Parameters in Deep Learning
20 0.068383038 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
topicId topicWeight
[(0, 0.228), (1, 0.048), (2, -0.045), (3, -0.045), (4, 0.02), (5, 0.035), (6, 0.069), (7, 0.059), (8, 0.031), (9, -0.045), (10, -0.1), (11, -0.045), (12, -0.193), (13, 0.024), (14, -0.027), (15, 0.025), (16, -0.121), (17, 0.095), (18, -0.016), (19, -0.122), (20, 0.014), (21, -0.006), (22, -0.17), (23, 0.031), (24, 0.008), (25, 0.005), (26, -0.046), (27, -0.031), (28, -0.027), (29, 0.073), (30, -0.132), (31, 0.104), (32, 0.0), (33, -0.0), (34, 0.091), (35, 0.071), (36, 0.032), (37, -0.012), (38, 0.011), (39, -0.049), (40, 0.008), (41, -0.094), (42, 0.097), (43, -0.019), (44, 0.076), (45, 0.031), (46, -0.025), (47, 0.004), (48, 0.003), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.93412238 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
2 0.85897946 54 nips-2013-Bayesian optimization explains human active search
Author: Ali Borji, Laurent Itti
Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1
3 0.84132224 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
4 0.79859972 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
5 0.64029735 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
6 0.62453651 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
7 0.59246939 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
8 0.50088704 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
9 0.47906008 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations
10 0.46874893 160 nips-2013-Learning Stochastic Feedforward Neural Networks
11 0.45226678 76 nips-2013-Correlated random features for fast semi-supervised learning
12 0.44364449 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family
13 0.43753797 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
14 0.43129519 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
15 0.4230825 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
16 0.4178701 137 nips-2013-High-Dimensional Gaussian Process Bandits
17 0.41721988 69 nips-2013-Context-sensitive active sensing in humans
18 0.39934096 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
19 0.39920387 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
20 0.39365202 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
topicId topicWeight
[(2, 0.027), (16, 0.068), (19, 0.152), (33, 0.133), (34, 0.145), (36, 0.014), (41, 0.041), (49, 0.034), (53, 0.012), (56, 0.091), (70, 0.044), (85, 0.033), (89, 0.032), (93, 0.075), (95, 0.024)]
simIndex simValue paperId paperTitle
1 0.924106 119 nips-2013-Fast Template Evaluation with Vector Quantization
Author: Mohammad Amin Sadeghi, David Forsyth
Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1
2 0.90895933 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
Author: Harish G. Ramaswamy, Shivani Agarwal, Ambuj Tewari
Abstract: The design of convex, calibrated surrogate losses, whose minimization entails consistency with respect to a desired target loss, is an important concept to have emerged in the theory of machine learning in recent years. We give an explicit construction of a convex least-squares type surrogate loss that can be designed to be calibrated for any multiclass learning problem for which the target loss matrix has a low-rank structure; the surrogate loss operates on a surrogate target space of dimension at most the rank of the target loss. We use this result to design convex calibrated surrogates for a variety of subset ranking problems, with target losses including the precision@q, expected rank utility, mean average precision, and pairwise disagreement. 1
3 0.89694643 9 nips-2013-A Kernel Test for Three-Variable Interactions
Author: Dino Sejdinovic, Arthur Gretton, Wicher Bergsma
Abstract: We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak influence on a third dependent variable, but their combined effect has a strong influence. This makes the Lancaster test especially suited to finding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.
same-paper 4 0.88040566 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
5 0.83819497 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
6 0.81844211 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
7 0.81550801 5 nips-2013-A Deep Architecture for Matching Short Texts
8 0.80567104 350 nips-2013-Wavelets on Graphs via Deep Learning
9 0.80521315 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
10 0.80515921 86 nips-2013-Demixing odors - fast inference in olfaction
11 0.80486745 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
12 0.80479932 99 nips-2013-Dropout Training as Adaptive Regularization
13 0.8046931 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
14 0.80421281 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
15 0.80401379 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
16 0.80372071 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
17 0.80360329 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
18 0.80326515 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
19 0.80191177 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
20 0.80111367 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent