jmlr jmlr2012 jmlr2012-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
Reference: text
sentIndex sentText sentNum sentScore
1 The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. [sent-10, score-0.203]
2 In the first one, relatively cheap function evaluations take place on a numerical machine and the goal is to find a “good” region of low or high function values. [sent-16, score-0.201]
3 This induces a measure pmin (x) ≡ p[x = arg min f (x)] = f :I R p( f ) ∏ θ[ f (x) − f (x)] d f , ˜ (1) x∈I ˜ x=x ˜ where θ is Heaviside’s step function. [sent-35, score-0.731]
4 Finally, let L(x∗ , xmin ) be a loss function describing the cost of naming x∗ as the result of optimization if the true minimum is at xmin . [sent-39, score-0.538]
5 This loss function induces a loss functional L (pmin ) assigning utility to the uncertain knowledge about xmin , as L (pmin ) = [min L(x∗ , xmin )]pmin (xmin ) dxmin . [sent-40, score-0.513]
6 ∗ I x The goal of global optimization is to decrease the expected loss after H function evaluations at locations x = {x1 , . [sent-41, score-0.372]
7 The expected loss is L H = p(y | x)L (pmin (x | y, x)) dy = p(y | f (x))p( f (x) | x)L (pmin (x | y, x)) dy d f , (2) where L (pmin (x | y, x)) should be understood as the cost assigned to the measure pmin (x) induced by the posterior belief over f after observations y = {y1 , . [sent-45, score-0.958]
8 If such formulations feature the global minimum xmin at all, then only in statements about the limit behavior of the algorithm after many evaluations. [sent-63, score-0.297]
9 In probabilistic optimization, the only quantity that counts is the quality of the belief on pmin under L , after H evaluations, not the sum of the function values returned during those H steps. [sent-87, score-0.848]
10 3 R ELATIONSHIP TO H EURISTIC G AUSSIAN P ROCESS O PTIMIZATION S URFACE O PTIMIZATION AND R ESPONSE There are also a number of works employing Gaussian process measures to construct heuristics for search, also known as “Gaussian process global optimization” (Jones et al. [sent-90, score-0.24]
11 Two popular heuristics are the probability of improvement (Lizotte, 2008) η uPI (x) = p[ f (x) < η] = −∞ N ( f (x); µ(x), σ(x)2 ) d f (x) = Φ η − µ(x) , σ(x) and expected improvement (Jones et al. [sent-95, score-0.204]
12 These two heuristics have different units of measure: probability of improvement is a probability, expected improvement has the units of f . [sent-97, score-0.204]
13 Both utilities differ markedly from Equation (1), pmin , which is a probability measure and as such a global quantity. [sent-98, score-0.813]
14 If the goal is to infer the location of the minimum (more generally: minimize loss at the horizon), the optimal strategy is to evaluate where we expect to learn most about the minimum (reduce loss toward the horizon), rather then where we think the minimum is (recall Section 1. [sent-103, score-0.259]
15 discretizing pmin We discretize the problem of calculating pmin , to a finite set of representer points chosen from a non-uniform measure, which deals gracefully with the curse of dimensionality. [sent-120, score-1.427]
16 approximating pmin We construct an efficient approximation to pmin , which is required because Equation (1), even for finite-dimensional Gaussian measures, is not analytically tractable, (Section 2. [sent-123, score-1.416]
17 Five sampled functions from the current belief as dashed red lines. [sent-128, score-0.158]
18 predicting change to pmin The Gaussian process measure affords a straightforward but rarely used analytic probabilistic formulation for the change of p( f ) as a function of the next evaluation point (Section 2. [sent-130, score-0.945]
19 choosing loss function We commit to relative entropy from a uniform distribution as the loss function, as this can be interpreted as a utility on gained information about the location of the minimum (Section 2. [sent-132, score-0.298]
20 predicting expected information gain From the predicted change, we construct a first-order expansion on L from future evaluations and, again, compare to the asymptotically exact Monte Carlo answer (Section 2. [sent-134, score-0.158]
21 Gaussian process beliefs are parametrized by a mean function m : I R and a covariance function k : I × I R. [sent-150, score-0.158]
22 A number of such kernel functions are known in the literature, and different kernel functions induce different kinds of Gaussian process measures over the space of functions. [sent-153, score-0.181]
23 In this work, we side-step this issue by assuming that the chosen kernel is sufficiently regular to induce a well-defined belief pmin as defined by Equation (8). [sent-159, score-0.919]
24 Finally, for what follows it is important to know that 1814 E NTROPY S EARCH f −5 −4 −3 −2 −1 0 x 1 2 3 4 5 Figure 2: pmin induced by p( f ) from Figure 1. [sent-173, score-0.694]
25 Blue solid line: Asymptotically exact representation of pmin gained from exact sampling of functions on a regular grid (artifacts due to finite sample size). [sent-175, score-0.776]
26 For comparison, the plot also shows the local utilities probability of improvement (dashed magenta) and expected improvement (solid magenta) often used for Gaussian process global optimization. [sent-176, score-0.303]
27 Stochastic error on sampled values, due to only asymptotically correct assignment of mass to samples, and varying density of points, focusing on relevant areas of pmin . [sent-178, score-0.778]
28 This plot uses arbitrary scales for each object: The two heuristics have different units of measure, differing from that of pmin . [sent-179, score-0.774]
29 Notice the interesting features of pmin at the boundaries of the domain: The prior belief encodes that f is smooth, and puts finite probability mass on the hypothesis that f has negative (positive) derivative at the right (left) boundary of the domain. [sent-180, score-0.94]
30 2 Discrete Representations for Continuous Distributions Having established a probability measure p( f ) on the function, we turn to constructing the belief pmin (x) over its minimum. [sent-188, score-0.858]
31 1815 H ENNIG AND S CHULER It may seem daunting that pmin involves an infinite-dimensional integral. [sent-192, score-0.694]
32 If the stochastic process representing the belief over f is sufficiently regular, then Equation (1) can be approximated arbitrarily well with finitely many representer points. [sent-194, score-0.256]
33 The discretization grid need not be regular—it may be sampled from any distribution which puts non-zero measure on every open neighborhood of I. [sent-195, score-0.215]
34 This is obviously not efficient: Just like in other numerical quadrature problems, any given resolution can be achieved with fewer representer points if they are chosen irregularly, with higher resolution in regions of greater influence on the result of integration. [sent-197, score-0.186]
35 A non-uniform quadrature measure u(x) ˜ for N representer locations {xi }i=1,. [sent-200, score-0.177]
36 3 will construct a discretized qmin (xi ) that is an approximation to the probability that fmin ˆ ˜ occurs within the step at xi . [sent-206, score-0.275]
37 So the approximate pmin on this step is proportional to qmin (xi )u(xi ), ˜ ˆ ˆ ˜ ˜ and can be easily normalized numerically, to become an approximation to pmin . [sent-207, score-1.537]
38 Intuitively, a good proposal measure for discretization points should put high resolution on regions of I where the shape of pmin has strong influence on the loss, and on its change. [sent-211, score-0.86]
39 5), it is a good idea to choose u such that it puts high mass on regions of high value for pmin . [sent-213, score-0.792]
40 We use these functions for a similar reason as their original authors: Because they tend to have high value in regions where pmin is also large. [sent-218, score-0.722]
41 To avoid confusion, however, note that we use these functions as unnormalized measures to sample discretization points for our calculation of pmin , not as an approximation for pmin itself, as was done in previous work by other authors. [sent-219, score-1.515]
42 Defects in these heuristics have weaker effect on our algorithm than in the cited works: in our case, if u is not a good proposal measure, we simply need more samples to construct a good representation of pmin . [sent-220, score-0.735]
43 3 Approximating pmin with Expectation Propagation The previous Section 2. [sent-223, score-0.694]
44 The restriction of the Gaussian process belief to these locations is a multivariate ˜ ˜ Gaussian density with mean µ ∈ RN and covariance Σ ∈ RN×N . [sent-228, score-0.317]
45 So Equation (1) reduces to a discrete ˜ probability distribution (as opposed to a density) N pmin (xi ) = ˆ f ∈RN N ( f ; µ, Σ) ∏ θ( f (x j ) − f (xi )) d f . [sent-229, score-0.694]
46 ˜ ˜ i= j 1816 E NTROPY S EARCH N ( f ; µ, Σ) ˜ ˜ θ[ f (x j ) − f (xi )] ˜ ˜ f i= j Figure 3: Graphical model providing motivation for EP approximation on pmin . [sent-230, score-0.722]
47 However, it is possible to construct an effective approximation qmin based on Expectation Propagation (EP) (Minka, 2001): Consider the belief ˆ p( f (x)) as a “prior message” on f (x), and each of the terms in the product as one factor providing ˜ ˜ another message. [sent-234, score-0.276]
48 Running EP on this graph provides an approximate Gaussian marginal, whose normalization constant qmin (xi ), which EP also ˆ provides, approximates p( f | xmin = xi ). [sent-236, score-0.38]
49 An important advantage of the EP approximation over both numerical integration and Monte Carlo integration (see next Section) is that it allows analytic differentiation of qmin with respect to the ˆ ˜ parameters µ and Σ (Cunningham et al. [sent-241, score-0.325]
50 1 A N ALTERNATIVE : S AMPLING An alternative to EP is Monte Carlo integration: sample S functions exactly from the Gaussian belief on p( f ), at cost O(N 2 ) per sample, then find the minimum for each sample in O (N) time. [sent-252, score-0.162]
51 But for relatively 1817 H ENNIG AND S CHULER f −5 −4 −3 −2 −1 0 x 1 2 3 4 5 Figure 4: EP-approximation to pmin (dashed green). [sent-260, score-0.694]
52 EP achieves good agreement with the asymptotically exact Monte Carlo approximation to pmin , including the point masses at the boundaries of the domain. [sent-262, score-0.722]
53 Samples from the belief over possible beliefs after observations at x in blue. [sent-265, score-0.225]
54 For each sampled innovation, the plot also shows the induced innovated pmin (lower sampling resolution as previous plots). [sent-266, score-0.79]
55 4 Predicting Innovation from Future Observations As detailed in Equation (2), the optimal choice of the next H evaluations is such that the expected change in the loss L x is extremal, that is, it effects the biggest possible expected drop in loss. [sent-272, score-0.221]
56 The loss is a function of pmin , which in turn is a function of p( f ). [sent-273, score-0.73]
57 It is another convenient aspect of Gaussian processes that they allow such predictions in analytic form (Hennig, 2011): Let previous observations at X 0 have yielded observations Y 0 . [sent-275, score-0.161]
58 One use of this result is to sample L X by sampling innovations, then evaluating the innovated pmin for each innovation in an inner loop, as described in Section 2. [sent-283, score-0.773]
59 Thus, we require a loss functional that evaluates the information content of innovated beliefs pmin . [sent-291, score-0.827]
60 Under a nonlinear transformation of I, the distribution would not remain 1820 E NTROPY S EARCH uniform belief over the minimum, and diverges toward negative infinity if p approaches a Dirac point distribution. [sent-313, score-0.188]
61 The reader may wonder: What about the alternative idea of maximizing, at each evaluation, entropy relative to the current pmin ? [sent-315, score-0.805]
62 For example, if the current belief puts very low mass on a certain region, an evaluation that has even a small chance of increasing pmin in this region could appear more favorable than an alternative evaluation predicted to have a large effect on regions where the current pmin has larger values. [sent-317, score-1.74]
63 The point is not to just change pmin , but to change it such that it moves away from the base measure. [sent-318, score-0.694]
64 In other words, we can approximate pmin (x) as ˜ pmin (x) ≈ p(xi )N u(xi ) ˆ ˜ ; Zu Zu = u(x) dx; ˜ xi = arg min x − x j . [sent-320, score-1.441]
65 6 First-Order Approximation to L Since EP provides analytic derivatives of pmin with respect to mean and covariance of the Gaussian measure over f , we can construct a first order expansion of the expected change in loss from evaluations. [sent-325, score-0.951]
66 To do so, we consider, in turn, the effect of evaluations at X on the measure on f , the induced change in pmin , and finally the change in L . [sent-326, score-0.862]
67 ∆L X = L p0 + min ∂2 pmin ∂pmin ∆ΣX (xi , x j ) + ˜ ˜ ∆µX,1 (xi )∆µX,1 (x j ) ˜ ˜ ∂Σ(xi , x j ) ˜ ˜ ∂µi ∂µ j ∂pmin + ∆X,Ω µ(xi ) + O ((∆µ)2 , (∆Σ)2 ) N (Ω; 0, 1) dΩ − L [p0 ]. [sent-329, score-0.694]
68 So, while b here does not represent prior knowledge on xmin per se, it does provide a unit of measure to information and as such is nontrivial. [sent-332, score-0.292]
69 5) from two additional evaluations to the three old evaluations shown in previous plots. [sent-334, score-0.262]
70 7 Greedy Planning, and its Defects The previous sections constructed a means to predict, approximately, the expected drop in loss from H new evaluations at locations X = {xi }i=1,. [sent-349, score-0.261]
71 It may seem pointless to construct an optimization algorithm which itself contains an optimization problem, but note that this new optimization problem is quite different from the initial one. [sent-354, score-0.165]
72 It is a numerical optimization problem, of the form described in Section 1: We can evaluate the utility 1822 E NTROPY S EARCH function numerically, without noise, with derivatives, and at hopefully relatively low cost compared to the physical process we are ultimately trying to optimize. [sent-355, score-0.169]
73 Nevertheless, one issue remains: Optimizing evaluations over the entire horizon H is a dynamic programming problem, which, in general, has cost exponential in H. [sent-356, score-0.188]
74 However, this problem has a particular structure: Apart from the fact that evaluations drawn from Gaussian process measures are exchangeable, there is also other evidence that optimization problems are benign from the point of view of planning. [sent-357, score-0.271]
75 1 D ERIVATIVE O BSERVATIONS Gaussian process inference remains analytically tractable if instead of, or in addition to direct observations of f , we observe the result of any linear operator acting on f . [sent-372, score-0.162]
76 Covariances between the function and its derivatives are simply given by cov ∂n f (x) ∂m f (x′ ) , ∏i ∂xi ∏ j ∂x′j = ∂n+m k(x, x′ ) , ∏i ∂xi ∏ j ∂x′j so kernel evaluations simply have to be replaced with derivatives (or integrals) of the kernel where required. [sent-376, score-0.289]
77 Hence, all results derived in previous sections for optimization from function evaluations can trivially be extended to optimization from function and derivative observations, or from only derivative observations. [sent-378, score-0.241]
78 To still be able to use the algorithm derived so far, we approximate this belief with a single Gaussian process by calculating expected values of mean and covariance function. [sent-395, score-0.251]
79 1) 2q ˆ ˆmin ˆmin 4: [qmin (x), ∂qmin , ∂∂µ∂µ , ∂q∂Σ x ] EP(µ, Σ) ˆ ˜ ∂µ ⊲ approximate pmin (2. [sent-455, score-0.694]
80 To choose where to evaluate next, we first sample discretization points from u, then calculate the current Gaussian belief over f on the discretized domain, along with its derivatives. [sent-457, score-0.199]
81 We construct an approximation to the belief over the minimum using Expectation Propagation, again with derivatives. [sent-458, score-0.19]
82 In this section, we compare the resulting algorithm to two Gaussian process global optimization heuristics: Expected Improvement, Probability of Improvement (Section 1. [sent-466, score-0.169]
83 of improvement expected improvement entropy search ˆ |fmin − fmin | 100 10−1 10−2 10−3 10−4 10−5 0 10 20 30 40 50 60 # of evaluations 70 80 90 100 Figure 10: Distance of function value at optimizers’ best guess for xmin from true global minimum. [sent-491, score-0.817]
84 After each function evaluation, the algorithms were asked to return a best guess for the minimum xmin . [sent-508, score-0.276]
85 The Gaussian process based methods returned the global minimum of the mean belief over the function (found by local optimization with random restarts). [sent-510, score-0.331]
86 of improvement expected improvement entropy search |xmin − xmin | ˆ 10−1 10−2 10−3 10−4 0 10 20 30 40 50 60 # of evaluations 70 80 90 100 Figure 11: Euclidean distance of optimizers’ best guess for xmin from truth. [sent-513, score-0.894]
87 The flattening out of the error of all three global optimizers toward the right is due to evaluation noise (recall that evaluations include Gaussian noise of standard deviation 10−3 ). [sent-518, score-0.394]
88 of improvement expected improvement entropy search 250 regret 200 150 100 50 0 0 10 20 30 40 50 60 # of evaluations 70 80 90 100 Figure 12: Regret as a function of number of evaluations. [sent-531, score-0.505]
89 2 that the bandit setting differs considerably from the kind of optimization discussed in this paper, because bandit algorithms try to minimize regret, rather than improve an estimate of the function’s optimum. [sent-534, score-0.276]
90 The intuition here is that this heuristic focuses evaluations on regions known to give low function values. [sent-537, score-0.159]
91 Conclusion This paper presented a new probabilistic paradigm for global optimization, as an inference problem on the minimum of the function, rather than the problem of collecting iteratively lower and lower function values. [sent-580, score-0.158]
92 We argue that this description is closer to practitioners’ requirements than classic response surface optimization, bandit algorithms, or other, heuristic, global optimization al1831 H ENNIG AND S CHULER 101 GP UCB prob. [sent-581, score-0.235]
93 of improvement expected improvement entropy search ˆ |fmin − fmin | 100 10−1 10−2 10−3 10−4 0 10 20 30 40 50 60 # of evaluations 70 80 90 100 Figure 14: Function value error, off-model tasks. [sent-582, score-0.52]
94 of improvement expected improvement entropy search 10−1 10−2 10−3 0 10 20 30 40 50 60 # of evaluations Figure 15: Error on xmin , off-model tasks. [sent-584, score-0.653]
95 of improvement expected improvement entropy search 200 regret 150 100 50 0 0 10 20 30 40 50 60 # of evaluations 70 80 90 100 Figure 16: Regret, off-model tasks. [sent-586, score-0.505]
96 The Gaussian belief allows analytic probabilistic predictions of the effect of future data points, from which we constructed a first-order approximation of the expected change in relative entropy of pmin to a base measure. [sent-589, score-1.101]
97 For completeness, we also pointed out some already known analytic properties of Gaussian process measures that can be used to generalize this algorithm. [sent-590, score-0.172]
98 pmin (x) = ˜ p( f ) ∏ θ( f (x) − f (x)) d f x=x ˜ N ≡ lim N ∞ |xi −xi−1 |·N m(x) p[ f ({xi }i=1,. [sent-611, score-0.694]
99 Gaussian process optimization in the bandit setting: No regret and experimental design. [sent-879, score-0.269]
100 An interior algorithm for nonlinear optimization that combines line search and trust region steps. [sent-916, score-0.238]
wordName wordTfidf (topN-words)
[('pmin', 0.694), ('xmin', 0.206), ('chuler', 0.17), ('ennig', 0.17), ('ntropy', 0.166), ('evaluations', 0.131), ('earch', 0.129), ('belief', 0.127), ('qmin', 0.121), ('kx', 0.114), ('entropy', 0.111), ('bandit', 0.098), ('ep', 0.098), ('gaussian', 0.098), ('gp', 0.096), ('analytic', 0.087), ('optimizers', 0.08), ('fmin', 0.073), ('discretization', 0.072), ('improvement', 0.068), ('locations', 0.067), ('srinivas', 0.062), ('ucb', 0.062), ('beliefs', 0.061), ('hennig', 0.061), ('process', 0.058), ('regret', 0.058), ('horizon', 0.057), ('global', 0.056), ('optimization', 0.055), ('xi', 0.053), ('bfgs', 0.052), ('zu', 0.052), ('location', 0.051), ('monte', 0.051), ('regular', 0.05), ('prior', 0.049), ('carlo', 0.048), ('kernel', 0.048), ('region', 0.043), ('puts', 0.043), ('innovation', 0.043), ('search', 0.042), ('evaluation', 0.042), ('innovations', 0.041), ('ective', 0.041), ('mat', 0.041), ('propagation', 0.041), ('heuristics', 0.041), ('inference', 0.04), ('trust', 0.04), ('representer', 0.039), ('covariance', 0.039), ('scales', 0.039), ('rational', 0.038), ('observations', 0.037), ('measure', 0.037), ('adler', 0.036), ('coleman', 0.036), ('cunningham', 0.036), ('innovated', 0.036), ('lik', 0.036), ('lkl', 0.036), ('thode', 0.036), ('rasmussen', 0.036), ('integrals', 0.036), ('loss', 0.036), ('guess', 0.035), ('minimum', 0.035), ('quadrature', 0.034), ('stochastic', 0.032), ('grid', 0.032), ('sampled', 0.031), ('ordinate', 0.031), ('derivatives', 0.031), ('integration', 0.031), ('toward', 0.031), ('nonlinear', 0.03), ('minima', 0.03), ('kernels', 0.03), ('resolution', 0.029), ('utility', 0.029), ('interior', 0.028), ('byrd', 0.028), ('regions', 0.028), ('approximation', 0.028), ('expected', 0.027), ('ptimization', 0.027), ('noise', 0.027), ('numerical', 0.027), ('probabilistic', 0.027), ('tractable', 0.027), ('mass', 0.027), ('measures', 0.027), ('density', 0.026), ('readers', 0.026), ('classic', 0.026), ('planning', 0.026), ('utilities', 0.026), ('considerably', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
2 0.082664102 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
Author: Grigorios Skolidis, Guido Sanguinetti
Abstract: We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it. Keywords: transfer learning, meta-generalising, multi-task learning, Gaussian processes, mixture of experts
3 0.076567374 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
4 0.076324902 59 jmlr-2012-Linear Regression With Random Projections
Author: Odalric-Ambrym Maillard, Rémi Munos
Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction
5 0.063145719 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding
Abstract: This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper. Keywords: Gaussian process regression, domain decomposition method, partial independent conditional, bagging for Gaussian process, local probabilistic regression
6 0.054732706 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
7 0.052285619 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
8 0.050613903 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
9 0.050202318 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
10 0.050182886 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
11 0.049546145 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
12 0.049048774 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
13 0.046656035 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
14 0.045756567 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
15 0.043675758 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
16 0.04357091 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
17 0.042200316 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
18 0.042121582 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
19 0.040218368 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
20 0.039138824 97 jmlr-2012-Regularization Techniques for Learning with Matrices
topicId topicWeight
[(0, -0.199), (1, 0.017), (2, 0.095), (3, -0.06), (4, 0.093), (5, -0.0), (6, -0.002), (7, 0.019), (8, -0.203), (9, 0.066), (10, -0.037), (11, -0.027), (12, -0.1), (13, 0.142), (14, -0.039), (15, 0.078), (16, -0.013), (17, 0.019), (18, -0.033), (19, 0.039), (20, 0.072), (21, -0.038), (22, -0.02), (23, 0.107), (24, -0.022), (25, 0.032), (26, -0.025), (27, 0.018), (28, -0.124), (29, 0.022), (30, 0.011), (31, 0.041), (32, 0.129), (33, 0.097), (34, 0.016), (35, 0.009), (36, -0.064), (37, -0.048), (38, 0.117), (39, 0.106), (40, 0.056), (41, 0.138), (42, -0.042), (43, -0.121), (44, -0.206), (45, -0.11), (46, -0.059), (47, 0.072), (48, 0.135), (49, 0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.91295326 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
2 0.55293834 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding
Abstract: This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper. Keywords: Gaussian process regression, domain decomposition method, partial independent conditional, bagging for Gaussian process, local probabilistic regression
3 0.49051857 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
4 0.42274368 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
Author: Grigorios Skolidis, Guido Sanguinetti
Abstract: We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it. Keywords: transfer learning, meta-generalising, multi-task learning, Gaussian processes, mixture of experts
5 0.41886318 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
6 0.37573254 59 jmlr-2012-Linear Regression With Random Projections
7 0.36416218 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
8 0.34615085 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
9 0.34245121 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
10 0.32043394 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
11 0.31367737 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
12 0.31132248 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
13 0.29926264 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
14 0.29397249 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
15 0.28920603 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
16 0.27755639 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
17 0.27751783 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
18 0.27725059 103 jmlr-2012-Sampling Methods for the Nyström Method
19 0.26530442 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
20 0.26028034 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
topicId topicWeight
[(7, 0.013), (21, 0.042), (26, 0.045), (29, 0.496), (35, 0.023), (49, 0.028), (56, 0.015), (57, 0.015), (75, 0.04), (77, 0.014), (79, 0.011), (92, 0.087), (96, 0.076)]
simIndex simValue paperId paperTitle
1 0.9213782 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
Author: Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer
Abstract: Identifying cause-effect relationships between variables of interest is a central problem in science. Given a set of experiments we describe a procedure that identifies linear models that may contain cycles and latent variables. We provide a detailed description of the model family, full proofs of the necessary and sufficient conditions for identifiability, a search algorithm that is complete, and a discussion of what can be done when the identifiability conditions are not satisfied. The algorithm is comprehensively tested in simulations, comparing it to competing algorithms in the literature. Furthermore, we adapt the procedure to the problem of cellular network inference, applying it to the biologically realistic data of the DREAM challenges. The paper provides a full theoretical foundation for the causal discovery procedure first presented by Eberhardt et al. (2010) and Hyttinen et al. (2010). Keywords: causality, graphical models, randomized experiments, structural equation models, latent variables, latent confounders, cycles
same-paper 2 0.9104867 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
3 0.844585 4 jmlr-2012-A Kernel Two-Sample Test
Author: Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. †. ‡. §. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨ t M¨ nchen. a u This work was carried out while M.J.R. was with the Graz University of Technology. Also at The Australian National University, Canberra, ACT 0200, Australia. c 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ lkopf and Alexander Smola. o ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching, integral probability metric, hypothesis testing
4 0.54110348 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
Author: Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani
Abstract: We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed algorithms make predictions that are better correlated with the sample estimates of the unknown parameters on test data ; this is particularly the case when the number of commonly measured variables is low. The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. This allows constraints stemming from causal assumptions (e.g., Causal Markov Condition, Faithfulness) to propagate. Several methods have been developed based on this idea, for which we propose the unifying name Integrative Causal Analysis (INCA). A contrived example is presented demonstrating the theoretical potential to develop more general methods for co-analyzing heterogeneous data sets. The computational experiments with the novel methods provide evidence that causallyinspired assumptions such as Faithfulness often hold to a good degree of approximation in many real systems and could be exploited for statistical inference. Code, scripts, and data are available at www.mensxmachina.org. Keywords: integrative causal analysis, causal discovery, Bayesian networks, maximal ancestral graphs, structural equation models, causality, statistical matching, data fusion
5 0.51732314 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
6 0.50606847 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
7 0.47719601 100 jmlr-2012-Robust Kernel Density Estimation
8 0.47396073 109 jmlr-2012-Stability of Density-Based Clustering
9 0.46310532 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
10 0.46058637 82 jmlr-2012-On the Necessity of Irrelevant Variables
11 0.45353311 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
12 0.4494426 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
13 0.44534367 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
14 0.4415803 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
15 0.44058847 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
16 0.4309729 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
17 0.42459935 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
18 0.41728577 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
19 0.41455257 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
20 0.41284561 72 jmlr-2012-Multi-Target Regression with Rule Ensembles