jmlr jmlr2011 jmlr2011-18 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
Reference: text
sentIndex sentText sentNum sentScore
1 UK Statistical Laboratory University of Cambridge Cambridge, CB3 0WB, UK Editor: Manfred Opper Abstract In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. [sent-6, score-0.173]
2 When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. [sent-10, score-0.209]
3 We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. [sent-11, score-0.174]
4 We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. [sent-14, score-0.215]
5 Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization 1. [sent-15, score-0.206]
6 We therefore need a global optimization algorithm, one which attempts to find a global minimum. [sent-19, score-0.17]
7 At time ∗ n, we choose a design point xn ∈ X, make an observation zn = f (xn ), and then report a point xn ∗ ) will be low. [sent-25, score-0.726]
8 Our goal is to find a strategy for choosing the x and x∗ , in where we believe f (xn n n ∗ terms of previous observations, so as to minimize f (xn ). [sent-26, score-0.197]
9 We would like to find a strategy which can guarantee convergence: for functions f in some ∗ smoothness class, f (xn ) should tend to min f , preferably at some fast rate. [sent-27, score-0.196]
10 B ULL ∗ would be to fix a sequence of xn in advance, and set xn = arg min fˆn , for some approximation fˆn ∗ to f . [sent-30, score-0.572]
11 However, while this strategy gives a good worst-case bound, on average it is clearly a poor method of optimization: the design points xn are completely independent of the observations zn . [sent-32, score-0.604]
12 , 1993), but perhaps the best-known strategy is expected improvement. [sent-38, score-0.186]
13 Vazquez and Bect (2010) show that when π is a fixed Gaussian process prior of finite smoothness, expected improvement converges on the minimum of any f ∈ H , and almost surely for f drawn from π. [sent-51, score-0.209]
14 (2010) bound the convergence rate of a computationally infeasible version of expected improvement: for priors π of smoothness ν, they show convergence at a rate O∗ (n−(ν∧0. [sent-53, score-0.288]
15 We begin by bounding the convergence rate of the feasible algorithm, and show convergence at a rate O∗ (n−(ν∧1)/d ) on all f ∈ H . [sent-55, score-0.17]
16 We go on to show that a modification of expected improvement converges at the near-optimal rate O∗ (n−ν/d ). [sent-56, score-0.201]
17 2) shows that, for a Brownian motion prior with estimated parameters, expected improvement may not converge at all. [sent-62, score-0.242]
18 We extend this result to more general settings, showing that for standard priors with estimated parameters, there exist smooth functions f on which expected improvement does not converge. [sent-63, score-0.193]
19 Expected Improvement Suppose we wish to minimize an unknown function f , choosing design points xn and estimated ∗ minima xn as in the introduction. [sent-76, score-0.749]
20 The strategy u is valid if xn is conditionally independent of f given Fn−1 , and ∗ likewise xn given Fn . [sent-81, score-0.704]
21 ) When taking probabilities and expectations we will write Pu and Eu , denoting the dependence π π on both the prior π and strategy u. [sent-83, score-0.181]
22 The average-case performance at some future time N is then given by the expected loss, ∗ Eu [ f (xN ) − min f ], π and our goal, given π, is to choose the strategy u to minimize this quantity. [sent-84, score-0.251]
23 First, we restrict the choice of xn to the previous design points x1 , . [sent-88, score-0.365]
24 ∗ (In practice this is reasonable, as choosing an xn we have not observed can be unreliable. [sent-92, score-0.286]
25 ) Secondly, rather than finding an optimal strategy for the problem, we derive the myopic strategy: the strategy which is optimal if we always assume we will stop after the next observation. [sent-93, score-0.3]
26 ∗ In this setting, given Fn , if we are to stop at time n we should choose xn := xi∗ , where i∗ := n n ∗ . [sent-97, score-0.286]
27 Were we to observe at xn+1 before stopping, the expected loss would be n n Eu [z∗ − min f | Fn ], π n+1 so the myopic strategy should choose xn+1 to minimize this quantity. [sent-103, score-0.287]
28 Equivalently, it should maximize the expected improvement over the current loss, EIn (xn+1 ; π) := Eu [z∗ − z∗ | Fn ] = Eu [(z∗ − zn+1 )+ | Fn ], π n n+1 π n (1) where x+ = max(x, 0). [sent-104, score-0.16]
29 2881 B ULL Section 1 f X d xn zn ∗ xn unknown function X → R to be minimized compact subset of Rd to minimize over number of dimensions to minimize over points in X at which f is observed observations zn = f (xn ) of f estimated minimum of f , given z1 , . [sent-108, score-0.981]
30 1 π u Fn z∗ n EIn prior distribution for f ∗ strategy for choosing xn , xn filtration Fn = σ(xi , zi : i ≤ n) best observation z∗ = mini=1,. [sent-112, score-0.796]
31 2 µ, σ2 K Kθ ν, α ˆn µn , fˆn , s2 , R2 ˆ n global mean and variance of Gaussian-process prior π underlying correlation kernel for π correlation kernel for π with length-scales θ smoothness parameters of K quantities describing posterior distribution of f given Fn Section 2. [sent-116, score-0.21]
32 3 EI(π) ˆn ˆ σ2 , θ n cn θL , θU ˆ EI(π) expected improvement strategy with fixed prior estimates of prior parameters σ2 , θ ˆn rate of decay of σ2 ˆ bounds on θn expected improvement strategy with estimated prior Section 3. [sent-117, score-0.916]
33 3 ˜ EI(π) expected improvement strategy with robust estimated prior Section 3. [sent-120, score-0.374]
34 4 EI( · , ε) ε-greedy expected improvement strategies Table 1: Notation 2882 C ONVERGENCE R ATES OF E FFICIENT G LOBAL O PTIMIZATION 2. [sent-121, score-0.16]
35 For a prior π as above, expected improvement chooses xn+1 to maximize (8), but this does not fully define the strategy. [sent-174, score-0.209]
36 Firstly, we must describe how the strategy breaks ties, when more than one x ∈ X maximizes EIn . [sent-175, score-0.164]
37 2) find that expected improvement can be unreliable given few data points, and recommend that several initial design points be chosen in a random quasi-uniform arrangement. [sent-180, score-0.239]
38 An EI(π) strategy chooses: (i) initial design points x1 , . [sent-187, score-0.211]
39 While these can be fixed in advance, doing so requires us to specify characteristic scales of the unknown function f , and causes expected improvement to behave differently on a rescaling of the same function. [sent-192, score-0.16]
40 Let πn be a sequence of priors, with parameters σn , θn satisfying: ˆn ˆ ˆn (i) σ2 = cn R2 (θn ) for constants cn > 0, cn = o(1/ log n); and ˆ (ii) θL ≤ θn ≤ θU for constants θL , θU ∈ Rd . [sent-202, score-0.373]
41 ) We define the loss suffered over the ball BR in Hθ (X) after n steps by a strategy u, ∗ Ln (u, Hθ (X), R) := sup Eu [ f (xn ) − min f ]. [sent-241, score-0.181]
42 Note that we do not allow u to vary with R; the strategy must achieve this rate without prior knowledge of f Hθ (X) . [sent-243, score-0.254]
43 If ν < ∞, then for any θ ∈ Rd , R > 0, + inf Ln (u, Hθ (X), R) = Θ(n−ν/d ), u and this rate can be achieved by a strategy u not depending on R. [sent-246, score-0.218]
44 The upper bound is provided by a naive strategy as in the introduction: we fix a quasi-uniform ∗ sequence xn in advance, and take xn to minimize a radial basis function interpolant of the data. [sent-247, score-0.896]
45 As remarked previously, however, this naive strategy is not very satisfying; in practice it will be outperformed by any good strategy varying with the data. [sent-248, score-0.309]
46 One such strategy is the EI(π) strategy of Definition 1. [sent-250, score-0.264]
47 We can show this strategy converges at least at rate n−(ν∧1)/d , up to log factors. [sent-251, score-0.213]
48 A good optimization strategy should be able to minimize such functions, and we must ask why expected improvement fails. [sent-267, score-0.474]
49 To ensure we fully explore f , we will therefore require that when our strategy is applied to a constant function f (x) = c, it produces a sequence xn dense in X. [sent-289, score-0.418]
50 An EI(π) strategy satisfies Definition 2, except: ˆn ˆ ˆn (i) we instead set σ2 = R2 (θn ); and (ii) we require the choice of xn+1 maximizing (8) to be such that, if f is constant, the design points are almost surely dense in X. [sent-293, score-0.211]
51 Conclusions We have shown that expected improvement can converge near-optimally, but a naive implementation may not converge at all. [sent-325, score-0.205]
52 Algorithms controlling the cumulative regret at rate rn also solve the optimization problem, at rate rn /n (Bubeck et al. [sent-338, score-0.302]
53 Firstly, the cumulative regret 2890 C ONVERGENCE R ATES OF E FFICIENT G LOBAL O PTIMIZATION is necessarily increasing, so cannot establish rates of optimization faster than n−1 . [sent-342, score-0.187]
54 It may be that convergence rates alone are not sufficient to capture the performance of a global optimization algorithm, and the time taken to find a basin of attraction is more relevant. [sent-349, score-0.217]
55 subject to µ + g(xi ) = zi , ˆ ˆ 1 ≤ i ≤ n, (ii) The prediction error satisfies | f (x) − fˆn (x; θ)| ≤ sn (x; θ) g Hθ (S) with equality for some g ∈ Hθ (S). [sent-403, score-0.213]
56 We will argue that, given n observations, we cannot distinguish between all the ∗ ψm , and thus cannot accurately pick a minimum xn . [sent-421, score-0.344]
57 ∗ ∗ Suppose f = 0, and let xn and xn be chosen by any valid strategy u. [sent-432, score-0.704]
58 0 ψ ψ 2 As the minimax loss is non-increasing in n, for (2(k − 1))d /2 ≤ n < (2k)d /2 we conclude inf Ln (u, Hθ (X), R) ≥ inf L(2k)d /2−1 (u, Hθ (X), R) u u ∗ ≥ inf sup Eu m f x(2k)d /2−1 − min f ψ u ≥ m −ν 1 2 C(2k) = Ω(n−ν/d ). [sent-440, score-0.184]
59 For the upper bound, consider a strategy u choosing a fixed sequence xn , independent of the ∗ zn . [sent-443, score-0.525]
60 Fit a radial basis function interpolant fˆn to the data, and pick xn to minimize fˆn . [sent-444, score-0.491]
61 5), for suitable radial basis functions the error is uniformly bounded by sup f H (X) ≤R θ fˆn − f 2894 ∞ = O(h−ν ), n C ONVERGENCE R ATES OF E FFICIENT G LOBAL O PTIMIZATION where the mesh norm n hn := sup min x − xi . [sent-448, score-0.336]
62 ) As X is bounded, we may choose the xn so that hn = O(n−1/d ), giving Ln (u, Hθ (X), R) = O(n−ν/d ). [sent-452, score-0.345]
63 + For any k ∈ N, and sequences xn ∈ X, θn ≥ θ, the inequality sn (xn+1 ; θn ) ≥ C′ k−(ν∧1)/d (log k)β holds for at most k distinct n. [sent-457, score-0.456]
64 We may thus conclude |1 − K(x)| = |K(x) − K(0)| = O 2895 x 2(ν∧1) (− log x )2β , B ULL and s2 (x; θn ) ≤ C2 x − xi n 2(ν∧1) (− log x − xi )2β , for a constant C > 0 depending only on X, K and θ. [sent-466, score-0.162]
65 Hence as there are k balls, at most k points xn+1 can satisfy sn (xn+1 ; θn ) ≥ C′ k−(ν∧1)/d (log k)β . [sent-470, score-0.202]
66 Next, we provide bounds on the expected improvement when f lies in the RKHS. [sent-471, score-0.16]
67 For x ∈ X, n ∈ N, set I = ( f (xn ) − f (x))+ , and s = sn (x; θ). [sent-474, score-0.17]
68 We will use the above bounds to show that there must be times ∗ nk when the expected improvement is low, and thus f (xnk ) is close to min f . [sent-489, score-0.271]
69 From Lemma 7 there exists C > 0, depending on X, K and θ, such that for any sequence xn ∈ X and k ∈ N, the inequality sn (xn+1 ; θ) > Ck−(ν∧1)/d (log k)β holds at most k times. [sent-491, score-0.456]
70 Thus there is a time nk , k ≤ nk ≤ 3k, for which snk (xnk +1 ; θ) ≤ Ck−(ν∧1)/d (log k)β and z∗k − f (xnk +1 ) ≤ 2Rk−1 . [sent-494, score-0.194]
71 For k large, xnk +1 will have been chosen by expected improvement (rather than being an initial design point, chosen at random). [sent-496, score-0.335]
72 Given θL , θU ∈ Rd , pick sequences xn ∈ X, θL ≤ θn ≤ θU . [sent-502, score-0.344]
73 Then for open S ⊂ X, + sup sn (x; θn ) = Ω(n−ν/d ), x∈S uniformly in the sequences xn , θn . [sent-503, score-0.505]
74 , xn , there must be some ψm such that ψm (xi ) = 0, 1 ≤ i ≤ n. [sent-511, score-0.318]
75 Thus for x ∈ T minimizing ψm , sn (x; θn ) ≥ C−1 sn (x; θn ) ψm Hθn (X) ≥ C−1 |ψm (x) − 0| = Ω(k−ν ). [sent-513, score-0.34]
76 As sn (x; θ) is non-increasing in n, for 1 (2(k − 1))d < n ≤ 1 (2k)d we obtain 2 2 sup sn (x; θn ) ≥ sup s 1 (2k)d (x; θn ) = Ω(k−ν ) = Ω(n−ν/d ). [sent-514, score-0.438]
77 x∈S x∈S 2 2897 B ULL Next, we bound the expected improvement when prior parameters are estimated by maximum likelihood. [sent-515, score-0.242]
78 Set In (x) = z∗ − f (x), sn (x) = sn (x; θn ), and tn (x) = n θ In (x)/sn (x). [sent-518, score-0.431]
79 Suppose: (i) for some i < j, zi = z j ; (ii) for some Tn → −∞, tn (xn+1 ) ≤ Tn whenever sn (xn+1 ) > 0; (iii) In (yn+1 ) ≥ 0; and (iv) for some C > 0, sn (yn+1 ) ≥ e−C/cn . [sent-519, score-0.474]
80 Then if sn (x) > 0, for some |un (x) − tn (x)| ≤ S, ˆ ˆ ˆ EIn (x; πn ) = σn sn (x)τ(un (x)/σn ), as in the proof of Lemma 8. [sent-526, score-0.431]
81 , xn }, so ˆ ˆ EIn (xn+1 ; πn ) = 0 < EIn (yn+1 ; πn ). [sent-530, score-0.286]
82 ˆ When sn (xn+1 ) > 0, as τ is increasing we may upper bound EIn (xn+1 ; πn ) using un (xn+1 ) ≤ Tn + S, 2 ˆ and lower bound EIn (yn+1 ; πn ) using un (yn+1 ) ≥ −S. [sent-531, score-0.17]
83 Since sn (xn+1 ) ≤ 1, and τ(x) = Θ(x−2 e−x /2 ) as x → −∞ (Abramowitz and Stegun, 1965, §7. [sent-532, score-0.17]
84 Let the EI(π) strategy choose initial design points x1 , . [sent-553, score-0.211]
85 Suppose xn ∈ V1 infinitely often, so the zn are not all equal. [sent-564, score-0.393]
86 By Lemma 7, n ˆ sn (xn+1 ; θn ) → 0, so on a subsequence with xn+1 ∈ V1 , we have ˆ ˆ tn = (z∗ − f (xn+1 ))/sn (xn+1 ; θn ) = −sn (xn+1 ; θn )−1 → −∞ n ˆ whenever sn (xn+1 ; θn ) > 0. [sent-565, score-0.473]
87 However, by Lemma 9, there are points yn ∈ V0 with z∗ − f (yn+1 ) = n ˆ ˆ ˆ 0, and sn (yn+1 ; θn ) = Ω(n−ν/d ). [sent-566, score-0.266]
88 Hence, on A, there is a random variable T taking values in N, for which n > T =⇒ xn ∈ V1 . [sent-568, score-0.286]
89 As in the proof of Theorem 2, we will show there are times nk when the expected improvement is small, so f (xnk ) must be close to the minimum. [sent-576, score-0.271]
90 2899 B ULL If the zn are all equal, then by assumption the xn are dense in X, so f is constant, and the result is trivial. [sent-578, score-0.393]
91 θn As in the proof of Theorem 2, we have a constant C > 0, and some nk , k ≤ nk ≤ 3k, for which ˆ − f (xnk +1 ) ≤ 2Rk−1 and snk (xnk +1 ; θnk ) ≤ Ck−α (log k)β . [sent-584, score-0.194]
92 random variables, distributed uniformly over X, and define their mesh norm, n hn := sup min x − xi . [sent-593, score-0.205]
93 Let Im be the indicator function of the event {xi ∈ Xm : 1 ≤ i ≤ ⌊γn log n⌋}, and define µn = E ∑ Im m = nE[I0 ] = n(1 − 1/n)⌊γn log n⌋ ∼ ne−γ log n = n−(γ−1) . [sent-602, score-0.201]
94 We will show that the points xn must be quasi-uniform in X, so posterior variances must be small. [sent-613, score-0.417]
95 , xn are chosen uniformly at random, so by a Chernoff bound, PEI( · ,ε) (Ac ) ≤ e−εn/16 . [sent-621, score-0.286]
96 n Finally, let Cn be the event that An and Bn occur, and further the mesh norm hn ≤ C(n/ log n)−1/d , for the constant C from Lemma 12. [sent-626, score-0.272]
97 (2003, §6), sup sn (x; θ) = O(M(θ)hν ) n x∈X uniformly in θ, for M(θ) a continuous function of θ. [sent-632, score-0.219]
98 Hence on the event Cn , sup sn (x; θn ) ≤ sup sup sn (x; θ) ≤ C′′ rn , x∈X x∈X θL ≤θ≤θU for a constant C′′ > 0 depending only on X, K, C, θL and θU . [sent-633, score-0.617]
99 2901 B ULL On the event Cn , we have some xm chosen by expected improvement, n < m ≤ 2n. [sent-635, score-0.194]
100 Gaussian process optimization in the bandit setting: no regret and experimental design. [sent-775, score-0.167]
wordName wordTfidf (topN-words)
[('ein', 0.403), ('xn', 0.286), ('ei', 0.226), ('ates', 0.219), ('lobal', 0.219), ('rd', 0.197), ('sn', 0.17), ('ull', 0.166), ('ptimization', 0.143), ('fficient', 0.143), ('strategy', 0.132), ('xnk', 0.128), ('fn', 0.123), ('onvergence', 0.116), ('eu', 0.112), ('cn', 0.111), ('zn', 0.107), ('improvement', 0.106), ('sobolev', 0.098), ('kx', 0.095), ('tn', 0.091), ('grunewalder', 0.091), ('narcowich', 0.091), ('jones', 0.091), ('event', 0.081), ('hilbert', 0.08), ('nk', 0.079), ('lemma', 0.078), ('regret', 0.076), ('abramowitz', 0.073), ('stegun', 0.073), ('fourier', 0.071), ('pu', 0.067), ('minimize', 0.065), ('rates', 0.065), ('bubeck', 0.064), ('yn', 0.064), ('smoothness', 0.064), ('wendland', 0.062), ('global', 0.062), ('ck', 0.059), ('hn', 0.059), ('xm', 0.059), ('pick', 0.058), ('mesh', 0.056), ('trn', 0.055), ('expected', 0.054), ('prior', 0.049), ('ln', 0.049), ('sup', 0.049), ('rn', 0.049), ('design', 0.047), ('kxi', 0.047), ('pei', 0.047), ('vazquez', 0.047), ('optimization', 0.046), ('radial', 0.046), ('naive', 0.045), ('inf', 0.045), ('bandit', 0.045), ('vaart', 0.045), ('convergence', 0.044), ('zi', 0.043), ('rkhs', 0.043), ('srinivas', 0.042), ('subsequence', 0.042), ('rate', 0.041), ('xi', 0.041), ('im', 0.041), ('log', 0.04), ('ask', 0.039), ('lipschitz', 0.039), ('suppose', 0.037), ('spaces', 0.037), ('bull', 0.036), ('diaconis', 0.036), ('faith', 0.036), ('ginsbourger', 0.036), ('hermitian', 0.036), ('holger', 0.036), ('interpolant', 0.036), ('myopic', 0.036), ('panconesi', 0.036), ('radially', 0.036), ('romeijn', 0.036), ('santner', 0.036), ('snk', 0.036), ('van', 0.036), ('norm', 0.036), ('william', 0.036), ('behaviour', 0.035), ('posterior', 0.035), ('reproducing', 0.035), ('theorem', 0.033), ('estimated', 0.033), ('must', 0.032), ('sutton', 0.032), ('arxiv', 0.032), ('advance', 0.032), ('points', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
2 0.17558026 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
Author: Aad van der Vaart, Harry van Zanten
Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel
3 0.10756797 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
4 0.094506584 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
Author: Vianney Perchet
Abstract: We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono¨ ı diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal, as well as the usual external, regret at stage n by O(n−1/3 ), which is known to be optimal. Keywords: repeated games, on-line learning, regret, partial monitoring, calibration, Vorono¨ and ı Laguerre diagrams
5 0.08576782 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet
Abstract: Over the last few years, two different notions of positive definite (pd) kernels—universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd , all these notions are shown to be equivalent. Keywords: kernel methods, characteristic kernels, Hilbert space embeddings, universal kernels, strictly positive definite kernels, integrally strictly positive definite kernels, conditionally strictly positive definite kernels, translation invariant kernels, radial kernels, binary classification, homogeneity testing
6 0.081031732 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
7 0.078224659 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
8 0.072119772 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
9 0.070286557 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
10 0.067777358 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
11 0.064111173 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
12 0.063984051 55 jmlr-2011-Learning Multi-modal Similarity
13 0.06223968 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
14 0.060480427 91 jmlr-2011-The Sample Complexity of Dictionary Learning
15 0.058324516 61 jmlr-2011-Logistic Stick-Breaking Process
16 0.057772331 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
17 0.057114679 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
18 0.056199875 14 jmlr-2011-Better Algorithms for Benign Bandits
19 0.052406542 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
20 0.051297255 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
topicId topicWeight
[(0, 0.278), (1, 0.037), (2, -0.021), (3, 0.046), (4, 0.145), (5, 0.041), (6, -0.047), (7, 0.013), (8, -0.3), (9, 0.085), (10, 0.059), (11, -0.045), (12, -0.042), (13, 0.203), (14, 0.138), (15, 0.019), (16, 0.184), (17, -0.059), (18, 0.15), (19, 0.018), (20, -0.049), (21, -0.202), (22, -0.028), (23, 0.073), (24, -0.021), (25, 0.059), (26, 0.026), (27, 0.161), (28, 0.007), (29, 0.04), (30, -0.118), (31, -0.028), (32, -0.12), (33, 0.047), (34, 0.122), (35, 0.071), (36, 0.079), (37, -0.061), (38, 0.027), (39, 0.081), (40, -0.016), (41, -0.002), (42, 0.061), (43, -0.018), (44, -0.013), (45, 0.056), (46, 0.037), (47, 0.021), (48, 0.093), (49, -0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.96705765 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
2 0.7638433 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
Author: Aad van der Vaart, Harry van Zanten
Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel
3 0.54780418 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
4 0.52125788 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
5 0.47479078 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
Author: Vianney Perchet
Abstract: We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono¨ ı diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal, as well as the usual external, regret at stage n by O(n−1/3 ), which is known to be optimal. Keywords: repeated games, on-line learning, regret, partial monitoring, calibration, Vorono¨ and ı Laguerre diagrams
6 0.40589982 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
7 0.40329206 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
8 0.39862168 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
9 0.37063262 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
10 0.34315732 91 jmlr-2011-The Sample Complexity of Dictionary Learning
11 0.32535127 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
12 0.32428935 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
13 0.32386279 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
14 0.30665872 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
15 0.29462793 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
16 0.28002974 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
17 0.26680306 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
18 0.26585987 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
19 0.26342288 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
20 0.25910425 61 jmlr-2011-Logistic Stick-Breaking Process
topicId topicWeight
[(4, 0.044), (6, 0.022), (9, 0.025), (10, 0.025), (24, 0.029), (31, 0.069), (32, 0.025), (41, 0.046), (60, 0.014), (65, 0.486), (67, 0.011), (70, 0.01), (73, 0.027), (78, 0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.83621138 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
2 0.363469 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
Author: Aad van der Vaart, Harry van Zanten
Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel
3 0.34573489 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
4 0.34516078 104 jmlr-2011-X-Armed Bandits
Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári
Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates
5 0.33233738 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
6 0.3231633 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
7 0.31867474 91 jmlr-2011-The Sample Complexity of Dictionary Learning
8 0.30891845 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
9 0.30761787 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
10 0.30663145 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
11 0.30419621 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
12 0.30089656 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
13 0.29612032 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
14 0.29309952 16 jmlr-2011-Clustering Algorithms for Chains
15 0.29289824 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
17 0.29133388 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
18 0.2907629 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
19 0.29042375 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
20 0.28970084 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes