jmlr jmlr2011 jmlr2011-44 knowledge-graph by maker-knowledge-mining

44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods


Source: pdf

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Journal of Machine Learning Research 12 (2011) 2095-2119 Submitted 12/10; Revised 5/11; Published 6/11 Information Rates of Nonparametric Gaussian Process Methods Aad van der Vaart AAD @ FEW. [sent-1, score-0.596]

2 NL Department of Mathematics VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam The Netherlands Harry van Zanten J . [sent-3, score-0.318]

3 Box 513 5600 MB Eindhoven The Netherlands Editor: Manfred Opper Abstract We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. [sent-9, score-0.429]

4 The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. [sent-11, score-0.209]

5 We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. [sent-12, score-0.194]

6 For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. [sent-13, score-0.206]

7 However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. [sent-14, score-0.32]

8 In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. [sent-15, score-0.485]

9 Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel 1. [sent-16, score-0.22]

10 They are widely used as prior distributions in nonparametric Bayesian learning to predict a response Y ∈ Y from a covariate X ∈ X . [sent-20, score-0.329]

11 , f (xn )) c 2011 Aad van der Vaart and Harry van Zanten. [sent-26, score-0.914]

12 , Wahba, 1978; Van der Vaart and Van Zanten, 2008a). [sent-32, score-0.278]

13 By Bochner’s theorem the stationary covariance functions on X = Rd correspond one-to-one to spectral distributions (see below for the examples of the squared-exponential and Mat´ rn kernels, e or see Rasmussen and Williams, 2006). [sent-34, score-0.182]

14 However, the risks measure the concentration of the full posterior (both location and spread) near the truth, whereas the prediction errors concern the location of the posterior only. [sent-67, score-0.326]

15 If the risk (3) is bounded by ε2 for some sequence εn → 0, then by another application of Jensen’s n inequality the posterior mean E( f | X1:n ,Y1:n ) = f dΠn ( f | X1:n ,Y1:n ) satisfies E f0 E( f | X1:n ,Y1:n ) − f0 2 2 ≤ ε2 . [sent-70, score-0.179]

16 It follows that a bound ε2 on the posterior risk (3) must satisfy the same fundamental lower bound as n the (quadratic) risk of general nonparametric estimators for the regression function f0 . [sent-72, score-0.272]

17 Such bounds are usually formulated as minimax results: for a given point estimator (for example the posterior mean) one takes the maximum (quadratic) risk over all f0 in a given “a-priori class” of response functions, and shows that this cannot be smaller than some lower bound (see, e. [sent-73, score-0.288]

18 It is known that if f0 is defined on a compact subset of Rd and has regularity β > 0, then the optimal, minimax rate εn is given by (see, e. [sent-78, score-0.209]

19 Recent findings in the statistics literature show that for GP priors, it is typically true that this optimal rate can only be attained if the regularity of the GP that is used matches the regularity of f0 (see Van der Vaart and Van Zanten, 2008a). [sent-82, score-0.623]

20 (2008) require the true response function f0 to be in the reproducing kernel Hilbert space (RKHS) of the GP prior. [sent-91, score-0.197]

21 , the paper Van der Vaart and Van Zanten, 2008b, which reviews theory on RKHSs that is relevant for Bayesian learning. [sent-98, score-0.278]

22 ) This means that prior draws can approximate any given continuous function arbitrarily closely, suggesting that the posterior distribution should be able to learn any such function f0 , not just the functions in the RKHS. [sent-105, score-0.281]

23 Example 1 (Integrated Brownian motion and Mat´ rn kernels) It is well known that the sample e paths x → f (x) of Brownian motion f have regularity 1/2. [sent-106, score-0.329]

24 More generally, it can be shown that draws from a k times integrated Brownian motion have regularity k + 1/2, while elements from its RKHS have regularity k + 1 (cf. [sent-114, score-0.349]

25 We show in this paper that if the true response function f0 on a compact X ⊂ Rd has regularity β, then for the Mat´ rn kernel with smoothness parameter α the (square) risk (3) decays at the rate e n−2 min(α,β)/(2α+d) . [sent-121, score-0.587]

26 Because the RKHS of the Mat´ rn (α) prior consists of functions of regularity α + 1/2, it contains functions of e regularity β only if β ≥ α + 1/2, and this excludes the case α = β that the Mat´ rn prior is optimal. [sent-123, score-0.638]

27 For instance, Bayesian learning with a Mat´ rn (α) prior is consistent for any cone tinuous true function f0 , not only for f0 of regularity α + 1/2 or higher. [sent-126, score-0.318]

28 Example 2 (Squared exponential kernel) For the squared exponential GP on a compact subset of ˆ Rd , every function h in the RKHS has a Fourier transform h that satisfies ˆ |h(λ)|2 ec λ 2 dλ < ∞ for some c > 0 (see Van der Vaart and Van Zanten, 2009 and Section 3. [sent-128, score-0.522]

29 For instance, the squared exponential process gives very fast learning rates for response functions in its RKHS, but as this is a tiny set of analytic functions, this gives a misleading idea of its performance in genuinely nonparametric situations. [sent-136, score-0.404]

30 We illustrate the general results for the Mat´ rn and squared exponential priors. [sent-145, score-0.194]

31 A third contribution is that although the concrete GP examples that we consider (Mat´ rn and e squared exponential) are stationary, our general results are not limited to stationary processes. [sent-147, score-0.185]

32 Underlying our approach are the so-called small deviations behaviour of the Gaussian prior and entropy calculations, following the same basic approach as in our earlier work (Van der Vaart and Van Zanten, 2008a). [sent-150, score-0.378]

33 Last but not least, the particular cases of the Mat´ rn and squared exponential kernels that we e investigate illustrate that the performance of Bayesian learning methods using GP priors is very sensitive to the fine properties of the priors used. [sent-153, score-0.32]

34 In particular, the relation between the regularity of the response function and the GP used is crucial. [sent-154, score-0.308]

35 Optimal performance is only guaranteed if the regularity of the prior matches the regularity of the unknown function of interest. [sent-155, score-0.37]

36 For instance, we show that using the squared-exponential prior, in a situation where a Mat´ rn prior would be appropriate, slows the learning rate from polynomial e to logarithmic in n. [sent-157, score-0.222]

37 (A function f is Lipschitz of order η if | f (x) − f (y)| ≤ C|x − y|η , for every x, y; see for instance Van der Vaart and Wellner (1996), Section 2. [sent-177, score-0.306]

38 The next section treats the special cases of the Mat´ rn and squared exponential kernels. [sent-211, score-0.194]

39 The bound depends on the “true” response function f0 and the GP prior Π and its RKHS H through the so-called concentration function φ f0 (ε) = inf h 2 − log Π f : f ∞ < ε (7) H h∈H: h− f0 ∞ <ε and the associated function ψ f0 (ε) = φ f0 (ε) . [sent-228, score-0.399]

40 The concentration function φ f0 for a general response function consists of two parts. [sent-230, score-0.303]

41 The second is the small ball exponent φ0 (ε) = − log Π( f : f ∞ < ε), which measures the amount of prior mass in a ball of radius ε around the zero function. [sent-231, score-0.309]

42 The first part of the definition of φ f0 (ε), the infimum, measures the decrease in prior mass if the (small) ball is shifted from the origin to the true parameter f0 . [sent-236, score-0.204]

43 This is not immediately clear from the definition (7), but it can be shown that up to constants, φ f0 (ε) equals − log Π( f : f − f0 ∞ < ε) (see for instance Van der Vaart and Van Zanten, 2008b, Lemma 5. [sent-237, score-0.278]

44 The latter closure is the support of the prior (Van der Vaart and Van Zanten, 2008b, Lemma 5. [sent-241, score-0.37]

45 Our general upper bound for the posterior risk in the fixed design case takes the following form. [sent-243, score-0.175]

46 f0 For ψ−1 (n) → 0 as n → ∞, which is the typical situation, the theorem shows that the posterior f0 distribution contracts at the rate ψ−1 (n) around the true response function f0 . [sent-245, score-0.412]

47 (2008), we have expressed the contraction using the quadratic risk, but the concentration is actually exponential. [sent-247, score-0.172]

48 2 we show how to obtain bounds for the concentration function, and hence a risk bound, for two classes of specific priors: the Mat´ rn class and the squared exponential. [sent-259, score-0.307]

49 e Other examples, including non-stationary ones like (multiply) integrated Brownian motion, were considered in Van der Vaart and Van Zanten (2008a), Van der Vaart and Van Zanten (2007) and Van der Vaart and Van Zanten (2009). [sent-260, score-0.834]

50 2 The theorem assumes that for some α > 0, draws from the prior are α-regular in H¨ lder sense. [sent-269, score-0.197]

51 f0 Unlike in the case of fixed design treated in Theorem 1, this theorem makes assumptions on the regularity of the prior. [sent-276, score-0.201]

52 In the next section we shall see that a typical rate for estimating a β-smooth response function f0 is given by ψ−1 (n) ∼ n−(β∧α)/(2α+d) . [sent-278, score-0.216]

53 In other words, upper bounds for fixed and random design have exactly the same form if prior and true response are not too rough. [sent-281, score-0.241]

54 For very rough priors and true response functions, the rate given by the preceding theorem is slower than the rate for deterministic design, and for very rough response functions the theorem may not give a rate at all. [sent-282, score-0.72]

55 Results for Concrete Priors In this section we specialize to two concrete classes of Gaussian process priors, the Mat´ rn class e and the squared exponential process. [sent-285, score-0.194]

56 1 Mat´ rn Priors e In this section we compute the risk bounds given by Theorems 1 and 2 for the case of the Mat´ rn e kernel. [sent-287, score-0.222]

57 In particular, we show that optimal rates are attained if the smoothness of the prior matches the smoothness of the unknown response function. [sent-288, score-0.349]

58 The Mat´ rn priors correspond to the mean-zero Gaussian processes W = (Wt :t ∈ [0, 1]d ) with e covariance function T eiλ (s−t) m(λ) dλ, EWsWt = Rd defined through the spectral densities m: Rd → R given by, for α > 0, m(λ) = 1 1+ λ 2 α+d/2 . [sent-289, score-0.176]

59 1 of Van der Vaart and Van Zanten (2009) the RKHS H of the process W is the space of all (real parts of) functions of the form hψ (t) = T eiλ t ψ(λ)m(λ) dλ, (11) for ψ ∈ L2 (m), and squared RKHS-norm given by hψ 2 H = min φ:hφ =hψ |φ|2 (λ)m(λ) dλ. [sent-300, score-0.373]

60 In the following two lemmas we describe the concentration function (7) of the Mat´ rn prior. [sent-303, score-0.246]

61 The e small ball probability can be obtained from the preceding characterization of the RKHS, estimates of metric entropy, and general results on Gaussian processes. [sent-304, score-0.227]

62 To estimate the infimum in the definition of the concentration function φ f0 for a nonzero response function f0 , we approximate f0 by elements of the RKHS. [sent-308, score-0.303]

63 A natural a-priori condition on the true response function f0 : [0, 1]d → R is that this function is contained in a Sobolev space of order β. [sent-313, score-0.235]

64 n Theorems 1 and 2 imply that the rate of contraction of the posterior distribution is of this order in the case of fixed design, and of this order if β > d/2 in the case of random design. [sent-320, score-0.206]

65 Then in the fixed design case the posterior contracts at the rate n−(α∧β)/(2α+d) . [sent-323, score-0.237]

66 This is in accordance with the findings for other GP priors in in Van der Vaart and Van Zanten (2008a). [sent-327, score-0.341]

67 Like the Mat´ rn process the squared exponential process is stationary. [sent-333, score-0.194]

68 This process was studied already in Van der Vaart and Van Zanten (2007) and Van der Vaart and Van Zanten (2009). [sent-336, score-0.556]

69 The following lemma concerns the infimum part of the concentration function in the case that the function f0 belongs to a Sobolev space with regularity β (see Section 1. [sent-342, score-0.353]

70 Combination of the preceding two lemmas shows that for a β-regular response function f0 (in Sobolev sense) 1 1+d φ f0 (ε) exp Cε−2/(β−d/2) + log . [sent-345, score-0.289]

71 Thus the extreme smoothness of the prior relative to the smoothness of the response function leads to very slow contraction rates for such functions. [sent-348, score-0.419]

72 4, is based on the fact that balls of this type also receive very little prior mass, essentially because the inequality of the preceding lemma can be reversed. [sent-356, score-0.213]

73 As the prior puts all of its mass on analytic functions, perhaps it is not fair to study its performance only for β-regular functions, and it makes sense to study the concentration function also for “supersmooth”, analytic response functions as well. [sent-358, score-0.507]

74 The functions in the RKHS of the squared exponential process are examples of supersmooth functions, and for those functions we obtain the rate ψ−1 (n) determined by the (centered) small ball probability only. [sent-359, score-0.345]

75 The following lemma deals with the infimum part of the concentration function in the case that that the function f0 is supersmooth. [sent-361, score-0.216]

76 Theorem 10 Suppose that we use a squared exponential prior and f0 is the restriction to [0, 1]d of an element of A γ,r (Rd ), for r ≥ 1 and γ > 0. [sent-367, score-0.177]

77 Then both in the fixed and the random design cases the √ posterior contracts at the rate (log n)1/r / n. [sent-368, score-0.237]

78 Observe that the rate that we get in the last theorem is up to a logarithmic factor equal to the rate √ 1/ n at which the posterior typically contracts for parametric models (cf. [sent-369, score-0.31]

79 , the Bernstein-von Mises theorem, for example, Van der Vaart, 1998). [sent-370, score-0.278]

80 Together, Theorems 8 and 10 give the same general message for the squared exponential kernel as Theorem 5 does for the Mat´ rn kernel: fast convergence rates are only attained if the smoothe ness of the prior matches the smoothness of the response function f0 . [sent-372, score-0.553]

81 However, generally the 2107 VAN DER VAART AND VAN Z ANTEN assumption of existence of infinitely many derivatives of a true response function ( f0 ∈ A g,r (Rd )) is considered too strong to define a test case for nonparametric learning. [sent-373, score-0.218]

82 If this assumption holds, then the response function f0 can be recovered at a very fast rate, but this is poor evidence of good performance, as only few functions satisfy the assumption. [sent-374, score-0.2]

83 1 Proof of Theorem 1 The proof of Theorem 1 is based on estimates of the prior mass near the true parameter f0 and on the metric entropy of the support of the prior. [sent-379, score-0.197]

84 We place in each of these shells a maximal collection Θ j of points that are jr/2-separated, and next construct a test φ j as the maximum of all the tests as in the preceding lemma attached to one of these points. [sent-396, score-0.173]

85 The details are the same as in Van der Vaart and Van Zanten (2008a). [sent-428, score-0.278]

86 3 in Van der Vaart and Van Zanten, 2008b) that the concentration function φ f0 determines the small ball probabilities around f0 , in the sense that, for the given ε, Π f : f − f0 ∞ 2 < ε ≥ e−nε . [sent-432, score-0.515]

87 1 in Van der Vaart and Van Zanten, 2008b) these sets have prior probability Π(Fr ) bounded below by 1 − Φ(α + Mr ), for Φ the standard normal distribution function and α the solution to the equation Φ(α) = Π f : f ∞ < 2 2 2 2 2 ε = e−φo (ε) . [sent-436, score-0.408]

88 1 of Van der Vaart and Van Zanten (2008a) that the sets Fr also satisfy the entropy bound (15), for the norm · ∞ , and hence certainly for · n . [sent-440, score-0.364]

89 2 Proof of Theorem 2 For a function f : [0, 1]d → R and α > 0 let f α|∞ be the Besov norm of regularity α measured using the L∞ − L∞ -norms (see (19) below). [sent-442, score-0.219]

90 By the same argument as used to obtain (17) in the proof of Proposition 11, we see that ∞ III ≤ 1 + ≤ 1+ 1 ∞ renε 2 (r2γ +1) Π f: f renε 2 (r2γ +1) e−2nη 1 2 r2γ α|∞ dr √ > τ nηrγ dr 2. [sent-472, score-0.195]

91 9 in Van der Vaart and ∞ Wellner, 1996) to see that P f − fε 2 ≥ 2 f − fε n ≤ e−(n/5) f − fε 2 / f − fε 2 2 ∞ . [sent-479, score-0.278]

92 The unit ball of the RKHS of a GP f is always contained in c times the unit ball of the Banach space on which it is supported, for c2 = E f 2 , where · is the norm of the Banach space (see, e. [sent-480, score-0.3]

93 , 2113 VAN DER VAART AND VAN Z ANTEN Van der Vaart and Van Zanten, 2008b), formula (2. [sent-482, score-0.278]

94 Because Π is concentrated on Cα [0, 1]d , we can apply this general fact with · the α-H¨ lder norm, and conclude o that the α-H¨ lder norm of an element of the RKHS is bounded above by τ/4 times its RKHS-norm, o for τ/4 the second moment of the prior norm defined previously. [sent-485, score-0.33]

95 Therefore, for f in the set F of functions with f α|∞ ≤ τ nεrγ , we have f − fε α|∞ ≤ √ γ 2τ nεr , whence by Lemma 15 for f ∈ F we can replace f − fε ∞ in the preceding display by √ 2α/(2α+d) c(2τ nεrγ )d/(2α+d) f − fε 2 , for a constant c depending on the covariate density. [sent-487, score-0.217]

96 In other words, the unit ball H1 of the RKHS is contained in a Sobolev ball of order α + d/2. [sent-499, score-0.246]

97 ) The metric entropy relative to the uniform norm of such a Sobolev ball is bounded by a constant times (1/ε)d/(α+d/2) (see Theorem 3. [sent-502, score-0.258]

98 The lemma next follows from the results of Kuelbs and Li (1993) and Li and Linde (1998) that characterize the small ball probability in terms of the entropy of the RKHS-unit ball. [sent-506, score-0.193]

99 Proof [Proof of 9] The first assertion is proved in Van der Vaart and Van Zanten (2009), Lemma 4. [sent-523, score-0.331]

100 The prior mass of a ball of radius ε around f0 is bounded below by e−φ f0 (ε/2) and bounded above by e−φ f0 (ε) , where we can use any norm. [sent-540, score-0.272]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vaart', 0.406), ('van', 0.318), ('zanten', 0.316), ('der', 0.278), ('pn', 0.23), ('mat', 0.229), ('gp', 0.2), ('rkhs', 0.172), ('anten', 0.165), ('onparametric', 0.165), ('response', 0.143), ('regularity', 0.137), ('ball', 0.105), ('concentration', 0.104), ('rocess', 0.096), ('sobolev', 0.095), ('posterior', 0.093), ('aussian', 0.091), ('preceding', 0.089), ('ethods', 0.087), ('fourier', 0.087), ('rn', 0.085), ('brownian', 0.084), ('dr', 0.081), ('rd', 0.076), ('covariate', 0.071), ('contracts', 0.069), ('kuelbs', 0.069), ('prior', 0.068), ('contraction', 0.068), ('squared', 0.066), ('priors', 0.063), ('mum', 0.061), ('seeger', 0.061), ('lder', 0.06), ('lemma', 0.056), ('norm', 0.054), ('assertion', 0.053), ('risk', 0.052), ('analytic', 0.052), ('cb', 0.051), ('fr', 0.049), ('nonparametric', 0.047), ('rate', 0.045), ('smoothness', 0.044), ('exponential', 0.043), ('borell', 0.041), ('motion', 0.04), ('mr', 0.039), ('gaussian', 0.038), ('transform', 0.037), ('contained', 0.036), ('risks', 0.036), ('dt', 0.035), ('aad', 0.035), ('draws', 0.035), ('stationary', 0.034), ('tails', 0.034), ('theorem', 0.034), ('bounded', 0.034), ('metric', 0.033), ('proof', 0.033), ('entropy', 0.032), ('wellner', 0.032), ('iv', 0.031), ('mass', 0.031), ('ei', 0.031), ('covariates', 0.03), ('design', 0.03), ('functions', 0.029), ('lemmas', 0.029), ('ft', 0.029), ('function', 0.028), ('rasmussen', 0.028), ('supremum', 0.028), ('display', 0.028), ('castillo', 0.028), ('edmunds', 0.028), ('eindhoven', 0.028), ('ewswt', 0.028), ('frc', 0.028), ('karatzas', 0.028), ('lipshitz', 0.028), ('lrl', 0.028), ('ro', 0.028), ('shells', 0.028), ('supersmooth', 0.028), ('compact', 0.027), ('en', 0.027), ('kl', 0.027), ('paths', 0.027), ('event', 0.027), ('attained', 0.026), ('kernel', 0.026), ('rough', 0.025), ('rates', 0.024), ('fj', 0.024), ('closure', 0.024), ('logarithmic', 0.024), ('sup', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

2 0.17558026 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization

3 0.13913652 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

Author: Lauren A. Hannah, David M. Blei, Warren B. Powell

Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency

4 0.11566535 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

5 0.075411461 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components

6 0.074273847 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

7 0.063579097 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

8 0.059832621 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

9 0.059682902 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

10 0.058252182 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

11 0.055157959 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

12 0.054816917 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

13 0.054538984 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

14 0.051543977 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

15 0.04783399 12 jmlr-2011-Bayesian Co-Training

16 0.047569394 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

17 0.047307111 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

18 0.043462895 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

19 0.042144116 35 jmlr-2011-Forest Density Estimation

20 0.042074475 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.25), (1, -0.043), (2, -0.07), (3, 0.066), (4, 0.111), (5, -0.084), (6, -0.095), (7, 0.001), (8, -0.283), (9, 0.081), (10, 0.068), (11, 0.044), (12, -0.096), (13, 0.132), (14, 0.127), (15, 0.099), (16, 0.134), (17, -0.155), (18, 0.22), (19, 0.189), (20, -0.086), (21, -0.168), (22, 0.071), (23, -0.013), (24, 0.119), (25, -0.033), (26, 0.003), (27, 0.039), (28, -0.106), (29, -0.03), (30, 0.07), (31, -0.03), (32, -0.087), (33, 0.038), (34, 0.025), (35, 0.038), (36, 0.111), (37, 0.044), (38, -0.076), (39, 0.033), (40, 0.015), (41, -0.012), (42, 0.081), (43, 0.014), (44, 0.035), (45, 0.034), (46, -0.15), (47, 0.016), (48, 0.083), (49, -0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96581429 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

2 0.71232522 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization

3 0.58784688 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

Author: Lauren A. Hannah, David M. Blei, Warren B. Powell

Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency

4 0.47404933 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components

5 0.39729175 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

6 0.35898805 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

7 0.34800586 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

8 0.30336431 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

9 0.28443921 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

10 0.27370688 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

11 0.27130759 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

12 0.25869435 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

13 0.25597975 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

14 0.24699828 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

15 0.24615906 12 jmlr-2011-Bayesian Co-Training

16 0.23407045 35 jmlr-2011-Forest Density Estimation

17 0.23329026 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

18 0.22631142 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

19 0.2260436 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

20 0.21889709 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.04), (6, 0.39), (9, 0.027), (10, 0.021), (24, 0.04), (31, 0.076), (32, 0.027), (41, 0.027), (60, 0.011), (65, 0.056), (67, 0.026), (70, 0.016), (73, 0.049), (78, 0.099), (90, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77847695 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

2 0.7679441 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

Abstract: In hierarchical classification, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classification is specified by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of flat classification, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classification, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its flat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classification, structured learning

3 0.41429996 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes

4 0.41186884 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components

5 0.38347077 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization

6 0.38151175 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

7 0.38079718 91 jmlr-2011-The Sample Complexity of Dictionary Learning

8 0.37841985 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

9 0.37840638 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

10 0.37137282 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

11 0.3690798 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

12 0.36592522 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

13 0.36367735 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

14 0.36299363 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

15 0.35626405 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

16 0.35259598 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

17 0.34627959 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

18 0.34466046 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

19 0.34137353 104 jmlr-2011-X-Armed Bandits

20 0.34047729 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination