jmlr jmlr2007 jmlr2007-18 knowledge-graph by maker-knowledge-mining

18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models


Source: pdf

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. [sent-14, score-0.457]

2 Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. [sent-15, score-0.278]

3 We study several classes of signed measures and their image mapped by the integral operator. [sent-16, score-0.225]

4 In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. [sent-17, score-0.164]

5 A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. [sent-18, score-0.188]

6 We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. [sent-19, score-0.567]

7 Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. [sent-21, score-0.539]

8 Another approach, the Bayesian kernel model, is to study the class of functions expressible as kernel integrals G = f f (x) = Z X K(x, u) γ(du), γ ∈ Γ , (3) for some space Γ ⊆ B(X ) of signed Borel measures. [sent-48, score-0.294]

9 The natural question that arises in this Bayesian approach is: For what spaces Γ of signed measures is the RKHS HK identical to the linear space span(G ) spanned by the Bayesian kernel model? [sent-50, score-0.244]

10 The proposition asserts that the Bayesian kernel model and the penalized loss model both operate in the same function space when Γ includes all signed measures. [sent-54, score-0.256]

11 This result lays a theoretical foundation from a function analytic perspective for the use of two commonly used prior specifications: Dirichlet process priors (Ferguson, 1973; West, 1992; ¨ Escobar and West, 1995; MacEachern and Muller, 1998; M¨ ller et al. [sent-55, score-0.157]

12 , 2004) and L´ vy process u e priors (Wolpert et al. [sent-56, score-0.349]

13 Prior distributions are placed on the space of signed measures in Section 4 using L´ vy, Dirichlet, and Gaussian processes. [sent-60, score-0.16]

14 This illustrates the use of these process priors for posterior inference. [sent-62, score-0.209]

15 A Bayesian method using L´ vy process priors to address numerically ill-posed problems was developed by Wolpert and e Ickstadt (2004). [sent-68, score-0.349]

16 However, functions in the RKHS having a posterior probability very close to that of the MAP estimator need not have a finite representation so building a prior only on the finite representation is problematic if one wants to estimate the full posterior on the entire RKHS. [sent-71, score-0.263]

17 Let {λ j } and {φ j } be the non-increasing eigenvalues and corresponding complete orthonormal set of eigenvectors of the operator LK of Equation (4), restricted to the Hilbert space L2 X , du of measures γ(du) = γ(u)du with square-integrable density functions γ ∈ L 2 X , du . [sent-82, score-1.041]

18 2 Bayesian Kernel Models and Integral Operators Recall the Bayesian kernel model was defined by G = LK [γ](x) := Z X K(x, u) γ(du), γ ∈ Γ , −1 where Γ is a space of signed Borel measures on X . [sent-85, score-0.244]

19 Since X is compact and K bounded, LK is a positive compact operator on L2 (X , du) with a complete ortho-normal system (CONS) {φ j } of eigenfunctions with non-increasing eigenvalues {λ j } ⊂ R+ satisfying Equation (5). [sent-91, score-0.177]

20 The j 2 image under LK of the measure γ(du) := γ(u) du with Lebesgue density function γ may be expressed as the L2 -convergent sum LK [γ](x) = ∑ λ j a j φ j (x). [sent-93, score-0.501]

21 This class will arise naturally when we examine L´ vy and Dirichlet processes in Section 4. [sent-105, score-0.344]

22 Let B+ (X ) denote the cone of all finite nonnegative Borel measures on X and B(X ) the set of signed Borel measures. [sent-111, score-0.16]

23 Proof We construct an infinite signed measure γ satisfying LK [γ] ∈ HK . [sent-124, score-0.156]

24 Consider the improper Be(0, 0) distribution γ(du) = du , u(1 − u) 1774 C HARACTERIZING THE F UNCTION S PACE FOR B AYESIAN K ERNEL M ODELS with image under the integral operator f (x) := LK [γ](x) = −x log(x) − (1 − x) log(1 − x). [sent-126, score-0.574]

25 Thus the infinite signed measure γ(ds) is in L K [HK ] −1 but not in B(X ), so LK [HK ] is larger than the space of finite signed measures. [sent-128, score-0.282]

26 The eigenfunctions and eigenvalues of Equation (2) for Lebesgue measure µ(du) = du are λj = 1 φ j (x) = j 2 π2 √ 2 sin( jπx). [sent-133, score-0.544]

27 Example 2 (Splines on a circle) The kernel function for first order splines on the real line is x, u ∈ R K(x, u) := |x − u| and the corresponding RKHS norm is 2 K f = Z ∞ −∞ f (x)2 dx. [sent-138, score-0.137]

28 However, since the domain is not compact the spectrum of the associated integral operator on L2 (R, du) is continuous rather than discrete, the approach of Section 2 does not apply. [sent-139, score-0.151]

29 Therefore, we assume that f (x) = Z X K(x, u)Z(du) (8) where Z(du) ∈ M (X ) is a signed measure on X . [sent-153, score-0.156]

30 2σ2 (9) With a prior distribution on Z, π(Z), we can obtain the posterior density function given data π(Z|D) ∝ L(D|Z) π(Z), which implies a posterior distribution for f via the integral operator (8). [sent-156, score-0.414]

31 1 Priors on M A random signed measure Z(du) on X can be viewed as a stochastic process on X . [sent-158, score-0.197]

32 Gaussian processes and Dirichlet processes are two commonly used stochastic processes to generate random measures. [sent-160, score-0.297]

33 We first apply the results of Section 2 to Gaussian process priors (Rasmussen and Williams, 2006, Section 6) and then to L´ vy process priors (Wolpert et al. [sent-161, score-0.453]

34 We also e remark that Dirichlet processes can be constructed from L´ vy process priors. [sent-164, score-0.385]

35 2 Gaussian Processes Gaussian processes are canonical examples of stochastic processes used for generating random measures. [sent-166, score-0.198]

36 Model I: Placing a prior directly on the space of functions f (x) by sampling from paths of the Gaussian process with its covariance structure defined via a kernel K; ii. [sent-170, score-0.238]

37 Model II: Placing a prior on the random signed measures Z(du) on X by using a Gaussian process prior for Z(du) which implies a prior on the function space defined by the kernel model in Equation (8). [sent-171, score-0.444]

38 The first approach is the more standard approach for non-parametric Bayesian inference using Gaussian processes while the later is an example of our Bayesian kernel model. [sent-173, score-0.183]

39 1 S AMPLE PATHS OF G AUSSIAN P ROCESSES Consider a Gaussian process {Zu , u ∈ X } on a probability space {Ω, A , P} having covariance functions determined by a kernel function K. [sent-181, score-0.151]

40 Conversely, let L : HR → HR be a positive, continuous, self-adjoint operator then K(s,t) = LRs , Rt R , s,t ∈ X defines a reproducing kernel on X such that K ≤ R. [sent-195, score-0.192]

41 L is the dominance operator of HR over HK and this dominance is called nuclear if L is a nuclear or trace class operator (a compact operator for which a trace may be defined that is finite and independent of the choice of basis). [sent-196, score-0.36]

42 2 I MPLICATIONS FOR THE F UNCTION S PACES OF THE M ODELS Model I placed a prior directly on the space of functions using sample paths from the Gaussian process with covariance structure defined by the kernel K. [sent-200, score-0.238]

43 However, there exists another RKHS HR with kernel R which does contain the sample path if R has nuclear dominance over K. [sent-202, score-0.159]

44 Model II places a prior on random signed measures Z(du) on X by using a Gaussian process prior for Z(du). [sent-207, score-0.307]

45 3 L´ vy Processes e L´ vy processes offer an alternative to Gaussian processes in non-parametric Bayesian modeling. [sent-212, score-0.688]

46 e Dirichlet processes and Gaussian processes with a particular covariance structure can be formulated from the framework of L´ vy processes. [sent-213, score-0.469]

47 For the sake of simplicity in exposition, we will use the e ´ univariate setting X = [0, 1] to illustrate the construction of random signed measures using L evy processes. [sent-214, score-0.16]

48 A stochastic process Z := {Zu ∈ R : u ∈ X } is called a L´ vy process if it satisfies the following e conditions: 1. [sent-216, score-0.327]

49 a a Familiar examples of L´ vy processes include Brownian motion, Poisson processes, and gamma e processes. [sent-227, score-0.368]

50 16), which asserts that every L´ vy process can be decomposed into the sum of two independent e components: a “continuous process” (Brownian motion with drift) and a (possibly compensated) “pure jump” process. [sent-233, score-0.286]

51 The three parameters (a, σ2 , ν) in (11) uniquely determine a L´ vy process e where a denotes the drift term, σ2 denotes the variance (diffusion coefficient) of the Brownian motion, and ν(dw) denotes the intensity of the jump process. [sent-234, score-0.409]

52 The so-called “L´ vy measure” ν need e not be finite, but (12) implies that ν[(−ε, ε)c ] < ∞ for each ε > 0 and so ν is at least sigma-finite. [sent-235, score-0.245]

53 1 P URE J UMP L E VY P ROCESSES Pure jump L´ vy processes are used extensively in non-parametric Bayesian statistics due to their e computationally amenability. [sent-238, score-0.467]

54 2 P OISSON R ANDOM F IELDS I NTERPRETATION Any pure jump L´ vy process Z has a nice representation via a Poisson random field. [sent-243, score-0.44]

55 (13) s∈B The measure N defined above turns out to be a Poisson random measure on Γ, with mean measure ν(dw)du where du is the uniform reference measure on X (for instance the Lebesgue measure R when X = [0, 1]). [sent-247, score-0.597]

56 When Z has a density with respect to the L´ vy random field e M with L´ vy measure m, Zu has finite total variation and determines a finite measure Z(du) = dZ u . [sent-250, score-0.574]

57 draws from ν(dw)du representing the jump size and the jump location, respectively. [sent-254, score-0.246]

58 If the measure ν(dw)du has a density function ν(w, u) with respect to some finite reference measure m(dwdu), then the prior density function for Z with respect to a L´ vy(m) process is e π(Z) = J ∏ ν(w j , u j ) em(Γ)−ν(Γ) . [sent-257, score-0.202]

59 However, if the L´ vy measure satisfies e Z R (1 ∧ |w|)ν(dw) < ∞, 1781 (16) P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT then the sequence {w j } is almost surely absolutely summable (i. [sent-260, score-0.275]

60 This allows for the existence of L´ vy e e processes with jumps that are not absolutely summable. [sent-265, score-0.41]

61 When passing from prior to posterior computations, it has been shown that the Dirichlet process is the only conjugate member of the whole class of normalized random measures with independent increments (James et al. [sent-269, score-0.233]

62 Though Dirichlet processes are often defined via Dirichlet distributions, they can also be defined as a normalized Gamma process as noted by Ferguson (1973). [sent-273, score-0.14]

63 A Gamma process is a pure jump L´ vy process, which has the L´ vy measure e e ν(dw) = aw−1 exp{−bw}dw, w > 0, so at each location u Zu ∼ Gamma(au, b). [sent-274, score-0.715]

64 (2006) consider a variation of the integral (8) Z Z f (x) = X K(x, u) Z(du) = X w(u) K(x, u) F(du), (17) where the random signed measure Z(du) is modeled by a random probability distribution function F(du) and random coefficients w(u). [sent-280, score-0.221]

65 A Dirichlet process prior is specified for F and a Gaussian prior distribution is specified for w. [sent-281, score-0.147]

66 4 S YMMETRIC α- STABLE P ROCESS Symmetric α-stable processes are another class of L´ vy processes, arising from symmetric α-stable e distributions. [sent-284, score-0.344]

67 It has the following L´ vy measure e ν(dw) = Γ(α + 1) sin π πα 2 |w|−1−α dw α ∈ (0, 2]. [sent-287, score-0.442]

68 Thus SαS processes allow us to model heavy or light tail processes by varying α. [sent-289, score-0.198]

69 One can verify R R that the L´ vy measure is infinite for 0 < α ≤ 2 since ν(R) = R ν(dw) = 2 (0,∞] αw−1−α dw = ∞. [sent-290, score-0.4]

70 Given the jumps sizes {w j }, jump locations {u j }, and the number of jumps J, the prior probability density function (15) is π(Z) = ΠJ |w j | j=1 1−α e2(ε −1 −ε−α ) αJ , |w j | ≥ ε (18) with respect to a Cauchy random field. [sent-294, score-0.332]

71 For pure jump processes discretization is not the bottleneck. [sent-315, score-0.253]

72 The nature of the pure jump process ensures that the kernel model will have discrete knots. [sent-316, score-0.279]

73 The key issue in using a pure jump processes to model multivariate data is that the knots of the model should be representative of samples drawn from the marginal distribution of the data ρX . [sent-317, score-0.282]

74 Due to the extensive literature on Gaussian process models from theoretical as well as practical perspectives (Rasmussen and Williams, 2006; Ghosal and Roy, 2006) our simulations will focus on two pure jump process models. [sent-322, score-0.236]

75 This information is used to update the prior and obtain the posterior density π(Z|D). [sent-328, score-0.182]

76 For pure jump measures Z(du) and most non-parametric models this update is computationally difficult because there is no closed-form expression for the posterior distribution. [sent-329, score-0.293]

77 We will apply a Dirichlet process model to a high-dimensional binary regression problem and illustrate the use of L´ vy process models on a univariate regression problem. [sent-331, score-0.327]

78 1 L´ vy Process Model e Posterior inference for L´ vy random measures have been less explored than Dirichlet and Gaussian e processes. [sent-333, score-0.524]

79 The random measure Z(du) is given by Z(du) ∼ L´ vy(ν(dw)du) e where Γ(α + 1) sin πα |w|−1−α 1{w:|w|>ε} dw α ∈ (0, 2] 2 π is the L´ vy measure (truncated) for the SαS process. [sent-337, score-0.472]

80 input : 0 < pb , pd < 1, τ > 0, current state θ ∈ Θ return: proposed new state θ∗ and its weighted transition probability Q(θ∗ |θ)π(θ) Draw t ∼ U[0, 1]; if t < 1 − pb then draw uniformly j ∈ {1, . [sent-351, score-0.207]

81 This is done by Metropolis-Hastings sampling using the weighted transition probability algorithm above to generate a Markov chain whose equilibrium density is the posterior density. [sent-361, score-0.161]

82 0 (f) Figure 1: Plots of the target sinusoid (solid line), the function realized at an iteration t of the Markov chain (dashed line), and the jump locations and magnitudes of the measure (spikes) for (a) t = 1, (b) t = 10, (c) t = 5 × 103 , and (d) t = 104 . [sent-446, score-0.191]

83 2 Classification of Gene Expression Data For Dirichlet processes there is extensive literature on exact posterior inference using MCMC meth¨ ods (West, 1992; Escobar and West, 1995; MacEachern and M uller, 1998; M¨ ller et al. [sent-450, score-0.204]

84 Recently Dirichlet process priors have been applied to a Bayesian kernel model for high dimensional data. [sent-452, score-0.188]

85 The model is based upon the integral operator given in Equation (17) f (x) = Z X K(x, u) Z(du) = Z X w(u) K(x, u) F(du), where the random signed measure Z(du) is modeled by a random probability distribution function F(du) and a random weight function w(u). [sent-459, score-0.283]

86 This simple incorporation of unlabeled data into the model further illustrates the advantage of placing the prior over random measures in the Bayesian kernel model. [sent-483, score-0.197]

87 We examined the function class defined by the Bayesian kernel model, the integral of a kernel with respect to a signed Borel measure G = f f (x) = Z X K(x, u) γ(du), γ ∈ Γ , where Γ ⊆ B(X ). [sent-520, score-0.389]

88 Posterior consistency: It is natural to expect the posterior distribution to concentrate around the true function since the posterior distribution is a probability measure on the RKHS. [sent-527, score-0.24]

89 A natural idea is to use the equivalence between the RKHS and our Bayesian model to exploit the well understood theory of RKHS in proving posterior consistency of the Bayesian kernel model. [sent-528, score-0.189]

90 An obvious question is can we use the same ideas to relate priors on measures and the kernel to specific classes of functions, such as Sobolev spaces. [sent-532, score-0.181]

91 A study of the relation between integral operators and priors could lead to interesting and useful results for putting priors over specific function classes using the kernel model. [sent-533, score-0.275]

92 Comparison of process priors for modeling: A theoretical and empirical comparison of the accuracy of the various process priors on a variety of function classes and data sets would be of great practical importance and interest, especially for high dimensional problems. [sent-535, score-0.208]

93 Further developing this relation is a very interesting area of research and may be of importance for the posterior consistency of the Bayesian kernel model. [sent-541, score-0.189]

94 It follows from Proposition 4 that LK [γn ] ∈ HK and LK [γn ] 2 K = Z Z X X K(u, v) γn (u) du γn (v) dv ≤ κ2 Z X |γn (u)|du Z X |γn (v)|dv = κ2 γn 2 1 < ∞. [sent-564, score-0.447]

95 In addition, for every x ∈ X , we have lim |LK [γn ](x) − LK [γ](x)| ≤ n→∞ Z X |K(x, u)(γn (u) − γ(u))| du ≤ κ2 γn − γ 1 = 0, which implies that LK [γn ](x) also converges to LK [γ](x). [sent-569, score-0.447]

96 We denote the corresponding integral operator as L K,µ and function space of integrable 2 1 and square integrable functions as Lµ (X ) and Lµ (X ) respectively. [sent-577, score-0.193]

97 1792 C HARACTERIZING THE F UNCTION S PACE FOR B AYESIAN K ERNEL M ODELS Theorem 16 (L´ vy-Khintchine) Let X be a d-dimensional L´ vy process with characteristic funce e tion φt (u) := E(ei u,Xt ), u ∈ Rd . [sent-585, score-0.286]

98 Sur une nouvelle m´ thode pour la r´ solution du probl` m de Dirichlet. [sent-684, score-0.447]

99 Stochasitic processes with sample paths in reproducing kernel c Hilbert spaces. [sent-748, score-0.263]

100 Reflecting uncertainty in inverse problems: A Bayesian solution using L´ vy processes. [sent-885, score-0.245]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('du', 0.447), ('lk', 0.405), ('hk', 0.308), ('vy', 0.245), ('dirichlet', 0.157), ('rkhs', 0.148), ('ayesian', 0.133), ('haracterizing', 0.133), ('iang', 0.133), ('illai', 0.133), ('olpert', 0.133), ('ukherjee', 0.133), ('signed', 0.126), ('dw', 0.125), ('jump', 0.123), ('pace', 0.113), ('unction', 0.108), ('posterior', 0.105), ('liang', 0.101), ('processes', 0.099), ('zu', 0.094), ('duke', 0.087), ('ernel', 0.086), ('kernel', 0.084), ('odels', 0.081), ('hr', 0.08), ('borel', 0.079), ('bayesian', 0.077), ('jumps', 0.066), ('integral', 0.065), ('sayan', 0.065), ('priors', 0.063), ('operator', 0.062), ('pd', 0.059), ('wolpert', 0.059), ('mcmc', 0.058), ('pb', 0.058), ('prior', 0.053), ('splines', 0.053), ('mukherjee', 0.052), ('jx', 0.048), ('mike', 0.048), ('nuclear', 0.048), ('proposition', 0.046), ('md', 0.046), ('reproducing', 0.046), ('poisson', 0.046), ('lebesgue', 0.045), ('durham', 0.043), ('gaussian', 0.042), ('sin', 0.042), ('process', 0.041), ('brownian', 0.04), ('eigenfunctions', 0.04), ('credible', 0.038), ('sinusoid', 0.038), ('zs', 0.038), ('coherent', 0.036), ('paths', 0.034), ('wahba', 0.034), ('measures', 0.034), ('cauchy', 0.033), ('feng', 0.033), ('integrable', 0.033), ('realization', 0.033), ('birth', 0.032), ('tomaso', 0.032), ('transition', 0.032), ('mercer', 0.031), ('pure', 0.031), ('measure', 0.03), ('periodic', 0.029), ('wj', 0.029), ('hilbert', 0.029), ('chakraborty', 0.029), ('dwdu', 0.029), ('ghosal', 0.029), ('ickstadt', 0.029), ('isds', 0.029), ('knots', 0.029), ('luki', 0.029), ('maceachern', 0.029), ('qiang', 0.029), ('poggio', 0.028), ('eigenvalues', 0.027), ('dominance', 0.027), ('nc', 0.027), ('covariance', 0.026), ('placing', 0.026), ('grace', 0.025), ('kimeldorf', 0.025), ('gamma', 0.024), ('west', 0.024), ('nite', 0.024), ('density', 0.024), ('dms', 0.024), ('escobar', 0.024), ('fredholm', 0.024), ('compact', 0.024), ('cancer', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT

2 0.23551781 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

3 0.20223817 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

Abstract: Motivated by mathematical learning from training data, we introduce the notion of refinable kernels. Various characterizations of refinable kernels are presented. The concept of refinable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a refinable kernel that forms a Riesz basis. In particular, we characterize refinable translation invariant kernels, and refinable kernels defined by refinable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: refinable kernels, refinable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases

4 0.094470613 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton

Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space

5 0.067505121 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

Author: Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram

Abstract: Consider the problem of learning logistic-regression models for multiple classification tasks, where the training data set for each task is not drawn from the same statistical distribution. In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. Relying on a Dirichlet process (DP) based statistical model to learn the extent of similarity between classification tasks, we develop computationally efficient algorithms for two different forms of the MTL problem. First, we consider a symmetric multi-task learning (SMTL) situation in which classifiers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. Second, we consider an asymmetric multi-task learning (AMTL) formulation in which the posterior density function from the SMTL model parameters (from previous tasks) is used as a prior for a new task: this approach has the significant advantage of not requiring storage and use of all previous data from prior tasks. The AMTL formulation is solved with a simple Markov Chain Monte Carlo (MCMC) construction. Experimental results on two real life MTL problems indicate that the proposed algorithms: (a) automatically identify subgroups of related tasks whose training data appear to be drawn from similar distributions; and (b) are more accurate than simpler approaches such as single-task learning, pooling of data across all tasks, and simplified approximations to DP. Keywords: classification, hierarchical Bayesian models, Dirichlet process

6 0.062402304 90 jmlr-2007-Value Regularization and Fenchel Duality

7 0.062173534 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

8 0.058862325 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

9 0.053727228 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs     (Special Topic on the Conference on Learning Theory 2005)

10 0.05186177 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

11 0.046095669 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

12 0.045808721 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

13 0.041415431 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

14 0.039894983 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

15 0.037880749 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

16 0.036651779 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

17 0.033349268 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

18 0.031389829 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data     (Special Topic on the Conference on Learning Theory 2005)

19 0.030923985 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

20 0.03077635 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.229), (1, -0.3), (2, 0.318), (3, 0.181), (4, 0.011), (5, 0.074), (6, 0.245), (7, -0.125), (8, -0.215), (9, -0.042), (10, -0.141), (11, -0.083), (12, 0.082), (13, 0.043), (14, -0.075), (15, -0.009), (16, -0.038), (17, -0.006), (18, -0.009), (19, -0.067), (20, -0.004), (21, -0.054), (22, -0.041), (23, 0.038), (24, -0.021), (25, 0.001), (26, 0.085), (27, 0.01), (28, -0.019), (29, 0.034), (30, -0.051), (31, -0.023), (32, 0.066), (33, 0.016), (34, -0.041), (35, -0.04), (36, -0.034), (37, -0.043), (38, -0.029), (39, 0.036), (40, 0.039), (41, -0.048), (42, -0.041), (43, 0.028), (44, 0.005), (45, -0.006), (46, -0.062), (47, 0.035), (48, -0.063), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96510166 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT

2 0.84400731 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

Abstract: Motivated by mathematical learning from training data, we introduce the notion of refinable kernels. Various characterizations of refinable kernels are presented. The concept of refinable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a refinable kernel that forms a Riesz basis. In particular, we characterize refinable translation invariant kernels, and refinable kernels defined by refinable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: refinable kernels, refinable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases

3 0.69614828 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

4 0.37250084 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton

Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space

5 0.30543387 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

Author: Gavin C. Cawley, Nicola L. C. Talbot

Abstract: While the model parameters of a kernel machine are typically given by the solution of a convex optimisation problem, with a single global optimum, the selection of good values for the regularisation and kernel parameters is much less straightforward. Fortunately the leave-one-out cross-validation procedure can be performed or a least approximated very efficiently in closed form for a wide variety of kernel learning methods, providing a convenient means for model selection. Leave-one-out cross-validation based estimates of performance, however, generally exhibit a relatively high variance and are therefore prone to over-fitting. In this paper, we investigate the novel use of Bayesian regularisation at the second level of inference, adding a regularisation term to the model selection criterion corresponding to a prior over the hyper-parameter values, where the additional regularisation parameters are integrated out analytically. Results obtained on a suite of thirteen real-world and synthetic benchmark data sets clearly demonstrate the benefit of this approach. Keywords: model selection, kernel methods, Bayesian regularisation

6 0.29881716 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

7 0.26051491 90 jmlr-2007-Value Regularization and Fenchel Duality

8 0.24843186 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

9 0.23280577 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

10 0.23148763 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs     (Special Topic on the Conference on Learning Theory 2005)

11 0.22629812 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

12 0.21700212 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

13 0.21350856 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

14 0.18351987 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

15 0.1769443 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

16 0.17430966 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

17 0.16892228 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

18 0.16541766 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)

19 0.15397301 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

20 0.15042108 70 jmlr-2007-Ranking the Best Instances


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.012), (8, 0.519), (10, 0.017), (12, 0.04), (15, 0.022), (28, 0.078), (40, 0.055), (45, 0.011), (48, 0.021), (60, 0.028), (85, 0.047), (98, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90085191 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT

2 0.87783623 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

Author: Sanjoy Dasgupta, Leonard Schulman

Abstract: We show that, given data from a mixture of k well-separated spherical Gaussians in R d , a simple two-round variant of EM will, with high probability, learn the parameters of the Gaussians to nearoptimal precision, if the dimension is high (d ln k). We relate this to previous theoretical and empirical work on the EM algorithm. Keywords: expectation maximization, mixtures of Gaussians, clustering, unsupervised learning, probabilistic analysis

3 0.49386269 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

4 0.41999424 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation

5 0.398655 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik

Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis

6 0.37665349 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

7 0.36627012 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

8 0.36168975 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

9 0.35739166 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

10 0.32999498 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

11 0.31688541 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

12 0.31334329 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

13 0.31074017 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

14 0.30851927 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

15 0.30801427 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

16 0.3035793 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

17 0.30160302 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

18 0.30020165 71 jmlr-2007-Refinable Kernels

19 0.29616153 39 jmlr-2007-Handling Missing Values when Applying Classification Models

20 0.29567814 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption