jmlr jmlr2007 jmlr2007-18 knowledge-graph by maker-knowledge-mining

18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Source: pdf

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator deﬁned as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior speciﬁcations in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classiﬁcation problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we study a coherent Bayesian kernel model based on an integral operator deﬁned as the convolution of a kernel with a signed measure. [sent-14, score-0.457]

2 Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. [sent-15, score-0.278]

3 We study several classes of signed measures and their image mapped by the integral operator. [sent-16, score-0.225]

4 In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. [sent-17, score-0.164]

5 A consequence of this result is a function theoretic foundation for using non-parametric prior speciﬁcations in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. [sent-18, score-0.188]

6 We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. [sent-19, score-0.567]

7 Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. [sent-21, score-0.539]

8 Another approach, the Bayesian kernel model, is to study the class of functions expressible as kernel integrals G = f f (x) = Z X K(x, u) γ(du), γ ∈ Γ , (3) for some space Γ ⊆ B(X ) of signed Borel measures. [sent-48, score-0.294]

9 The natural question that arises in this Bayesian approach is: For what spaces Γ of signed measures is the RKHS HK identical to the linear space span(G ) spanned by the Bayesian kernel model? [sent-50, score-0.244]

10 The proposition asserts that the Bayesian kernel model and the penalized loss model both operate in the same function space when Γ includes all signed measures. [sent-54, score-0.256]

11 This result lays a theoretical foundation from a function analytic perspective for the use of two commonly used prior speciﬁcations: Dirichlet process priors (Ferguson, 1973; West, 1992; ¨ Escobar and West, 1995; MacEachern and Muller, 1998; M¨ ller et al. [sent-55, score-0.157]

12 , 2004) and L´ vy process u e priors (Wolpert et al. [sent-56, score-0.349]

13 Prior distributions are placed on the space of signed measures in Section 4 using L´ vy, Dirichlet, and Gaussian processes. [sent-60, score-0.16]

14 This illustrates the use of these process priors for posterior inference. [sent-62, score-0.209]

15 A Bayesian method using L´ vy process priors to address numerically ill-posed problems was developed by Wolpert and e Ickstadt (2004). [sent-68, score-0.349]

16 However, functions in the RKHS having a posterior probability very close to that of the MAP estimator need not have a ﬁnite representation so building a prior only on the ﬁnite representation is problematic if one wants to estimate the full posterior on the entire RKHS. [sent-71, score-0.263]

17 Let {λ j } and {φ j } be the non-increasing eigenvalues and corresponding complete orthonormal set of eigenvectors of the operator LK of Equation (4), restricted to the Hilbert space L2 X , du of measures γ(du) = γ(u)du with square-integrable density functions γ ∈ L 2 X , du . [sent-82, score-1.041]

18 2 Bayesian Kernel Models and Integral Operators Recall the Bayesian kernel model was deﬁned by G = LK [γ](x) := Z X K(x, u) γ(du), γ ∈ Γ , −1 where Γ is a space of signed Borel measures on X . [sent-85, score-0.244]

19 Since X is compact and K bounded, LK is a positive compact operator on L2 (X , du) with a complete ortho-normal system (CONS) {φ j } of eigenfunctions with non-increasing eigenvalues {λ j } ⊂ R+ satisfying Equation (5). [sent-91, score-0.177]

20 The j 2 image under LK of the measure γ(du) := γ(u) du with Lebesgue density function γ may be expressed as the L2 -convergent sum LK [γ](x) = ∑ λ j a j φ j (x). [sent-93, score-0.501]

21 This class will arise naturally when we examine L´ vy and Dirichlet processes in Section 4. [sent-105, score-0.344]

22 Let B+ (X ) denote the cone of all ﬁnite nonnegative Borel measures on X and B(X ) the set of signed Borel measures. [sent-111, score-0.16]

23 Proof We construct an inﬁnite signed measure γ satisfying LK [γ] ∈ HK . [sent-124, score-0.156]

24 Consider the improper Be(0, 0) distribution γ(du) = du , u(1 − u) 1774 C HARACTERIZING THE F UNCTION S PACE FOR B AYESIAN K ERNEL M ODELS with image under the integral operator f (x) := LK [γ](x) = −x log(x) − (1 − x) log(1 − x). [sent-126, score-0.574]

25 Thus the inﬁnite signed measure γ(ds) is in L K [HK ] −1 but not in B(X ), so LK [HK ] is larger than the space of ﬁnite signed measures. [sent-128, score-0.282]

26 The eigenfunctions and eigenvalues of Equation (2) for Lebesgue measure µ(du) = du are λj = 1 φ j (x) = j 2 π2 √ 2 sin( jπx). [sent-133, score-0.544]

27 Example 2 (Splines on a circle) The kernel function for ﬁrst order splines on the real line is x, u ∈ R K(x, u) := |x − u| and the corresponding RKHS norm is 2 K f = Z ∞ −∞ f (x)2 dx. [sent-138, score-0.137]

28 However, since the domain is not compact the spectrum of the associated integral operator on L2 (R, du) is continuous rather than discrete, the approach of Section 2 does not apply. [sent-139, score-0.151]

29 Therefore, we assume that f (x) = Z X K(x, u)Z(du) (8) where Z(du) ∈ M (X ) is a signed measure on X . [sent-153, score-0.156]

30 2σ2 (9) With a prior distribution on Z, π(Z), we can obtain the posterior density function given data π(Z|D) ∝ L(D|Z) π(Z), which implies a posterior distribution for f via the integral operator (8). [sent-156, score-0.414]

31 1 Priors on M A random signed measure Z(du) on X can be viewed as a stochastic process on X . [sent-158, score-0.197]

32 Gaussian processes and Dirichlet processes are two commonly used stochastic processes to generate random measures. [sent-160, score-0.297]

33 We ﬁrst apply the results of Section 2 to Gaussian process priors (Rasmussen and Williams, 2006, Section 6) and then to L´ vy process priors (Wolpert et al. [sent-161, score-0.453]

34 We also e remark that Dirichlet processes can be constructed from L´ vy process priors. [sent-164, score-0.385]

35 2 Gaussian Processes Gaussian processes are canonical examples of stochastic processes used for generating random measures. [sent-166, score-0.198]

36 Model I: Placing a prior directly on the space of functions f (x) by sampling from paths of the Gaussian process with its covariance structure deﬁned via a kernel K; ii. [sent-170, score-0.238]

37 Model II: Placing a prior on the random signed measures Z(du) on X by using a Gaussian process prior for Z(du) which implies a prior on the function space deﬁned by the kernel model in Equation (8). [sent-171, score-0.444]

38 The ﬁrst approach is the more standard approach for non-parametric Bayesian inference using Gaussian processes while the later is an example of our Bayesian kernel model. [sent-173, score-0.183]

39 1 S AMPLE PATHS OF G AUSSIAN P ROCESSES Consider a Gaussian process {Zu , u ∈ X } on a probability space {Ω, A , P} having covariance functions determined by a kernel function K. [sent-181, score-0.151]

40 Conversely, let L : HR → HR be a positive, continuous, self-adjoint operator then K(s,t) = LRs , Rt R , s,t ∈ X deﬁnes a reproducing kernel on X such that K ≤ R. [sent-195, score-0.192]

41 L is the dominance operator of HR over HK and this dominance is called nuclear if L is a nuclear or trace class operator (a compact operator for which a trace may be deﬁned that is ﬁnite and independent of the choice of basis). [sent-196, score-0.36]

42 2 I MPLICATIONS FOR THE F UNCTION S PACES OF THE M ODELS Model I placed a prior directly on the space of functions using sample paths from the Gaussian process with covariance structure deﬁned by the kernel K. [sent-200, score-0.238]

43 However, there exists another RKHS HR with kernel R which does contain the sample path if R has nuclear dominance over K. [sent-202, score-0.159]

44 Model II places a prior on random signed measures Z(du) on X by using a Gaussian process prior for Z(du). [sent-207, score-0.307]

45 3 L´ vy Processes e L´ vy processes offer an alternative to Gaussian processes in non-parametric Bayesian modeling. [sent-212, score-0.688]

46 e Dirichlet processes and Gaussian processes with a particular covariance structure can be formulated from the framework of L´ vy processes. [sent-213, score-0.469]

47 For the sake of simplicity in exposition, we will use the e ´ univariate setting X = [0, 1] to illustrate the construction of random signed measures using L evy processes. [sent-214, score-0.16]

48 A stochastic process Z := {Zu ∈ R : u ∈ X } is called a L´ vy process if it satisﬁes the following e conditions: 1. [sent-216, score-0.327]

49 a a Familiar examples of L´ vy processes include Brownian motion, Poisson processes, and gamma e processes. [sent-227, score-0.368]

50 16), which asserts that every L´ vy process can be decomposed into the sum of two independent e components: a “continuous process” (Brownian motion with drift) and a (possibly compensated) “pure jump” process. [sent-233, score-0.286]

51 The three parameters (a, σ2 , ν) in (11) uniquely determine a L´ vy process e where a denotes the drift term, σ2 denotes the variance (diffusion coefﬁcient) of the Brownian motion, and ν(dw) denotes the intensity of the jump process. [sent-234, score-0.409]

52 The so-called “L´ vy measure” ν need e not be ﬁnite, but (12) implies that ν[(−ε, ε)c ] < ∞ for each ε > 0 and so ν is at least sigma-ﬁnite. [sent-235, score-0.245]

53 1 P URE J UMP L E VY P ROCESSES Pure jump L´ vy processes are used extensively in non-parametric Bayesian statistics due to their e computationally amenability. [sent-238, score-0.467]

54 2 P OISSON R ANDOM F IELDS I NTERPRETATION Any pure jump L´ vy process Z has a nice representation via a Poisson random ﬁeld. [sent-243, score-0.44]

55 (13) s∈B The measure N deﬁned above turns out to be a Poisson random measure on Γ, with mean measure ν(dw)du where du is the uniform reference measure on X (for instance the Lebesgue measure R when X = [0, 1]). [sent-247, score-0.597]

56 When Z has a density with respect to the L´ vy random ﬁeld e M with L´ vy measure m, Zu has ﬁnite total variation and determines a ﬁnite measure Z(du) = dZ u . [sent-250, score-0.574]

57 draws from ν(dw)du representing the jump size and the jump location, respectively. [sent-254, score-0.246]

58 If the measure ν(dw)du has a density function ν(w, u) with respect to some ﬁnite reference measure m(dwdu), then the prior density function for Z with respect to a L´ vy(m) process is e π(Z) = J ∏ ν(w j , u j ) em(Γ)−ν(Γ) . [sent-257, score-0.202]

59 However, if the L´ vy measure satisﬁes e Z R (1 ∧ |w|)ν(dw) < ∞, 1781 (16) P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT then the sequence {w j } is almost surely absolutely summable (i. [sent-260, score-0.275]

60 This allows for the existence of L´ vy e e processes with jumps that are not absolutely summable. [sent-265, score-0.41]

61 When passing from prior to posterior computations, it has been shown that the Dirichlet process is the only conjugate member of the whole class of normalized random measures with independent increments (James et al. [sent-269, score-0.233]

62 Though Dirichlet processes are often deﬁned via Dirichlet distributions, they can also be deﬁned as a normalized Gamma process as noted by Ferguson (1973). [sent-273, score-0.14]

63 A Gamma process is a pure jump L´ vy process, which has the L´ vy measure e e ν(dw) = aw−1 exp{−bw}dw, w > 0, so at each location u Zu ∼ Gamma(au, b). [sent-274, score-0.715]

64 (2006) consider a variation of the integral (8) Z Z f (x) = X K(x, u) Z(du) = X w(u) K(x, u) F(du), (17) where the random signed measure Z(du) is modeled by a random probability distribution function F(du) and random coefﬁcients w(u). [sent-280, score-0.221]

65 A Dirichlet process prior is speciﬁed for F and a Gaussian prior distribution is speciﬁed for w. [sent-281, score-0.147]

66 4 S YMMETRIC α- STABLE P ROCESS Symmetric α-stable processes are another class of L´ vy processes, arising from symmetric α-stable e distributions. [sent-284, score-0.344]

67 It has the following L´ vy measure e ν(dw) = Γ(α + 1) sin π πα 2 |w|−1−α dw α ∈ (0, 2]. [sent-287, score-0.442]

68 Thus SαS processes allow us to model heavy or light tail processes by varying α. [sent-289, score-0.198]

69 One can verify R R that the L´ vy measure is inﬁnite for 0 < α ≤ 2 since ν(R) = R ν(dw) = 2 (0,∞] αw−1−α dw = ∞. [sent-290, score-0.4]

70 Given the jumps sizes {w j }, jump locations {u j }, and the number of jumps J, the prior probability density function (15) is π(Z) = ΠJ |w j | j=1 1−α e2(ε −1 −ε−α ) αJ , |w j | ≥ ε (18) with respect to a Cauchy random ﬁeld. [sent-294, score-0.332]

71 For pure jump processes discretization is not the bottleneck. [sent-315, score-0.253]

72 The nature of the pure jump process ensures that the kernel model will have discrete knots. [sent-316, score-0.279]

73 The key issue in using a pure jump processes to model multivariate data is that the knots of the model should be representative of samples drawn from the marginal distribution of the data ρX . [sent-317, score-0.282]

74 Due to the extensive literature on Gaussian process models from theoretical as well as practical perspectives (Rasmussen and Williams, 2006; Ghosal and Roy, 2006) our simulations will focus on two pure jump process models. [sent-322, score-0.236]

75 This information is used to update the prior and obtain the posterior density π(Z|D). [sent-328, score-0.182]

76 For pure jump measures Z(du) and most non-parametric models this update is computationally difﬁcult because there is no closed-form expression for the posterior distribution. [sent-329, score-0.293]

77 We will apply a Dirichlet process model to a high-dimensional binary regression problem and illustrate the use of L´ vy process models on a univariate regression problem. [sent-331, score-0.327]

78 1 L´ vy Process Model e Posterior inference for L´ vy random measures have been less explored than Dirichlet and Gaussian e processes. [sent-333, score-0.524]

79 The random measure Z(du) is given by Z(du) ∼ L´ vy(ν(dw)du) e where Γ(α + 1) sin πα |w|−1−α 1{w:|w|>ε} dw α ∈ (0, 2] 2 π is the L´ vy measure (truncated) for the SαS process. [sent-337, score-0.472]

80 input : 0 < pb , pd < 1, τ > 0, current state θ ∈ Θ return: proposed new state θ∗ and its weighted transition probability Q(θ∗ |θ)π(θ) Draw t ∼ U[0, 1]; if t < 1 − pb then draw uniformly j ∈ {1, . [sent-351, score-0.207]

81 This is done by Metropolis-Hastings sampling using the weighted transition probability algorithm above to generate a Markov chain whose equilibrium density is the posterior density. [sent-361, score-0.161]

82 0 (f) Figure 1: Plots of the target sinusoid (solid line), the function realized at an iteration t of the Markov chain (dashed line), and the jump locations and magnitudes of the measure (spikes) for (a) t = 1, (b) t = 10, (c) t = 5 × 103 , and (d) t = 104 . [sent-446, score-0.191]

83 2 Classiﬁcation of Gene Expression Data For Dirichlet processes there is extensive literature on exact posterior inference using MCMC meth¨ ods (West, 1992; Escobar and West, 1995; MacEachern and M uller, 1998; M¨ ller et al. [sent-450, score-0.204]

84 Recently Dirichlet process priors have been applied to a Bayesian kernel model for high dimensional data. [sent-452, score-0.188]

85 The model is based upon the integral operator given in Equation (17) f (x) = Z X K(x, u) Z(du) = Z X w(u) K(x, u) F(du), where the random signed measure Z(du) is modeled by a random probability distribution function F(du) and a random weight function w(u). [sent-459, score-0.283]

86 This simple incorporation of unlabeled data into the model further illustrates the advantage of placing the prior over random measures in the Bayesian kernel model. [sent-483, score-0.197]

87 We examined the function class deﬁned by the Bayesian kernel model, the integral of a kernel with respect to a signed Borel measure G = f f (x) = Z X K(x, u) γ(du), γ ∈ Γ , where Γ ⊆ B(X ). [sent-520, score-0.389]

88 Posterior consistency: It is natural to expect the posterior distribution to concentrate around the true function since the posterior distribution is a probability measure on the RKHS. [sent-527, score-0.24]

89 A natural idea is to use the equivalence between the RKHS and our Bayesian model to exploit the well understood theory of RKHS in proving posterior consistency of the Bayesian kernel model. [sent-528, score-0.189]

90 An obvious question is can we use the same ideas to relate priors on measures and the kernel to speciﬁc classes of functions, such as Sobolev spaces. [sent-532, score-0.181]

91 A study of the relation between integral operators and priors could lead to interesting and useful results for putting priors over speciﬁc function classes using the kernel model. [sent-533, score-0.275]

92 Comparison of process priors for modeling: A theoretical and empirical comparison of the accuracy of the various process priors on a variety of function classes and data sets would be of great practical importance and interest, especially for high dimensional problems. [sent-535, score-0.208]

93 Further developing this relation is a very interesting area of research and may be of importance for the posterior consistency of the Bayesian kernel model. [sent-541, score-0.189]

94 It follows from Proposition 4 that LK [γn ] ∈ HK and LK [γn ] 2 K = Z Z X X K(u, v) γn (u) du γn (v) dv ≤ κ2 Z X |γn (u)|du Z X |γn (v)|dv = κ2 γn 2 1 < ∞. [sent-564, score-0.447]

95 In addition, for every x ∈ X , we have lim |LK [γn ](x) − LK [γ](x)| ≤ n→∞ Z X |K(x, u)(γn (u) − γ(u))| du ≤ κ2 γn − γ 1 = 0, which implies that LK [γn ](x) also converges to LK [γ](x). [sent-569, score-0.447]

96 We denote the corresponding integral operator as L K,µ and function space of integrable 2 1 and square integrable functions as Lµ (X ) and Lµ (X ) respectively. [sent-577, score-0.193]

97 1792 C HARACTERIZING THE F UNCTION S PACE FOR B AYESIAN K ERNEL M ODELS Theorem 16 (L´ vy-Khintchine) Let X be a d-dimensional L´ vy process with characteristic funce e tion φt (u) := E(ei u,Xt ), u ∈ Rd . [sent-585, score-0.286]

98 Sur une nouvelle m´ thode pour la r´ solution du probl` m de Dirichlet. [sent-684, score-0.447]

99 Stochasitic processes with sample paths in reproducing kernel c Hilbert spaces. [sent-748, score-0.263]

100 Reﬂecting uncertainty in inverse problems: A Bayesian solution using L´ vy processes. [sent-885, score-0.245]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('du', 0.447), ('lk', 0.405), ('hk', 0.308), ('vy', 0.245), ('dirichlet', 0.157), ('rkhs', 0.148), ('ayesian', 0.133), ('haracterizing', 0.133), ('iang', 0.133), ('illai', 0.133), ('olpert', 0.133), ('ukherjee', 0.133), ('signed', 0.126), ('dw', 0.125), ('jump', 0.123), ('pace', 0.113), ('unction', 0.108), ('posterior', 0.105), ('liang', 0.101), ('processes', 0.099), ('zu', 0.094), ('duke', 0.087), ('ernel', 0.086), ('kernel', 0.084), ('odels', 0.081), ('hr', 0.08), ('borel', 0.079), ('bayesian', 0.077), ('jumps', 0.066), ('integral', 0.065), ('sayan', 0.065), ('priors', 0.063), ('operator', 0.062), ('pd', 0.059), ('wolpert', 0.059), ('mcmc', 0.058), ('pb', 0.058), ('prior', 0.053), ('splines', 0.053), ('mukherjee', 0.052), ('jx', 0.048), ('mike', 0.048), ('nuclear', 0.048), ('proposition', 0.046), ('md', 0.046), ('reproducing', 0.046), ('poisson', 0.046), ('lebesgue', 0.045), ('durham', 0.043), ('gaussian', 0.042), ('sin', 0.042), ('process', 0.041), ('brownian', 0.04), ('eigenfunctions', 0.04), ('credible', 0.038), ('sinusoid', 0.038), ('zs', 0.038), ('coherent', 0.036), ('paths', 0.034), ('wahba', 0.034), ('measures', 0.034), ('cauchy', 0.033), ('feng', 0.033), ('integrable', 0.033), ('realization', 0.033), ('birth', 0.032), ('tomaso', 0.032), ('transition', 0.032), ('mercer', 0.031), ('pure', 0.031), ('measure', 0.03), ('periodic', 0.029), ('wj', 0.029), ('hilbert', 0.029), ('chakraborty', 0.029), ('dwdu', 0.029), ('ghosal', 0.029), ('ickstadt', 0.029), ('isds', 0.029), ('knots', 0.029), ('luki', 0.029), ('maceachern', 0.029), ('qiang', 0.029), ('poggio', 0.028), ('eigenvalues', 0.027), ('dominance', 0.027), ('nc', 0.027), ('covariance', 0.026), ('placing', 0.026), ('grace', 0.025), ('kimeldorf', 0.025), ('gamma', 0.024), ('west', 0.024), ('nite', 0.024), ('density', 0.024), ('dms', 0.024), ('escobar', 0.024), ('fredholm', 0.024), ('compact', 0.024), ('cancer', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

2 0.23551781 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with ﬂexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with ﬂexible variances is a uniform Glivenko-Cantelli (uGC) class. This result conﬁrms a conjecture concerning learnability of Gaussian kernels and veriﬁes the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when ﬂexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with ﬂexible variances we present explicit learning rates for regression with least square loss and classiﬁcation with hinge loss. Keywords: Gaussian kernel, ﬂexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

3 0.20223817 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

Abstract: Motivated by mathematical learning from training data, we introduce the notion of reﬁnable kernels. Various characterizations of reﬁnable kernels are presented. The concept of reﬁnable kernels leads to the introduction of wavelet-like reproducing kernels. We also investigate a reﬁnable kernel that forms a Riesz basis. In particular, we characterize reﬁnable translation invariant kernels, and reﬁnable kernels deﬁned by reﬁnable functions. This study leads to multiresolution analysis of reproducing kernel Hilbert spaces. Keywords: reﬁnable kernels, reﬁnable feature maps, wavelet-like reproducing kernels, dual kernels, learning with kernels, reproducing kernel Hilbert spaces, Riesz bases

4 0.094470613 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton

Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of ﬁnite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justiﬁcation for the method. The proof uses covariance operators deﬁned on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of ﬁnite rank to their population counterparts, which can have inﬁnite rank. The result also gives a sufﬁcient condition for convergence on the regularization coefﬁcient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space

5 0.067505121 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

Author: Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram

Abstract: Consider the problem of learning logistic-regression models for multiple classiﬁcation tasks, where the training data set for each task is not drawn from the same statistical distribution. In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. Relying on a Dirichlet process (DP) based statistical model to learn the extent of similarity between classiﬁcation tasks, we develop computationally efﬁcient algorithms for two different forms of the MTL problem. First, we consider a symmetric multi-task learning (SMTL) situation in which classiﬁers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. Second, we consider an asymmetric multi-task learning (AMTL) formulation in which the posterior density function from the SMTL model parameters (from previous tasks) is used as a prior for a new task: this approach has the signiﬁcant advantage of not requiring storage and use of all previous data from prior tasks. The AMTL formulation is solved with a simple Markov Chain Monte Carlo (MCMC) construction. Experimental results on two real life MTL problems indicate that the proposed algorithms: (a) automatically identify subgroups of related tasks whose training data appear to be drawn from similar distributions; and (b) are more accurate than simpler approaches such as single-task learning, pooling of data across all tasks, and simpliﬁed approximations to DP. Keywords: classiﬁcation, hierarchical Bayesian models, Dirichlet process

6 0.062402304 90 jmlr-2007-Value Regularization and Fenchel Duality

7 0.062173534 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

8 0.058862325 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

9 0.053727228 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

10 0.05186177 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

11 0.046095669 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

12 0.045808721 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

13 0.041415431 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

14 0.039894983 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

15 0.037880749 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

16 0.036651779 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

17 0.033349268 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

18 0.031389829 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data (Special Topic on the Conference on Learning Theory 2005)

19 0.030923985 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

20 0.03077635 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.229), (1, -0.3), (2, 0.318), (3, 0.181), (4, 0.011), (5, 0.074), (6, 0.245), (7, -0.125), (8, -0.215), (9, -0.042), (10, -0.141), (11, -0.083), (12, 0.082), (13, 0.043), (14, -0.075), (15, -0.009), (16, -0.038), (17, -0.006), (18, -0.009), (19, -0.067), (20, -0.004), (21, -0.054), (22, -0.041), (23, 0.038), (24, -0.021), (25, 0.001), (26, 0.085), (27, 0.01), (28, -0.019), (29, 0.034), (30, -0.051), (31, -0.023), (32, 0.066), (33, 0.016), (34, -0.041), (35, -0.04), (36, -0.034), (37, -0.043), (38, -0.029), (39, 0.036), (40, 0.039), (41, -0.048), (42, -0.041), (43, 0.028), (44, 0.005), (45, -0.006), (46, -0.062), (47, 0.035), (48, -0.063), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96510166 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

2 0.84400731 71 jmlr-2007-Refinable Kernels

Author: Yuesheng Xu, Haizhang Zhang

3 0.69614828 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

4 0.37250084 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton

5 0.30543387 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

Author: Gavin C. Cawley, Nicola L. C. Talbot

Abstract: While the model parameters of a kernel machine are typically given by the solution of a convex optimisation problem, with a single global optimum, the selection of good values for the regularisation and kernel parameters is much less straightforward. Fortunately the leave-one-out cross-validation procedure can be performed or a least approximated very efﬁciently in closed form for a wide variety of kernel learning methods, providing a convenient means for model selection. Leave-one-out cross-validation based estimates of performance, however, generally exhibit a relatively high variance and are therefore prone to over-ﬁtting. In this paper, we investigate the novel use of Bayesian regularisation at the second level of inference, adding a regularisation term to the model selection criterion corresponding to a prior over the hyper-parameter values, where the additional regularisation parameters are integrated out analytically. Results obtained on a suite of thirteen real-world and synthetic benchmark data sets clearly demonstrate the beneﬁt of this approach. Keywords: model selection, kernel methods, Bayesian regularisation

6 0.29881716 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

7 0.26051491 90 jmlr-2007-Value Regularization and Fenchel Duality

8 0.24843186 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

9 0.23280577 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

10 0.23148763 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

11 0.22629812 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

12 0.21700212 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

13 0.21350856 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

14 0.18351987 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

15 0.1769443 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

16 0.17430966 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

17 0.16892228 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

18 0.16541766 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)

19 0.15397301 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

20 0.15042108 70 jmlr-2007-Ranking the Best Instances

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.012), (8, 0.519), (10, 0.017), (12, 0.04), (15, 0.022), (28, 0.078), (40, 0.055), (45, 0.011), (48, 0.021), (60, 0.028), (85, 0.047), (98, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90085191 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert

2 0.87783623 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

Author: Sanjoy Dasgupta, Leonard Schulman

Abstract: We show that, given data from a mixture of k well-separated spherical Gaussians in R d , a simple two-round variant of EM will, with high probability, learn the parameters of the Gaussians to nearoptimal precision, if the dimension is high (d ln k). We relate this to previous theoretical and empirical work on the EM algorithm. Keywords: expectation maximization, mixtures of Gaussians, clustering, unsupervised learning, probabilistic analysis

3 0.49386269 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identiﬁcation is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efﬁciently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model ﬁtting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Speciﬁcally, a pairwise separability measure for clusters is deﬁned using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difﬁculty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

4 0.41999424 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to ﬁt together and to yield efﬁcient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a speciﬁc model structure has been ﬁxed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation

5 0.398655 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik

Abstract: Quadratic discriminant analysis is a common tool for classiﬁcation, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classiﬁer is derived using information geometry. Using a calculus of variations approach to deﬁne a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classiﬁer that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassiﬁcation cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classiﬁer is proposed where the prior is deﬁned using a coarse estimate of the covariance based on the training data; this classiﬁer is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases signiﬁcantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classiﬁer Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis

6 0.37665349 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

7 0.36627012 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

8 0.36168975 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

9 0.35739166 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

10 0.32999498 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

11 0.31688541 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

12 0.31334329 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

13 0.31074017 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

14 0.30851927 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

15 0.30801427 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

16 0.3035793 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

17 0.30160302 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

18 0.30020165 71 jmlr-2007-Refinable Kernels

19 0.29616153 39 jmlr-2007-Handling Missing Values when Applying Classification Models

20 0.29567814 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption