jmlr jmlr2012 jmlr2012-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Haizhang Zhang, Yuesheng Xu, Qinghui Zhang
Abstract: This paper studies the construction of a refinement kernel for a given operator-valued reproducing kernel such that the vector-valued reproducing kernel Hilbert space of the refinement kernel contains that of the given kernel as a subspace. The study is motivated from the need of updating the current operator-valued reproducing kernel in multi-task learning when underfitting or overfitting occurs. Numerical simulations confirm that the established refinement kernel method is able to meet this need. Various characterizations are provided based on feature maps and vector-valued integral representations of operator-valued reproducing kernels. Concrete examples of refining translation invariant and finite Hilbert-Schmidt operator-valued reproducing kernels are provided. Other examples include refinement of Hessian of scalar-valued translation-invariant kernels and transformation kernels. Existence and properties of operator-valued reproducing kernels preserved during the refinement process are also investigated. Keywords: vector-valued reproducing kernel Hilbert spaces, operator-valued reproducing kernels, refinement, embedding, translation invariant kernels, Hessian of Gaussian kernels, Hilbert-Schmidt kernels, numerical experiments
Reference: text
sentIndex sentText sentNum sentScore
1 China Editor: John Shawe-Taylor Abstract This paper studies the construction of a refinement kernel for a given operator-valued reproducing kernel such that the vector-valued reproducing kernel Hilbert space of the refinement kernel contains that of the given kernel as a subspace. [sent-11, score-1.116]
2 The study is motivated from the need of updating the current operator-valued reproducing kernel in multi-task learning when underfitting or overfitting occurs. [sent-12, score-0.402]
3 Various characterizations are provided based on feature maps and vector-valued integral representations of operator-valued reproducing kernels. [sent-14, score-0.395]
4 Concrete examples of refining translation invariant and finite Hilbert-Schmidt operator-valued reproducing kernels are provided. [sent-15, score-0.538]
5 Existence and properties of operator-valued reproducing kernels preserved during the refinement process are also investigated. [sent-17, score-0.461]
6 Keywords: vector-valued reproducing kernel Hilbert spaces, operator-valued reproducing kernels, refinement, embedding, translation invariant kernels, Hessian of Gaussian kernels, Hilbert-Schmidt kernels, numerical experiments 1. [sent-18, score-0.777]
7 Built upon the theory of scalar-valued reproducing kernels (Aronszajn, 1950), kernel methods have proven useful in single task learning (Sch¨ lkopf and Smola, 2002; Shawe-Taylor and o Cristianini, 2004; Vapnik, 1998). [sent-21, score-0.565]
8 These two considerations lead immediately to the notion of reproducing kernel Hilbert spaces (RKHS). [sent-32, score-0.402]
9 It is well-known that a bivariate function is a scalar-valued reproducing kernel if and only if it is representable as some inner product of the feature of inputs (Sch¨ lkopf and Smola, 2002). [sent-35, score-0.402]
10 Finally, finding a feature map and taking the inner product of the o feature of two inputs are equivalent to choosing a scalar-valued reproducing kernel and performing function evaluations of it. [sent-36, score-0.425]
11 We shall briefly review existing work on vector-valued RKHS and the associated operator-valued reproducing kernels. [sent-45, score-0.395]
12 The notion of matrix-valued or operator-valued reproducing kernels was also obtained in Burbea and Masani (1984). [sent-47, score-0.461]
13 Various characterizations and examples of universal operator-valued reproducing kernels were provided in Caponnetto et al. [sent-51, score-0.515]
14 , 2010) also examined basic operations of operator-valued reproducing kernels and extended the Bochner characterization of translation invariant reproducing kernels to the operator-valued case. [sent-55, score-1.018]
15 The purpose of this paper is to study the refinement relationship of two vector-valued reproducing kernels. [sent-56, score-0.298]
16 We say that a vector-valued reproducing kernel is a refinement of another kernel of such type if the RKHS of the first kernel contains that of the latter one as a linear subspace and their norms coincide on the smaller space. [sent-57, score-0.61]
17 The study is motivated by the need of updating a vector-valued reproducing kernel for multi-task machine learning when underfitting or overfitting occurs. [sent-59, score-0.402]
18 3 about the refinement of Hessian kernels and transformation kernels are unique, and Section 7 of numerical experiments is novel. [sent-70, score-0.326]
19 By contrast, the discussion of general characterizations and finite-dimensional RKHS in Section 3, refinement of kernels defined by the integral of operator-value kernels with respect to a scalar-valued measure in Section 4. [sent-71, score-0.443]
20 In Section 3, we shall present three general characterizations of the refinement relationship by examining the difference of two given kernels, the feature map representation of kernels, and the associated kernels on the extended input space. [sent-80, score-0.337]
21 Recall that most scalar-valued reproducing kernels are represented by integrals. [sent-81, score-0.461]
22 In the operator-valued case, we have two types of integral representations: the integral of operator-valued reproducing kernels with respect to a scalar-valued measure, and the integral of scalar-valued reproducing kernels with respect to an operator-valued measure. [sent-82, score-1.051]
23 As a key part of this paper, we shall investigate in Section 4 specifications of the general characterizations when the operator-valued reproducing kernels are given by such integrals. [sent-83, score-0.612]
24 Section 6 treats specially the existence of nontrivial refinements and desirable properties of operator-valued reproducing kernels that can be preserved during the refinement process. [sent-85, score-0.535]
25 In Section 7, we perform three numerical simulations to show the effect of the refinement kernel method in updating operator-valued reproducing kernels for multi-task learning. [sent-86, score-0.565]
26 Kernel Refinement To explain our motivation from multi-task learning in details, we first recall the definition of operatorvalued reproducing kernels. [sent-89, score-0.298]
27 An L (Λ)-valued reproducing kernel on X is a function K : X × X → L (Λ) such that K(x, y) = K(y, x)∗ for all x, y ∈ X, and such that for all x j ∈ X, ξ j ∈ Λ, j ∈ Nn := {1, 2, . [sent-96, score-0.402]
28 (1) j=1 k=1 For each L (Λ)-valued reproducing kernel K on X, there exists a unique Hilbert space, denoted by HK , consisting of Λ-valued functions on X such that K(x, ·)ξ ∈ HK for all x ∈ X and ξ ∈ Λ (2) ( f (x), ξ)Λ = ( f , K(x, ·)ξ)HK for all f ∈ HK , x ∈ X, and ξ ∈ Λ. [sent-100, score-0.402]
29 Conversely, for each Λ-valued RKHS on X, there exists a unique L (Λ)-valued reproducing kernel K on X that satisfies (2) and (3). [sent-104, score-0.402]
30 For this reason, we also call K the reproducing kernel (or kernel for short) of HK . [sent-105, score-0.506]
31 The bijective correspondence between L (Λ)-valued reproducing kernels and Λ-valued RKHS is central to the theory of vector-valued RKHS. [sent-106, score-0.461]
32 Given two L (Λ)-valued reproducing kernels K, G on X, we shall investigate in this paper the fundamental embedding relationship HK HG in the sense that HK ⊆ HG and for all f ∈ HK , f HK = f HG . [sent-107, score-0.558]
33 When one of the above situations happens, a remedy is to modify the reproducing kernel. [sent-126, score-0.298]
34 Specifically, one might want to find another L (Λ)-valued reproducing kernel G such that HK HG when there is underfitting, or such that HG HK when there is overfitting. [sent-127, score-0.402]
35 We shall verify in the last section through extensive numerical simulations that the refinement kernel method is indeed able to provide an appropriate update of an operator-valued reproducing kernel when underfitting or overfitting occurs. [sent-129, score-0.603]
36 Before moving on to the characterization of refinement of operator-valued reproducing kernels, we collect here notations that will be frequently used in the rest of the paper. [sent-130, score-0.317]
37 General Characterizations The relationship between the RKHS of the sum of two operator-valued reproducing kernels and those of the summand kernels has been made clear in Theorem 1 on page 44 of Pedrick (1957). [sent-134, score-0.624]
38 Proposition 1 Let K, G be two L (Λ)-valued reproducing kernels on X. [sent-136, score-0.461]
39 Then HK if G − K is an L (Λ)-valued reproducing kernel on X and HK ∩ HG−K = {0}. [sent-137, score-0.402]
40 HG if and only HG then HG−K Every reproducing kernel has a feature map representation. [sent-139, score-0.425]
41 The following lemma is useful in identifying the RKHS of a reproducing kernel given by a feature map representation (5). [sent-142, score-0.425]
42 Lemma 2 If K is an L (Λ)-valued reproducing kernel on X given by (5) then HK = {Φ(·)∗ u : u ∈ W } with inner product (Φ(·)∗ u, Φ(·)∗ v)HK := (PΦ u, PΦ v)W , u, v ∈ W , where PΦ is the orthogonal projection of W onto WΦ := span{Φ(x)ξ : x ∈ X, ξ ∈ Λ}. [sent-143, score-0.424]
43 Theorem 3 Suppose that L (Λ)-valued reproducing kernels K and G are given by the feature maps ′ Φ : X → L (Λ, W ) and Φ′ : X → L (Λ, W ′ ), respectively. [sent-145, score-0.461]
44 To illustrate the above useful results, we shall present a concrete example aiming at refining L (Λ)-valued reproducing kernels K with a finite-dimensional RKHS. [sent-149, score-0.58]
45 j=1 k=1 98 R EFINEMENT OF O PERATOR - VALUED R EPRODUCING K ERNELS Thus, we see that K(x, y) = Φ(y)∗ Φ(x), x, y ∈ X, implying that K is an L (Λ)-valued reproducing kernel. [sent-169, score-0.298]
46 Introduce for each L (Λ)-valued reproducing kernel K on X a scalar-valued reproducing kernel ˜ ˜ K on the extended input space X := X × Λ by setting ˜ K((x, ξ), (y, η)) := (K(x, y)ξ, η)Λ , x, y ∈ X, ξ, η ∈ Λ. [sent-191, score-0.804]
47 It appears by Proposition 6 that we do not have to bother studying refinement of operator-valued reproducing kernels. [sent-205, score-0.298]
48 Typical examples include the important translationinvariant operator-valued kernels and hessian kernels to be considered in the next section. [sent-213, score-0.348]
49 We introduce another notation before returning to reproducing kernels. [sent-241, score-0.298]
50 To introduce our L (Λ)-valued reproducing kernels, we also let W be a Hilbert space and φ a mapping from X × Ω to L (Λ, W ) such that for each x ∈ X, φ(x, ·) belongs to both L2 (Ω, L (Λ, W ), dµ) and L2 (Ω, L (Λ, W ), dν). [sent-247, score-0.298]
51 We verify that (20) indeed defines an L (Λ)-valued reproducing kernel. [sent-282, score-0.298]
52 104 R EFINEMENT OF O PERATOR - VALUED R EPRODUCING K ERNELS Proposition 8 With the above assumptions on Ψ and µ, the function K defined by (20) is an L (Λ)valued reproducing kernel on X. [sent-283, score-0.402]
53 (24) j=1 Since Ψ(·, ·,tl ) is a scalar-valued reproducing kernel on X, [Ψ(x j , xk ,tl ) : j, k ∈ Nn ] is a positive semi-definite matrix for each l ∈ Nm . [sent-291, score-0.43]
54 To investigate the refinement relationship, we shall consider a simplified version of (20) that covers a large class of operator-valued reproducing kernels. [sent-295, score-0.395]
55 By Proposition 8, K, G are L (Λ)-valued reproducing kernels on X. [sent-299, score-0.461]
56 Examples We present in this section several concrete examples of refinement of operator-valued reproducing kernels. [sent-405, score-0.32]
57 1 Translation Invariant Reproducing Kernels Let d ∈ N and K be an L (Λ)-valued reproducing kernel on Rd . [sent-408, score-0.402]
58 A celebrated characterization due to Bochner (1959) states that every continuous scalar-valued translation invariant reproducing kernel on Rd must be the Fourier transform of a finite nonnegative Borel measure on Rd , and vice versa. [sent-410, score-0.573]
59 To this end, we first investigate the structure of the RKHS of a translation invariant L (Λ)-valued reproducing kernel. [sent-416, score-0.375]
60 Let γ be an arbitrary measure in B (Rd , Λ) and L the associated translation invariant reproducing kernel defined by L(x, y) = Rd ei(x−y)·t dγ(t), x, y ∈ Rd . [sent-417, score-0.499]
61 Proof We introduce for each ξ ∈ Λ two scalar-valued translation invariant reproducing kernels on Rd by setting Aa (x, y) := (La (x, y)ξ, ξ)Λ , x, y ∈ Rd , a ∈ {c, s}. [sent-429, score-0.538]
62 In this subsection, we shall consider special translation invariant reproducing kernels and establish the characterization of refinement using Theorem 7. [sent-498, score-0.654]
63 Let k be a continuously differentiable translation invariant reproducing kernel on Rd . [sent-499, score-0.479]
64 By the general integral representation (17) of operator-valued reproducing kernels, K defined by (44) is an L (Cd )valued reproducing kernel on Rd . [sent-506, score-0.743]
65 Matrix-valued translation invariant reproducing kernels of the form (44) are useful for the development of divergence-free kernel methods for solving some special partial differential equations (see, for example, Lowitzsh, 2003; Wendland, 2009, and the references therein). [sent-507, score-0.642]
66 Another class of kernels constructed from the Hessian of a scalar-valued translation invariant reproducing kernel is widely applied to the learning of a multivariate function together with its gradient simultaneously (Mukherjee and Wu, 2006; Mukherjee and Zhou, 2006; Ying and Campbell, 2008). [sent-508, score-0.642]
67 We aim at refining matrix-valued reproducing kernels of the forms (44) and (48) in this subsection. [sent-514, score-0.461]
68 Then K, G, K, G are matrix-valued translation invariant reproducing kernels on Rd . [sent-518, score-0.538]
69 3 Transformation Reproducing Kernels Let us consider a particular class of matrix-valued reproducing kernels whose universality was studied in Caponnetto et al. [sent-530, score-0.481]
70 To this end, we let k, g be two scalar-valued reproducing kernels on another input space Y and Tp be mappings from X to Y , p ∈ Nn . [sent-533, score-0.461]
71 (51) It is known that K, G defined above are indeed L (Cn )-valued reproducing kernels (Caponnetto et al. [sent-535, score-0.461]
72 A more general case of refinement of transformation reproducing kernels is discussed below. [sent-557, score-0.461]
73 Proposition 20 Let Tp , S p be mappings from X to Y and k, g be scalar-valued reproducing kernels on Y . [sent-559, score-0.461]
74 4 Finite Hilbert-Schmidt Reproducing Kernels We consider refinement of finite Hilbert-Schmidt reproducing kernels in this subsection. [sent-564, score-0.461]
75 Let B j ,C j be invertible operators in L+ (Λ), n ≤ m ∈ N, and Ψ j , j ∈ Nm , be scalar-valued reproducing kernels on the input space X. [sent-565, score-0.504]
76 (57) By the general integral representation (20) and Proposition 8, K, G above are L (Λ)-valued reproducing kernels on X. [sent-567, score-0.504]
77 k∈Nm \{ j} Theorem 21 Let K, G be defined by (57), where B j ,C j ∈ L+ (Λ) are invertible and Ψ j , j ∈ Nm , are scalar-valued reproducing kernels on X satisfying (58). [sent-569, score-0.461]
78 The above equation can be equivalently formulated as ∑ ∑(w j , ei ⊗ fk )Λ⊗W (ei , B j ξ)Λ fk , φ j (x) j =0 Wj k∈J j i∈I By the denseness of φ j (X) in W j , ∑(w j , ei ⊗ fk )Λ⊗W (ei , B j ξ)Λ = 0 for all j ∈ Nn , k ∈ J j and ξ ∈ Λ. [sent-587, score-0.343]
79 j i∈I We thus have for all j ∈ Nn and k ∈ J j that ∑i∈I (w j , ei ⊗ fk )Λ⊗W j ei = 0, which implies (w j , ei ⊗ fk )Λ⊗W j = 0 for all j ∈ Nn , k ∈ J j , i ∈ I. [sent-588, score-0.34]
80 Existence This section is devoted to the existence of nontrivial refinement of operator-valued reproducing kernels. [sent-620, score-0.372]
81 Proposition 25 There does not exist a nontrivial refinement of an L (Λ)-valued reproducing kernel K on X if and only if HK = ΛX , the set of all the functions from X to Λ. [sent-624, score-0.476]
82 If the cardinality of X is infinite then every L (Λ)-valued reproducing kernel on X has a nontrivial refinement. [sent-625, score-0.476]
83 A necessary condition for an L (Λ)-valued reproducing kernel on X to have no nontrivial refinements is that n ∑ n ∑ (K(x j , xk )ξ j , ξk )Λ > 0 for all ξ j ∈ Λ, j ∈ Nn with j=1 k=1 121 n ∑ j=1 ξj Λ > 0. [sent-628, score-0.504]
84 In the process of refining an operator-valued reproducing kernel, it is usually desirable to preserve favorable properties of the original kernel. [sent-653, score-0.298]
85 We shall show that this is feasible as far as continuity and universality of operator-valued reproducing kernels are concerned. [sent-654, score-0.578]
86 Let X be a metric space and K an L (Λ)-valued reproducing kernel that is continuous from X × X to L (Λ) when the latter is equipped with the operator norm. [sent-655, score-0.448]
87 Then every continuous L (Λ)valued reproducing kernel on X has a nontrivial continuous refinement. [sent-664, score-0.52]
88 123 Z HANG , X U AND Z HANG Lemma 28 Let K be a continuous L (Λ)-valued reproducing kernel on X with the feature map representation (5), where Φ : X → L (Λ, W ) is continuous. [sent-669, score-0.447]
89 Proposition 29 Let X be a metric space and K a continuous L (Λ)-valued reproducing kernel on X. [sent-682, score-0.424]
90 Numerical Experiments We present in this final section three numerical experiments on the application of refinement of operator-valued reproducing kernels to multi-task learning. [sent-685, score-0.461]
91 To deal with the noise and have an acceptable generalization error, we use the following regularization network 1 f ∈ HK m min m ∑ j=1 f (x j ) − ξ j 124 2 Λ +σ f 2 K, H (68) R EFINEMENT OF O PERATOR - VALUED R EPRODUCING K ERNELS where K is a chosen Λ-valued reproducing kernel on X. [sent-688, score-0.402]
92 On the other hand, when overfitting appears in the second experiment, we shall then find a Λ-valued reproducing kernel L on X such that HL HK with the same purpose. [sent-691, score-0.499]
93 The L+ (Rn )-valued reproducing kernel that we shall use in the regularization network (68) is a Gaussian kernel K(x, y) := S exp − (x − y)2 , x, y ∈ R, 2 n) where S ∈ L+ (R√ is strictly positive-definite. [sent-699, score-0.603]
94 More specifically, we plan to apply the refinement kernel method 131 Z HANG , X U AND Z HANG kernel K kernel L n = 8, δ = 0. [sent-1185, score-0.312]
95 Conclusion and Discussion The refinement relationship between two operator-valued reproducing kernels provides a promising way of updating kernels for multi-task machine learning when overfitting or underfitting occurs. [sent-1299, score-0.624]
96 Particular attention has been paid to the case when the kernels under investigation have a vector-valued integral representation, the most general form of operator-valued reproducing kernels. [sent-1301, score-0.504]
97 By the characterizations, we present concrete examples of refining the translation invariant operator-valued reproducing kernels, Hessian of the scalar-valued Gaussian kernel, and finite Hilbert-Schmidt operator-valued reproducing kernels. [sent-1302, score-0.695]
98 Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. [sent-1370, score-0.517]
99 Theory of reproducing kernels for Hilbert spaces of vector valued functions. [sent-1483, score-0.536]
100 On the inclusion relation of reproducing kernel Hilbert spaces. [sent-1536, score-0.402]
wordName wordTfidf (topN-words)
[('hk', 0.43), ('hg', 0.41), ('nement', 0.327), ('reproducing', 0.298), ('nn', 0.24), ('hang', 0.184), ('kernels', 0.163), ('efinement', 0.141), ('eproducing', 0.141), ('perator', 0.141), ('ek', 0.128), ('ernels', 0.121), ('nm', 0.119), ('rd', 0.112), ('kernel', 0.104), ('eg', 0.102), ('rkhs', 0.098), ('shall', 0.097), ('re', 0.093), ('tp', 0.084), ('ei', 0.082), ('hkc', 0.077), ('outliers', 0.077), ('valued', 0.075), ('nontrivial', 0.074), ('span', 0.069), ('hilbert', 0.065), ('el', 0.063), ('xu', 0.063), ('hgc', 0.058), ('hgs', 0.058), ('characterizations', 0.054), ('banach', 0.053), ('diestel', 0.051), ('kz', 0.051), ('uhl', 0.051), ('borel', 0.049), ('translation', 0.047), ('fk', 0.047), ('ran', 0.045), ('hks', 0.045), ('hkz', 0.045), ('proposition', 0.044), ('operators', 0.043), ('integral', 0.043), ('absolutely', 0.041), ('integrable', 0.04), ('carmeli', 0.038), ('denseness', 0.038), ('hlc', 0.038), ('hls', 0.038), ('tting', 0.037), ('gc', 0.036), ('zhang', 0.036), ('bk', 0.035), ('cn', 0.035), ('lebesgue', 0.035), ('nonnegative', 0.033), ('caponnetto', 0.033), ('bochner', 0.032), ('brought', 0.03), ('tq', 0.03), ('invariant', 0.03), ('xk', 0.028), ('instances', 0.028), ('xy', 0.028), ('kc', 0.027), ('hl', 0.027), ('china', 0.027), ('fn', 0.026), ('lim', 0.026), ('guangdong', 0.026), ('sees', 0.025), ('adjoint', 0.025), ('micchelli', 0.025), ('operator', 0.024), ('listed', 0.024), ('map', 0.023), ('minimizer', 0.023), ('orthogonal', 0.022), ('continuous', 0.022), ('fz', 0.022), ('cucker', 0.022), ('schatten', 0.022), ('mukherjee', 0.022), ('tk', 0.022), ('concrete', 0.022), ('hessian', 0.022), ('errors', 0.02), ('universality', 0.02), ('lc', 0.02), ('nements', 0.02), ('measure', 0.02), ('characterization', 0.019), ('guangzhou', 0.019), ('haizhang', 0.019), ('hla', 0.019), ('pedrick', 0.019), ('qinghui', 0.019), ('runge', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
Author: Haizhang Zhang, Yuesheng Xu, Qinghui Zhang
Abstract: This paper studies the construction of a refinement kernel for a given operator-valued reproducing kernel such that the vector-valued reproducing kernel Hilbert space of the refinement kernel contains that of the given kernel as a subspace. The study is motivated from the need of updating the current operator-valued reproducing kernel in multi-task learning when underfitting or overfitting occurs. Numerical simulations confirm that the established refinement kernel method is able to meet this need. Various characterizations are provided based on feature maps and vector-valued integral representations of operator-valued reproducing kernels. Concrete examples of refining translation invariant and finite Hilbert-Schmidt operator-valued reproducing kernels are provided. Other examples include refinement of Hessian of scalar-valued translation-invariant kernels and transformation kernels. Existence and properties of operator-valued reproducing kernels preserved during the refinement process are also investigated. Keywords: vector-valued reproducing kernel Hilbert spaces, operator-valued reproducing kernels, refinement, embedding, translation invariant kernels, Hessian of Gaussian kernels, Hilbert-Schmidt kernels, numerical experiments
2 0.098515257 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
Author: Zhihua Zhang, Dehua Liu, Guang Dai, Michael I. Jordan
Abstract: Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as “C -learning,” and we present efficient coordinate descent algorithms for the training of regularized C -learning models. Keywords: large-margin classifiers, hinge functions, logistic functions, coherence functions, C learning
3 0.095710211 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
Author: Lan Xue, Annie Qu
Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty
4 0.090212353 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection
5 0.086499676 100 jmlr-2012-Robust Kernel Density Estimation
Author: JooSeuk Kim, Clayton D. Scott
Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation
6 0.0748519 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
7 0.071891159 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
8 0.065026872 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
9 0.061258055 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
10 0.054976501 80 jmlr-2012-On Ranking and Generalization Bounds
11 0.053449422 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
12 0.053090803 4 jmlr-2012-A Kernel Two-Sample Test
13 0.053039961 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
14 0.049816646 44 jmlr-2012-Feature Selection via Dependence Maximization
15 0.049219977 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
16 0.047357142 59 jmlr-2012-Linear Regression With Random Projections
17 0.045301113 111 jmlr-2012-Structured Sparsity and Generalization
18 0.042274289 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
19 0.041960247 20 jmlr-2012-Analysis of a Random Forests Model
20 0.041226786 97 jmlr-2012-Regularization Techniques for Learning with Matrices
topicId topicWeight
[(0, -0.184), (1, 0.093), (2, -0.052), (3, 0.149), (4, 0.1), (5, -0.041), (6, -0.169), (7, -0.017), (8, 0.023), (9, -0.009), (10, 0.069), (11, -0.079), (12, 0.025), (13, -0.007), (14, 0.155), (15, 0.188), (16, -0.044), (17, -0.055), (18, -0.11), (19, 0.065), (20, 0.08), (21, 0.006), (22, -0.085), (23, 0.062), (24, 0.15), (25, -0.01), (26, 0.144), (27, -0.105), (28, -0.136), (29, 0.162), (30, -0.107), (31, -0.203), (32, 0.052), (33, -0.068), (34, -0.166), (35, 0.188), (36, 0.032), (37, 0.042), (38, -0.061), (39, -0.074), (40, 0.087), (41, 0.097), (42, -0.187), (43, 0.05), (44, 0.0), (45, -0.127), (46, 0.092), (47, -0.028), (48, 0.013), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.96559453 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
Author: Haizhang Zhang, Yuesheng Xu, Qinghui Zhang
Abstract: This paper studies the construction of a refinement kernel for a given operator-valued reproducing kernel such that the vector-valued reproducing kernel Hilbert space of the refinement kernel contains that of the given kernel as a subspace. The study is motivated from the need of updating the current operator-valued reproducing kernel in multi-task learning when underfitting or overfitting occurs. Numerical simulations confirm that the established refinement kernel method is able to meet this need. Various characterizations are provided based on feature maps and vector-valued integral representations of operator-valued reproducing kernels. Concrete examples of refining translation invariant and finite Hilbert-Schmidt operator-valued reproducing kernels are provided. Other examples include refinement of Hessian of scalar-valued translation-invariant kernels and transformation kernels. Existence and properties of operator-valued reproducing kernels preserved during the refinement process are also investigated. Keywords: vector-valued reproducing kernel Hilbert spaces, operator-valued reproducing kernels, refinement, embedding, translation invariant kernels, Hessian of Gaussian kernels, Hilbert-Schmidt kernels, numerical experiments
2 0.54331756 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection
3 0.48151481 100 jmlr-2012-Robust Kernel Density Estimation
Author: JooSeuk Kim, Clayton D. Scott
Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation
4 0.38343185 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
Author: Zhihua Zhang, Dehua Liu, Guang Dai, Michael I. Jordan
Abstract: Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as “C -learning,” and we present efficient coordinate descent algorithms for the training of regularized C -learning models. Keywords: large-margin classifiers, hinge functions, logistic functions, coherence functions, C learning
5 0.37719682 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
Author: Tim van Erven, Mark D. Reid, Robert C. Williamson
Abstract: Mixability of a loss characterizes fast rates in the online learning setting of prediction with expert advice. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses. We then extend this result to multiclass proper losses where there are few existing results. We show that mixability is governed by the maximum eigenvalue of the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms of the geometry of the Bayes risk. Although all calculations are for proper losses, we also show how to carry the results across to improper losses. Keywords: mixability, multiclass, prediction with expert advice, proper loss, learning rates
6 0.32483622 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
7 0.31987941 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
8 0.30728906 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
9 0.30579454 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
10 0.29576191 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
11 0.27090096 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
12 0.24932085 44 jmlr-2012-Feature Selection via Dependence Maximization
13 0.24051072 4 jmlr-2012-A Kernel Two-Sample Test
14 0.23509187 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
15 0.23315485 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification
16 0.23036566 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
17 0.22721906 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
18 0.22422482 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
19 0.21480434 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
20 0.21392529 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs
topicId topicWeight
[(7, 0.026), (21, 0.05), (26, 0.031), (29, 0.07), (32, 0.338), (35, 0.012), (49, 0.011), (56, 0.013), (57, 0.013), (67, 0.011), (69, 0.021), (75, 0.084), (77, 0.011), (79, 0.015), (92, 0.133), (96, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.77850568 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
Author: Haizhang Zhang, Yuesheng Xu, Qinghui Zhang
Abstract: This paper studies the construction of a refinement kernel for a given operator-valued reproducing kernel such that the vector-valued reproducing kernel Hilbert space of the refinement kernel contains that of the given kernel as a subspace. The study is motivated from the need of updating the current operator-valued reproducing kernel in multi-task learning when underfitting or overfitting occurs. Numerical simulations confirm that the established refinement kernel method is able to meet this need. Various characterizations are provided based on feature maps and vector-valued integral representations of operator-valued reproducing kernels. Concrete examples of refining translation invariant and finite Hilbert-Schmidt operator-valued reproducing kernels are provided. Other examples include refinement of Hessian of scalar-valued translation-invariant kernels and transformation kernels. Existence and properties of operator-valued reproducing kernels preserved during the refinement process are also investigated. Keywords: vector-valued reproducing kernel Hilbert spaces, operator-valued reproducing kernels, refinement, embedding, translation invariant kernels, Hessian of Gaussian kernels, Hilbert-Schmidt kernels, numerical experiments
2 0.44687349 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
3 0.4468146 111 jmlr-2012-Structured Sparsity and Generalization
Author: Andreas Maurer, Massimiliano Pontil
Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.
4 0.44532841 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
5 0.43977824 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
6 0.43956318 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
7 0.43905339 4 jmlr-2012-A Kernel Two-Sample Test
8 0.43766305 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
9 0.4366855 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
10 0.4351877 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
11 0.43477166 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
12 0.43464261 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
13 0.43313015 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
14 0.43279368 34 jmlr-2012-Dynamic Policy Programming
15 0.43234122 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
16 0.43078372 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
17 0.4301855 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
18 0.42934394 13 jmlr-2012-Active Learning via Perfect Selective Classification
19 0.42920828 80 jmlr-2012-On Ranking and Generalization Bounds
20 0.42431945 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity