jmlr jmlr2005 jmlr2005-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
Reference: text
sentIndex sentText sentNum sentScore
1 This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. [sent-13, score-0.278]
2 We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. [sent-15, score-0.193]
3 Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming 1. [sent-18, score-0.222]
4 This paper provides an inference framework for learning the kernel from training data using an approach akin to the regularized quality functional. [sent-25, score-0.402]
5 This difference in scale creates problems for the Gaussian RBF kernel, since it is unable to find a kernel width suitable for both directions. [sent-35, score-0.207]
6 We ‘learn’ this kernel by defining a quantity analogous to the risk functional, called the quality functional, which measures the ‘badness’ of the kernel function. [sent-43, score-0.5]
7 2 Related Work We analyze some recent approaches to learning the kernel by looking at the objective function that is being optimized and the class of kernels being considered. [sent-50, score-0.22]
8 , 2002) uses the objective function tr(Kyy⊤ ) where y are the training labels, and K is from the class of kernels spanned by the eigenvectors of the kernel matrix of the combined training and test data. [sent-60, score-0.252]
9 Similar to this, Bousquet and Herrmann (2002) further restricts the class of kernels to the convex hull of the kernel matrices normalized by their trace. [sent-64, score-0.22]
10 In principle, one can learn the kernel using Bayesian methods by defining a suitable prior, and learning the hyperparameters by optimizing the marginal likelihood (Williams and Barber, 1998, Williams and Rasmussen, 1996). [sent-69, score-0.205]
11 As an example of this, when other information is available, an auxiliary matrix can be used with the EM algorithm for learning the kernel (Tsuda et al. [sent-70, score-0.208]
12 (2003), we show (Section 2) that for most kernel-based learning methods there exists a functional, the quality functional, which plays a similar role to the empirical risk functional. [sent-81, score-0.206]
13 We introduce a kernel on the space of kernels itself, a hyperkernel (Section 3), and its regularization on the associated hyper reproducing kernel Hilbert space (Hyper-RKHS). [sent-82, score-0.947]
14 Using this general framework, we consider the specific example of using the regularized risk functional in the rest of the paper. [sent-86, score-0.329]
15 The positive definiteness of the kernel function is ensured using the positive definiteness of the kernel matrix (Section 5), and the resulting optimization problem is a semidefinite program. [sent-87, score-0.371]
16 We denote by K the kernel matrix given by Ki j := k(xi , x j ) where xi , x j ∈ Xtrain and k is a positive semidefinite kernel function. [sent-103, score-0.36]
17 We begin by introducing a new class of functionals Q on data which we will call quality functionals. [sent-105, score-0.209]
18 Their purpose is to indicate, given a kernel k and the training data, how suitable the kernel is for explaining the training data, or in other words, the quality of the kernel for the estimation problem at hand. [sent-107, score-0.611]
19 Such quality functionals may be the Kernel Target Alignment, the negative log posterior, the minimum of the regularized risk functional, or any luckiness function for kernel methods. [sent-108, score-0.573]
20 We will discuss those functionals after a formal definition of the quality functional itself. [sent-109, score-0.357]
21 1 Empirical and Expected Quality Definition 1 (Empirical Quality Functional) Given a kernel k, and data X,Y , we define Qemp (k, X,Y ) to be an empirical quality functional if it depends on k only via k(xi , x j ) where xi , x j ∈ X for 1 i, j m. [sent-111, score-0.469]
22 Provided a sufficiently rich class of kernels K it is in general possible to find a kernel k∗ ∈ K that attains the minimum of any such Qemp regardless of the data. [sent-114, score-0.22]
23 To measure the overall quality of k we therefore introduce the following definition: Definition 2 (Expected Quality Functional) Denote by Qemp (k, X,Y ) an empirical quality functional, then Q(k) := EX,Y Qemp (k, X,Y ) is defined to be the expected quality functional. [sent-116, score-0.42]
24 However, while in the case of the empirical risk we can interpret Remp as the empirical estimate of the expected loss R( f ) = Ex,y [l(x, y, f (x))], due to the general form of Qemp , no such analogy is available for quality functionals. [sent-120, score-0.206]
25 In other words, we assume a) we have given a concentration inequality on quality functionals, such as Pr |Qemp (k, X,Y ) − Q(k)| εQ < δQ , and b) we have a bound on the deviation of the empirical risk in terms of the quality functional Pr |Remp ( f , X,Y ) − R( f )| εR < δ(Qemp ). [sent-125, score-0.533]
26 This means that the bound now becomes independent of the particular value of the quality functional obtained on the data, rather than the expected value of the quality functional. [sent-127, score-0.428]
27 2 Examples of Qemp Before we continue with the derivations of a regularized quality functional and introduce a corresponding reproducing kernel Hilbert space, we give some examples of quality functionals and present their exact minimizers, whenever possible. [sent-131, score-0.817]
28 This demonstrates that given a rich enough feature space, we can arbitrarily minimize the empirical quality functional Qemp . [sent-132, score-0.362]
29 The difference here from traditional kernel methods is the fact that we allow the kernel to change. [sent-133, score-0.322]
30 That is, if we use the training labels as the kernel matrix, we arbitrarily minimize the quality functional. [sent-136, score-0.361]
31 They take on the general o form 1 m λ Rreg ( f , Xtrain ,Ytrain ) := ∑ l(xi , yi , f (xi )) + (1) f 2 H m i=1 2 1047 O NG , S MOLA AND W ILLIAMSON where f 2 is the RKHS norm of f and l is a loss function such that for f (xi ) = yi , l(xi , yi , yi ) = 0. [sent-139, score-0.239]
32 H By virtue of the representer theorem (see Section 3) we know that the minimizer of (1) can be written as a kernel expansion. [sent-140, score-0.259]
33 This leads to the following definition of a quality functional, for a particular loss functional l: Qregrisk (k, Xtrain ,Ytrain ) := min emp m α∈R λ 1 m ∑ l(xi , yi , [Kα]i) + 2 α⊤ Kα . [sent-141, score-0.373]
34 Since (3) tends to zero regrisk and the regularized risk is lower bounded by zero, we can still arbitrarily minimize Qemp . [sent-150, score-0.315]
35 The quality functional for classification using kernel methods is given by: Qloo (k, Xtrain ,Ytrain ) := min emp m α∈R 1 m ∑ −yi sign([Kαi ]i ) , m i=1 which is optimized in Duan et al. [sent-154, score-0.467]
36 Example 3 (Kernel Target Alignment) This quality functional was introduced by Cristianini et al. [sent-162, score-0.288]
37 (2002) to assess the alignment of a kernel with training labels. [sent-163, score-0.244]
38 This quality functional was optimized in Lanckriet et al. [sent-168, score-0.288]
39 The above examples illustrate how existing methods for assessing the quality of a kernel fit within the quality functional framework. [sent-174, score-0.575]
40 Hyper Reproducing Kernel Hilbert Spaces We now propose a conceptually simple method to optimize quality functionals over classes of kernels by introducing a reproducing kernel Hilbert space on the kernel k itself, so to say, a HyperRKHS. [sent-178, score-0.634]
41 H is called a reproducing kernel Hilbert space endowed with the dot product ·, · (and the norm f := f , f ) if there exists a function k : X × X → R with the following properties. [sent-181, score-0.232]
42 Then each minimizer f ∈ H of the general regularized risk l ((x1 , y1 , f (x1 )) , . [sent-193, score-0.218]
43 The Hilbert space H of functions k : X → R, endowed with a dot product ·, · (and the norm k = k, k ) is called a hyper reproducing kernel Hilbert space if there exists a hyperkernel k : X × X → R with the following properties: 1. [sent-202, score-0.721]
44 By analogy with the definition of the regularized risk functional (1), we proceed to define the regularized quality functional. [sent-214, score-0.584]
45 For a positive semidefinite kernel matrix K on X, the regularized quality functional is defined as Qreg (k, X,Y ) := Qemp (k, X,Y ) + where λQ λQ k 2 , H 2 (6) 0 is a regularization constant and k 2 denotes the RKHS norm in H . [sent-216, score-0.642]
46 H Note that although we have possibly non positive kernels in H , we define the regularized quality functional only on positive semidefinite kernel matrices. [sent-217, score-0.623]
47 Note too that the kernel matrix K also only depends on k via its values at xi j . [sent-229, score-0.213]
48 H Lemma 7 implies that the solution of the regularized quality functional is a linear combination of hyperkernels on the input data. [sent-231, score-0.534]
49 While the latter is almost impossible to enforce directly, the former could be verified directly, hence imposing a constraint only on the values of the kernel matrix k(xi , x j ) rather than on the kernel function k itself. [sent-236, score-0.326]
50 This latter requirement can be formally stated as follows: For any fixed x ∈ X the hyperkernel k is a kernel in its second argument; that is for any fixed x ∈ X , the function k(x, x′ ) := k(x, (x, x′ )), with x, x′ ∈ X , is a positive semidefinite kernel. [sent-239, score-0.588]
51 , m, then the kernel m k(x p , xq ) := ∑ βi j k(xi , x j , x p , xq ) i, j=1 is positive semidefinite. [sent-243, score-0.247]
52 There exists a GP analog to the support vector machine (for example Opper and Winther (2000), Seeger (1999)), which is essentially obtained (ignoring normalizing terms) by exponentiating the regularized risk functional used in SVMs. [sent-254, score-0.365]
53 Given the regularized quality functional (Equation 6), with the Qemp set to the SVM with squared loss, we obtain the following equation. [sent-257, score-0.403]
54 Substituting the expressions for p(Y |X, f ), p( f |k) and p(k|k0 , k) into the posterior (Equation 9), we get Equation (10) which is of the same form as the exponentiated negative quality (Equation 8): exp − γf 1 γε m ∑ (yi − f (xi ))2 exp − 2 f , f H k exp − 2 k, k0 H 2 i=1 . [sent-293, score-0.239]
55 Note that the maximum likelihood (ML-II) estimator (MacKay, 1994, Williams and Barber, 1998, Williams and Rasmussen, 1996) in Bayesian estimation leads to the same optimization problems as those arising from minimizing the regularized quality functional. [sent-297, score-0.3]
56 1 Power Series Construction Suppose k is a kernel such that k(x, x′ ) ≥ 0 for all x, x′ ∈ X , and suppose g : R → R is a function with positive Taylor expansion coefficients, that is g(ξ) = ∑∞ ci ξi for basis functions ξ, ci 0 for i=0 √ all i = 0, . [sent-302, score-0.27]
57 To see this, observe that for any fixed x, k(x, (x, x′ )) is a sum of kernel functions, hence it is a kernel itself (since k p (x, x′ ) is a kernel if k is, for p ∈ N). [sent-309, score-0.441]
58 The Gaussian RBF kernel satisfies this condition, but polynomial kernels of odd degree are not always pointwise positive. [sent-315, score-0.22]
59 Example 4 (Harmonic Hyperkernel) Suppose k is a kernel with range [0, 1], (RBF kernels satisfy this property), and set ci := (1 − λh )λi , i ∈ N, for some 0 < λh < 1. [sent-317, score-0.263]
60 However, if we want the kernel to adapt automatically to different widths for each dimension, we need to perform the summation that led to (12) for each dimension in its arguments separately. [sent-365, score-0.203]
61 Such a hyperkernel corresponds to ideas developed in automatic relevance determination (ARD) (MacKay, 1994, Neal, 1996). [sent-366, score-0.441]
62 transform with respect to (x1 − x1 1055 O NG , S MOLA AND W ILLIAMSON x′′′ ) The function k is a hyperkernel if k(τ, τ′ ) is a kernel in τ, τ′ and F 1 k(ω, (x′′ −x′′′ )) ≥ 0 for all (x′′ − and ω. [sent-377, score-0.588]
63 This means that we can check whether a radial basis function (for example Gaussian RBF, exponential RBF, damped harmonic oscillator, generalized Bn spline), can be used to construct a hyperkernel by checking whether its Fourier transform is positive. [sent-385, score-0.474]
64 3 Explicit Expansion If we have a finite set of kernels that we want to choose from, we can generate a hyperkernel which is a finite sum of possible kernel functions. [sent-387, score-0.661]
65 Suppose ki (x, x′ ) is a kernel for each i = 1, . [sent-390, score-0.198]
66 , n (for example the RBF kernel or the polynomial kernel), then n k(x, x′ ) := ∑ ci ki (x)ki (x′ ), ki (x) i=1 0, ∀x (15) is a hyperkernel, as can be seen by an argument similar to that of section 4. [sent-393, score-0.292]
67 Optimization Problems for Regularized Risk based Quality Functionals We will now consider the optimization of the quality functionals utilizing hyperkernels. [sent-402, score-0.254]
68 We choose the regularized risk functional as the empirical quality functional; that is we set Qemp (k, X,Y ) := Rreg ( f , X,Y ). [sent-403, score-0.469]
69 For a particular loss function l(xi , yi , f (xi )), we obtain the regularized quality functional. [sent-407, score-0.308]
70 Using the soft margin loss, we obtain min min β α λ λQ 1 m ∑ max(0, 1 − yi f (xi )) + 2 α⊤ Kα + 2 β⊤ Kβ subject to β m i=1 0 (18) where α ∈ Rm are the coefficients of the kernel expansion (5), and β ∈ Rm are the coefficients of the hyperkernel expansion (7). [sent-410, score-0.771]
71 Specifically, when we have the regularized quality functional, d(ξ) is quadratic, and hence we obtain an optimization problem which has a mix of linear, quadratic and semidefinite constraints. [sent-441, score-0.328]
72 Since ξ is the coeffient in the hyperkernel expansion, this implies that we have a set of possible kernels which are just scalar multiples of each other. [sent-447, score-0.514]
73 In our setting, the regularizer for controlling the complexity of the kernel is taken to be the squared norm of the kernel in the Hyper-RKHS. [sent-452, score-0.361]
74 1059 O NG , S MOLA AND W ILLIAMSON We define the hyperkernel Gram matrix K by putting together m2 of these vectors, that is we set K = [Kpq ]m . [sent-472, score-0.473]
75 Let w be the weight vector and boffset the bias term in feature space, that is the hypothesis function in feature space is defined as g(x) = w⊤ φ(x) + boffset where φ(·) is the feature mapping defined by the kernel function k. [sent-474, score-0.361]
76 Where appropriate, γ and χ are Lagrange multipliers, while η and ξ are vectors of Lagrange multipliers from the derivation of the Wolfe dual for the SDP, β are the hyperkernel coefficients, t1 and t2 are the auxiliary variables. [sent-482, score-0.47]
77 The empirical quality functional derived 1 from this is Qemp (k, X,Y ) = min f ∈H m ∑m ζ2 + 1 ( w 2 + b2 ) subject to yi f (xi ) 1 − ζi and i=1 i offset 2 H ζi 0 for all i = 1, . [sent-514, score-0.397]
78 A suitable quality functional Qemp (k, X,Y ) = o m 1 1 2 − ρ subject to f (x ) min f ∈H νm ∑i=1 ζi + 2 w H ρ − ζi , and ζi 0 for all i = 1, . [sent-523, score-0.374]
79 The score to be used for novelty detection is given by f = Kα − boffset , which reduces to f = η − ξ, by substituting α = K † (γ1 + η − ξ), boffset = γ1 and K = reshape(Kβ). [sent-528, score-0.362]
80 Using the εinsensitive loss, l(xi , yi , f (xi )) = max(0, |yi − f (xi )|−ε), and the ν-parameterized quality functional, 1 Qemp (k, X,Y ) = min f ∈H C νε + m ∑m (ζi + ζ∗ ) subject to f (xi ) − yi ε − ζi , yi − f (xi ) ε − ζ∗ , i=1 i i 1062 H YPERKERNELS (∗) ζi 0 for all i = 1, . [sent-532, score-0.355]
81 , 2002), Qemp = tr(Kyy⊤ ) = y⊤ Ky, we directly minimize the regularized quality functional, obtaining the following optimization problem (Lanckriet et al. [sent-538, score-0.339]
82 The optimization problems in Section 6 were solved with an approximate hyperkernel matrix as described in Section 7. [sent-549, score-0.518]
83 We used the hyperkernel for automatic relevance determination defined by (14) for the hyperkernel optimization problems. [sent-552, score-0.927]
84 6, giving adequate coverage over various kernel widths in (12) (small λh emphasizes wide kernels almost exclusively, λh close to 1 will treat all widths equally). [sent-568, score-0.284]
85 • The hyperkernel regularization constant was set to λQ = 1. [sent-569, score-0.474]
86 However, for regression, c := stddev(Ytrain ) (the standard deviation of the training labels) so that the hyperkernel coefficients are of the same scale as the output (the constant offset boffset takes care of the mean). [sent-571, score-0.587]
87 In certain cases, numerical problems in the SDP optimizer or in the pseudo-inverse may prevent this hypothesis from optimizing the regularized risk for the particular kernel matrix. [sent-573, score-0.356]
88 For an explicit expansion of type (15) one can optimize in the expansion coefficients ki (x)ki (x′ ) directly, which leads to a quality functional with an ℓ2 penalty on the expansion coefficients. [sent-577, score-0.45]
89 In the following experiments, the hyperkernel matrix was approximated to δ = 10−6 using the incomplete Cholesky factorization method (Bach and Jordan, 2002). [sent-583, score-0.473]
90 The second, third and fourth columns show the results of the hyperkernel optimizations of C-SVM (Example 7), ν-SVM (Example 8) and Lagrangian SVM (Example 9) respectively. [sent-675, score-0.441]
91 A Gaussian RBF kernel was tuned using 10-fold cross validation on the training data, with the best value of C shown in brackets. [sent-680, score-0.249]
92 3 Effect of λQ and λh on Classification Error In order to investigate the effect of varying the hyperkernel regularization constant, λQ , and the Harmonic Hyperkernel parameter, λh , we performed experiments using the C-SVM hyperkernel optimization (Example 7). [sent-689, score-0.96]
93 From Table 4, we observe that the variation in classification accuracy over the whole range of the hyperkernel regularization constant, λQ is less than the standard deviation of the classification accuracies of the various datasets (compare with Table 3). [sent-692, score-0.562]
94 The method shows a higher sensitivity to the harmonic hyperkernel parameter. [sent-694, score-0.474]
95 Comparing the second and fourth columns, we observe that the hyperkernel optimization problem performs better than a ε-SVR tuned using cross validation for all the datasets except the servo dataset. [sent-780, score-0.673]
96 The second column shows the results from the hyperkernel optimization of the ν-regression (Example (11)). [sent-820, score-0.486]
97 The rightmost column shows a ε-SVR with a gaussian kernel tuned using 10-fold cross validation on the training data. [sent-823, score-0.28]
98 Summary and Outlook The regularized quality functional allows the systematic solution of problems associated with the choice of a kernel. [sent-838, score-0.403]
99 We have shown that when the empirical quality functional is the regularized risk functional, the resulting optimization problem is convex, and in fact is a SDP. [sent-841, score-0.514]
100 The EM algorithm for kernel matrix completion with auxiliary data. [sent-1119, score-0.208]
wordName wordTfidf (topN-words)
[('qemp', 0.501), ('hyperkernel', 0.441), ('sdp', 0.22), ('illiamson', 0.179), ('semide', 0.175), ('yperkernels', 0.167), ('mola', 0.15), ('functional', 0.148), ('kernel', 0.147), ('xtrain', 0.143), ('quality', 0.14), ('hyperkernels', 0.131), ('regularized', 0.115), ('boffset', 0.107), ('alignment', 0.097), ('novelty', 0.097), ('lanckriet', 0.082), ('rkhs', 0.081), ('ng', 0.079), ('representer', 0.075), ('kernels', 0.073), ('meyer', 0.072), ('qreg', 0.072), ('functionals', 0.069), ('hilbert', 0.067), ('risk', 0.066), ('regrisk', 0.06), ('ytrain', 0.06), ('reproducing', 0.058), ('subject', 0.056), ('yi', 0.053), ('ki', 0.051), ('detection', 0.051), ('xq', 0.05), ('remp', 0.05), ('datasets', 0.049), ('ong', 0.048), ('duan', 0.048), ('hyper', 0.048), ('hyperprior', 0.048), ('ax', 0.047), ('rm', 0.047), ('rbf', 0.046), ('cristianini', 0.046), ('svm', 0.046), ('optimization', 0.045), ('yy', 0.044), ('tuned', 0.044), ('ci', 0.043), ('gp', 0.042), ('recipes', 0.04), ('regularizer', 0.04), ('lagrange', 0.04), ('minimize', 0.039), ('deviation', 0.039), ('cients', 0.038), ('minimizer', 0.037), ('expansion', 0.037), ('exponentiating', 0.036), ('luckiness', 0.036), ('quantiles', 0.036), ('servo', 0.036), ('syndata', 0.036), ('wishart', 0.036), ('arbitrarily', 0.035), ('xi', 0.034), ('exp', 0.033), ('coef', 0.033), ('regularization', 0.033), ('cross', 0.033), ('harmonic', 0.033), ('emp', 0.032), ('widths', 0.032), ('matrix', 0.032), ('pr', 0.032), ('lagrangian', 0.032), ('williams', 0.032), ('gaussian', 0.031), ('suitable', 0.03), ('width', 0.03), ('kyy', 0.03), ('credit', 0.03), ('optimizes', 0.03), ('freedom', 0.029), ('auxiliary', 0.029), ('optimizing', 0.028), ('tr', 0.028), ('quadratic', 0.028), ('traditional', 0.028), ('bayesian', 0.028), ('smola', 0.028), ('glass', 0.027), ('ky', 0.027), ('norm', 0.027), ('covariance', 0.026), ('validation', 0.025), ('australia', 0.025), ('arguments', 0.024), ('cv', 0.024), ('ghaoui', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
2 0.078908183 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
3 0.075471699 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
Author: Alain Rakotomamonjy, Stéphane Canu
Abstract: This work deals with a method for building a reproducing kernel Hilbert space (RKHS) from a Hilbert space with frame elements having special properties. Conditions on existence and a method of construction are given. Then, these RKHS are used within the framework of regularization theory for function approximation. Implications on semiparametric estimation are discussed and a multiscale scheme of regularization is also proposed. Results on toy and real-world approximation problems illustrate the effectiveness of such methods. Keywords: regularization, kernel, frames, wavelets
4 0.070394039 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
5 0.067853935 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
6 0.065001853 41 jmlr-2005-Kernel Methods for Measuring Independence
7 0.063800134 47 jmlr-2005-Learning from Examples as an Inverse Problem
8 0.058930628 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
9 0.058929358 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.05763571 3 jmlr-2005-A Classification Framework for Anomaly Detection
11 0.051725969 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
12 0.051207706 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
13 0.051000174 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
14 0.05021745 48 jmlr-2005-Learning the Kernel Function via Regularization
15 0.04818086 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
16 0.045209214 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
17 0.044634659 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
18 0.044443782 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
19 0.042172089 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
20 0.042082001 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
topicId topicWeight
[(0, 0.248), (1, 0.122), (2, 0.142), (3, 0.157), (4, 0.123), (5, 0.009), (6, 0.022), (7, -0.073), (8, -0.097), (9, 0.106), (10, -0.014), (11, 0.082), (12, 0.092), (13, -0.02), (14, -0.149), (15, -0.093), (16, -0.103), (17, 0.224), (18, -0.11), (19, 0.017), (20, 0.095), (21, -0.06), (22, 0.054), (23, -0.134), (24, -0.02), (25, -0.032), (26, 0.021), (27, 0.084), (28, 0.152), (29, -0.091), (30, -0.126), (31, 0.003), (32, -0.004), (33, 0.01), (34, -0.1), (35, 0.084), (36, -0.016), (37, -0.027), (38, -0.106), (39, -0.178), (40, 0.176), (41, 0.02), (42, -0.153), (43, -0.223), (44, 0.06), (45, -0.244), (46, -0.382), (47, 0.002), (48, 0.144), (49, 0.243)]
simIndex simValue paperId paperTitle
same-paper 1 0.91569048 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
2 0.36587033 41 jmlr-2005-Kernel Methods for Measuring Independence
Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf
Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF
3 0.30576596 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
4 0.30306557 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
Author: Alain Rakotomamonjy, Stéphane Canu
Abstract: This work deals with a method for building a reproducing kernel Hilbert space (RKHS) from a Hilbert space with frame elements having special properties. Conditions on existence and a method of construction are given. Then, these RKHS are used within the framework of regularization theory for function approximation. Implications on semiparametric estimation are discussed and a multiscale scheme of regularization is also proposed. Results on toy and real-world approximation problems illustrate the effectiveness of such methods. Keywords: regularization, kernel, frames, wavelets
5 0.284529 3 jmlr-2005-A Classification Framework for Anomaly Detection
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classification, SVMs
6 0.25352326 64 jmlr-2005-Semigroup Kernels on Measures
7 0.24890985 47 jmlr-2005-Learning from Examples as an Inverse Problem
9 0.23776326 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.23671129 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
11 0.23316792 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
12 0.23244317 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
13 0.20756893 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
14 0.20375873 48 jmlr-2005-Learning the Kernel Function via Regularization
15 0.20335527 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
16 0.19918358 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
17 0.18367845 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
18 0.18354906 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
19 0.18015023 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
20 0.17600301 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
topicId topicWeight
[(13, 0.028), (17, 0.022), (19, 0.049), (36, 0.044), (37, 0.039), (42, 0.013), (43, 0.041), (44, 0.363), (47, 0.025), (52, 0.084), (59, 0.021), (70, 0.032), (80, 0.011), (88, 0.086), (90, 0.036), (94, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.67902005 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
2 0.39485249 48 jmlr-2005-Learning the Kernel Function via Regularization
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.
3 0.38270867 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
4 0.37688911 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
5 0.36703613 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
6 0.36575496 36 jmlr-2005-Gaussian Processes for Ordinal Regression
7 0.36438769 39 jmlr-2005-Information Bottleneck for Gaussian Variables
8 0.36153558 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
9 0.35981578 3 jmlr-2005-A Classification Framework for Anomaly Detection
10 0.35878962 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
11 0.35450804 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
12 0.35393658 44 jmlr-2005-Learning Module Networks
13 0.35384092 20 jmlr-2005-Clustering with Bregman Divergences
14 0.35365933 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
15 0.35306957 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
16 0.3524242 11 jmlr-2005-Algorithmic Stability and Meta-Learning
17 0.35213608 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
18 0.35175315 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
19 0.35158372 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
20 0.34860429 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial