jmlr jmlr2008 jmlr2008-29 knowledge-graph by maker-knowledge-mining

29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Source: pdf

Author: Matthias W. Seeger

Abstract: We propose a highly efﬁcient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the ﬁtting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classiﬁcation tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classiﬁcation, kernel logistic regression, hierarchical classiﬁcation, cross validation optimization, Newton-Raphson optimization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 38 T¨ bingen, Germany u Editor: Zoubin Ghahramani Abstract We propose a highly efﬁcient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. [sent-5, score-0.275]

2 Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. [sent-7, score-0.748]

3 We demonstrate our approach on large scale text classiﬁcation tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. [sent-9, score-0.175]

4 Keywords: multi-way classiﬁcation, kernel logistic regression, hierarchical classiﬁcation, cross validation optimization, Newton-Raphson optimization 1. [sent-11, score-0.39]

5 However, if n and C denote data set size and number of classes respectively, nonparametric kernel methods like support vector machines (SVMs) or Gaussian processes (GPs) typically scale super-linearly in nC, if dependencies between the latent class functions are represented properly. [sent-14, score-0.26]

6 Furthermore, most large scale kernel methods proposed so far refrain from solving the problem of learning hyperparameters (kernel or loss function parameters), also known as “learning the kernels”. [sent-15, score-0.321]

7 However, many models for modern applications come with a large number of hyperparameters (for example to represent dependencies through “mixing” as in independent components analysis), and adjusting them through optimization must make use of gradients. [sent-17, score-0.289]

8 For example, our framework applied to hierarchical classiﬁcation with hundreds of classes and thousands of data points requires a few minutes for ﬁtting. [sent-21, score-0.175]

9 This is essentially Newton’s method, and one aspect of our work is to ﬁnd approximate Newton directions very efﬁciently, making use of model structure and linear conjugate gradients in order to reduce the computation to standard linear algebra primitives on large contiguous chunks of memory. [sent-28, score-0.406]

10 Interestingly, such global approaches are generally favoured in the optimization community for problems (such as kernel methods ﬁtting) which cannot be decomposed naturally into parts. [sent-29, score-0.215]

11 For multi-way classiﬁcation, our primary ﬁtting method scales linearly in C, and depends on n mainly via a ﬁxed number of matrix-vector multiplications (MVM) with n × n kernel matrices. [sent-31, score-0.286]

12 In many situations, these MVM primitives can be computed very efﬁciently, often without having to store the kernel matrices themselves. [sent-32, score-0.438]

13 We also show how to choose hyperparameters automatically by maximizing the cross-validation log likelihood, making use of our primary ﬁtting technology as inner loop in order to compute the CV criterion and its gradient. [sent-33, score-0.355]

14 It is important to note that our hyperparameter learning method works by gradient-based optimization, where the dominating part of the gradient computation does not scale with the number of hyperparameters at all. [sent-34, score-0.375]

15 1 The gradient computation also requires a number of MVMs with derivatives of kernel matrices, which can be reduced to kernel MVMs for many frequently used kernels (see Section 7. [sent-35, score-0.448]

16 We apply our framework to hierarchical classiﬁcation with many classes. [sent-38, score-0.175]

17 An extension to hierarchical classiﬁcation is provided in Section 3, and in Section 4 we give our automatic hyperparameter learning procedure. [sent-52, score-0.341]

18 Experimental results on a very large hierarchical text classiﬁcation and several standard machine learning problems are given in Section 6. [sent-54, score-0.175]

19 It is fairly simple to include new kernels or (approximate) kernel MVM implementations. [sent-65, score-0.261]

20 u2 uC likelihood coupling latent dependent functions penalization log P(y | u) + log P(u) Figure 1: Structure of penalized likelihood optimization. [sent-75, score-0.338]

21 A set of latent (unobserved) functions uc (·) is ﬁtted to observed data by penalized likelihood maximization. [sent-77, score-0.257]

22 This primary ﬁtting step corresponds to a convex optimization problem over ﬁnitely many variables. [sent-79, score-0.212]

23 p (our main example of such mixing is hierarchical classiﬁcation, developed in Section 3). [sent-90, score-0.246]

24 3 We employ the multiple logistic regression model, consisting of C latent class functions uc (·) feeding into the multiple logistic (or softmax) likelihood P(yic = 1|xi , ui (·)) = euc (xi ) /(∑c euc (xi ) ). [sent-103, score-0.273]

25 For example, for the Gaussian kernel (7), non-smooth f c are penalized, and for the linear c kernel (Appendix D. [sent-107, score-0.332]

26 Details on penalized c likelihood kernel methods and RKHS penalisation can be found in Green and Silverman (1994) and Sch¨ lkopf and Smola (2002). [sent-109, score-0.343]

27 Since the likelihood depends on the f c (·) only through the values f c (xi ) at the data points, every minimizer of Φ must be a kernel expansion: f c (·) = ∑i αic K (c) (·, xi ). [sent-113, score-0.232]

28 1150 L ARGE S CALE S TRUCTURED C LASSIFICATION K ERNEL M ETHODS sponding kernel expansions are ˆ uc (·) = ∑ αic (K (c) (·, xi ) + σ2 ). [sent-127, score-0.256]

29 The negative log multiple logistic likelihood has similar properties, but is smooth as a function of u, and the primary ﬁtting of α does not require constrained convex optimization. [sent-132, score-0.265]

30 We think that a proper SCG implementation can be at least as efﬁcient as NR, but needs ﬁne-tuning to the speciﬁc problem, which in the case of hierarchical classiﬁcation (discussed next) is already quite difﬁcult. [sent-150, score-0.175]

31 Hierarchical Classiﬁcation So far we dealt with ﬂat classiﬁcation, the classes being independent a priori, with block-diagonal kernel matrix K . [sent-155, score-0.216]

32 Here we focus on hierarchical classiﬁcation, the label set {1, . [sent-157, score-0.175]

33 In ﬂat classiﬁcation, the latent class functions uc (·) are modelled as a priori independent, in that the penaliser (or the log prior in the GP view) is a sum of individual terms for each c, without 4. [sent-163, score-0.184]

34 However, it would be wrong to claim that hierarchical classiﬁcation in general comes as cheap as ﬂat classiﬁcation. [sent-192, score-0.175]

35 In the hierarchical case, this “near-decomposition” does not hold, and both LCG and NR need more iterations to attain the same accuracy, although each LCG step comes at about the same cost as in the ﬂat case. [sent-196, score-0.175]

36 However, in all our experiments so far the ﬁtting of the hierarchical model took less than twice the time required for the ﬂat model on the same task. [sent-200, score-0.175]

37 Hyperparameter Learning Our framework comes with an automatic method for setting free hyperparameters h, by gradientbased maximization of the cross-validation (CV) log likelihood. [sent-202, score-0.191]

38 The gradient computation decomposes into two parts: accumulation, and kernel derivative MVMs. [sent-237, score-0.22]

39 The accumulation part requires solving q systems of size ((q − 1)/q)nC, thus ˜ q k3 kernel MVMs on the K Jk if linear conjugate gradients (LCG) is used, k3 being the number of LCG steps. [sent-238, score-0.364]

40 This second part is much simpler than the accumulation one, consisting entirely of large matrix operations, which can be run very efﬁciently using specialized numerical linear algebra code. [sent-242, score-0.187]

41 3, the extension of hyperparameter learning to the hierarchical case of Section 3 is done by wrapping the accumulation part with Φ MVMs, the coding and additional memory effort being minimal. [sent-245, score-0.468]

42 In this context, it is important to regard the {α [Jk ] } as an inner state alongside the hyperparameter vector h. [sent-257, score-0.21]

43 This can be seen by noting that the parameterization of our likelihood in terms of ui is overcomplete, in that ui + κ1 gives the same likelihood values for all κ. [sent-323, score-0.25]

44 We could ﬁx one of the ui components, which would however lead to subtle dependencies between the remaining C − 1 functions uc (·). [sent-324, score-0.185]

45 The gradient computation then requires to compute (6) for each component, using kernel derivative MVMs. [sent-438, score-0.22]

46 The dominating part of the accumulation is the re-optimisation of the α [J] , which are done by calling the optimized code for primary ﬁtting (Section 5. [sent-445, score-0.255]

47 This means that linear algebra primitives with K Jk can be run without mapping matrix coordinates through an index, which would be many times slower. [sent-453, score-0.313]

48 3 Details for Hierarchical Classiﬁcation In this section, we provide details for hierarchical classiﬁcation method, introduced in Section 3. [sent-463, score-0.175]

49 The induced covaripriors (or regularisers), the corresponding kernel matrix K ance matrix K over uL is given by (2), and hierarchical classiﬁcation differs from the ﬂat variant only in that this non-block-diagonal matrix is used. [sent-466, score-0.491]

50 The diagonal preconditioning of LCG (see Appendix B) requires the computation of diag K ∈ nC . [sent-470, score-0.178]

51 2) is easily extended to the hierarchical case, recalling ˜ ˜ (2) and the fact that Φ does not depend on hyperparameters. [sent-480, score-0.175]

52 The step from ﬂat to hierarchical classiﬁcation requires only minor modiﬁcations of existing code. [sent-486, score-0.175]

53 Wrappers for MVM and the other primitives essentially pre- and post-multiply their input 1159 S EEGER ˘ with Φ and ΦT respectively, calling the existing “ﬂat” primitives for K in between (block-diagonal with P rather than C blocks). [sent-487, score-0.426]

54 However, this problem is very ill-conditioned, the Hessian being ˜ ˜ ˜ KW K + K (large kernel matrices are typically very ill-conditioned, and here we deal with K 2 ), and SCG runs exceedingly slowly to the point of being essentially useless (as we determined in experiments). [sent-499, score-0.225]

55 In preliminary direct comparisons, the NR method still works more efﬁciently, meaning that SCG would require additional preconditioning speciﬁc to the problem at hand, which would likely be different for ﬂat and hierarchical classiﬁcation. [sent-506, score-0.262]

56 H1 and H2 do not perform differently: choosing many different vc in the linear kernel seems no advantage here (but see Section 6. [sent-564, score-0.239]

57 In Table 2, we present running times13 for the ﬁnal ﬁtting and for a single fold during hyperparameter optimization (5 of these are required for Ψ, ∇h Ψ). [sent-570, score-0.27]

58 14 It is precisely this high efﬁciency of primary ﬁtting, which allows us to use it as inner loop for automatic hyperparameter learning (Cai and Hofmann, 2004, do not adjust hyperparameters to the data). [sent-572, score-0.485]

59 CV Fold: Re-optimization of α[J] and gradient accumulation for single fold J. [sent-792, score-0.196]

60 We use the isotropic Gaussian (RBF) covariance function K (c) (x, x ) = vc exp − wc x −x 2 2 , vc , wc > 0. [sent-802, score-0.238]

61 (7) We compare the methods mc-sep (ours with separate kernels for each class; 12 hyperparameters), mc-tied (ours with a single shared kernel; 2 hyperparameters), mc-semi (ours with single kernel M (1) , but different vc ; 7 hyperparameters), 1rest (one-against-rest; 12 hyperparameters). [sent-803, score-0.301]

62 Even if run sequentially, 1rest typically requires less memory by a factor of C than a joint multi-class method, although this is not true if the kernel matrices are dominating the memory requirements and they are shared between classes in a multi-class method (as in mc-tied and mc-semi here). [sent-807, score-0.305]

63 These steps are reduced to matrix-vector multiplication (MVM) primitives with kernel matrices. [sent-940, score-0.379]

64 For general kernels, these MVM primitives can be reduced to large numerical linear algebra primitives, which can be computed very efﬁciently on modern computer architectures. [sent-941, score-0.312]

65 This means that within our framework, all code optimization efforts can be concentrated on these essential primitives (see also Section 7. [sent-947, score-0.31]

66 1 Related Work Our primary ﬁtting optimization for ﬂat multi-way classiﬁcation appeared in Williams and Barber (1998), although some fairly essential features are novel here. [sent-950, score-0.202]

67 Empirical Bayesian criteria such as the marginal likelihood are routinely used for hyperparameter learning in Gaussian process models (Williams and Barber, 1998; Williams and Rasmussen, 1996). [sent-952, score-0.232]

68 Finally, none of these papers propose (or achieve) a complete reduction to kernel MVM primitives only, nor do they deal with representing class structures or work on problems of the scale considered here. [sent-970, score-0.379]

69 SVM methods typically do not come with efﬁcient automatic kernel parameter learning schemes, and they do not provide estimates of predictive probabilities which are asymptotically correct. [sent-984, score-0.197]

70 2 Global versus Decomposition Methods In most kernel methods proposed so far in machine learning, the primary ﬁtting to data (for ﬁxed hyperparameters) translates to a convex minimization problem, where the penalisation terms correspond to quadratic expressions with kernel matrices. [sent-1003, score-0.563]

71 While kernel matrices may show a rapidly decaying eigenvalue spectrum, they certainly do couple the optimization variables strongly. [sent-1004, score-0.274]

72 In machine learning, the most dominant technique for large scale (structured label) kernel classiﬁcation is what optimization researchers call block coordinate descent methods (BCD), see Bertsekas (1999, Sect. [sent-1007, score-0.252]

73 An almost low-rank kernel matrix translates into a coupling of a simple structure, but the dominant couplings are typically strong and not sparse. [sent-1028, score-0.306]

74 In kernel methods, such a decomposable structure is not present, because the penalisation terms couple all variables strongly via the kernel matrices. [sent-1034, score-0.4]

75 It occurs whenever variables not in the same block are signiﬁcantly coupled, a situation which is to be expected for kernel machines in general. [sent-1040, score-0.203]

76 In our case here, the bottleneck operations are numerical linear algebra primitives from the basic linear algebra subroutines (BLAS), a standardized interface for highperformance dense linear algebra code. [sent-1056, score-0.363]

77 21 In this paper, we advocate to take a step back and to use global direction methods as approximation to Newton’s method for kernel machine ﬁtting. [sent-1058, score-0.205]

78 3 Matrix-Vector Multiplication The computational load in our framework is determined by applications of the MVM primitives v → K (c) v and v → (∂K (c) /∂h p )v. [sent-1068, score-0.213]

79 If the cost for a direct evaluation of these primitives is prohibitive, several well-known approximations may be applied. [sent-1073, score-0.213]

80 Our public code could fairly easily be extended, given specialized approximate kernel MVM code is available. [sent-1080, score-0.295]

81 We also plan to address hierarchical multi-label problems, which differ from hierarchical multi-class in that each instance can have multiple associated labels. [sent-1086, score-0.35]

82 This is the same modiﬁcation which led to the hierarchical extension, only that the ﬁxed coupling matrix Φ is replaced by the variable B. [sent-1095, score-0.258]

83 c∈Lyi ˜ Note that the log likelihood is not a concave function anymore, whenever |L yi | > 1, and in the pres˜ ence of such factors, primary ﬁtting is not a convex problem. [sent-1120, score-0.301]

84 3 L OW R ANK A PPROXIMATIONS Our generic kernel matrix representation is described in Appendix D. [sent-1129, score-0.216]

85 We also use this for matrix-valued vectors, an example is the diagonal kernel matrix K = diag(K (c) )c in ﬂat classiﬁcation. [sent-1178, score-0.249]

86 For kernel matrices (for example, K (c) ), we do not list the kernel functions (here: K (c) ), and for evaluation vectors (for example, u), we do not list the underlying functions (here: u (c) (·)). [sent-1199, score-0.391]

87 The criterion we minimize is strictly convex, even if the kernel matrix K is singular (or nearly so). [sent-1210, score-0.216]

88 We make use of a threshold κ < 0 and deﬁne I = {(i, c) | log πic < κ, yic > 0} , I0 = {(i, c) | log πic < κ, yic = 0} . [sent-1213, score-0.246]

89 1/2 The indices in I can be problematic due to the corresponding component g ic ≈ yic /πic becoming ˜ large. [sent-1214, score-0.193]

90 If the joint kernel matrix K is not block-diagonal (as in hierarchical classiﬁcation, see Section 3), diag K is not sufﬁcient for computing the system matrix diagonal. [sent-1232, score-0.499]

91 Then, the system matrix diagonal has elements i 1 + πic K ic + σ2 − 2wic + πT wi , i w = v + σ2 π. [sent-1234, score-0.189]

92 1 A Generic Kernel Matrix Representation A kernel matrix representation is some data structure which allows to compute kernel matrix MVMs v → K (c) v efﬁciently, being the principal primitives of our primary ﬁtting method. [sent-1257, score-0.765]

93 An efﬁcient representation depends strongly on the covariance function used, and also on whether kernel matrix MVMs are approximated rather than computed exactly. [sent-1261, score-0.216]

94 It requires kernel matrices to be stored explicitly, which may not be possible for very large n. [sent-1266, score-0.262]

95 In general, we allow for different covariance functions K (c) for each class c, although sharing of kernels is supported, in that K (c) (·, ·) = vc M (lc ) (·, ·) and vc > 0. [sent-1267, score-0.208]

96 In practice, this turns out to be signiﬁcantly slower (by a factor), i the reason being that the optimized BLAS primitives are many times more efﬁcient than applying a non-linear function f (l) point-wise to a matrix, even if the matrix is stored contiguously. [sent-1277, score-0.3]

97 It is required during hyperparameter optimization, because the MVM primitives for sub-matrices K Jk have to be driven by a single representation of the complete K (note that each K Jk is of size n(q − 1)/q, thus almost as large as K ). [sent-1282, score-0.379]

98 1, we simply permute the kernel matrices K (c) using the index (Jk , Ik ). [sent-1288, score-0.225]

99 1 uses the linear kernel K (c) (x, x ) = vc xT x , where x is very high-dimensional (word counts over a dictionary), but also extremely sparse (by far the most entries are zero). [sent-1292, score-0.239]

100 Cross-validation optimization for large scale hierarchical classiﬁcation kernel methods. [sent-1482, score-0.39]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mvm', 0.31), ('primitives', 0.213), ('newton', 0.195), ('lcg', 0.193), ('nr', 0.187), ('jk', 0.176), ('hierarchical', 0.175), ('kernel', 0.166), ('hyperparameter', 0.166), ('ranw', 0.16), ('hyperparameters', 0.155), ('eeger', 0.155), ('cale', 0.135), ('cai', 0.127), ('arge', 0.121), ('tructured', 0.121), ('primary', 0.12), ('cv', 0.119), ('tting', 0.115), ('ic', 0.106), ('mvms', 0.106), ('lassification', 0.104), ('ethods', 0.103), ('ernel', 0.103), ('blas', 0.103), ('hofmann', 0.101), ('bcd', 0.091), ('uc', 0.09), ('accumulation', 0.087), ('preconditioning', 0.087), ('yic', 0.087), ('kv', 0.08), ('barber', 0.078), ('scg', 0.077), ('vc', 0.073), ('williams', 0.073), ('mixing', 0.071), ('pcls', 0.068), ('pdata', 0.068), ('penalisation', 0.068), ('shuf', 0.068), ('likelihood', 0.066), ('gradients', 0.065), ('kernels', 0.062), ('ik', 0.062), ('ui', 0.059), ('matrices', 0.059), ('diag', 0.058), ('latent', 0.058), ('couplings', 0.057), ('dk', 0.056), ('fold', 0.055), ('gradient', 0.054), ('lh', 0.052), ('optimizations', 0.052), ('algebra', 0.05), ('matrix', 0.05), ('modern', 0.049), ('optimization', 0.049), ('code', 0.048), ('ul', 0.048), ('hessian', 0.046), ('conjugate', 0.046), ('seeger', 0.046), ('pacc', 0.046), ('rnc', 0.046), ('taxo', 0.046), ('uic', 0.046), ('wc', 0.046), ('svm', 0.044), ('inner', 0.044), ('penalized', 0.043), ('convex', 0.043), ('hierarchy', 0.042), ('classi', 0.041), ('memory', 0.04), ('nc', 0.039), ('direction', 0.039), ('leaf', 0.038), ('stored', 0.037), ('dl', 0.037), ('folds', 0.037), ('block', 0.037), ('log', 0.036), ('yi', 0.036), ('dependencies', 0.036), ('gaussian', 0.035), ('hsu', 0.035), ('anova', 0.034), ('argmaxc', 0.034), ('dpcls', 0.034), ('leafs', 0.034), ('qic', 0.034), ('silverman', 0.034), ('fairly', 0.033), ('diagonal', 0.033), ('coupling', 0.033), ('cache', 0.032), ('contiguous', 0.032), ('predictive', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999893 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

2 0.15626225 91 jmlr-2008-Trust Region Newton Method for Logistic Regression

Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi

Abstract: Large-scale logistic regression arises in many applications such as document classiﬁcation and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines

3 0.13720095 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

4 0.13387015 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufﬁciently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classiﬁcation results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

5 0.085335955 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufﬁcient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

6 0.082137108 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

7 0.081664413 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

8 0.08114399 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

9 0.080545776 86 jmlr-2008-SimpleMKL

10 0.080512822 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

11 0.078332029 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

12 0.077178776 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

13 0.073824897 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

14 0.073614746 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

15 0.06798742 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

16 0.067713514 58 jmlr-2008-Max-margin Classification of Data with Absent Features

17 0.065049231 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

18 0.062412459 56 jmlr-2008-Magic Moments for Structured Output Prediction

19 0.060681768 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

20 0.058722902 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.321), (1, -0.129), (2, 0.036), (3, 0.237), (4, 0.081), (5, 0.035), (6, -0.028), (7, 0.047), (8, 0.015), (9, 0.121), (10, -0.099), (11, 0.073), (12, 0.032), (13, -0.02), (14, -0.021), (15, -0.169), (16, 0.071), (17, 0.013), (18, 0.027), (19, -0.058), (20, -0.088), (21, -0.059), (22, -0.038), (23, 0.074), (24, 0.03), (25, 0.031), (26, -0.015), (27, -0.206), (28, -0.158), (29, -0.059), (30, -0.096), (31, -0.043), (32, -0.084), (33, -0.088), (34, 0.147), (35, -0.114), (36, 0.045), (37, 0.058), (38, -0.06), (39, 0.07), (40, 0.018), (41, -0.032), (42, 0.086), (43, 0.05), (44, 0.158), (45, -0.054), (46, 0.041), (47, 0.054), (48, 0.017), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92675072 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

2 0.49745631 91 jmlr-2008-Trust Region Newton Method for Logistic Regression

Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi

3 0.49636617 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

4 0.48290536 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

5 0.4237799 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

6 0.41981801 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

7 0.40571204 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

8 0.40322322 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

9 0.38696146 86 jmlr-2008-SimpleMKL

10 0.38262129 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

11 0.3787995 56 jmlr-2008-Magic Moments for Structured Output Prediction

12 0.36999077 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

13 0.36986855 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

14 0.34456673 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents

15 0.33684713 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

16 0.33325818 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

17 0.32076117 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)

18 0.31984952 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

19 0.29369113 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

20 0.28605565 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (5, 0.023), (31, 0.015), (40, 0.047), (54, 0.034), (58, 0.036), (66, 0.058), (76, 0.023), (78, 0.012), (88, 0.065), (92, 0.033), (94, 0.089), (99, 0.461)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91484058 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

2 0.90562129 7 jmlr-2008-A Tutorial on Conformal Prediction

Author: Glenn Shafer, Vladimir Vovk

Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 Ă˘ˆ’ ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 Ă˘ˆ’ ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions

3 0.56038344 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

4 0.45300344 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classiﬁcation. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classiﬁcation, Laplaces’s approximation, expectation propagation, variational bounding, mean ﬁeld methods, marginal likelihood evidence, MCMC

5 0.41693974 86 jmlr-2008-SimpleMKL

Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet

Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efﬁcient and general multiple kernel learning algorithm, based on semi-inﬁnite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is deﬁned as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classiﬁcation, for problems like regression, clustering (one-class classiﬁcation) or multiclass classiﬁcation. Experimental results show that the proposed algorithm converges rapidly and that its efﬁciency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classiﬁcation problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET

6 0.41430271 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

7 0.41359591 83 jmlr-2008-Robust Submodular Observation Selection

8 0.41171736 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

9 0.41059306 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

10 0.40356338 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

11 0.4017548 58 jmlr-2008-Max-margin Classification of Data with Absent Features

12 0.39672807 56 jmlr-2008-Magic Moments for Structured Output Prediction

13 0.39655024 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

14 0.39403272 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

15 0.39338607 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2

16 0.39273891 57 jmlr-2008-Manifold Learning: The Price of Normalization

17 0.39253163 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

18 0.39041618 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

19 0.3851684 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

20 0.3842476 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration