jmlr jmlr2008 jmlr2008-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
Reference: text
sentIndex sentText sentNum sentScore
1 38 T¨ bingen, Germany u Editor: Zoubin Ghahramani Abstract We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. [sent-5, score-0.275]
2 Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. [sent-7, score-0.748]
3 We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. [sent-9, score-0.175]
4 Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization 1. [sent-11, score-0.39]
5 However, if n and C denote data set size and number of classes respectively, nonparametric kernel methods like support vector machines (SVMs) or Gaussian processes (GPs) typically scale super-linearly in nC, if dependencies between the latent class functions are represented properly. [sent-14, score-0.26]
6 Furthermore, most large scale kernel methods proposed so far refrain from solving the problem of learning hyperparameters (kernel or loss function parameters), also known as “learning the kernels”. [sent-15, score-0.321]
7 However, many models for modern applications come with a large number of hyperparameters (for example to represent dependencies through “mixing” as in independent components analysis), and adjusting them through optimization must make use of gradients. [sent-17, score-0.289]
8 For example, our framework applied to hierarchical classification with hundreds of classes and thousands of data points requires a few minutes for fitting. [sent-21, score-0.175]
9 This is essentially Newton’s method, and one aspect of our work is to find approximate Newton directions very efficiently, making use of model structure and linear conjugate gradients in order to reduce the computation to standard linear algebra primitives on large contiguous chunks of memory. [sent-28, score-0.406]
10 Interestingly, such global approaches are generally favoured in the optimization community for problems (such as kernel methods fitting) which cannot be decomposed naturally into parts. [sent-29, score-0.215]
11 For multi-way classification, our primary fitting method scales linearly in C, and depends on n mainly via a fixed number of matrix-vector multiplications (MVM) with n × n kernel matrices. [sent-31, score-0.286]
12 In many situations, these MVM primitives can be computed very efficiently, often without having to store the kernel matrices themselves. [sent-32, score-0.438]
13 We also show how to choose hyperparameters automatically by maximizing the cross-validation log likelihood, making use of our primary fitting technology as inner loop in order to compute the CV criterion and its gradient. [sent-33, score-0.355]
14 It is important to note that our hyperparameter learning method works by gradient-based optimization, where the dominating part of the gradient computation does not scale with the number of hyperparameters at all. [sent-34, score-0.375]
15 1 The gradient computation also requires a number of MVMs with derivatives of kernel matrices, which can be reduced to kernel MVMs for many frequently used kernels (see Section 7. [sent-35, score-0.448]
16 We apply our framework to hierarchical classification with many classes. [sent-38, score-0.175]
17 An extension to hierarchical classification is provided in Section 3, and in Section 4 we give our automatic hyperparameter learning procedure. [sent-52, score-0.341]
18 Experimental results on a very large hierarchical text classification and several standard machine learning problems are given in Section 6. [sent-54, score-0.175]
19 It is fairly simple to include new kernels or (approximate) kernel MVM implementations. [sent-65, score-0.261]
20 u2 uC likelihood coupling latent dependent functions penalization log P(y | u) + log P(u) Figure 1: Structure of penalized likelihood optimization. [sent-75, score-0.338]
21 A set of latent (unobserved) functions uc (·) is fitted to observed data by penalized likelihood maximization. [sent-77, score-0.257]
22 This primary fitting step corresponds to a convex optimization problem over finitely many variables. [sent-79, score-0.212]
23 p (our main example of such mixing is hierarchical classification, developed in Section 3). [sent-90, score-0.246]
24 3 We employ the multiple logistic regression model, consisting of C latent class functions uc (·) feeding into the multiple logistic (or softmax) likelihood P(yic = 1|xi , ui (·)) = euc (xi ) /(∑c euc (xi ) ). [sent-103, score-0.273]
25 For example, for the Gaussian kernel (7), non-smooth f c are penalized, and for the linear c kernel (Appendix D. [sent-107, score-0.332]
26 Details on penalized c likelihood kernel methods and RKHS penalisation can be found in Green and Silverman (1994) and Sch¨ lkopf and Smola (2002). [sent-109, score-0.343]
27 Since the likelihood depends on the f c (·) only through the values f c (xi ) at the data points, every minimizer of Φ must be a kernel expansion: f c (·) = ∑i αic K (c) (·, xi ). [sent-113, score-0.232]
28 1150 L ARGE S CALE S TRUCTURED C LASSIFICATION K ERNEL M ETHODS sponding kernel expansions are ˆ uc (·) = ∑ αic (K (c) (·, xi ) + σ2 ). [sent-127, score-0.256]
29 The negative log multiple logistic likelihood has similar properties, but is smooth as a function of u, and the primary fitting of α does not require constrained convex optimization. [sent-132, score-0.265]
30 We think that a proper SCG implementation can be at least as efficient as NR, but needs fine-tuning to the specific problem, which in the case of hierarchical classification (discussed next) is already quite difficult. [sent-150, score-0.175]
31 Hierarchical Classification So far we dealt with flat classification, the classes being independent a priori, with block-diagonal kernel matrix K . [sent-155, score-0.216]
32 Here we focus on hierarchical classification, the label set {1, . [sent-157, score-0.175]
33 In flat classification, the latent class functions uc (·) are modelled as a priori independent, in that the penaliser (or the log prior in the GP view) is a sum of individual terms for each c, without 4. [sent-163, score-0.184]
34 However, it would be wrong to claim that hierarchical classification in general comes as cheap as flat classification. [sent-192, score-0.175]
35 In the hierarchical case, this “near-decomposition” does not hold, and both LCG and NR need more iterations to attain the same accuracy, although each LCG step comes at about the same cost as in the flat case. [sent-196, score-0.175]
36 However, in all our experiments so far the fitting of the hierarchical model took less than twice the time required for the flat model on the same task. [sent-200, score-0.175]
37 Hyperparameter Learning Our framework comes with an automatic method for setting free hyperparameters h, by gradientbased maximization of the cross-validation (CV) log likelihood. [sent-202, score-0.191]
38 The gradient computation decomposes into two parts: accumulation, and kernel derivative MVMs. [sent-237, score-0.22]
39 The accumulation part requires solving q systems of size ((q − 1)/q)nC, thus ˜ q k3 kernel MVMs on the K Jk if linear conjugate gradients (LCG) is used, k3 being the number of LCG steps. [sent-238, score-0.364]
40 This second part is much simpler than the accumulation one, consisting entirely of large matrix operations, which can be run very efficiently using specialized numerical linear algebra code. [sent-242, score-0.187]
41 3, the extension of hyperparameter learning to the hierarchical case of Section 3 is done by wrapping the accumulation part with Φ MVMs, the coding and additional memory effort being minimal. [sent-245, score-0.468]
42 In this context, it is important to regard the {α [Jk ] } as an inner state alongside the hyperparameter vector h. [sent-257, score-0.21]
43 This can be seen by noting that the parameterization of our likelihood in terms of ui is overcomplete, in that ui + κ1 gives the same likelihood values for all κ. [sent-323, score-0.25]
44 We could fix one of the ui components, which would however lead to subtle dependencies between the remaining C − 1 functions uc (·). [sent-324, score-0.185]
45 The gradient computation then requires to compute (6) for each component, using kernel derivative MVMs. [sent-438, score-0.22]
46 The dominating part of the accumulation is the re-optimisation of the α [J] , which are done by calling the optimized code for primary fitting (Section 5. [sent-445, score-0.255]
47 This means that linear algebra primitives with K Jk can be run without mapping matrix coordinates through an index, which would be many times slower. [sent-453, score-0.313]
48 3 Details for Hierarchical Classification In this section, we provide details for hierarchical classification method, introduced in Section 3. [sent-463, score-0.175]
49 The induced covaripriors (or regularisers), the corresponding kernel matrix K ance matrix K over uL is given by (2), and hierarchical classification differs from the flat variant only in that this non-block-diagonal matrix is used. [sent-466, score-0.491]
50 The diagonal preconditioning of LCG (see Appendix B) requires the computation of diag K ∈ nC . [sent-470, score-0.178]
51 2) is easily extended to the hierarchical case, recalling ˜ ˜ (2) and the fact that Φ does not depend on hyperparameters. [sent-480, score-0.175]
52 The step from flat to hierarchical classification requires only minor modifications of existing code. [sent-486, score-0.175]
53 Wrappers for MVM and the other primitives essentially pre- and post-multiply their input 1159 S EEGER ˘ with Φ and ΦT respectively, calling the existing “flat” primitives for K in between (block-diagonal with P rather than C blocks). [sent-487, score-0.426]
54 However, this problem is very ill-conditioned, the Hessian being ˜ ˜ ˜ KW K + K (large kernel matrices are typically very ill-conditioned, and here we deal with K 2 ), and SCG runs exceedingly slowly to the point of being essentially useless (as we determined in experiments). [sent-499, score-0.225]
55 In preliminary direct comparisons, the NR method still works more efficiently, meaning that SCG would require additional preconditioning specific to the problem at hand, which would likely be different for flat and hierarchical classification. [sent-506, score-0.262]
56 H1 and H2 do not perform differently: choosing many different vc in the linear kernel seems no advantage here (but see Section 6. [sent-564, score-0.239]
57 In Table 2, we present running times13 for the final fitting and for a single fold during hyperparameter optimization (5 of these are required for Ψ, ∇h Ψ). [sent-570, score-0.27]
58 14 It is precisely this high efficiency of primary fitting, which allows us to use it as inner loop for automatic hyperparameter learning (Cai and Hofmann, 2004, do not adjust hyperparameters to the data). [sent-572, score-0.485]
59 CV Fold: Re-optimization of α[J] and gradient accumulation for single fold J. [sent-792, score-0.196]
60 We use the isotropic Gaussian (RBF) covariance function K (c) (x, x ) = vc exp − wc x −x 2 2 , vc , wc > 0. [sent-802, score-0.238]
61 (7) We compare the methods mc-sep (ours with separate kernels for each class; 12 hyperparameters), mc-tied (ours with a single shared kernel; 2 hyperparameters), mc-semi (ours with single kernel M (1) , but different vc ; 7 hyperparameters), 1rest (one-against-rest; 12 hyperparameters). [sent-803, score-0.301]
62 Even if run sequentially, 1rest typically requires less memory by a factor of C than a joint multi-class method, although this is not true if the kernel matrices are dominating the memory requirements and they are shared between classes in a multi-class method (as in mc-tied and mc-semi here). [sent-807, score-0.305]
63 These steps are reduced to matrix-vector multiplication (MVM) primitives with kernel matrices. [sent-940, score-0.379]
64 For general kernels, these MVM primitives can be reduced to large numerical linear algebra primitives, which can be computed very efficiently on modern computer architectures. [sent-941, score-0.312]
65 This means that within our framework, all code optimization efforts can be concentrated on these essential primitives (see also Section 7. [sent-947, score-0.31]
66 1 Related Work Our primary fitting optimization for flat multi-way classification appeared in Williams and Barber (1998), although some fairly essential features are novel here. [sent-950, score-0.202]
67 Empirical Bayesian criteria such as the marginal likelihood are routinely used for hyperparameter learning in Gaussian process models (Williams and Barber, 1998; Williams and Rasmussen, 1996). [sent-952, score-0.232]
68 Finally, none of these papers propose (or achieve) a complete reduction to kernel MVM primitives only, nor do they deal with representing class structures or work on problems of the scale considered here. [sent-970, score-0.379]
69 SVM methods typically do not come with efficient automatic kernel parameter learning schemes, and they do not provide estimates of predictive probabilities which are asymptotically correct. [sent-984, score-0.197]
70 2 Global versus Decomposition Methods In most kernel methods proposed so far in machine learning, the primary fitting to data (for fixed hyperparameters) translates to a convex minimization problem, where the penalisation terms correspond to quadratic expressions with kernel matrices. [sent-1003, score-0.563]
71 While kernel matrices may show a rapidly decaying eigenvalue spectrum, they certainly do couple the optimization variables strongly. [sent-1004, score-0.274]
72 In machine learning, the most dominant technique for large scale (structured label) kernel classification is what optimization researchers call block coordinate descent methods (BCD), see Bertsekas (1999, Sect. [sent-1007, score-0.252]
73 An almost low-rank kernel matrix translates into a coupling of a simple structure, but the dominant couplings are typically strong and not sparse. [sent-1028, score-0.306]
74 In kernel methods, such a decomposable structure is not present, because the penalisation terms couple all variables strongly via the kernel matrices. [sent-1034, score-0.4]
75 It occurs whenever variables not in the same block are significantly coupled, a situation which is to be expected for kernel machines in general. [sent-1040, score-0.203]
76 In our case here, the bottleneck operations are numerical linear algebra primitives from the basic linear algebra subroutines (BLAS), a standardized interface for highperformance dense linear algebra code. [sent-1056, score-0.363]
77 21 In this paper, we advocate to take a step back and to use global direction methods as approximation to Newton’s method for kernel machine fitting. [sent-1058, score-0.205]
78 3 Matrix-Vector Multiplication The computational load in our framework is determined by applications of the MVM primitives v → K (c) v and v → (∂K (c) /∂h p )v. [sent-1068, score-0.213]
79 If the cost for a direct evaluation of these primitives is prohibitive, several well-known approximations may be applied. [sent-1073, score-0.213]
80 Our public code could fairly easily be extended, given specialized approximate kernel MVM code is available. [sent-1080, score-0.295]
81 We also plan to address hierarchical multi-label problems, which differ from hierarchical multi-class in that each instance can have multiple associated labels. [sent-1086, score-0.35]
82 This is the same modification which led to the hierarchical extension, only that the fixed coupling matrix Φ is replaced by the variable B. [sent-1095, score-0.258]
83 c∈Lyi ˜ Note that the log likelihood is not a concave function anymore, whenever |L yi | > 1, and in the pres˜ ence of such factors, primary fitting is not a convex problem. [sent-1120, score-0.301]
84 3 L OW R ANK A PPROXIMATIONS Our generic kernel matrix representation is described in Appendix D. [sent-1129, score-0.216]
85 We also use this for matrix-valued vectors, an example is the diagonal kernel matrix K = diag(K (c) )c in flat classification. [sent-1178, score-0.249]
86 For kernel matrices (for example, K (c) ), we do not list the kernel functions (here: K (c) ), and for evaluation vectors (for example, u), we do not list the underlying functions (here: u (c) (·)). [sent-1199, score-0.391]
87 The criterion we minimize is strictly convex, even if the kernel matrix K is singular (or nearly so). [sent-1210, score-0.216]
88 We make use of a threshold κ < 0 and define I = {(i, c) | log πic < κ, yic > 0} , I0 = {(i, c) | log πic < κ, yic = 0} . [sent-1213, score-0.246]
89 1/2 The indices in I can be problematic due to the corresponding component g ic ≈ yic /πic becoming ˜ large. [sent-1214, score-0.193]
90 If the joint kernel matrix K is not block-diagonal (as in hierarchical classification, see Section 3), diag K is not sufficient for computing the system matrix diagonal. [sent-1232, score-0.499]
91 Then, the system matrix diagonal has elements i 1 + πic K ic + σ2 − 2wic + πT wi , i w = v + σ2 π. [sent-1234, score-0.189]
92 1 A Generic Kernel Matrix Representation A kernel matrix representation is some data structure which allows to compute kernel matrix MVMs v → K (c) v efficiently, being the principal primitives of our primary fitting method. [sent-1257, score-0.765]
93 An efficient representation depends strongly on the covariance function used, and also on whether kernel matrix MVMs are approximated rather than computed exactly. [sent-1261, score-0.216]
94 It requires kernel matrices to be stored explicitly, which may not be possible for very large n. [sent-1266, score-0.262]
95 In general, we allow for different covariance functions K (c) for each class c, although sharing of kernels is supported, in that K (c) (·, ·) = vc M (lc ) (·, ·) and vc > 0. [sent-1267, score-0.208]
96 In practice, this turns out to be significantly slower (by a factor), i the reason being that the optimized BLAS primitives are many times more efficient than applying a non-linear function f (l) point-wise to a matrix, even if the matrix is stored contiguously. [sent-1277, score-0.3]
97 It is required during hyperparameter optimization, because the MVM primitives for sub-matrices K Jk have to be driven by a single representation of the complete K (note that each K Jk is of size n(q − 1)/q, thus almost as large as K ). [sent-1282, score-0.379]
98 1, we simply permute the kernel matrices K (c) using the index (Jk , Ik ). [sent-1288, score-0.225]
99 1 uses the linear kernel K (c) (x, x ) = vc xT x , where x is very high-dimensional (word counts over a dictionary), but also extremely sparse (by far the most entries are zero). [sent-1292, score-0.239]
100 Cross-validation optimization for large scale hierarchical classification kernel methods. [sent-1482, score-0.39]
wordName wordTfidf (topN-words)
[('mvm', 0.31), ('primitives', 0.213), ('newton', 0.195), ('lcg', 0.193), ('nr', 0.187), ('jk', 0.176), ('hierarchical', 0.175), ('kernel', 0.166), ('hyperparameter', 0.166), ('ranw', 0.16), ('hyperparameters', 0.155), ('eeger', 0.155), ('cale', 0.135), ('cai', 0.127), ('arge', 0.121), ('tructured', 0.121), ('primary', 0.12), ('cv', 0.119), ('tting', 0.115), ('ic', 0.106), ('mvms', 0.106), ('lassification', 0.104), ('ethods', 0.103), ('ernel', 0.103), ('blas', 0.103), ('hofmann', 0.101), ('bcd', 0.091), ('uc', 0.09), ('accumulation', 0.087), ('preconditioning', 0.087), ('yic', 0.087), ('kv', 0.08), ('barber', 0.078), ('scg', 0.077), ('vc', 0.073), ('williams', 0.073), ('mixing', 0.071), ('pcls', 0.068), ('pdata', 0.068), ('penalisation', 0.068), ('shuf', 0.068), ('likelihood', 0.066), ('gradients', 0.065), ('kernels', 0.062), ('ik', 0.062), ('ui', 0.059), ('matrices', 0.059), ('diag', 0.058), ('latent', 0.058), ('couplings', 0.057), ('dk', 0.056), ('fold', 0.055), ('gradient', 0.054), ('lh', 0.052), ('optimizations', 0.052), ('algebra', 0.05), ('matrix', 0.05), ('modern', 0.049), ('optimization', 0.049), ('code', 0.048), ('ul', 0.048), ('hessian', 0.046), ('conjugate', 0.046), ('seeger', 0.046), ('pacc', 0.046), ('rnc', 0.046), ('taxo', 0.046), ('uic', 0.046), ('wc', 0.046), ('svm', 0.044), ('inner', 0.044), ('penalized', 0.043), ('convex', 0.043), ('hierarchy', 0.042), ('classi', 0.041), ('memory', 0.04), ('nc', 0.039), ('direction', 0.039), ('leaf', 0.038), ('stored', 0.037), ('dl', 0.037), ('folds', 0.037), ('block', 0.037), ('log', 0.036), ('yi', 0.036), ('dependencies', 0.036), ('gaussian', 0.035), ('hsu', 0.035), ('anova', 0.034), ('argmaxc', 0.034), ('dpcls', 0.034), ('leafs', 0.034), ('qic', 0.034), ('silverman', 0.034), ('fairly', 0.033), ('diagonal', 0.033), ('coupling', 0.033), ('cache', 0.032), ('contiguous', 0.032), ('predictive', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999893 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
2 0.15626225 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
3 0.13720095 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
4 0.13387015 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
5 0.085335955 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
6 0.082137108 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
7 0.081664413 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
8 0.08114399 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
9 0.080545776 86 jmlr-2008-SimpleMKL
10 0.080512822 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
11 0.078332029 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
12 0.077178776 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
13 0.073824897 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
14 0.073614746 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
15 0.06798742 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
16 0.067713514 58 jmlr-2008-Max-margin Classification of Data with Absent Features
17 0.065049231 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
18 0.062412459 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.060681768 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
20 0.058722902 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
topicId topicWeight
[(0, 0.321), (1, -0.129), (2, 0.036), (3, 0.237), (4, 0.081), (5, 0.035), (6, -0.028), (7, 0.047), (8, 0.015), (9, 0.121), (10, -0.099), (11, 0.073), (12, 0.032), (13, -0.02), (14, -0.021), (15, -0.169), (16, 0.071), (17, 0.013), (18, 0.027), (19, -0.058), (20, -0.088), (21, -0.059), (22, -0.038), (23, 0.074), (24, 0.03), (25, 0.031), (26, -0.015), (27, -0.206), (28, -0.158), (29, -0.059), (30, -0.096), (31, -0.043), (32, -0.084), (33, -0.088), (34, 0.147), (35, -0.114), (36, 0.045), (37, 0.058), (38, -0.06), (39, 0.07), (40, 0.018), (41, -0.032), (42, 0.086), (43, 0.05), (44, 0.158), (45, -0.054), (46, 0.041), (47, 0.054), (48, 0.017), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.92675072 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
2 0.49745631 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
3 0.49636617 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
4 0.48290536 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
5 0.4237799 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
6 0.41981801 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
7 0.40571204 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
8 0.40322322 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
9 0.38696146 86 jmlr-2008-SimpleMKL
10 0.38262129 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
11 0.3787995 56 jmlr-2008-Magic Moments for Structured Output Prediction
12 0.36999077 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
13 0.36986855 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
14 0.34456673 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
15 0.33684713 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
16 0.33325818 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
17 0.32076117 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
18 0.31984952 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
19 0.29369113 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
topicId topicWeight
[(0, 0.019), (5, 0.023), (31, 0.015), (40, 0.047), (54, 0.034), (58, 0.036), (66, 0.058), (76, 0.023), (78, 0.012), (88, 0.065), (92, 0.033), (94, 0.089), (99, 0.461)]
simIndex simValue paperId paperTitle
same-paper 1 0.91484058 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
2 0.90562129 7 jmlr-2008-A Tutorial on Conformal Prediction
Author: Glenn Shafer, Vladimir Vovk
Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions
3 0.56038344 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
4 0.45300344 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
Author: Hannes Nickisch, Carl Edward Rasmussen
Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC
5 0.41693974 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
7 0.41359591 83 jmlr-2008-Robust Submodular Observation Selection
8 0.41171736 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
9 0.41059306 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
10 0.40356338 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
11 0.4017548 58 jmlr-2008-Max-margin Classification of Data with Absent Features
12 0.39672807 56 jmlr-2008-Magic Moments for Structured Output Prediction
13 0.39655024 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
14 0.39403272 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
15 0.39338607 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
16 0.39273891 57 jmlr-2008-Manifold Learning: The Price of Normalization
17 0.39253163 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
18 0.39041618 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
19 0.3851684 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
20 0.3842476 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration