nips nips2006 nips2006-63 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthias Seeger
Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. [sent-6, score-0.179]
2 Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. [sent-7, score-0.089]
3 We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. [sent-8, score-0.234]
4 For example, in multi-way classification problems with a structured label space, modern applications demand predictions on thousands of classes, and very large datasets become available. [sent-10, score-0.094]
5 If n and C denote dataset size and number of classes respectively, nonparametric kernel methods like SVMs or Gaussian processes typically scale superlinearly in n C, if dependencies between the latent class functions are properly represented. [sent-11, score-0.257]
6 Furthermore, most large scale kernel methods proposed so far refrain from solving the problem of learning hyperparameters (kernel or loss function parameters). [sent-12, score-0.22]
7 In this paper, we propose a general framework for learning in probabilistic kernel classification models. [sent-14, score-0.138]
8 While the basic model is standard, a major feature of our approach is the high computational efficiency with which the primary fitting (for fixed hyperparameters) is done, allowing us to deal with hundreds of classes and thousands of datapoints within a few minutes. [sent-15, score-0.199]
9 The primary fitting scales linearly in C, and depends on n mainly via a fixed number of matrix-vector multiplications (MVM) with n × n kernel matrices. [sent-16, score-0.33]
10 Furthermore, we optimize hyperparameters automatically by minimizing the cross-validation log likelihood, making use of our primary fitting technology as inner loop in order to compute the CV criterion and its gradient. [sent-18, score-0.354]
11 Our approach can be used to learn a large number of hyperparameters and does not need user interaction. [sent-19, score-0.082]
12 Our framework is generally applicable to structured label spaces, which we demonstrate here for hierarchical classification of text documents. [sent-20, score-0.294]
13 While the C latent class functions are fully dependent a priori, the scaling of our method stays within a factor of two compared to unstructured classification. [sent-22, score-0.075]
14 The primary fitting method is given in Section 2, the extension to hierarchical classification in Section 3. [sent-28, score-0.31]
15 We elpoy the multiple logistic regression model, consisting of C latent (unobserved) class functions uc feeding into the multiple logistic (or softmax) likelihood P (yi,c = 1|xi , ui ) = euc (xi ) /( c euc (xi ) ). [sent-46, score-0.447]
16 Since the likelihood depends on the fc only through fc (xi ), every minimizer of Φ must be a kernel expansion: fc = i αi,c K (c) (·, xi ) (representer theorem, see [4]). [sent-60, score-0.695]
17 The corresponding kernel expansions are uc = ˆ ˆ (·, xi ) + σ 2 ). [sent-67, score-0.286]
18 Estimates of the i αi,c (K conditional probability on test points x∗ are obtained by plugging uc (x∗ ) into the likelihood. [sent-68, score-0.133]
19 ˆ We note that this setup can also be seen as MAP approximation to a Bayesian model, where the fc are given independent Gaussian process priors, e. [sent-69, score-0.195]
20 The negative log multiple logistic likelihood has similar properties, but is smooth as a function of u, and the primary fitting of α does not require constrained convex optimization. [sent-73, score-0.272]
21 The complexity of our fitting algorithm is dominated by k1 (k2 + 2) matrix-vector multiplications with K , where k1 is the number of NR iterations, k2 the number of linear conjugate gradient (LCG) steps for computing each Newton direction. [sent-75, score-0.11]
22 3 Hierarchical Classification So far we dealt with flat classification, the classes being independent a priori, with block-diagonal kernel matrix K . [sent-79, score-0.226]
23 Here we focus on hierarchical classification, the label set {1, . [sent-81, score-0.208]
24 In this Section, we propose a model for this setup and show how it can be dealt with in our framework with minor modifications and minor extra cost. [sent-86, score-0.153]
25 In flat classification, the latent class functions uc are modelled as a priori independent, in that the regularizer (which plays the role of a log prior) is a sum of individual terms for each uc , without any 1 See www. [sent-87, score-0.331]
26 Assign a pair of latent functions up , up to each node, except the root. [sent-108, score-0.075]
27 The class functions to be fed into the likelihood are the uL(c) of the leafs. [sent-111, score-0.078]
28 For example, if leafs L(c), L(c ) have the common parent p, then uL(c) = up + uL(c) , uL(c ) = up + uL(c ) , so ˘ ˘ the class functions share the effect up . [sent-113, score-0.181]
29 Under the hierarchical model, the class functions uL(c) are strongly dependent a priori. [sent-119, score-0.187]
30 We may represent this prior coupling in our framework by simply plugging in the implied kernel matrix K : ˘ (2) K = (ΦL,· ⊗ I)K (ΦT ⊗ I), L,· ˘ where the inner K is block-diagonal. [sent-120, score-0.219]
31 K is not sparse and certainly not block-diagonal, but the important point is that we are still able to do kernel MVMs efficiently: pre- and postmultiplying by ˘ Φ is cheap, and K is block-diagonal just as in the flat case. [sent-121, score-0.138]
32 We note that the step from flat to hierarchical classification requires minor modifications of Algorithm 1: Matrix-vector multiplication existing code only. [sent-122, score-0.231]
33 This simplicity carries through to the hyperif np > 0 (p not a leaf node) then parameter learning case (see Section 4). [sent-130, score-0.109]
34 cost of a kernel MVM is increased by a factor y ← (y T , yp 1T + xT )T . [sent-135, score-0.138]
35 However, it would be wrong end for to claim that hierarchical classification in general comes as cheap as flat classification. [sent-138, score-0.19]
36 The subtle issue is that the primary fitting becomes more costly, precisely because there is more coupling between the variables. [sent-139, score-0.155]
37 In the hierarchical case, both LCG and NR need more iterations to attain the same accuracy. [sent-142, score-0.155]
38 An important point for future research is to find a good preconditioning strategy for the system of Eq. [sent-145, score-0.084]
39 However, in all our experiments so far the fitting of the hierarchical model took less than twice the time required for the flat model on the same task. [sent-147, score-0.155]
40 4 Hyperparameter Learning In any model of interest, there will be free hyperparameters h, for example parameters of the kernels K (c) . [sent-149, score-0.13]
41 These were assumed to be fixed in the primary fitting method introduced in Section 2. [sent-150, score-0.155]
42 In this Section, we describe a scheme for learning h which makes use of the primary fitting algorithm as inner loop. [sent-151, score-0.196]
43 Recall that primary fitting consists of minimizing Φ of Eq. [sent-153, score-0.155]
44 A common remedy is to minimize the negative cross- validation log likelihood Ψ instead. [sent-162, score-0.118]
45 , n}\Ik , and let ΦJk = uT k ] ((1/2)α[Jk ] − y Jk ) + 1T l[Jk ] be the primary criterion on the subset Jk of the [J ˜ data. [sent-169, score-0.198]
46 To this end, we keep the set {α[Jk ] } of primary variables, and iterate between re-fitting those for each fold Ik , and computing Ψ and h Ψ. [sent-178, score-0.222]
47 The latter can be determined analytically, requiring ˜ us to solve a linear system with the Hessian matrix I + V T k ] K Jk V [Jk ] already encountered during [J primary fitting (see Section 5). [sent-179, score-0.228]
48 This means that the same LCG code used to compute Newton directions there can be applied here in order to compute the gradient of Ψ. [sent-180, score-0.113]
49 The update of the α[Jk ] requires q primary fitting applications, but since they are initialized with the previous values α[Jk ] , they do converge very rapidly, especially during later outer iterations. [sent-183, score-0.155]
50 The gradient computation decomposes into two parts: accumulation, and kernel derivative MVMs. [sent-185, score-0.218]
51 The accumulation part requires solving q systems of size ((q − 1)/q)n C, ˜ thus q k3 kernel MVMs on the K Jk if linear conjugate gradients (LCG) is used, k3 being the number of LCG steps. [sent-186, score-0.324]
52 Note that the accumulation step is independent of the number of hyperparameters. [sent-188, score-0.116]
53 The kernel derivative MVM part consists of q derivative MVM calls for each independent component of h, see Section 5. [sent-189, score-0.226]
54 As opposed to the accumulation part, this part consists of a simple large matrix operation and can be run very efficiently using specialized numerical linear algebra code. [sent-191, score-0.182]
55 As shown in Section 5, the extension of hyperparameter learning to the hierarchical case of Section 3 is simply done by wrapping the accumulation part, the coding and additional memory effort being minimal. [sent-192, score-0.437]
56 Since kernel MVMs are significantly more expensive than these O(n C) operations, the line searches basically come for free! [sent-204, score-0.138]
57 We choose search directions by Newton-Raphson (NR)4 , since the Hessian of Φ is required anyway for hyperparameter learning. [sent-205, score-0.15]
58 The NR direction is obtained by solving this system approximately by the linear conjugate gradients (LCG) method, requiring a MVM with the system matrix in each iteration, thus a single MVM with K . [sent-214, score-0.177]
59 Our implementation includes diagonal preconditioning and numerical stability safeguards [8]. [sent-215, score-0.085]
60 We now show how to compute the gradient h Ψ for the CV criterion Ψ (Eq. [sent-219, score-0.079]
61 We finally mention some of the computational “tricks”, without which we could not have dealt with the largest tasks in Section 6. [sent-230, score-0.078]
62 For ˜ hyperparameter learning, we work on subsets Jk and need MVMs with K Jk . [sent-238, score-0.116]
63 We also share memory blocks of size n C between LCG, gradient accumulation, line searches, keeping the overall memory requirements at r n C for a small constant r, and avoiding frequent reallocations. [sent-242, score-0.136]
64 With many classes C, we may share kernels: K (c) = vc M (lc ) , vc > 0 variance parameters, M (l) independent correlation functions. [sent-246, score-0.488]
65 The linear kernel K (c) (x, x ) = vc xT x is frequently used for text classification (see Section 6. [sent-248, score-0.405]
66 If the data matrix X is sparse, kernel MVM can be done in much less than the generic O(C n2 ), typically in O(C n), requiring O(n) storage for X only, even if the dimension of x is way beyond n. [sent-250, score-0.177]
67 If the K (c) are isotropic kernels (depending on x −x only) and the x are low-dimensional, MVM with K (c) can be approximated using specialized nearest neighbour data structures such as KD trees [12, 9]. [sent-251, score-0.091]
68 For general kernels whose kernel matrices have a rapidly decaying eigenspectrum, one can approximate MVM by using low-rank matrices instead of the K (c) [10], whence MVM is O(C n d), d the rank. [sent-253, score-0.244]
69 Note that (∂/∂ log vc )K (c) = K (c) , reducing to kernel MVM. [sent-255, score-0.393]
70 For isotropic kernels, K (c) = f (A), ai,j = xi − xj , so (∂/∂hj )K (c) = gj (A). [sent-256, score-0.149]
71 5 The innermost vector operations work on contiguous chunks of memory, rather than strided ones, thus supporting cacheing or vector functions of the processor. [sent-258, score-0.082]
72 6 Experiments In this Section, we provide experimental results for our framework on data from remote sensing, and on a set of large text classification tasks with very many classes, the latter are hierarchical. [sent-259, score-0.144]
73 1 Flat Classification: Remote Sensing We use the satimage remote sensing task from the statlog repository. [sent-261, score-0.117]
74 We use the isotropic Gaussian (RBF) kernel wc K (c) (x, x ) = vc exp − x − x 2 , vc , wc > 0, x, x ∈ Rd . [sent-264, score-0.699]
75 Note that while 1rest also may choose 12 independent kernel parameters, it does not make good use of this possibility, as opposed to mc-sep. [sent-277, score-0.169]
76 Our method outputs a predictive pj ∈ RC for each test case xj . [sent-296, score-0.107]
77 The standard prediction y(xj ) = argmaxc pj,c maximizes expected accuracy, classes are ranked as rj (c) ≤ rj (c ) iff pj,c ≥ pj,c . [sent-297, score-0.18]
78 The test scores are the same as in [1]: accuracy (acc) m−1 j I{y(xj )=yj } , precision (prec) m−1 j rj (yj )−1 , parent accuracy (pacc) m−1 j I{par(y(xj ))=par(yj )} , par(c) being the parent of L(c). [sent-298, score-0.306]
79 For taxo-loss and parent accuracy, we better choose y(xj ) to minimize expected loss8 , different from the standard prediction. [sent-302, score-0.113]
80 F1: all vc shared (1); H1: vc shared across each level of the tree (3). [sent-304, score-0.544]
81 F2, H2: vc shared across each subtree rooted at root’s children (A: 15, B: 34, C: 17, D: 7, E: 7, F: 17, G: 12, H: 5). [sent-305, score-0.335]
82 For hyperparameter learning: k1 = 8, k2 = 4, k3 = 15 (F1, F2); k1 = 10, k2 = 4, k3 = 25 (H1, H2)9 . [sent-307, score-0.116]
83 8 For parent accuracy, let p(j) be the node with maximal mass (under pj ) of its children which are leafs, then y(xj ) must be a child of p(j). [sent-325, score-0.124]
84 taxo[0-1], pacc[0-1] for argmaxc pj,c rule, rather than minimize expected loss. [sent-521, score-0.089]
85 CV Fold: Re-optimization of α[J] , gradient accumulation for single fold. [sent-533, score-0.152]
86 The optimization is started from vc = 5 for all methods. [sent-535, score-0.222]
87 While the differences in accuracy and precision are hardly significant (as also found in [1]), they (partly) are in taxo-loss and parent accuracy. [sent-538, score-0.112]
88 H1 and H2 do not perform differently: choosing many different vc in the linear kernel seems no advantage here (but see Section 6. [sent-540, score-0.36]
89 However, for our method, the recommendation in [1] to use vc = 1 leads to significantly worse results in all scores, the vc chosen by our methods are generally larger. [sent-543, score-0.444]
90 In Table 2, we present running times10 for the final fitting and for a single fold during hyperparameter optimization (5 of them are required for Ψ, h Ψ). [sent-544, score-0.183]
91 It is precisely this high efficiency of primary fitting which allows us to use it as inner loop for hyperparameter learning. [sent-546, score-0.312]
92 7 Discussion We presented a general framework for very efficient large scale kernel multi-way classification with structured label spaces and demonstrated its features on hierarchical text classification tasks with many classes. [sent-547, score-0.466]
93 As shown for the hierarchical case, the framework is easily extended to novel struc10 Processor time on 64bit 2. [sent-548, score-0.155]
94 We solve the kernel parameter learning problem by optimizing the CV log likelihood, whose gradient can be computed within the framework. [sent-551, score-0.207]
95 An application to label sequence learning is work in progress, which may even be combined with a hierarchical prior. [sent-556, score-0.208]
96 Infering a hierarchy from data is possible in principle, using expectation maximization techniques (note that the primary fitting can deal with target distributions y i ), as well as incorporating uncertain data. [sent-557, score-0.216]
97 Empirical Bayesian methods or approximate CV scores for hyperparameter learning have been proposed in [11, 3, 6], but they are orders of magnitude more expensive than our proposal here, and do not apply to a massive number of classes. [sent-558, score-0.155]
98 SVM methods typically do not come with efficient automatic kernel parameter learning schemes, and they do not provide estimates of predictive probabilities which are asymptotically correct. [sent-562, score-0.194]
99 Efficient kernel machines using the improved fast Gauss transform. [sent-640, score-0.138]
100 The latter kind, if feasible at all, often makes better use of hardware features such as cacheing and vector operations, and therefore is the preferred approach in numerical optimization. [sent-643, score-0.085]
wordName wordTfidf (topN-words)
[('mvm', 0.475), ('jk', 0.349), ('vc', 0.222), ('lcg', 0.2), ('nr', 0.199), ('hierarchical', 0.155), ('primary', 0.155), ('fc', 0.152), ('cv', 0.143), ('tting', 0.142), ('kernel', 0.138), ('hyperparameter', 0.116), ('accumulation', 0.116), ('ul', 0.105), ('dk', 0.105), ('mvms', 0.1), ('pacc', 0.1), ('taxo', 0.1), ('ik', 0.098), ('uc', 0.093), ('lh', 0.087), ('classi', 0.085), ('hyperparameters', 0.082), ('leafs', 0.075), ('parent', 0.074), ('newton', 0.069), ('fold', 0.067), ('remote', 0.065), ('cai', 0.065), ('np', 0.063), ('hierarchy', 0.061), ('ui', 0.057), ('predictive', 0.056), ('bc', 0.055), ('xi', 0.055), ('hessian', 0.053), ('label', 0.053), ('sensing', 0.052), ('xj', 0.051), ('children', 0.05), ('acc', 0.05), ('argmaxc', 0.05), ('cacheing', 0.05), ('euc', 0.05), ('prec', 0.05), ('preconditioning', 0.05), ('memory', 0.05), ('kernels', 0.048), ('likelihood', 0.046), ('leaf', 0.046), ('par', 0.046), ('text', 0.045), ('classes', 0.044), ('derivative', 0.044), ('dealt', 0.044), ('yj', 0.044), ('latent', 0.043), ('code', 0.043), ('setup', 0.043), ('criterion', 0.043), ('isotropic', 0.043), ('rj', 0.043), ('inner', 0.041), ('structured', 0.041), ('plugging', 0.04), ('svm', 0.04), ('requiring', 0.039), ('minimize', 0.039), ('scores', 0.039), ('cation', 0.039), ('accuracy', 0.038), ('logistic', 0.038), ('anova', 0.037), ('kd', 0.037), ('multiplications', 0.037), ('wc', 0.037), ('conjugate', 0.037), ('priori', 0.037), ('gradient', 0.036), ('numerical', 0.035), ('root', 0.035), ('cheap', 0.035), ('seeger', 0.035), ('tree', 0.034), ('directions', 0.034), ('tasks', 0.034), ('system', 0.034), ('minor', 0.033), ('shared', 0.033), ('section', 0.033), ('gradients', 0.033), ('log', 0.033), ('functions', 0.032), ('removal', 0.032), ('hj', 0.032), ('opposed', 0.031), ('final', 0.03), ('rooted', 0.03), ('matrices', 0.029), ('details', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
Author: Matthias Seeger
Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1
2 0.15989268 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
Author: Yee W. Teh, David Newman, Max Welling
Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.
3 0.14980027 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
Author: Mark Girolami, Mingjun Zhong
Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1
4 0.12924239 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
5 0.12792408 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
6 0.09489242 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
7 0.090283722 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
8 0.082443103 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
9 0.082248725 186 nips-2006-Support Vector Machines on a Budget
10 0.082207836 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
11 0.079678431 115 nips-2006-Learning annotated hierarchies from relational data
12 0.074201465 116 nips-2006-Learning from Multiple Sources
13 0.071183547 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
14 0.069881633 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
15 0.068072192 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
16 0.067516588 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
17 0.065456547 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects
18 0.064272992 138 nips-2006-Multi-Task Feature Learning
19 0.063229837 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
20 0.063117623 146 nips-2006-No-regret Algorithms for Online Convex Programs
topicId topicWeight
[(0, -0.242), (1, 0.069), (2, 0.016), (3, -0.018), (4, -0.0), (5, 0.115), (6, 0.059), (7, 0.002), (8, 0.001), (9, -0.015), (10, -0.135), (11, 0.036), (12, 0.043), (13, -0.196), (14, 0.012), (15, 0.092), (16, -0.026), (17, 0.038), (18, -0.063), (19, -0.002), (20, -0.074), (21, 0.068), (22, -0.074), (23, 0.044), (24, -0.057), (25, 0.002), (26, 0.028), (27, -0.074), (28, 0.014), (29, 0.075), (30, 0.087), (31, 0.071), (32, 0.039), (33, -0.131), (34, -0.008), (35, -0.002), (36, 0.12), (37, 0.053), (38, 0.001), (39, -0.079), (40, 0.106), (41, 0.057), (42, -0.092), (43, 0.003), (44, 0.052), (45, 0.059), (46, 0.054), (47, -0.081), (48, 0.1), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.92262733 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
Author: Matthias Seeger
Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1
2 0.62242085 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
Author: Jerónimo Arenas-garcía, Kaare B. Petersen, Lars K. Hansen
Abstract: In this paper we are presenting a novel multivariate analysis method. Our scheme is based on a novel kernel orthonormalized partial least squares (PLS) variant for feature extraction, imposing sparsity constrains in the solution to improve scalability. The algorithm is tested on a benchmark of UCI data sets, and on the analysis of integrated short-time music features for genre prediction. The upshot is that the method has strong expressive power even with rather few features, is clearly outperforming the ordinary kernel PLS, and therefore is an appealing method for feature extraction of labelled data. 1
3 0.60184765 82 nips-2006-Gaussian and Wishart Hyperkernels
Author: Risi Kondor, Tony Jebara
Abstract: We propose a new method for constructing hyperkenels and define two promising special cases that can be computed in closed form. These we call the Gaussian and Wishart hyperkernels. The former is especially attractive in that it has an interpretable regularization scheme reminiscent of that of the Gaussian RBF kernel. We discuss how kernel learning can be used not just for improving the performance of classification and regression methods, but also as a stand-alone algorithm for dimensionality reduction and relational or metric learning. 1
4 0.5999642 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
Author: Mark Girolami, Mingjun Zhong
Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1
5 0.58744138 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
6 0.58066928 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
7 0.55595976 169 nips-2006-Relational Learning with Gaussian Processes
8 0.51399976 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
9 0.50759351 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
10 0.5028007 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
11 0.49781632 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
12 0.49132228 65 nips-2006-Denoising and Dimension Reduction in Feature Space
13 0.48762155 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
14 0.48682645 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
15 0.46664083 186 nips-2006-Support Vector Machines on a Budget
16 0.46060839 156 nips-2006-Ordinal Regression by Extended Binary Classification
17 0.45838028 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
18 0.43568623 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction
19 0.42654958 129 nips-2006-Map-Reduce for Machine Learning on Multicore
20 0.42461646 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space
topicId topicWeight
[(1, 0.568), (3, 0.02), (7, 0.068), (9, 0.047), (20, 0.013), (22, 0.052), (44, 0.038), (57, 0.043), (65, 0.047), (69, 0.025)]
simIndex simValue paperId paperTitle
1 0.99650091 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
2 0.99253005 46 nips-2006-Blind source separation for over-determined delayed mixtures
Author: Lars Omlor, Martin Giese
Abstract: Blind source separation, i.e. the extraction of unknown sources from a set of given signals, is relevant for many applications. A special case of this problem is dimension reduction, where the goal is to approximate a given set of signals by superpositions of a minimal number of sources. Since in this case the signals outnumber the sources the problem is over-determined. Most popular approaches for addressing this problem are based on purely linear mixing models. However, many applications like the modeling of acoustic signals, EMG signals, or movement trajectories, require temporal shift-invariance of the extracted components. This case has only rarely been treated in the computational literature, and specifically for the case of dimension reduction almost no algorithms have been proposed. We present a new algorithm for the solution of this problem, which is based on a timefrequency transformation (Wigner-Ville distribution) of the generative model. We show that this algorithm outperforms classical source separation algorithms for linear mixtures, and also a related method for mixtures with delays. In addition, applying the new algorithm to trajectories of human gaits, we demonstrate that it is suitable for the extraction of spatio-temporal components that are easier to interpret than components extracted with other classical algorithms. 1
same-paper 3 0.98767591 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
Author: Matthias Seeger
Abstract: We propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1
4 0.98088688 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
Author: Tobias Sing, Niko Beerenwinkel
Abstract: Starting with the work of Jaakkola and Haussler, a variety of approaches have been proposed for coupling domain-specific generative models with statistical learning methods. The link is established by a kernel function which provides a similarity measure based inherently on the underlying model. In computational biology, the full promise of this framework has rarely ever been exploited, as most kernels are derived from very generic models, such as sequence profiles or hidden Markov models. Here, we introduce the MTreeMix kernel, which is based on a generative model tailored to the underlying biological mechanism. Specifically, the kernel quantifies the similarity of evolutionary escape from antiviral drug pressure between two viral sequence samples. We compare this novel kernel to a standard, evolution-agnostic amino acid encoding in the prediction of HIV drug resistance from genotype, using support vector regression. The results show significant improvements in predictive performance across 17 anti-HIV drugs. Thus, in our study, the generative-discriminative paradigm is key to bridging the gap between population genetic modeling and clinical decision making. 1
5 0.96261919 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
Author: Shai Avidan, Moshe Butman
Abstract: Bob offers a face-detection web service where clients can submit their images for analysis. Alice would very much like to use the service, but is reluctant to reveal the content of her images to Bob. Bob, for his part, is reluctant to release his face detector, as he spent a lot of time, energy and money constructing it. Secure MultiParty computations use cryptographic tools to solve this problem without leaking any information. Unfortunately, these methods are slow to compute and we introduce a couple of machine learning techniques that allow the parties to solve the problem while leaking a controlled amount of information. The first method is an information-bottleneck variant of AdaBoost that lets Bob find a subset of features that are enough for classifying an image patch, but not enough to actually reconstruct it. The second machine learning technique is active learning that allows Alice to construct an online classifier, based on a small number of calls to Bob’s face detector. She can then use her online classifier as a fast rejector before using a cryptographically secure classifier on the remaining image patches. 1
6 0.85009539 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
7 0.84807837 179 nips-2006-Sparse Representation for Signal Classification
8 0.84697437 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
9 0.84288985 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
10 0.81162316 138 nips-2006-Multi-Task Feature Learning
11 0.80686742 82 nips-2006-Gaussian and Wishart Hyperkernels
12 0.80055773 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
13 0.79607296 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
14 0.7908228 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
15 0.78455687 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
16 0.77966857 65 nips-2006-Denoising and Dimension Reduction in Feature Space
17 0.77872157 75 nips-2006-Efficient sparse coding algorithms
18 0.7777226 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations
19 0.77023518 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons
20 0.76973748 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning