nips nips2006 nips2006-63 knowledge-graph by maker-knowledge-mining

63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

Source: pdf

Author: Matthias Seeger

Abstract: We propose a highly efﬁcient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classiﬁcation tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We propose a highly efﬁcient framework for kernel multi-class models with a large and structured set of classes. [sent-6, score-0.179]

2 Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. [sent-7, score-0.089]

3 We demonstrate our approach on large scale text classiﬁcation tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work. [sent-8, score-0.234]

4 For example, in multi-way classiﬁcation problems with a structured label space, modern applications demand predictions on thousands of classes, and very large datasets become available. [sent-10, score-0.094]

5 If n and C denote dataset size and number of classes respectively, nonparametric kernel methods like SVMs or Gaussian processes typically scale superlinearly in n C, if dependencies between the latent class functions are properly represented. [sent-11, score-0.257]

6 Furthermore, most large scale kernel methods proposed so far refrain from solving the problem of learning hyperparameters (kernel or loss function parameters). [sent-12, score-0.22]

7 In this paper, we propose a general framework for learning in probabilistic kernel classiﬁcation models. [sent-14, score-0.138]

8 While the basic model is standard, a major feature of our approach is the high computational efﬁciency with which the primary ﬁtting (for ﬁxed hyperparameters) is done, allowing us to deal with hundreds of classes and thousands of datapoints within a few minutes. [sent-15, score-0.199]

9 The primary ﬁtting scales linearly in C, and depends on n mainly via a ﬁxed number of matrix-vector multiplications (MVM) with n × n kernel matrices. [sent-16, score-0.33]

10 Furthermore, we optimize hyperparameters automatically by minimizing the cross-validation log likelihood, making use of our primary ﬁtting technology as inner loop in order to compute the CV criterion and its gradient. [sent-18, score-0.354]

11 Our approach can be used to learn a large number of hyperparameters and does not need user interaction. [sent-19, score-0.082]

12 Our framework is generally applicable to structured label spaces, which we demonstrate here for hierarchical classiﬁcation of text documents. [sent-20, score-0.294]

13 While the C latent class functions are fully dependent a priori, the scaling of our method stays within a factor of two compared to unstructured classiﬁcation. [sent-22, score-0.075]

14 The primary ﬁtting method is given in Section 2, the extension to hierarchical classiﬁcation in Section 3. [sent-28, score-0.31]

15 We elpoy the multiple logistic regression model, consisting of C latent (unobserved) class functions uc feeding into the multiple logistic (or softmax) likelihood P (yi,c = 1|xi , ui ) = euc (xi ) /( c euc (xi ) ). [sent-46, score-0.447]

16 Since the likelihood depends on the fc only through fc (xi ), every minimizer of Φ must be a kernel expansion: fc = i αi,c K (c) (·, xi ) (representer theorem, see [4]). [sent-60, score-0.695]

17 The corresponding kernel expansions are uc = ˆ ˆ (·, xi ) + σ 2 ). [sent-67, score-0.286]

18 Estimates of the i αi,c (K conditional probability on test points x∗ are obtained by plugging uc (x∗ ) into the likelihood. [sent-68, score-0.133]

19 ˆ We note that this setup can also be seen as MAP approximation to a Bayesian model, where the fc are given independent Gaussian process priors, e. [sent-69, score-0.195]

20 The negative log multiple logistic likelihood has similar properties, but is smooth as a function of u, and the primary ﬁtting of α does not require constrained convex optimization. [sent-73, score-0.272]

21 The complexity of our ﬁtting algorithm is dominated by k1 (k2 + 2) matrix-vector multiplications with K , where k1 is the number of NR iterations, k2 the number of linear conjugate gradient (LCG) steps for computing each Newton direction. [sent-75, score-0.11]

22 3 Hierarchical Classiﬁcation So far we dealt with ﬂat classiﬁcation, the classes being independent a priori, with block-diagonal kernel matrix K . [sent-79, score-0.226]

23 Here we focus on hierarchical classiﬁcation, the label set {1, . [sent-81, score-0.208]

24 In this Section, we propose a model for this setup and show how it can be dealt with in our framework with minor modiﬁcations and minor extra cost. [sent-86, score-0.153]

25 In ﬂat classiﬁcation, the latent class functions uc are modelled as a priori independent, in that the regularizer (which plays the role of a log prior) is a sum of individual terms for each uc , without any 1 See www. [sent-87, score-0.331]

26 Assign a pair of latent functions up , up to each node, except the root. [sent-108, score-0.075]

27 The class functions to be fed into the likelihood are the uL(c) of the leafs. [sent-111, score-0.078]

28 For example, if leafs L(c), L(c ) have the common parent p, then uL(c) = up + uL(c) , uL(c ) = up + uL(c ) , so ˘ ˘ the class functions share the effect up . [sent-113, score-0.181]

29 Under the hierarchical model, the class functions uL(c) are strongly dependent a priori. [sent-119, score-0.187]

30 We may represent this prior coupling in our framework by simply plugging in the implied kernel matrix K : ˘ (2) K = (ΦL,· ⊗ I)K (ΦT ⊗ I), L,· ˘ where the inner K is block-diagonal. [sent-120, score-0.219]

31 K is not sparse and certainly not block-diagonal, but the important point is that we are still able to do kernel MVMs efﬁciently: pre- and postmultiplying by ˘ Φ is cheap, and K is block-diagonal just as in the ﬂat case. [sent-121, score-0.138]

32 We note that the step from ﬂat to hierarchical classiﬁcation requires minor modiﬁcations of Algorithm 1: Matrix-vector multiplication existing code only. [sent-122, score-0.231]

33 This simplicity carries through to the hyperif np > 0 (p not a leaf node) then parameter learning case (see Section 4). [sent-130, score-0.109]

34 cost of a kernel MVM is increased by a factor y ← (y T , yp 1T + xT )T . [sent-135, score-0.138]

35 However, it would be wrong end for to claim that hierarchical classiﬁcation in general comes as cheap as ﬂat classiﬁcation. [sent-138, score-0.19]

36 The subtle issue is that the primary ﬁtting becomes more costly, precisely because there is more coupling between the variables. [sent-139, score-0.155]

37 In the hierarchical case, both LCG and NR need more iterations to attain the same accuracy. [sent-142, score-0.155]

38 An important point for future research is to ﬁnd a good preconditioning strategy for the system of Eq. [sent-145, score-0.084]

39 However, in all our experiments so far the ﬁtting of the hierarchical model took less than twice the time required for the ﬂat model on the same task. [sent-147, score-0.155]

40 4 Hyperparameter Learning In any model of interest, there will be free hyperparameters h, for example parameters of the kernels K (c) . [sent-149, score-0.13]

41 These were assumed to be ﬁxed in the primary ﬁtting method introduced in Section 2. [sent-150, score-0.155]

42 In this Section, we describe a scheme for learning h which makes use of the primary ﬁtting algorithm as inner loop. [sent-151, score-0.196]

43 Recall that primary ﬁtting consists of minimizing Φ of Eq. [sent-153, score-0.155]

44 A common remedy is to minimize the negative cross- validation log likelihood Ψ instead. [sent-162, score-0.118]

45 , n}\Ik , and let ΦJk = uT k ] ((1/2)α[Jk ] − y Jk ) + 1T l[Jk ] be the primary criterion on the subset Jk of the [J ˜ data. [sent-169, score-0.198]

46 To this end, we keep the set {α[Jk ] } of primary variables, and iterate between re-ﬁtting those for each fold Ik , and computing Ψ and h Ψ. [sent-178, score-0.222]

47 The latter can be determined analytically, requiring ˜ us to solve a linear system with the Hessian matrix I + V T k ] K Jk V [Jk ] already encountered during [J primary ﬁtting (see Section 5). [sent-179, score-0.228]

48 This means that the same LCG code used to compute Newton directions there can be applied here in order to compute the gradient of Ψ. [sent-180, score-0.113]

49 The update of the α[Jk ] requires q primary ﬁtting applications, but since they are initialized with the previous values α[Jk ] , they do converge very rapidly, especially during later outer iterations. [sent-183, score-0.155]

50 The gradient computation decomposes into two parts: accumulation, and kernel derivative MVMs. [sent-185, score-0.218]

51 The accumulation part requires solving q systems of size ((q − 1)/q)n C, ˜ thus q k3 kernel MVMs on the K Jk if linear conjugate gradients (LCG) is used, k3 being the number of LCG steps. [sent-186, score-0.324]

52 Note that the accumulation step is independent of the number of hyperparameters. [sent-188, score-0.116]

53 The kernel derivative MVM part consists of q derivative MVM calls for each independent component of h, see Section 5. [sent-189, score-0.226]

54 As opposed to the accumulation part, this part consists of a simple large matrix operation and can be run very efﬁciently using specialized numerical linear algebra code. [sent-191, score-0.182]

55 As shown in Section 5, the extension of hyperparameter learning to the hierarchical case of Section 3 is simply done by wrapping the accumulation part, the coding and additional memory effort being minimal. [sent-192, score-0.437]

56 Since kernel MVMs are signiﬁcantly more expensive than these O(n C) operations, the line searches basically come for free! [sent-204, score-0.138]

57 We choose search directions by Newton-Raphson (NR)4 , since the Hessian of Φ is required anyway for hyperparameter learning. [sent-205, score-0.15]

58 The NR direction is obtained by solving this system approximately by the linear conjugate gradients (LCG) method, requiring a MVM with the system matrix in each iteration, thus a single MVM with K . [sent-214, score-0.177]

59 Our implementation includes diagonal preconditioning and numerical stability safeguards [8]. [sent-215, score-0.085]

60 We now show how to compute the gradient h Ψ for the CV criterion Ψ (Eq. [sent-219, score-0.079]

61 We ﬁnally mention some of the computational “tricks”, without which we could not have dealt with the largest tasks in Section 6. [sent-230, score-0.078]

62 For ˜ hyperparameter learning, we work on subsets Jk and need MVMs with K Jk . [sent-238, score-0.116]

63 We also share memory blocks of size n C between LCG, gradient accumulation, line searches, keeping the overall memory requirements at r n C for a small constant r, and avoiding frequent reallocations. [sent-242, score-0.136]

64 With many classes C, we may share kernels: K (c) = vc M (lc ) , vc > 0 variance parameters, M (l) independent correlation functions. [sent-246, score-0.488]

65 The linear kernel K (c) (x, x ) = vc xT x is frequently used for text classiﬁcation (see Section 6. [sent-248, score-0.405]

66 If the data matrix X is sparse, kernel MVM can be done in much less than the generic O(C n2 ), typically in O(C n), requiring O(n) storage for X only, even if the dimension of x is way beyond n. [sent-250, score-0.177]

67 If the K (c) are isotropic kernels (depending on x −x only) and the x are low-dimensional, MVM with K (c) can be approximated using specialized nearest neighbour data structures such as KD trees [12, 9]. [sent-251, score-0.091]

68 For general kernels whose kernel matrices have a rapidly decaying eigenspectrum, one can approximate MVM by using low-rank matrices instead of the K (c) [10], whence MVM is O(C n d), d the rank. [sent-253, score-0.244]

69 Note that (∂/∂ log vc )K (c) = K (c) , reducing to kernel MVM. [sent-255, score-0.393]

70 For isotropic kernels, K (c) = f (A), ai,j = xi − xj , so (∂/∂hj )K (c) = gj (A). [sent-256, score-0.149]

71 5 The innermost vector operations work on contiguous chunks of memory, rather than strided ones, thus supporting cacheing or vector functions of the processor. [sent-258, score-0.082]

72 6 Experiments In this Section, we provide experimental results for our framework on data from remote sensing, and on a set of large text classiﬁcation tasks with very many classes, the latter are hierarchical. [sent-259, score-0.144]

73 1 Flat Classiﬁcation: Remote Sensing We use the satimage remote sensing task from the statlog repository. [sent-261, score-0.117]

74 We use the isotropic Gaussian (RBF) kernel wc K (c) (x, x ) = vc exp − x − x 2 , vc , wc > 0, x, x ∈ Rd . [sent-264, score-0.699]

75 Note that while 1rest also may choose 12 independent kernel parameters, it does not make good use of this possibility, as opposed to mc-sep. [sent-277, score-0.169]

76 Our method outputs a predictive pj ∈ RC for each test case xj . [sent-296, score-0.107]

77 The standard prediction y(xj ) = argmaxc pj,c maximizes expected accuracy, classes are ranked as rj (c) ≤ rj (c ) iff pj,c ≥ pj,c . [sent-297, score-0.18]

78 The test scores are the same as in [1]: accuracy (acc) m−1 j I{y(xj )=yj } , precision (prec) m−1 j rj (yj )−1 , parent accuracy (pacc) m−1 j I{par(y(xj ))=par(yj )} , par(c) being the parent of L(c). [sent-298, score-0.306]

79 For taxo-loss and parent accuracy, we better choose y(xj ) to minimize expected loss8 , different from the standard prediction. [sent-302, score-0.113]

80 F1: all vc shared (1); H1: vc shared across each level of the tree (3). [sent-304, score-0.544]

81 F2, H2: vc shared across each subtree rooted at root’s children (A: 15, B: 34, C: 17, D: 7, E: 7, F: 17, G: 12, H: 5). [sent-305, score-0.335]

82 For hyperparameter learning: k1 = 8, k2 = 4, k3 = 15 (F1, F2); k1 = 10, k2 = 4, k3 = 25 (H1, H2)9 . [sent-307, score-0.116]

83 8 For parent accuracy, let p(j) be the node with maximal mass (under pj ) of its children which are leafs, then y(xj ) must be a child of p(j). [sent-325, score-0.124]

84 taxo[0-1], pacc[0-1] for argmaxc pj,c rule, rather than minimize expected loss. [sent-521, score-0.089]

85 CV Fold: Re-optimization of α[J] , gradient accumulation for single fold. [sent-533, score-0.152]

86 The optimization is started from vc = 5 for all methods. [sent-535, score-0.222]

87 While the differences in accuracy and precision are hardly signiﬁcant (as also found in [1]), they (partly) are in taxo-loss and parent accuracy. [sent-538, score-0.112]

88 H1 and H2 do not perform differently: choosing many different vc in the linear kernel seems no advantage here (but see Section 6. [sent-540, score-0.36]

89 However, for our method, the recommendation in [1] to use vc = 1 leads to signiﬁcantly worse results in all scores, the vc chosen by our methods are generally larger. [sent-543, score-0.444]

90 In Table 2, we present running times10 for the ﬁnal ﬁtting and for a single fold during hyperparameter optimization (5 of them are required for Ψ, h Ψ). [sent-544, score-0.183]

91 It is precisely this high efﬁciency of primary ﬁtting which allows us to use it as inner loop for hyperparameter learning. [sent-546, score-0.312]

92 7 Discussion We presented a general framework for very efﬁcient large scale kernel multi-way classiﬁcation with structured label spaces and demonstrated its features on hierarchical text classiﬁcation tasks with many classes. [sent-547, score-0.466]

93 As shown for the hierarchical case, the framework is easily extended to novel struc10 Processor time on 64bit 2. [sent-548, score-0.155]

94 We solve the kernel parameter learning problem by optimizing the CV log likelihood, whose gradient can be computed within the framework. [sent-551, score-0.207]

95 An application to label sequence learning is work in progress, which may even be combined with a hierarchical prior. [sent-556, score-0.208]

96 Infering a hierarchy from data is possible in principle, using expectation maximization techniques (note that the primary ﬁtting can deal with target distributions y i ), as well as incorporating uncertain data. [sent-557, score-0.216]

97 Empirical Bayesian methods or approximate CV scores for hyperparameter learning have been proposed in [11, 3, 6], but they are orders of magnitude more expensive than our proposal here, and do not apply to a massive number of classes. [sent-558, score-0.155]

98 SVM methods typically do not come with efﬁcient automatic kernel parameter learning schemes, and they do not provide estimates of predictive probabilities which are asymptotically correct. [sent-562, score-0.194]

99 Efﬁcient kernel machines using the improved fast Gauss transform. [sent-640, score-0.138]

100 The latter kind, if feasible at all, often makes better use of hardware features such as cacheing and vector operations, and therefore is the preferred approach in numerical optimization. [sent-643, score-0.085]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mvm', 0.475), ('jk', 0.349), ('vc', 0.222), ('lcg', 0.2), ('nr', 0.199), ('hierarchical', 0.155), ('primary', 0.155), ('fc', 0.152), ('cv', 0.143), ('tting', 0.142), ('kernel', 0.138), ('hyperparameter', 0.116), ('accumulation', 0.116), ('ul', 0.105), ('dk', 0.105), ('mvms', 0.1), ('pacc', 0.1), ('taxo', 0.1), ('ik', 0.098), ('uc', 0.093), ('lh', 0.087), ('classi', 0.085), ('hyperparameters', 0.082), ('leafs', 0.075), ('parent', 0.074), ('newton', 0.069), ('fold', 0.067), ('remote', 0.065), ('cai', 0.065), ('np', 0.063), ('hierarchy', 0.061), ('ui', 0.057), ('predictive', 0.056), ('bc', 0.055), ('xi', 0.055), ('hessian', 0.053), ('label', 0.053), ('sensing', 0.052), ('xj', 0.051), ('children', 0.05), ('acc', 0.05), ('argmaxc', 0.05), ('cacheing', 0.05), ('euc', 0.05), ('prec', 0.05), ('preconditioning', 0.05), ('memory', 0.05), ('kernels', 0.048), ('likelihood', 0.046), ('leaf', 0.046), ('par', 0.046), ('text', 0.045), ('classes', 0.044), ('derivative', 0.044), ('dealt', 0.044), ('yj', 0.044), ('latent', 0.043), ('code', 0.043), ('setup', 0.043), ('criterion', 0.043), ('isotropic', 0.043), ('rj', 0.043), ('inner', 0.041), ('structured', 0.041), ('plugging', 0.04), ('svm', 0.04), ('requiring', 0.039), ('minimize', 0.039), ('scores', 0.039), ('cation', 0.039), ('accuracy', 0.038), ('logistic', 0.038), ('anova', 0.037), ('kd', 0.037), ('multiplications', 0.037), ('wc', 0.037), ('conjugate', 0.037), ('priori', 0.037), ('gradient', 0.036), ('numerical', 0.035), ('root', 0.035), ('cheap', 0.035), ('seeger', 0.035), ('tree', 0.034), ('directions', 0.034), ('tasks', 0.034), ('system', 0.034), ('minor', 0.033), ('shared', 0.033), ('section', 0.033), ('gradients', 0.033), ('log', 0.033), ('functions', 0.032), ('removal', 0.032), ('hj', 0.032), ('opposed', 0.031), ('final', 0.03), ('rooted', 0.03), ('matrices', 0.029), ('details', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

Author: Matthias Seeger

2 0.15989268 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efﬁcient, easy to implement and signiﬁcantly more accurate than standard variational Bayesian inference for LDA.

3 0.14980027 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classiﬁcation setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classiﬁer combination. 1

4 0.12924239 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

5 0.12792408 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle

Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efﬁciently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identiﬁed by our approach with very few training rounds and gradient computations. . 1

6 0.09489242 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

7 0.090283722 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

8 0.082443103 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

9 0.082248725 186 nips-2006-Support Vector Machines on a Budget

10 0.082207836 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

11 0.079678431 115 nips-2006-Learning annotated hierarchies from relational data

12 0.074201465 116 nips-2006-Learning from Multiple Sources

13 0.071183547 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests

14 0.069881633 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

15 0.068072192 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

16 0.067516588 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

17 0.065456547 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

18 0.064272992 138 nips-2006-Multi-Task Feature Learning

19 0.063229837 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

20 0.063117623 146 nips-2006-No-regret Algorithms for Online Convex Programs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.242), (1, 0.069), (2, 0.016), (3, -0.018), (4, -0.0), (5, 0.115), (6, 0.059), (7, 0.002), (8, 0.001), (9, -0.015), (10, -0.135), (11, 0.036), (12, 0.043), (13, -0.196), (14, 0.012), (15, 0.092), (16, -0.026), (17, 0.038), (18, -0.063), (19, -0.002), (20, -0.074), (21, 0.068), (22, -0.074), (23, 0.044), (24, -0.057), (25, 0.002), (26, 0.028), (27, -0.074), (28, 0.014), (29, 0.075), (30, 0.087), (31, 0.071), (32, 0.039), (33, -0.131), (34, -0.008), (35, -0.002), (36, 0.12), (37, 0.053), (38, 0.001), (39, -0.079), (40, 0.106), (41, 0.057), (42, -0.092), (43, 0.003), (44, 0.052), (45, 0.059), (46, 0.054), (47, -0.081), (48, 0.1), (49, -0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92262733 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

Author: Matthias Seeger

2 0.62242085 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

Author: Jerónimo Arenas-garcía, Kaare B. Petersen, Lars K. Hansen

Abstract: In this paper we are presenting a novel multivariate analysis method. Our scheme is based on a novel kernel orthonormalized partial least squares (PLS) variant for feature extraction, imposing sparsity constrains in the solution to improve scalability. The algorithm is tested on a benchmark of UCI data sets, and on the analysis of integrated short-time music features for genre prediction. The upshot is that the method has strong expressive power even with rather few features, is clearly outperforming the ordinary kernel PLS, and therefore is an appealing method for feature extraction of labelled data. 1

3 0.60184765 82 nips-2006-Gaussian and Wishart Hyperkernels

Author: Risi Kondor, Tony Jebara

Abstract: We propose a new method for constructing hyperkenels and deﬁne two promising special cases that can be computed in closed form. These we call the Gaussian and Wishart hyperkernels. The former is especially attractive in that it has an interpretable regularization scheme reminiscent of that of the Gaussian RBF kernel. We discuss how kernel learning can be used not just for improving the performance of classiﬁcation and regression methods, but also as a stand-alone algorithm for dimensionality reduction and relational or metric learning. 1

4 0.5999642 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

5 0.58744138 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle

6 0.58066928 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

7 0.55595976 169 nips-2006-Relational Learning with Gaussian Processes

8 0.51399976 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

9 0.50759351 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype

10 0.5028007 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

11 0.49781632 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

12 0.49132228 65 nips-2006-Denoising and Dimension Reduction in Feature Space

13 0.48762155 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

14 0.48682645 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

15 0.46664083 186 nips-2006-Support Vector Machines on a Budget

16 0.46060839 156 nips-2006-Ordinal Regression by Extended Binary Classification

17 0.45838028 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

18 0.43568623 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

19 0.42654958 129 nips-2006-Map-Reduce for Machine Learning on Multicore

20 0.42461646 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.568), (3, 0.02), (7, 0.068), (9, 0.047), (20, 0.013), (22, 0.052), (44, 0.038), (57, 0.043), (65, 0.047), (69, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99650091 186 nips-2006-Support Vector Machines on a Budget

Author: Ofer Dekel, Yoram Singer

Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to deﬁne the generated classiﬁer. We present a modiﬁed version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classiﬁed examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1

2 0.99253005 46 nips-2006-Blind source separation for over-determined delayed mixtures

Author: Lars Omlor, Martin Giese

Abstract: Blind source separation, i.e. the extraction of unknown sources from a set of given signals, is relevant for many applications. A special case of this problem is dimension reduction, where the goal is to approximate a given set of signals by superpositions of a minimal number of sources. Since in this case the signals outnumber the sources the problem is over-determined. Most popular approaches for addressing this problem are based on purely linear mixing models. However, many applications like the modeling of acoustic signals, EMG signals, or movement trajectories, require temporal shift-invariance of the extracted components. This case has only rarely been treated in the computational literature, and speciﬁcally for the case of dimension reduction almost no algorithms have been proposed. We present a new algorithm for the solution of this problem, which is based on a timefrequency transformation (Wigner-Ville distribution) of the generative model. We show that this algorithm outperforms classical source separation algorithms for linear mixtures, and also a related method for mixtures with delays. In addition, applying the new algorithm to trajectories of human gaits, we demonstrate that it is suitable for the extraction of spatio-temporal components that are easier to interpret than components extracted with other classical algorithms. 1

same-paper 3 0.98767591 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

Author: Matthias Seeger

4 0.98088688 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype

Author: Tobias Sing, Niko Beerenwinkel

Abstract: Starting with the work of Jaakkola and Haussler, a variety of approaches have been proposed for coupling domain-speciﬁc generative models with statistical learning methods. The link is established by a kernel function which provides a similarity measure based inherently on the underlying model. In computational biology, the full promise of this framework has rarely ever been exploited, as most kernels are derived from very generic models, such as sequence proﬁles or hidden Markov models. Here, we introduce the MTreeMix kernel, which is based on a generative model tailored to the underlying biological mechanism. Speciﬁcally, the kernel quantiﬁes the similarity of evolutionary escape from antiviral drug pressure between two viral sequence samples. We compare this novel kernel to a standard, evolution-agnostic amino acid encoding in the prediction of HIV drug resistance from genotype, using support vector regression. The results show signiﬁcant improvements in predictive performance across 17 anti-HIV drugs. Thus, in our study, the generative-discriminative paradigm is key to bridging the gap between population genetic modeling and clinical decision making. 1

5 0.96261919 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection

Author: Shai Avidan, Moshe Butman

Abstract: Bob offers a face-detection web service where clients can submit their images for analysis. Alice would very much like to use the service, but is reluctant to reveal the content of her images to Bob. Bob, for his part, is reluctant to release his face detector, as he spent a lot of time, energy and money constructing it. Secure MultiParty computations use cryptographic tools to solve this problem without leaking any information. Unfortunately, these methods are slow to compute and we introduce a couple of machine learning techniques that allow the parties to solve the problem while leaking a controlled amount of information. The ﬁrst method is an information-bottleneck variant of AdaBoost that lets Bob ﬁnd a subset of features that are enough for classifying an image patch, but not enough to actually reconstruct it. The second machine learning technique is active learning that allows Alice to construct an online classiﬁer, based on a small number of calls to Bob’s face detector. She can then use her online classiﬁer as a fast rejector before using a cryptographically secure classiﬁer on the remaining image patches. 1

6 0.85009539 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

7 0.84807837 179 nips-2006-Sparse Representation for Signal Classification

8 0.84697437 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

9 0.84288985 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

10 0.81162316 138 nips-2006-Multi-Task Feature Learning

11 0.80686742 82 nips-2006-Gaussian and Wishart Hyperkernels

12 0.80055773 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

13 0.79607296 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

14 0.7908228 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

15 0.78455687 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

16 0.77966857 65 nips-2006-Denoising and Dimension Reduction in Feature Space

17 0.77872157 75 nips-2006-Efficient sparse coding algorithms

18 0.7777226 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations

19 0.77023518 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons

20 0.76973748 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning