nips nips2012 nips2012-284 knowledge-graph by maker-knowledge-mining

284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging


Source: pdf

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. [sent-7, score-0.174]

2 Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. [sent-8, score-0.382]

3 We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. [sent-11, score-0.167]

4 This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. [sent-12, score-0.045]

5 We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. [sent-13, score-0.057]

6 The design of a kernel (feature pre-processing) may involve using different sets of extracted features, dimensionality reductions, or parameterizations of the kernel functions. [sent-19, score-0.33]

7 Each of these alternatives produces a distinct kernel matrix. [sent-20, score-0.165]

8 Rather than selecting a single kernel, MKL offers the flexibility of specifying a large set of kernels corresponding to the many options (i. [sent-25, score-0.321]

9 In the context of our primary motivating application, the current state of the art in multi-modality neuroimaging-based Alzheimer’s Disease (AD) prediction [10] is achieved by multi-kernel methods [3, 4], where each imaging modality spawns a kernel, or set of kernels. [sent-31, score-0.118]

10 In allowing the user to specify an arbitrary number of base kernels for combination MKL provides more expressive power, but this comes with the responsibility to regularize the kernel mixing coefficients so that the classifier generalizes well. [sent-32, score-0.689]

11 While the importance of this regularization cannot be overstated, it is also a fact that commonly used p norm regularizers operate on kernels separately, without explicitly acknowledging dependencies and interactions among them. [sent-33, score-0.447]

12 To see how such dependencies can arise in practice, consider our neuroimaging learning problem of interest: the task of learning to predict the onset of AD. [sent-34, score-0.148]

13 , KM are derived from several different medical imaging modalities (MRI; PET), image processing methods (morphometric; anatomical modelling), and kernel functions (linear; RBF). [sent-38, score-0.275]

14 Some features may be shared between kernels, or kernel functions may use similar parameters. [sent-39, score-0.165]

15 As a result we expect the kernels’ behaviors to exhibit some correlational, or other cluster structure according to how they were constructed. [sent-40, score-0.059]

16 Ideally, the regularizer should reflect these dependencies encoded by Q, as they can significantly impact the learning characteristics of a linearly combined kernel. [sent-44, score-0.094]

17 Instead, rather than penalizing covariances or inducing sparsity among groups of kernels, it may be beneficial to reward such covariances, so as to better reflect a latent cluster structure between kernels. [sent-48, score-0.053]

18 The development of MKL methods began with [5], which showed that the problem of learning the right kernel for an input problem instance could be formulated as a Semi-Definite Program (SDP). [sent-53, score-0.165]

19 To this end, the model in [5] was shown to be solvable as a Second Order Cone Program [12], a Semi-Infinite Linear Program [6], and via gradient descent methods in the dual and primal [7, 13]. [sent-55, score-0.062]

20 A non-linear “hyperkernel” method was proposed [15] which implicitly maps the kernels themselves to an implicit RKHS, however this method is computationally very demanding, (it has 4th order interactions among training examples). [sent-58, score-0.345]

21 The authors of [16] proposed to first select the sub-kernel weights by minimizing an objective function derived from Normalized Cuts, and subsequently train an SVM on the combined kernel. [sent-59, score-0.052]

22 Contemporary to these results, [18] showed that if a large number of kernels had a desirable shared structure (e. [sent-61, score-0.321]

23 Recently in [8], a set of base classifiers were first trained using each kernel and were then boosted to produce a strong multi-class classifier. [sent-64, score-0.263]

24 Adding kernels corresponds to taking a direct sum of Reproducing Kernel Hilbert √ spaces (RKHS), and scaling a kernel by a constant c scales the axes of it’s RKHS by c. [sent-72, score-0.486]

25 In the wm M 2 Hm 1 MKL setting, the SVM margin regularizer 2 w 2 becomes a weighted sum 1 m=1 over 2 βm contributions from RKHS’s H1 , . [sent-73, score-0.274]

26 A norm penalty on β ensures that the units in which the margin is measured are meaningful (provided the base kernels are normalized). [sent-77, score-0.561]

27 The MKL primal problem is given as min w,b,β≥0,ξ≥0 1 2 M m wm 2 m H +C βm n M ξi + β 2 p s. [sent-78, score-0.213]

28 yi wm , φm (xi ) Hm +b ≥ 1 − ξi , (1) m i where φm (x) is the (potentially unknown) transformation from the original data space to the mth RKHS Hm . [sent-80, score-0.216]

29 As in SVMs, we turn to the dual problem to see the role of kernels: max 0≤α≤C αT 1 − 1 G q, 2 G ∈ RM ; Gm = (α ◦ y)T Km (α ◦ y), (2) 1 where ◦ denotes element-wise multiplication, and the dual q-norm follows the identity p + 1 = 1. [sent-81, score-0.07]

30 q 2 Note that the primal norm penalty β p becomes a dual-norm on the vector G. [sent-82, score-0.121]

31 At optimality, wm 2 Hm wm = βm (α ◦ y)T φm (X), and so the term Gm = (α ◦ y)T Km (α ◦ y) = is the vector 2 βm of scaled classifier norms. [sent-83, score-0.372]

32 This shows that the dual norm is tied to how MKL measures the margin in each RKHS. [sent-84, score-0.132]

33 The key characteristic of Q-MKL is that the standard (squared) p -norm penalty on β, along with the corresponding dual-norm penalty in (2), is substituted with a more general class of quadratic penalty functions, expressed as β T Qβ = β 2 . [sent-86, score-0.179]

34 β Q = β T Qβ is a Q Mahalanobis (matrix-induced) norm so long as Q 0. [sent-87, score-0.049]

35 In this framework, the burden of choosing a kernel is deferred to a choice of Q-function. [sent-88, score-0.19]

36 yi wm , φm (xi ) Hm +b ≥ 1 − ξi , (3) m i where the last objective term provides a bias relative to β T Qβ. [sent-92, score-0.186]

37 [19] derived a theoretical generalization error bound on kernel combinations which depends on the degree of redundancy between support vectors in SVMs trained on base kernels individually. [sent-100, score-0.584]

38 Using this type of correlational structure, we can derive a Q function between kernels to automatically select a combination of kernels which will maximize this bound. [sent-101, score-0.711]

39 We first show that as Q−1 ’s eigen-values decay, so do the traces of the virtual kernels. [sent-106, score-0.335]

40 Assuming Q−1 has a bounded, non-uniform spectrum, this property can then be used to analyze, (and bound), Q-MKL’s Rademacher complexity, which has been shown to depend on the traces of the base kernels. [sent-107, score-0.204]

41 We then offer a few observations on how Q−1 ’s Renyi entropy [20] relates to these learning theoretic bounds. [sent-108, score-0.125]

42 First, observe that because Q’s eigen-vectors provide an orthonormal basis of RM , β ∈ RM can be expressed as a linear combination in this basis with γ as its coefficients: β = i γi vi = V γ. [sent-111, score-0.091]

43 Each eigen-vector vi of Q can be used to define a linear combination of kernels, which we will refer to as virtual kernel Ki = m vi (m)Km . [sent-113, score-0.545]

44 If Ki 0, ∀i, then Q-MKL is equivalent to 2-norm MKL using virtual kernels instead of base kernels. [sent-117, score-0.648]

45 4) and K ∗ = 2 m βm Km 1 M M M M M −2 = = = = m i γi vi (m)Km i µi λ m vi (m)Km i µi Ki , where Ki 1 M λ− 2 m vi (m)Km is the ith virtual kernel. [sent-121, score-0.409]

46 The learned kernel K ∗ is a weighted combination of virtual kernels, and the coefficients are regularized under a squared 2-norm. [sent-122, score-0.425]

47 We first state a theorem from [21], which relates the Rademacher complexity of MKL to the traces of its base kernels. [sent-125, score-0.286]

48 ([21]) The empirical Rademacher complexity on a sample set S of size n, with M base 23 kernels is given as follows (with η0 = 22 ), RS (HM p ) ≤ T where u = [Tr(K1 ), · · · , Tr(KM )] and 1 p + 1 q η0 q u n q (5) = 1. [sent-127, score-0.46]

49 The bound in (5) shows that the Rademacher complexity RS (·) depends on u q , a norm on the base kernels’ traces. [sent-128, score-0.188]

50 However, in Q-MKL the virtual kernels traces are not equal, T √v and are in fact given by Tr(Ki ) = 1 λ i . [sent-130, score-0.656]

51 With this expression for the traces of the virtual kernels, i we can now prove that the bound given in (5) is strictly decreased as long as the eigen-values ψi of Q−1 are in the range (0, 1]. [sent-131, score-0.335]

52 By Lemma 1, we have that the bound in (5) will decrease if u 2 , the norm on the virtual √ kernel traces, decreases. [sent-136, score-0.443]

53 As shown above, the virtual kernel traces are given as Tr(Ki ) = ψi 1T vi , N N T meaning that u 2 = i ψi (1T vi )2 = i ψi 1T vi vi 1 = 1T Q−1 1. [sent-137, score-0.74]

54 Q-MKL is equivalent to the following model: min w,b,µ,ξ≥0 1 2 M wm 2 m V +C µm m n ξi + µ 2 2 M s. [sent-147, score-0.186]

55 yi wm , φm (xi ) (6) i Vm +b ≥ 1 − ξi , 1 Q− 2 µ ≥ 0, m where φm () is the feature transform mapping data space to the mth virtual kernel, denoted as Vm . [sent-149, score-0.445]

56 4 1 While the virtual kernels themselves may be indefinite, recall that µ = Q 2 β, and so the constraint 1 Q− 2 µ ≥ 0 is equivalent to β ≥ 0, guaranteeing that the combined kernel will be p. [sent-150, score-0.74]

57 Renyi entropy [20] significantly generalizes the usual notion of Shannon entropy [22, 23, 24], has applications in Statistics and many other fields, and has recently been proposed as an alternative to PCA [22]. [sent-155, score-0.213]

58 2 points to an intuitive explanation of where the benefit from a Q regularizer comes from as well, if we analyze the Renyi entropy of the distribution on kernels defined by Q−1 , if we treat Q−1 as a kernel density estimator. [sent-157, score-0.61]

59 The quadratic Renyi entropy of a probability measure is given as, H(p) = − log p2 (x)dx. [sent-158, score-0.128]

60 [15],) then with some normalization we can derive an estimate of the underlying probability p, which is a distribution over base kernels. [sent-164, score-0.098]

61 We can then interpret its Renyi ˆ entropy as a complexity measure on the space of combined kernels. [sent-165, score-0.15]

62 2) in [23] relates the 1 virtual kernel traces to the Renyi entropy estimator of Q−1 as p2 (x)dx = N 2 1T Q−1 1,1 which ˆ leads to a nice connection to Thm. [sent-168, score-0.625]

63 , 2-norm MKL), has maximal Renyi entropy because it is maximally uninformative; adding structure to Q−1 concentrates p, reducing both its Renyi entropy, and Rademacher complexity together. [sent-172, score-0.125]

64 2 relies on decreasing a norm on the virtual kernel traces, which we now see directly relates to the Renyi entropy of Q−1 , as well as with decreasing the Rademacher complexity of the search space of combined kernels. [sent-175, score-0.634]

65 It is even possible that by directly analyzing Renyi entropy in a multi-kernel setting, this conjecture may be useful in deriving analogous bounds in, e. [sent-176, score-0.084]

66 , Indefinite Kernel Learning [25], because the virtual kernels are indefinite in general. [sent-178, score-0.55]

67 2 Special Cases: Q-SVM and relative margin Before describing our optimization strategy, we discuss several variations on the Q-MKL model. [sent-180, score-0.048]

68 , singleton features,) then each coefficient βm effectively becomes a feature weight, and a 2-norm penalty on β is a penalty on weights. [sent-185, score-0.09]

69 Several interesting extensions to the SVM and MKL frameworks have been proposed which focus on the relative margin methods [28, 29] which maximize the margin relative to the spread of the data. [sent-190, score-0.12]

70 Q-MKL generalizes β 2 to arbitrary convex quadratic functions, while the feasible set p is the same as for MKL. [sent-196, score-0.116]

71 We may consider this process as a composition of two modules: one which solves for SVM dual parameters (α) with fixed β, and the other for solving for β with fixed α: 1 Note that this involves a Gaussian assumption, but [24] provides extensions to non-Gauss kernels. [sent-199, score-0.059]

72 Notice that (8) has a sum of ratios with optimization variables in the denominator, while the constraint is quadratic – this means that standard convex optimization toolkits may not be able to solve this problem without significant reformulation from its canonical form in (8). [sent-205, score-0.071]

73 Writing the gradient in terms of the Lagrange multiplier δ, and setting it equal to 0 gives: wm 2 m H − δ(Qβ)m = 0, ∀m ∈ {1, · · · , M }. [sent-207, score-0.186]

74 Left (a-b): weights given by a standard SVM; Right (c-d): weights given by Q-SVM . [sent-232, score-0.054]

75 5 Experiments We performed extensive experiments to validate Q-MKL, examine the effect it has on β, and to assess its advantages in the context of our motivating neuroimaging application. [sent-236, score-0.143]

76 Our focus on a practical application is intended as a demonstration of how domain knowledge can be seamlessly incorporated into a learning model, giving significant gains in accuracy. [sent-238, score-0.051]

77 In out main experiments we used brain scans of AD patients and Cognitively Normal healthy controls (CN) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [30] in a set of cross-validation experiments. [sent-242, score-0.088]

78 ADNI is a landmark study sponsored by the NIH, major pharmaceuticals and others to determine the extent to which multi-modal brain imaging can help predict on-set, and monitor progression of, AD. [sent-243, score-0.116]

79 For our experiments, 48 AD subjects and 66 controls were chosen who had both T1 -weighted MR scans and Fluoro-Deoxy-Glucose PET (FDG-PET) scans at two time-points two years apart. [sent-245, score-0.118]

80 uk/spm/) were used to register scans to a common template and calculate Gray Matter (GM) densities at each voxel in the MR scans. [sent-251, score-0.059]

81 Feature selection was performed separately in each set of images by sorting voxels by t-statistic (calculated using training data), and choosing the highest 2000, 5000, 10000,. [sent-253, score-0.046]

82 We used linear, quadratic, and Gaussian kernels: a total of 24 kernels per set, (GM maps, TBM maps, baseline FDG-PET, FDG-PET at 2-year follow up) for a total of 96 kernels. [sent-257, score-0.321]

83 For Q-matrix we used the Laplacian of covariance between single-kernel α parameters, (recall the motivation from [19] in Section 3,) plus a block-diagonal representing clusters of kernels derived from the same imaging modalities. [sent-258, score-0.378]

84 To demonstrate that Q-regularizers indeed influence the learned classifier, we performed classification experiments with the Laplacian of the inverse distance between voxels as a Q matrix, and voxel-wise GM density (VBM) as features. [sent-262, score-0.046]

85 2 Multi-modality Alzheimer’s disease (AD) prediction Next, we performed multi-modality AD prediction experiments using all 96 kernels across two modalRegularizer Acc. [sent-272, score-0.405]

86 have emerged as the state of the art in this domain [3, 4], and have performed favorably in extensive evaluations against various baselines such as single-kernel methods, and na¨ve combinations, ı we therefore focus our analysis on comparison with existing MKL methods. [sent-308, score-0.064]

87 The primary benefit of current sparse MKL methods is that they effectively filter out uninformative or noisy kernels, however, the kernels used in these experiments are all derived from clinically relevant neuroimaging data, and are thus highly reliable. [sent-316, score-0.464]

88 We next turn to an analysis of the covariance structures found in the data empirically as a concrete demonstration of the type of patterns towards which the Q-MKL regularizer biases β. [sent-319, score-0.064]

89 The FDG-PET kernels (last 48 kernels) are much more strongly interrelated. [sent-324, score-0.321]

90 Interestingly, the first eigenvector is almost entirely devoted to two large blocks of kernels – those which come from MRI data, and those which come from FDG-PET data. [sent-325, score-0.373]

91 Q-MKL extends this framework to exploit higher order interactions between kernels using supervised, unsupervised, or domain-knowledge driven measures. [sent-331, score-0.345]

92 This flexibility can impart greater control over how the model utilizes cluster structure among kernels, and effectively encourage cancellation of errors wherever possible. [sent-332, score-0.053]

93 We have presented a convex optimization model to efficiently solve the resultant model, and shown experiments on a challenging problem of identifying AD based on multi modal brain imaging data (obtaining statistically significant improvements). [sent-333, score-0.139]

94 Note the block structure in (a–d) relating to the imaging modalities and kernel functions. [sent-339, score-0.247]

95 Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. [sent-349, score-0.33]

96 Alzheimer’s disease diagnosis in individual subjects using structural MR images: validation studies. [sent-416, score-0.084]

97 Multiple kernel learning, conic duality, and the SMO algorithm. [sent-430, score-0.165]

98 Exploring large feature spaces with hierarchical multiple kernel learning. [sent-468, score-0.165]

99 Orthogonal series density estimation and the kernel eigenvalue problem. [sent-493, score-0.165]

100 Spatial and anatomical regularization of SVM for brain image analysis. [sent-519, score-0.057]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mkl', 0.573), ('kernels', 0.321), ('virtual', 0.229), ('renyi', 0.225), ('wm', 0.186), ('km', 0.165), ('kernel', 0.165), ('rademacher', 0.158), ('alzheimer', 0.124), ('neuroimaging', 0.119), ('ad', 0.111), ('tbm', 0.106), ('traces', 0.106), ('module', 0.104), ('base', 0.098), ('hm', 0.094), ('svm', 0.091), ('vbm', 0.087), ('rkhs', 0.084), ('entropy', 0.084), ('disease', 0.084), ('gm', 0.08), ('ki', 0.074), ('inde', 0.065), ('adni', 0.065), ('vi', 0.06), ('scans', 0.059), ('imaging', 0.057), ('im', 0.056), ('mr', 0.053), ('hinrichs', 0.052), ('sdp', 0.051), ('norm', 0.049), ('margin', 0.048), ('voxels', 0.046), ('laplacian', 0.046), ('penalty', 0.045), ('generalizes', 0.045), ('gehler', 0.044), ('quadratic', 0.044), ('jmlr', 0.044), ('rmm', 0.043), ('shogun', 0.043), ('szafranski', 0.043), ('complexity', 0.041), ('relates', 0.041), ('regularizer', 0.04), ('tr', 0.038), ('neuroimage', 0.038), ('qz', 0.038), ('pet', 0.038), ('correlational', 0.038), ('sonnenburg', 0.038), ('uw', 0.038), ('art', 0.037), ('dual', 0.035), ('behaviors', 0.035), ('supplements', 0.035), ('morphometry', 0.035), ('singh', 0.034), ('rm', 0.032), ('optimized', 0.032), ('svms', 0.032), ('vm', 0.031), ('combination', 0.031), ('mth', 0.03), ('progression', 0.03), ('brain', 0.029), ('encourage', 0.029), ('covariances', 0.029), ('mixing', 0.029), ('dependencies', 0.029), ('cristianini', 0.029), ('brie', 0.028), ('program', 0.028), ('anatomical', 0.028), ('mri', 0.028), ('madison', 0.028), ('weights', 0.027), ('domain', 0.027), ('primal', 0.027), ('convex', 0.027), ('come', 0.026), ('psd', 0.026), ('multi', 0.026), ('lanckriet', 0.026), ('peng', 0.025), ('modalities', 0.025), ('burden', 0.025), ('combined', 0.025), ('wisconsin', 0.025), ('coef', 0.024), ('toolbox', 0.024), ('uninformative', 0.024), ('extensions', 0.024), ('interactions', 0.024), ('cluster', 0.024), ('regularizers', 0.024), ('demonstration', 0.024), ('motivating', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

2 0.21166502 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

3 0.19076546 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

Author: Sung J. Hwang, Kristen Grauman, Fei Sha

Abstract: When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient. While an object taxonomy specifying the categories’ semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that are relevant. In light of these issues, we propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class. To learn the weights, we introduce a novel hierarchical regularization term that further exploits the taxonomies’ structure. We demonstrate our method on challenging object recognition datasets, and show that interleaving multiple taxonomic views yields significant accuracy improvements.

4 0.1717836 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

5 0.14541832 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

6 0.13783975 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

7 0.11314036 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

8 0.10872429 197 nips-2012-Learning with Recursive Perceptual Representations

9 0.092912547 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

10 0.092724457 330 nips-2012-Supervised Learning with Similarity Functions

11 0.085303135 150 nips-2012-Hierarchical spike coding of sound

12 0.083590977 276 nips-2012-Probabilistic Event Cascades for Alzheimer's disease

13 0.08271753 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

14 0.082389683 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

15 0.081138767 168 nips-2012-Kernel Latent SVM for Visual Recognition

16 0.07614512 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

17 0.073390096 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

18 0.073161863 167 nips-2012-Kernel Hyperalignment

19 0.071947865 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

20 0.070970342 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, 0.047), (2, -0.028), (3, -0.059), (4, 0.108), (5, 0.009), (6, -0.012), (7, 0.108), (8, -0.062), (9, -0.095), (10, 0.005), (11, 0.006), (12, 0.119), (13, -0.01), (14, 0.098), (15, -0.149), (16, -0.013), (17, 0.069), (18, 0.022), (19, -0.141), (20, 0.065), (21, -0.053), (22, 0.028), (23, -0.276), (24, -0.003), (25, 0.059), (26, -0.019), (27, -0.051), (28, -0.016), (29, 0.057), (30, -0.003), (31, -0.115), (32, -0.043), (33, -0.003), (34, 0.1), (35, -0.058), (36, 0.03), (37, -0.081), (38, -0.078), (39, -0.017), (40, 0.054), (41, 0.072), (42, 0.087), (43, 0.041), (44, 0.054), (45, -0.101), (46, 0.074), (47, -0.022), (48, 0.006), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95134282 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

2 0.81363082 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

3 0.79424065 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

4 0.76482755 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

5 0.73523337 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

6 0.72447497 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

7 0.67722642 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

8 0.59972721 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

9 0.55312449 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

10 0.5267185 168 nips-2012-Kernel Latent SVM for Visual Recognition

11 0.50513601 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

12 0.48700255 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

13 0.4568283 330 nips-2012-Supervised Learning with Similarity Functions

14 0.44552571 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

15 0.44289407 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

16 0.43388486 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

17 0.42840105 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

18 0.4279609 95 nips-2012-Density-Difference Estimation

19 0.4231644 197 nips-2012-Learning with Recursive Perceptual Representations

20 0.41902635 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.052), (11, 0.031), (21, 0.031), (36, 0.011), (38, 0.115), (39, 0.011), (42, 0.031), (54, 0.027), (55, 0.046), (64, 0.219), (74, 0.058), (76, 0.135), (80, 0.075), (92, 0.045), (94, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81520569 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

2 0.78049791 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

3 0.76306206 147 nips-2012-Graphical Models via Generalized Linear Models

Author: Eunho Yang, Genevera Allen, Zhandong Liu, Pradeep K. Ravikumar

Abstract: Undirected graphical models, also known as Markov networks, enjoy popularity in a variety of applications. The popular instances of these models such as Gaussian Markov Random Fields (GMRFs), Ising models, and multinomial discrete models, however do not capture the characteristics of data in many settings. We introduce a new class of graphical models based on generalized linear models (GLMs) by assuming that node-wise conditional distributions arise from exponential families. Our models allow one to estimate multivariate Markov networks given any univariate exponential distribution, such as Poisson, negative binomial, and exponential, by fitting penalized GLMs to select the neighborhood for each node. A major contribution of this paper is the rigorous statistical analysis showing that with high probability, the neighborhood of our graphical models can be recovered exactly. We also provide examples of non-Gaussian high-throughput genomic networks learned via our GLM graphical models. 1

4 0.74154592 27 nips-2012-A quasi-Newton proximal splitting method

Author: Stephen Becker, Jalal Fadili

Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1

5 0.73040056 221 nips-2012-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang

Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1

6 0.68427235 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

7 0.68378854 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

8 0.68319398 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

9 0.68198699 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

10 0.68151784 188 nips-2012-Learning from Distributions via Support Measure Machines

11 0.68118423 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

12 0.68092918 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

13 0.68091702 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

14 0.67884791 69 nips-2012-Clustering Sparse Graphs

15 0.6786378 215 nips-2012-Minimizing Uncertainty in Pipelines

16 0.67724675 260 nips-2012-Online Sum-Product Computation Over Trees

17 0.67651457 225 nips-2012-Multi-task Vector Field Learning

18 0.6763339 197 nips-2012-Learning with Recursive Perceptual Representations

19 0.67616451 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

20 0.67609882 210 nips-2012-Memorability of Image Regions