nips nips2012 nips2012-231 knowledge-graph by maker-knowledge-mining

231 nips-2012-Multiple Operator-valued Kernel Learning


Source: pdf

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. [sent-10, score-0.278]

2 This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. [sent-11, score-0.431]

3 We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). [sent-12, score-0.687]

4 The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. [sent-13, score-0.436]

5 We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. [sent-14, score-0.5]

6 We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. [sent-15, score-0.503]

7 Typical learning situations include multi-task learning [11], functional regression [12], and structured output prediction [4]. [sent-21, score-0.348]

8 In this paper, we are interested in the problem of functional regression with functional responses in the context of brain-computer interface (BCI) design. [sent-22, score-0.605]

9 More precisely, we are interested in finger movement prediction from electrocorticographic signals [23]. [sent-23, score-0.298]

10 Indeed, from a set of signals measuring brain surface electrical activity on d channels during a given period of time, we want to predict, for any instant of that period whether a finger is moving or not and the amplitude of the finger flexion. [sent-24, score-0.165]

11 Formally, the problem consists in learning a functional dependency between a set of d signals and a sequence of labels (a step function indicating whether a finger is moving or not) and between the same set of signals and vector of real values (the amplitude function). [sent-25, score-0.488]

12 While, it is clear that this problem can be formalized as functional regression problem, from our point of view, such problem can benefit from the multiple operator-valued kernel learning framework. [sent-26, score-0.551]

13 Indeed, for these problems, one of the difficulties arises from the unknown latency between the signal related to the finger 1 movement and the actual movement [23]. [sent-27, score-0.385]

14 Hence, instead of fixing in advance some value for this latency in the regression model, our framework allows to learn it from the data by means of several operator-valued kernels. [sent-28, score-0.092]

15 If we wish to address functional regression problem in the principled framework of reproducing kernel Hilbert spaces (RKHS), we have to consider RKHSs whose elements are operators that map a function to another function space, possibly source and target function spaces being different. [sent-29, score-0.803]

16 Such a functional RKHS framework and associated operator-valued kernels have been introduced recently [12, 13]. [sent-31, score-0.392]

17 A basic question with reproducing kernels is how to build these kernels and what is the optimal kernel choice for a given application. [sent-32, score-0.648]

18 In order to overcome the need for choosing a kernel before the learning process, several works have tried to address the problem of learning the scalar-valued kernel jointly with the decision function [18, 29]. [sent-33, score-0.416]

19 Since these seminal works, many efforts have been carried out in order to theoretically analyze the kernel learning framework [9, 3] or in order to provide efficient algorithms [24, 1, 15]. [sent-34, score-0.208]

20 While many works have been devoted to multiple scalar-valued kernel learning, this problem of kernel learning have been barely investigated for operator-valued kernels. [sent-35, score-0.46]

21 One motivation of this work is to bridge the gap between multiple kernel learning (MKL) and operatorvalued kernels by proposing a framework and an algorithm for learning a finite linear combination of operator-valued kernels. [sent-36, score-0.487]

22 It should be pointed out that in a recent work [10], the problem of learning the output kernel was formulated as an optimization problem over the cone of positive semidefinite matrices, and a block-coordinate descent method was proposed to solve it. [sent-38, score-0.305]

23 In contrast, our multiple operator-valued kernel learning formulation can be seen as a way of learning simultaneously input and output kernels, although we consider a linear combination of kernels that are fixed in advance. [sent-40, score-0.479]

24 We denote by Gx and Gy the separable real Hilbert spaces of input and output functional data, respectively. [sent-46, score-0.33]

25 In functional data analysis domain, continuous data are generally assumed to belong to the space of square integrable functions L2 . [sent-47, score-0.266]

26 2 We consider the problem of estimating a function f such that f (xi ) = yi when observed functional data (xi , yi )i=1,. [sent-51, score-0.3]

27 Since Gx and Gy are spaces of functions, the problem can be thought of as an operator estimation problem, where the desired operator maps a Hilbert space of factors to a Hilbert space of targets. [sent-55, score-0.314]

28 We can define the regularized operator estimate of f ∈ F as: fλ arg min f ∈F 1 n n yi − f (xi ) 2 Gy +λ f 2 F. [sent-56, score-0.166]

29 (1) i=1 In this work, we are looking for a solution to this minimization problem in a function-valued reproducing kernel Hilbert space F. [sent-57, score-0.324]

30 The continuity property is obtained by considering a special class of reproducing kernels called Mercer kernels [7, Proposition 2. [sent-59, score-0.44]

31 We now introduce (function) Gy -valued reproducing kernel Hilbert spaces and show the correspondence between such spaces and positive definite (operator) L(G y )-valued kernels. [sent-63, score-0.428]

32 Definition 1 (function-valued RKHS) A Hilbert space F of functions from Gx to Gy is called a reproducing kernel Hilbert space if there is a positive definite L(G y )-valued kernel KF (w, z) on Gx × Gx such that: i. [sent-65, score-0.532]

33 Definition 2 (operator-valued kernel) An L(G y )-valued kernel KF (w, z) on Gx is a function KF (·, ·) : Gx × Gx −→ L(G y ); furthermore: i. [sent-68, score-0.208]

34 Theorem 1 (bijection between function-valued RKHS and operator-valued kernel) An L(G y )-valued kernel KF (w, z) on Gx is the reproducing kernel of some Hilbert space F, if and only if it is positive definite. [sent-74, score-0.532]

35 For further reading on operator-valued kernels and their associated RKHSs, see, e. [sent-76, score-0.162]

36 2 Functional Response Ridge Regression in Dual Variables We can write the ridge regression with functional responses optimization problem (1) as follows: n 1 1 f 2 + ξi F f ∈F 2 2nλ i=1 with ξi = yi − f (xi ). [sent-80, score-0.497]

37 , n which are functional variables since the output space is the space of functions Gy . [sent-84, score-0.256]

38 The method of Lagrange multipliers on Banach spaces, which is a generalization of the classical (finite-dimensional) Lagrange multipliers method suitable to solve certain infinite-dimensional constrained optimization problems, is applied n here. [sent-86, score-0.158]

39 )αi , (4) i=1 where K : Gx × Gx −→ L(G y ) is the operator-valued kernel of F. [sent-106, score-0.208]

40 Substituting this into (3) and minimizing with respect to ξ, we obtain the dual of the functional response kernel ridge regression (KRR) problem nλ 1 n n α 2 y − Kα, α Gy + α, y Gy , (5) Gn 2 2 where K = [K(xi , xj )]n i,j=1 is the block operator kernel matrix. [sent-107, score-1.108]

41 The computational details regrading the dual formulation of functional KRR are derived in Appendix B of [14]. [sent-108, score-0.263]

42 3 MovKL in Dual Variables Let us now consider that the function f (·) is sum of M functions {fk (·)}M where each fk belongs k=1 to a Gy -valued RKHS with kernel Kk (·, ·). [sent-110, score-0.323]

43 Similarly to scalar-valued multiple kernel learning, we can cast the problem of learning these functions fk as M min min d∈D fk ∈Fk k=1 fk 2 k 1 F + 2dk 2nλ with ξi = yi − M k=1 n ξi 2 Gy (6) i=1 fk (xi ), where d = [d1 , · · · , dM ], D = {d : ∀k, dk ≥ 0 and k dr ≤ 1} and 1 ≤ r ≤ ∞. [sent-111, score-0.854]

44 The KKT conditions also state that at optimality we have fk (·) = i=1 3 Solving the MovKL Problem After having presented the framework, we now devise an algorithm for solving this MovKL problem. [sent-117, score-0.115]

45 1 Block-coordinate descent algorithm Since the optimization problem (6) has the same structure as a multiple scalar-valued kernel learning problem, we can build our MovKL algorithm upon the MKL literature. [sent-119, score-0.303]

46 The convergence of a block coordinate descent algorithm which is related closely to the Gauss-Seidel method was studied in works of [30] and others. [sent-121, score-0.089]

47 The difference here is that we have operators and block operator matrices rather than matrices and block matrices, but this doesn’t increase the complexity if the inverse of the operators are computable (typically analytically or by spectral decomposition). [sent-122, score-0.445]

48 with {dk } fixed, the resulting optimization problem with respect to α has the following form: (K + λI)α = y, (8) 4 M where K = k=1 dk Kk . [sent-125, score-0.129]

49 While the form of solution is rather simple, solving this linear system is still challenging in the operator setting and we propose below an algorithm for its resolution. [sent-126, score-0.162]

50 with {fk } fixed, according to problem (6), we can rewrite the problem as M min d∈D k=1 fk 2 k F dk (9) which has a closed-form solution and for which optimality occurs at [20]: dk = fk ( k fk 2 r+1 2r r+1 )1/r . [sent-128, score-0.559]

51 The difference here is that we have to solve a linear system involving a block-operator kernel matrix which is a combination of basic kernel matrices associated to M operator-valued kernels. [sent-130, score-0.527]

52 2 Solving a linear system involving multiple operator-valued kernel matrices One common way to construct operator-valued kernels is to build scalar-valued ones which are carried over to the vector-valued (resp. [sent-134, score-0.466]

53 In this setting an operator-valued kernel has the following form: K(w, z) = G(w, z)T, where G is a scalar-valued kernel and T is a positive operator in L(G y ). [sent-137, score-0.547]

54 More recently and for supervised functional output learning problems, T is chosen to be a multiplication or an integral operator [12, 13]. [sent-139, score-0.466]

55 This choice is motivated by the fact that functional linear models for functional responses [25] are based on these operators and then such kernels provide an interesting alternative to extend these models to nonlinear contexts. [sent-140, score-0.737]

56 In addition, some works on functional regression and structured-output learning consider operator-valued kernels constructed from the identity operator as in [19] and [4]. [sent-141, score-0.617]

57 In this work we adopt a functional data analysis point of view and then we are interested in a finite combination of operator-valued kernels constructed from the identity, multiplication and integral operators. [sent-142, score-0.51]

58 A problem encountered when working with operator-valued kernels in infinite-dimensional spaces is that of solving the system of linear operator equations (8). [sent-143, score-0.402]

59 In the following we show how to solve this problem for two cases of operator-valued kernel combinations. [sent-144, score-0.228]

60 This is the simpler case where the combination of operator-valued kernels has the following form M K(w, z) = dk Gk (w, z)T, (11) k=1 where Gk is a scalar-valued kernel, T is a positive operator in L(G y ), and dk are the combination coefficients. [sent-146, score-0.585]

61 In this setting, the block operator kernel matrix K can be expressed as a M Kronecker product between the multiple scalar-valued kernel matrix G = k=1 dk Gk , where n Gk = [Gk (xi , xj )]i,j=1 , and the operator T . [sent-147, score-0.911]

62 This is the general case where multiple operator-valued kernels are combined as follows M K(w, z) = dk Gk (w, z)Tk , k=1 5 (12) Algorithm 1 r -norm Algorithm 2 Gauss-Seidel Method MovKL choose an initial vector of functions α(0) repeat for i = 1, 2, . [sent-150, score-0.313]

63 of (13): (t) [K(xi , xi ) + λI]αi = si end for until convergence αi α ←− solution of (K + λI)α = y if α − α < then break end if 2 fk r+1 for k = 1, . [sent-163, score-0.204]

64 , M dt+1 ←− 2r k ( k fk r+1 )1/r end for where Gk is a scalar-valued kernel, Tk is a positive operator in L(G y ), and dk are the combination coefficients. [sent-166, score-0.392]

65 Inverting the associated block operator kernel matrix K is not feasible in this case, that is why we propose a Gauss-Seidel iterative procedure (see Algorithm 2) to solve the system of linear operator equations (8). [sent-167, score-0.607]

66 This problem is still challenging because the kernel K(·, ·) still involves a positive combination of operator-valued kernels. [sent-169, score-0.247]

67 , M + 1, (t) 2 αi (t) ˆ where K(xi , xi ) is the (M + 1) × (M + 1) diagonal matrix [Kk (xi , xi )]M +1 . [sent-178, score-0.11]

68 Writing down the Lagrangian of this optimization problem and then deriving its first-order optimality conditions leads us to the following set of linear equations   K1 (xi , xi )αi,1 − si + M γk = 0 k=1 (14) Kk (xi , xi )αi,k − γk = 0  αi,1 − αi,k = 0 where k = 2, . [sent-187, score-0.192]

69 4 Experiments In order to highlight the benefit of our multiple operator-valued kernel learning approach, we have considered a series of experiments on a real dataset, involving functional output prediction in a brain-computer interface framework. [sent-205, score-0.568]

70 The problem we addressed is a sub-problem related to finger movement decoding from Electrocorticographic (ECoG) signals. [sent-206, score-0.181]

71 We focus on the problem of estimating if a finger is moving or not and also on the direct estimation of the finger movement amplitude from the ECoG signals. [sent-207, score-0.278]

72 The development of the full BCI application is beyond the scope of this paper and our objective here is to prove that this problem of predicting finger movement can benefit from multiple kernel learning. [sent-208, score-0.433]

73 The finger flexion of the subject was recorded at 25Hz and up-sampled to 1KHz by means of a data glove which measures the finger movement amplitude. [sent-215, score-0.181]

74 Due to the acquisition process, a delay appears between the finger movement and the measured ECoG signal [22]. [sent-216, score-0.181]

75 Features from the ECoG signals are built by computing some band-specific amplitude modulation features, which is defined as the sum of the square of the band-specific filtered ECoG signal samples during a fixed time window. [sent-218, score-0.132]

76 For our finger movement prediction task, we have kept 5 channels that have been manually selected and split ECoG signals in portions of 200 samples. [sent-219, score-0.272]

77 For each of these time segments, we have the label of whether at each time sample, the finger is moving or not as well as the real movement amplitudes. [sent-220, score-0.214]

78 The dataset is composed of 487 couples of input-output signals, the output signals being either the binary movement labels or the real amplitude movement. [sent-221, score-0.364]

79 In a nutshell, the problem boils down to be a functional regression task with functional responses. [sent-223, score-0.551]

80 The inverses of these operators can be computed analytically. [sent-226, score-0.098]

81 7 Table 1: (Left) Results for the movement state prediction. [sent-227, score-0.181]

82 Residual Sum of Squares Error (RSSE) and the percentage number of Labels Correctly Recognized (LCR) of : (1) baseline KRR with the Gaussian kernel, (2) functional response KRR with the integral operator-valued kernel, (3) MovKL with ∞ , 1 and 2 -norm constraint. [sent-228, score-0.314]

83 (Right) Residual Sum of Squares Error (RSSE) results for finger movement prediction. [sent-229, score-0.181]

84 Algorithm RSSE LCR(%) Algorithm RSSE KRR - scalar-valued KRR - functional response MovKL - ∞ norm MovKL - 1 norm MovKL - 2 norm - 68. [sent-230, score-0.344]

85 72 KRR - scalar-valued KRR - functional response MovKL - ∞ norm MovKL - 1 norm MovKL - 2 norm - 88. [sent-240, score-0.344]

86 15 While the inverses of the identity and the multiplication operators are easily and directly computable from the analytic expressions of the operators, the inverse of the integral operator is computed from its spectral decomposition as in [13]. [sent-245, score-0.333]

87 As in the scalar case, using multiple operator-valued kernels leads to better results. [sent-250, score-0.206]

88 By directly combining kernels constructed from identity, multiplication and integral operators we could reduce the residual sum of squares error and enhance the label classification accuracy. [sent-251, score-0.384]

89 RSSE and LCR of the baseline kernel ridge regression are significantly outperformed by the operator-valued kernel based functional response regression. [sent-253, score-0.862]

90 This is due to the fact that an operatorvalued kernel induces a similarity measure between two pairs of input/output. [sent-255, score-0.242]

91 5 Conclusion In this paper we have presented a new method for learning simultaneously an operator and a finite linear combination of operator-valued kernels. [sent-256, score-0.17]

92 We have extended the MKL framework to deal with functional response kernel ridge regression and we have proposed a block coordinate descent algorithm to solve the resulting optimization problem. [sent-257, score-0.806]

93 The method is applied on a BCI dataset to predict finger movement in a functional regression setting. [sent-258, score-0.48]

94 It would be interesting for future work to thoroughly compare the proposed MKL method for operator estimation with previous related methods for multi-class and multi-label MKL in the contexts of structured-output learning and collaborative filtering. [sent-260, score-0.131]

95 Consistency of the group Lasso and multiple kernel learning. [sent-281, score-0.252]

96 Semi-supervised penalized output kernel regression for e link prediction. [sent-287, score-0.303]

97 Vector valued reproducing kernel Hilbert spaces of integrable functions and mercer theorem. [sent-302, score-0.437]

98 On the existence and nonexistence of lagrange multipliers in Banach spaces. [sent-373, score-0.106]

99 Nonlinear functional models for functional responses in reproducing kernel Hilbert spaces. [sent-390, score-0.823]

100 Prediction of arm movement trajectories from ECoG-recordings in humans. [sent-416, score-0.181]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gy', 0.485), ('movkl', 0.31), ('gx', 0.303), ('nger', 0.277), ('functional', 0.23), ('kernel', 0.208), ('movement', 0.181), ('kernels', 0.162), ('kk', 0.15), ('krr', 0.138), ('ecog', 0.137), ('kf', 0.132), ('operator', 0.131), ('rsse', 0.121), ('reproducing', 0.116), ('fk', 0.115), ('dk', 0.107), ('hilbert', 0.104), ('ridge', 0.102), ('bci', 0.1), ('gk', 0.08), ('operators', 0.076), ('mkl', 0.075), ('regression', 0.069), ('kadri', 0.069), ('lcr', 0.069), ('rkhs', 0.068), ('signals', 0.068), ('amplitude', 0.064), ('rkhss', 0.061), ('block', 0.06), ('multipliers', 0.058), ('xi', 0.055), ('spaces', 0.052), ('lagrange', 0.048), ('jmlr', 0.046), ('preux', 0.046), ('response', 0.045), ('multiple', 0.044), ('rakotomamonjy', 0.042), ('micchelli', 0.042), ('multiplication', 0.04), ('combination', 0.039), ('responses', 0.039), ('integral', 0.039), ('interface', 0.037), ('integrable', 0.036), ('yi', 0.035), ('inria', 0.035), ('carmeli', 0.034), ('operatorvalued', 0.034), ('si', 0.034), ('residual', 0.034), ('lille', 0.033), ('dual', 0.033), ('squares', 0.033), ('moving', 0.033), ('france', 0.031), ('system', 0.031), ('lagrangian', 0.03), ('finger', 0.03), ('sonnenburg', 0.03), ('vito', 0.03), ('recognized', 0.03), ('descent', 0.029), ('inverting', 0.029), ('exion', 0.028), ('sierra', 0.028), ('equations', 0.026), ('wr', 0.026), ('electrocorticographic', 0.026), ('banach', 0.026), ('os', 0.026), ('output', 0.026), ('bach', 0.026), ('identity', 0.025), ('labels', 0.025), ('cnrs', 0.025), ('hermitian', 0.025), ('mercer', 0.025), ('universit', 0.025), ('tk', 0.024), ('prediction', 0.023), ('latency', 0.023), ('norm', 0.023), ('du', 0.023), ('francis', 0.022), ('inverses', 0.022), ('gn', 0.022), ('boils', 0.022), ('separable', 0.022), ('nite', 0.022), ('xj', 0.022), ('optimization', 0.022), ('cortes', 0.021), ('matrices', 0.021), ('deal', 0.021), ('mohri', 0.02), ('project', 0.02), ('solve', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

2 0.1717836 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

3 0.15705034 198 nips-2012-Learning with Target Prior

Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji

Abstract: In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training. We term this method as learning with target priors (LTP). Specifically, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y). Compared to the conventional (semi)supervised learning approach, LTP can make efficient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efficiently implemented and deployed in tasks where running efficiency is critical. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1

4 0.14905517 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

5 0.14061581 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

6 0.1269466 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

7 0.12660109 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

8 0.11025884 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

9 0.091748193 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

10 0.080824643 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

11 0.07808686 330 nips-2012-Supervised Learning with Similarity Functions

12 0.075226754 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

13 0.07352002 16 nips-2012-A Polynomial-time Form of Robust Regression

14 0.061737463 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study

15 0.060193419 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

16 0.058060914 200 nips-2012-Local Supervised Learning through Space Partitioning

17 0.055836156 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

18 0.055634372 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

19 0.055024207 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

20 0.053298578 167 nips-2012-Kernel Hyperalignment


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, 0.025), (2, 0.022), (3, -0.037), (4, 0.083), (5, 0.062), (6, 0.004), (7, 0.108), (8, -0.059), (9, -0.125), (10, -0.024), (11, 0.01), (12, 0.099), (13, -0.05), (14, 0.08), (15, -0.153), (16, -0.014), (17, 0.074), (18, -0.027), (19, -0.063), (20, 0.103), (21, -0.027), (22, 0.056), (23, -0.199), (24, 0.043), (25, -0.004), (26, 0.017), (27, -0.029), (28, 0.017), (29, 0.1), (30, -0.034), (31, -0.039), (32, -0.002), (33, 0.062), (34, 0.047), (35, -0.097), (36, -0.046), (37, -0.144), (38, -0.04), (39, 0.082), (40, 0.069), (41, -0.094), (42, 0.06), (43, 0.081), (44, -0.116), (45, 0.023), (46, 0.105), (47, -0.004), (48, 0.028), (49, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95453876 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

2 0.77133441 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

3 0.76502562 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

4 0.73453748 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1

5 0.72587198 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

6 0.72436041 188 nips-2012-Learning from Distributions via Support Measure Machines

7 0.65523309 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

8 0.56306225 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

9 0.56086957 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

10 0.55567938 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

11 0.53830528 330 nips-2012-Supervised Learning with Similarity Functions

12 0.49822411 198 nips-2012-Learning with Target Prior

13 0.48872802 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

14 0.45941812 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

15 0.44412673 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

16 0.41171032 168 nips-2012-Kernel Latent SVM for Visual Recognition

17 0.40946642 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

18 0.40856531 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

19 0.406533 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study

20 0.40310803 95 nips-2012-Density-Difference Estimation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (21, 0.018), (36, 0.017), (38, 0.151), (42, 0.015), (49, 0.284), (53, 0.011), (54, 0.029), (55, 0.08), (64, 0.02), (74, 0.044), (76, 0.091), (80, 0.066), (92, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7474516 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

2 0.67310852 145 nips-2012-Gradient Weights help Nonparametric Regressors

Author: Samory Kpotufe, Abdeslam Boularias

Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1

3 0.66467714 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery

Author: Ehsan Elhamifar, Guillermo Sapiro, René Vidal

Abstract: Given pairwise dissimilarities between data points, we consider the problem of finding a subset of data points, called representatives or exemplars, that can efficiently describe the data collection. We formulate the problem as a row-sparsity regularized trace minimization problem that can be solved efficiently using convex programming. The solution of the proposed optimization program finds the representatives and the probability that each data point is associated with each one of the representatives. We obtain the range of the regularization parameter for which the solution of the proposed optimization program changes from selecting one representative for all data points to selecting all data points as representatives. When data points are distributed around multiple clusters according to the dissimilarities, we show that the data points in each cluster select representatives only from that cluster. Unlike metric-based methods, our algorithm can be applied to dissimilarities that are asymmetric or violate the triangle inequality, i.e., it does not require that the pairwise dissimilarities come from a metric. We demonstrate the effectiveness of the proposed algorithm on synthetic data as well as real-world image and text data. 1

4 0.62846541 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1

5 0.58248872 155 nips-2012-Human memory search as a random walk in a semantic network

Author: Joseph L. Austerweil, Joshua T. Abbott, Thomas L. Griffiths

Abstract: The human mind has a remarkable ability to store a vast amount of information in memory, and an even more remarkable ability to retrieve these experiences when needed. Understanding the representations and algorithms that underlie human memory search could potentially be useful in other information retrieval settings, including internet search. Psychological studies have revealed clear regularities in how people search their memory, with clusters of semantically related items tending to be retrieved together. These findings have recently been taken as evidence that human memory search is similar to animals foraging for food in patchy environments, with people making a rational decision to switch away from a cluster of related information as it becomes depleted. We demonstrate that the results that were taken as evidence for this account also emerge from a random walk on a semantic network, much like the random web surfer model used in internet search engines. This offers a simpler and more unified account of how people search their memory, postulating a single process rather than one process for exploring a cluster and one process for switching between clusters. 1

6 0.57479995 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

7 0.57468325 211 nips-2012-Meta-Gaussian Information Bottleneck

8 0.5732801 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

9 0.57034713 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

10 0.56664395 215 nips-2012-Minimizing Uncertainty in Pipelines

11 0.56554127 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

12 0.56139839 227 nips-2012-Multiclass Learning with Simplex Coding

13 0.56095403 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

14 0.56037641 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

15 0.55873317 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

16 0.55869645 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

17 0.55848521 86 nips-2012-Convex Multi-view Subspace Learning

18 0.55821854 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

19 0.55779898 319 nips-2012-Sparse Prediction with the $k$-Support Norm

20 0.55772293 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound