nips nips2012 nips2012-188 knowledge-graph by maker-knowledge-mining

188 nips-2012-Learning from Distributions via Support Measure Machines


Source: pdf

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. [sent-10, score-0.286]

2 By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. [sent-11, score-0.469]

3 Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. [sent-14, score-0.303]

4 There are, in fact, multiple reasons why probability distributions may be preferable. [sent-18, score-0.142]

5 Probability distributions may be equally appropriate given an abundance of training data. [sent-24, score-0.156]

6 In [2], the probability product kernel (PPK) was proposed as a generalized inner product between two input objects, which is in fact closely related to well-known kernels such as the Bhattacharyya kernel [3] and the exponential symmetrized Kullback-Leibler (KL) divergence [4]. [sent-31, score-0.748]

7 In [5], an extension of a two-parameter family of Hilbertian metrics of Topsøe was used to define Hilbertian kernels on probability measures. [sent-32, score-0.316]

8 In [6], the semi-group kernels were designed for objects with additive semi-group structure such as positive measures. [sent-33, score-0.26]

9 Recently, [7] introduced nonextensive information theoretic kernels on probability measures based on new JensenShannon-type divergences. [sent-34, score-0.332]

10 Although these kernels have proven successful in many applications, they are designed specifically for certain properties of distributions and application domains. [sent-35, score-0.372]

11 Moreover, there has been no attempt in making a connection to the kernels on corresponding input spaces. [sent-36, score-0.26]

12 First, we prove the representer theorem for a regularization framework over the space of probability distributions, which is a generalization of regularization over the input space on which the distributions are defined (Section 2). [sent-38, score-0.211]

13 Second, a family of positive definite kernels on distributions is introduced (Section 3). [sent-39, score-0.372]

14 If the distributions depend only on the locations in the input space, the SMM particularly reduces to a more flexible SVM that places different kernels on each data point. [sent-43, score-0.402]

15 2 Regularization on probability distributions Given a non-empty set X , let P denote the set of all probability measures P on a measurable space (X , A), where A is a σ-algebra of subsets of X . [sent-44, score-0.172]

16 That is, we adopt a Hilbert space embedding to represent the distribution as a mean function in an RKHS [8, 9]. [sent-51, score-0.156]

17 Formally, let H denote an RKHS of functions f : X → R, endowed with a reproducing kernel k : X × X → R. [sent-52, score-0.31]

18 (2) That is, we can see the mean embedding µP as a feature map associated with the kernel K : P × P → R, defined as K(P, Q) = µP , µQ H . [sent-59, score-0.397]

19 Roughly speaking, the coefficients αi controls the contribution of the distributions through the mean embeddings µPi . [sent-77, score-0.155]

20 Furthermore, if we restrict P to a class of Dirac measures δx on X and consider 2 the training set {(δxi , yi )}m , the functional (3) reduces to the usual regularization functional [11] i=1 m and the solution reduces to f = i=1 αi k(xi , ·). [sent-78, score-0.231]

21 Therefore, the standard representer theorem is recovered as a particular case (see also [12] for more general results on representer theorem). [sent-79, score-0.138]

22 3 Kernels on probability distributions As the map (1) is linear in P, optimizing the functional (3) amounts to finding a function in H that approximate well functions from P to R in the function class F {P → X g dP | P ∈ P, g ∈ C(X )} where C(X ) is a class of bounded continuous functions on X . [sent-96, score-0.248]

23 The following lemma states the relation between the RKHS H induced by the kernel k and the function class F. [sent-98, score-0.229]

24 Assuming that X is compact, the RKHS H induced by a kernel k is dense in F if k is universal, i. [sent-100, score-0.229]

25 Nonlinear kernels on P can be defined in an analogous way to nonlinear kernels on X , by treating mean embeddings µP of P ∈ P as its feature representation. [sent-107, score-0.619]

26 Then, the nonlinear kernels on P can be defined as K(P, Q) = κ(µP , µQ ) = ψ(µP ), ψ(µQ ) Hκ where κ is a p. [sent-111, score-0.316]

27 As a result, many standard nonlinear kernels on X can be used to define nonlinear kernels on P as long as the kernel evaluation depends entirely on the inner product µP , µQ H , e. [sent-114, score-0.861]

28 kernels on distributions proposed in this work is so generic that standard kernel functions can be reused to derive kernels on distributions that are different from many other kernel functions proposed specifically for certain distributions. [sent-120, score-1.202]

29 It has been recently proved that the Gaussian RBF kernel given by K(P, Q) = exp(− γ µP − 2 µQ 2 ), ∀P, Q ∈ P is universal w. [sent-121, score-0.274]

30 It is therefore of theoretical interest to consider more general classes of universal kernels on probability distributions. [sent-125, score-0.335]

31 In its general form, an SMM amounts to solving an SVM problem with the expected kernel K(P, Q) = Ex∼P,z∼Q [k(x, z)]. [sent-128, score-0.229]

32 This kernel can be computed in closed-form for certain classes of distributions and kernels k. [sent-129, score-0.601]

33 Alternatively, one can approximate the kernel K(P, Q) by the empirical estimate: 1 Kemp (Pn , Qm ) = n·m n m k(xi , zj ) (4) i=1 j=1 where Pn and Qm are empirical distributions of P and Q given random samples {xi }n and i=1 {zj }m , respectively. [sent-131, score-0.341]

34 A finite sample of size m from a distribution P suffices (with high probability) j=1 3 Table 1: the analytic forms of expected kernels for different choices of kernels and distributions. [sent-132, score-0.52]

35 , a mixture of Gaussians model, and choose a kernel k whose expected value admits an analytic form. [sent-136, score-0.229]

36 Thus, for an SMM, the first level kernel k is used to obtain a vectorial representation of the measures, and the second level kernel K allows for a nonlinear algorithm on distributions. [sent-139, score-0.57]

37 from some unknown probability distribui=1 tion P on P × Y, a loss function ℓ : R × R → R, and a function class Λ, the goal of statistical learning is to find the function f ∈ Λ that minimizes the expected risk functional R(f ) = P X ℓ(y, f (x)) dP(x) dP(P, y). [sent-145, score-0.144]

38 Furthermore, m m 1 the risk functional can be simplified further by considering m·n i=1 xij ∼Pi ℓ(yi , f (xij )) based on n samples xij drawn from each Pi . [sent-147, score-0.176]

39 Our framework, on the other hand, alleviates the problem by minimizing the risk functional Rµ (f ) = P ℓ(y, EP [f (x)]) dP(P, y) for f ∈ H with corresponding empirical risk functional m 1 Rµ (f ) = m i=1 ℓ(yi , EPi [f (x)]) (cf. [sent-148, score-0.228]

40 As a result, if this holds for any distribution Pi in the training set {(Pi , yi )}m , the true risk deviation |R − Rµ | is also expected to be small. [sent-167, score-0.141]

41 2 Flexible support vector machines It turns out that, for certain choices of distributions P, the linear SMM trained using {(Pi , yi )}m i=1 is equivalent to an SVM trained using some samples {(xi , yi )}m with an appropriate choice of i=1 kernel function. [sent-169, score-0.511]

42 kernel on a measure space such that k(x, z)2 dx dz < ∞, and g(x, x) be a square integrable function such that g(x, x) d˜ < ∞ for all x. [sent-173, score-0.229]

43 Given ˜ ˜ x a sample {(Pi , yi )}m where each Pi is assumed to have a density given by g(xi , x), the lini=1 ear SMM is equivalent to the SVM on the training sample {(xi , yi )}m with kernel Kg (x, z) = i=1 k(˜, z )g(x, x)g(z, z ) d˜ d˜. [sent-174, score-0.375]

44 x ˜ ˜ ˜ x z Note that the important assumption for this equivalence is that the distributions Pi differ only in their location in the parameter space. [sent-175, score-0.152]

45 Thus, it is clear that x ˜ x z ˜ z the feature map of x depends not only on the kernel k, but also on the density g(x, x). [sent-178, score-0.267]

46 Consequently, ˜ by virtue of Lemma 4, the kernel Kg allows the SVM to place different kernels at each data point. [sent-179, score-0.489]

47 , N (xm ; σm · I) and Gaussian RBF kernel kσ2 with bandwidth parameter σ. [sent-184, score-0.261]

48 The convolution theorem of Gaussian distributions implies that this SMM is equivalent to a flexible SVM that places a data-dependent 2 kernel kσ2 +2σi (xi , ·) on training example xi , i. [sent-185, score-0.415]

49 5 Related works The kernel K(P, Q) = µP , µQ H is in fact a special case of the Hilbertian metric [5], with the associated kernel K(P, Q) = EP,Q [k(x, x)], and a generative mean map kernel (GMMK) proposed ˜ by [15]. [sent-188, score-0.725]

50 In the GMMK, the kernel between two objects x and y is defined via px and py , which are ˆ ˆ estimated probabilistic models of x and y, respectively. [sent-189, score-0.229]

51 That is, a probabilistic model px is learned ˆ for each example and used as a surrogate to construct the kernel between those examples. [sent-190, score-0.229]

52 The idea of surrogate kernels has also been adopted by the Probability Product Kernel (PPK) [2]. [sent-191, score-0.26]

53 Consequently, GMMK, PPK with ρ = 1, and our linear kernels are equivalent when the embedding kernel is k(x, x′ ) = δ(x − x′ ). [sent-193, score-0.619]

54 More recently, the empirical kernel (4) was employed in an unsupervised way for multi-task learning to generalize to a previously unseen task [16]. [sent-194, score-0.229]

55 In contrast, we treat the probability distributions in a supervised way (cf. [sent-195, score-0.142]

56 the regularized functional (3)) and the kernel is not restricted to only the empirical kernel. [sent-196, score-0.297]

57 The use of expected kernels in dealing with the uncertainty in the input data has a connection to robust SVMs. [sent-197, score-0.26]

58 [18] showed the equivalence between SVMs using expected kernels and SOCP when τi = 0. [sent-202, score-0.3]

59 When τi > 0, the mean and covariance of missing kernel entries have to be estimated explicitly, making the SOCP more involved for nonlinear kernels. [sent-203, score-0.31]

60 ii) Augmented SVM (ASVM) is an SVM trained on augmented samples drawn according to the distributions {Pi }m . [sent-206, score-0.146]

61 1 2 3 4 5 6 Parameters 7 8 Embedding RBF 2 Embedding RBF 1 Accuracy(%) 100 90 80 70 60 50 40 Level-2 POLY Level-2 RBF (b) sensitivity of kernel parameters Figure 1: (a) the decision boundaries of SVM, ASVM, and SMM. [sent-211, score-0.229]

62 (b) the heatmap plots of average accuracies of SMM over 30 experiments using POLY-RBF (center) and RBF-RBF (right) kernel combinations with the plots of average accuracies at different parameter values (left). [sent-212, score-0.355]

63 Table 2: accuracies (%) of SMM on synthetic data with different combinations of embedding and level-2 kernels. [sent-213, score-0.235]

64 Embedding kernels POLY3 RBF Level-2 kernels LIN 6. [sent-214, score-0.52]

65 We trained the SVM using only the means of the distributions, ASVM with 30 virtual examples generated from each distribution, and SMM using distributions as training examples. [sent-247, score-0.376]

66 Nevertheless, this becomes more difficult if the training distributions are, for example, nonisotropic and have different covariance matrices. [sent-258, score-0.156]

67 Secondly, we evaluate the performance of the SMM for different combinations of embedding and level-2 kernels. [sent-259, score-0.164]

68 The training set consists of 500 distributions from the positive class and 500 distributions from the negative class. [sent-272, score-0.268]

69 The kernels used in the experiment include linear kernel (LIN), polynomial kernel of degree 2 (POLY2), polynomial kernel of degree 3 (POLY3), unnormalized Gaussian RBF kernel (RBF), and normalized Gaussian RBF kernel (NRBF). [sent-274, score-1.545]

70 To fix parameter values of both kernel functions and SMM, 10-fold cross-validation (10-CV) is performed on a parameter grid, C ∈ {2−3 , 2−2 , . [sent-275, score-0.229]

71 The average accuracy and ±1 standard deviation for all kernel combinations over 30 repetitions are reported in Table 2. [sent-282, score-0.263]

72 Moreover, we also investigate the sensitivity of kernel parameters for two kernel combinations: RBF-RBF and POLYRBF. [sent-283, score-0.458]

73 Figure 1b depicts the accuracy values and average accuracies for considered kernel functions. [sent-295, score-0.275]

74 Table 2 indicates that both embedding and level-2 kernels are important for the performance of the classifier. [sent-296, score-0.39]

75 The embedding kernels tend to have more impact on the predictive performance compared to the level-2 kernels. [sent-297, score-0.39]

76 2 Handwritten digit recognition In this section, the proposed framework is applied to distributions over equivalence classes of images that are invariant to basic transformations, namely, scaling, translation, and rotation. [sent-300, score-0.299]

77 For each image, the virtual examples are obtained by sampling parameter values from the distribution and applying the transformation accordingly. [sent-307, score-0.186]

78 The former consists of classifying digit 1 against digit 8 and digit 3 against digit 4. [sent-309, score-0.423]

79 The latter considers classifying digit 3 against digit 8 and digit 6 against digit 9. [sent-310, score-0.423]

80 Then, for each example in the initial dataset, we generate 10, 20, and 30 virtual examples using the aforementioned transformations to construct virtual data sets consisting of 2,000, 4,000, and 6,000 examples, respectively. [sent-312, score-0.33]

81 The original examples are excluded from the virtual datasets. [sent-314, score-0.186]

82 The virtual examples are normalized such that their feature values are in [0, 1]. [sent-315, score-0.186]

83 We compare the SVM on the initial dataset, the ASVM on the virtual datasets, and the SMM. [sent-317, score-0.144]

84 For SVM and ASVM, the Gaussian RBF kernel is used. [sent-318, score-0.229]

85 For SMM, we employ the empirical kernel (4) with Gaussian RBF kernel as a base kernel. [sent-319, score-0.458]

86 In most cases, the SMM outperforms both the SVM and the ASVM as the number of virtual examples increases. [sent-327, score-0.186]

87 2 While the reported results were obtained using virtual examples with Gaussian parameter distributions (Sec. [sent-329, score-0.298]

88 3 Natural scene categorization This section illustrates benefits of the nonlinear kernels between distributions for learning natural scene categories in which the bag-of-word (BoW) representation is used to represent images in the dataset. [sent-335, score-0.555]

89 Standard BoW representations encode each image as a histogram that enumerates the occurrence probability of local patches detected in the image w. [sent-337, score-0.172]

90 Then, each patch in an image is mapped to a codeword and the image can be represented by the histogram of the codewords. [sent-354, score-0.162]

91 For SMM, we use the empirical embedding kernel with Gaussian RBF base kernel k: K(hi , hj ) = M M r=1 s=1 hi (cr )hj (cs )k(cr , cs ) where hi is the histogram of the ith image and cr is the rth SIFT vector. [sent-357, score-0.904]

92 A Gaussian RBF kernel is also used as the level-2 kernel for nonlinear SMM. [sent-358, score-0.514]

93 For the SVM, we adopt a Gaussian RBF kernel with χ2 -distance between the histograms [21], i. [sent-359, score-0.255]

94 , M (hi (c )−h (c ))2 K(hi , hj ) = exp −γχ2 (hi , hj ) where χ2 (hi , hj ) = r=1 hi (crr )+hjj (crr ) . [sent-361, score-0.245]

95 For NLSMM, we use the best γ of LSMM in the base kernel and perform 10-CV to choose γ parameter only for the level-2 kernel. [sent-368, score-0.229]

96 A family of linear and nonlinear kernels on distributions allows one to flexibly choose the kernel function that is suitable for the problems at hand. [sent-375, score-0.657]

97 Our analyses provide insights into the relations between distribution-based methods and traditional sample-based methods, particularly the flexible SVM that allows the SVM to place different kernels on each training example. [sent-376, score-0.359]

98 A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. [sent-409, score-0.229]

99 On the influence of the kernel on the consistency of support vector machines. [sent-473, score-0.229]

100 Expected kernel for missing features in support vector machines. [sent-509, score-0.254]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('smm', 0.595), ('kernels', 0.26), ('asvm', 0.255), ('rbf', 0.249), ('kernel', 0.229), ('svm', 0.188), ('pi', 0.162), ('virtual', 0.144), ('embedding', 0.13), ('distributions', 0.112), ('digit', 0.098), ('mj', 0.098), ('socp', 0.098), ('gmmk', 0.085), ('hilbertian', 0.085), ('dp', 0.084), ('hi', 0.08), ('rkhs', 0.072), ('representer', 0.069), ('functional', 0.068), ('ep', 0.066), ('lsmm', 0.064), ('nlsmm', 0.064), ('fukumizu', 0.059), ('mi', 0.058), ('vectorial', 0.056), ('ppk', 0.056), ('nonlinear', 0.056), ('hj', 0.055), ('reproducing', 0.055), ('injective', 0.052), ('yi', 0.051), ('hilbert', 0.051), ('mt', 0.051), ('images', 0.049), ('gaussian', 0.049), ('bow', 0.049), ('risk', 0.046), ('accuracies', 0.046), ('universal', 0.045), ('bingen', 0.044), ('mpi', 0.044), ('kg', 0.044), ('training', 0.044), ('embeddings', 0.043), ('crr', 0.042), ('epm', 0.042), ('krikamol', 0.042), ('nonextensive', 0.042), ('remp', 0.042), ('smms', 0.042), ('patches', 0.042), ('examples', 0.042), ('ex', 0.042), ('equivalence', 0.04), ('microarray', 0.04), ('scene', 0.039), ('poly', 0.039), ('map', 0.038), ('plsa', 0.038), ('gretton', 0.037), ('polynomial', 0.037), ('cr', 0.036), ('ym', 0.035), ('image', 0.035), ('sch', 0.035), ('codeword', 0.035), ('dinuzzo', 0.035), ('trained', 0.034), ('combinations', 0.034), ('exible', 0.034), ('degree', 0.033), ('vs', 0.032), ('bandwidth', 0.032), ('digits', 0.032), ('firstly', 0.031), ('xij', 0.031), ('classifying', 0.031), ('histogram', 0.03), ('probability', 0.03), ('translation', 0.03), ('xm', 0.03), ('places', 0.03), ('codewords', 0.03), ('dq', 0.03), ('handwritten', 0.029), ('sift', 0.029), ('bhattacharyya', 0.028), ('supx', 0.028), ('insights', 0.028), ('qm', 0.028), ('patch', 0.027), ('analyses', 0.027), ('adopt', 0.026), ('endowed', 0.026), ('lkopf', 0.026), ('metrics', 0.026), ('lipschitz', 0.025), ('synthetic', 0.025), ('missing', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

2 0.21166502 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

3 0.19106048 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

4 0.18488675 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

5 0.14061581 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

6 0.13024627 168 nips-2012-Kernel Latent SVM for Visual Recognition

7 0.12184626 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

8 0.10370784 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

9 0.10284473 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

10 0.09243083 227 nips-2012-Multiclass Learning with Simplex Coding

11 0.092394054 361 nips-2012-Volume Regularization for Binary Classification

12 0.085468367 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

13 0.084707782 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

14 0.082539208 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

15 0.082309805 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

16 0.081649467 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

17 0.076546095 330 nips-2012-Supervised Learning with Similarity Functions

18 0.073275536 167 nips-2012-Kernel Hyperalignment

19 0.07299424 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

20 0.071517758 247 nips-2012-Nonparametric Reduced Rank Regression


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.203), (1, 0.054), (2, -0.059), (3, -0.06), (4, 0.104), (5, -0.036), (6, 0.002), (7, 0.149), (8, -0.032), (9, -0.129), (10, 0.023), (11, 0.04), (12, 0.139), (13, -0.033), (14, 0.078), (15, -0.176), (16, -0.011), (17, 0.035), (18, -0.033), (19, -0.076), (20, 0.011), (21, -0.106), (22, 0.054), (23, -0.199), (24, -0.012), (25, 0.112), (26, -0.046), (27, -0.093), (28, -0.075), (29, 0.045), (30, -0.107), (31, -0.086), (32, -0.054), (33, 0.033), (34, 0.102), (35, -0.043), (36, -0.015), (37, -0.064), (38, 0.068), (39, 0.026), (40, 0.039), (41, 0.079), (42, 0.011), (43, 0.047), (44, -0.051), (45, -0.073), (46, 0.079), (47, -0.101), (48, -0.007), (49, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95753217 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

2 0.86541522 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1

3 0.80756027 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

4 0.78514946 231 nips-2012-Multiple Operator-valued Kernel Learning

Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach

Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1

5 0.72451478 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

6 0.69823229 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

7 0.6864115 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

8 0.65720743 168 nips-2012-Kernel Latent SVM for Visual Recognition

9 0.65138304 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

10 0.60942662 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

11 0.57487464 197 nips-2012-Learning with Recursive Perceptual Representations

12 0.5743379 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

13 0.56033742 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

14 0.54211968 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

15 0.52711827 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

16 0.52690607 330 nips-2012-Supervised Learning with Similarity Functions

17 0.50734192 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

18 0.48298284 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

19 0.45996821 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

20 0.45071182 361 nips-2012-Volume Regularization for Binary Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.047), (1, 0.181), (21, 0.028), (38, 0.103), (42, 0.053), (54, 0.027), (55, 0.072), (74, 0.075), (76, 0.15), (80, 0.098), (84, 0.012), (92, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90829921 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

Author: Jenna Wiens, Eric Horvitz, John V. Guttag

Abstract: A patient’s risk for adverse events is affected by temporal processes including the nature and timing of diagnostic and therapeutic activities, and the overall evolution of the patient’s pathophysiology over time. Yet many investigators ignore this temporal aspect when modeling patient outcomes, considering only the patient’s current or aggregate state. In this paper, we represent patient risk as a time series. In doing so, patient risk stratification becomes a time-series classification task. The task differs from most applications of time-series analysis, like speech processing, since the time series itself must first be extracted. Thus, we begin by defining and extracting approximate risk processes, the evolving approximate daily risk of a patient. Once obtained, we use these signals to explore different approaches to time-series classification with the goal of identifying high-risk patterns. We apply the classification to the specific task of identifying patients at risk of testing positive for hospital acquired Clostridium difficile. We achieve an area under the receiver operating characteristic curve of 0.79 on a held-out set of several hundred patients. Our two-stage approach to risk stratification outperforms classifiers that consider only a patient’s current state (p<0.05). 1

2 0.88291848 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

Author: Ko-jen Hsiao, Kevin Xu, Jeff Calder, Alfred O. Hero

Abstract: We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria using a linear combination of them. If the importance of the different criteria are not known in advance, the algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach scales linearly in the number of criteria and is provably better than linear combinations of the criteria. 1

same-paper 3 0.8451649 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

4 0.83621347 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

Author: Ulugbek Kamilov, Sundeep Rangan, Michael Unser, Alyson K. Fletcher

Abstract: We consider the estimation of an i.i.d. vector x ∈ Rn from measurements y ∈ Rm obtained by a general cascade model consisting of a known linear transform followed by a probabilistic componentwise (possibly nonlinear) measurement channel. We present a method, called adaptive generalized approximate message passing (Adaptive GAMP), that enables joint learning of the statistics of the prior and measurement channel along with estimation of the unknown vector x. Our method can be applied to a large class of learning problems including the learning of sparse priors in compressed sensing or identification of linear-nonlinear cascade models in dynamical systems and neural spiking processes. We prove that for large i.i.d. Gaussian transform matrices the asymptotic componentwise behavior of the adaptive GAMP algorithm is predicted by a simple set of scalar state evolution equations. This analysis shows that the adaptive GAMP method can yield asymptotically consistent parameter estimates, which implies that the algorithm achieves a reconstruction quality equivalent to the oracle algorithm that knows the correct parameter values. The adaptive GAMP methodology thus provides a systematic, general and computationally efficient method applicable to a large range of complex linear-nonlinear models with provable guarantees. 1

5 0.80282301 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Author: Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric

Abstract: We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms. 1

6 0.79548103 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

7 0.78039545 34 nips-2012-Active Learning of Multi-Index Function Models

8 0.77148116 215 nips-2012-Minimizing Uncertainty in Pipelines

9 0.7697736 197 nips-2012-Learning with Recursive Perceptual Representations

10 0.76785612 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

11 0.76653314 210 nips-2012-Memorability of Image Regions

12 0.76597214 168 nips-2012-Kernel Latent SVM for Visual Recognition

13 0.76532888 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

14 0.76370937 193 nips-2012-Learning to Align from Scratch

15 0.76104218 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

16 0.75972533 298 nips-2012-Scalable Inference of Overlapping Communities

17 0.75748992 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.75652832 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

19 0.75532955 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

20 0.75525779 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model