nips nips2008 nips2008-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. [sent-3, score-0.505]
2 In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. [sent-5, score-0.149]
3 This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance. [sent-7, score-0.492]
4 1 Introduction In the last two decades, kernel methods have been a prolific theoretical and algorithmic machine learning framework. [sent-8, score-0.26]
5 By using appropriate regularization by Hilbertian norms, representer theorems enable to consider large and potentially infinite-dimensional feature spaces while working within an implicit feature space no larger than the number of observations. [sent-9, score-0.266]
6 This has led to numerous works on kernel design adapted to specific data types and generic kernel-based algorithms for many learning tasks (see, e. [sent-10, score-0.303]
7 While early work has focused on efficient algorithms to solve the convex optimization problems, recent research has looked at the model selection properties and predictive performance of such methods, in the linear case [3] or within the multiple kernel learning framework (see, e. [sent-14, score-0.381]
8 Indeed, feature spaces are large and we expect the estimated predictor function to require only a small number of features, which is exactly the situation where ℓ1 -norms have proven advantageous. [sent-18, score-0.242]
9 This leads to two natural questions that we try to answer in this paper: (1) Is it feasible to perform optimization in this very large feature space with cost which is polynomial in the size of the input space? [sent-19, score-0.298]
10 More precisely, we consider a positive definite kernel that can be expressed as a large sum of positive definite basis or local kernels. [sent-21, score-0.298]
11 This exactly corresponds to the situation where a large feature space is the concatenation of smaller feature spaces, and we aim to do selection among these many kernels, which may be done through multiple kernel learning. [sent-22, score-0.546]
12 One major difficulty however is that the number of these smaller kernels is usually exponential in the dimension of the input space and applying multiple kernel learning directly in this decomposition would be intractable. [sent-23, score-0.715]
13 In order to peform selection efficiently, we make the extra assumption that these small kernels can be embedded in a directed acyclic graph (DAG). [sent-24, score-0.598]
14 Finally, we extend in Section 4 some of the known consistency results of the Lasso and multiple kernel learning [3, 4], and give a partial answer to the model selection capabilities of our regularization framework by giving necessary and sufficient conditions for model consistency. [sent-28, score-0.476]
15 In particular, we show that our framework is adapted to estimating consistently only the hull of the relevant variables. [sent-29, score-0.43]
16 2 Hierarchical multiple kernel learning (HKL) We consider the problem of predicting a random variable Y ∈ Y ⊂ R from a random variable X ∈ X , where X and Y may be quite general spaces. [sent-31, score-0.305]
17 1 Graph-structured positive definite kernels We assume that we are given a positive definite kernel k : X × X → R, and that this kernel can be expressed as the sum, over an index set V , of basis kernels kv , v ∈ V , i. [sent-45, score-1.405]
18 e, for all x, x′ ∈ X , k(x, x′ ) = v∈V kv (x, x′ ). [sent-46, score-0.161]
19 For each v ∈ V , we denote by Fv and Φv the feature space and feature map of kv , i. [sent-47, score-0.291]
20 , for all x, x′ ∈ X , kv (x, x′ ) = Φv (x), Φv (x′ ) . [sent-49, score-0.161]
21 Our sum assumption corresponds to a situation where the feature map Φ(x) and feature space F for k is the concatenation of the feature maps Φv (x) for each kernel kv , i. [sent-51, score-0.721]
22 As mentioned earlier, we make the assumption that the set V can be embedded into a directed acyclic graph. [sent-54, score-0.163]
23 Given a subset of nodes W ⊂ V , we can define the hull of W as the union of all ancestors of w ∈ W , i. [sent-63, score-0.407]
24 The goal of this paper is to perform kernel selection among the kernels kv , v ∈ V . [sent-68, score-0.827]
25 Namely, instead of considering all possible subsets of active (relevant) vertices, we are only interested in estimating correctly the hull of these relevant vertices; in Section 2. [sent-70, score-0.378]
26 2, we design a specific sparsity-inducing norms adapted to hulls. [sent-71, score-0.15]
27 In this paper, we primarily focus on kernels that can be expressed as “products of sums”, and on the associated p-dimensional directed grids, while noting that our framework is applicable to many other kernels. [sent-72, score-0.473]
28 Namely, we assume that the input space X factorizes into p components X = X1 × · · ·× Xp and that we are given p sequences of length q + 1 of kernels kij (xi , x′ ), i ∈ {1, . [sent-73, score-0.362]
29 (Middle) Example of sparsity pattern (× in light blue) and the complement of its hull (+ in light red). [sent-78, score-0.426]
30 2, this DAG will correspond to the constraint of selecting a given product of kernels only after all the subproducts are selected. [sent-105, score-0.362]
31 Those DAGs are especially suited to nonlinear variable selection, in particular with the polynomial and Gaussian kernels. [sent-106, score-0.16]
32 In this context, products of kernels correspond to interactions between certain variables, and our DAG implies that we select an interaction only after all sub-interactions were already selected. [sent-107, score-0.362]
33 Polynomial kernels We consider Xi = R, kij (xi , x′ ) = q (xi x′ )j ; the full kernel is then equal i i j p q p to k(x, x′ ) = i=1 j=0 q (xi x′ )j = i=1 (1 + xi x′ )q . [sent-108, score-0.676]
34 Note that this is not exactly the usual i i j polynomial kernel (whose feature space is the space of multivariate polynomials of total degree less than q), since our kernel considers polynomials of maximal degree q. [sent-109, score-0.892]
35 ′ 2 Gaussian kernels We also consider Xi = R, and the Gaussian-RBF kernel e−b(x−x ) . [sent-110, score-0.622]
36 The following decomposition is the eigendecomposition of the non centered covariance operator for a normal distribution with variance 1/4a (see, e. [sent-111, score-0.137]
37 The decomposition ends up being close to a polynomial kernel of infinite degree, modulated by an exponential [2]. [sent-117, score-0.468]
38 One may also use an adaptive decomposition using kernel PCA (see, e. [sent-118, score-0.308]
39 ANOVA kernels When q = 1, the directed grid is isomorphic to the power set (i. [sent-122, score-0.44]
40 In this setting, we can decompose the ANOVA kernel [2] as p −b(xi −x′ )2 −b(xi −x′ )2 −b xJ −x′ 2 i i J 2 , and our ) = = i=1 (1 + e J⊂{1,. [sent-125, score-0.26]
41 , we are given a kernel (and thus a feature space) and we explore it using ℓ1 -norms. [sent-135, score-0.325]
42 , we have a large structured set of features that we try to select from; however, the techniques developed in this paper assume that (a) each feature might be infinite-dimensional and (b) that we can sum all the local kernels efficiently (see in particular Section 3. [sent-138, score-0.506]
43 Following the kernel view thus seems slightly more natural. [sent-140, score-0.26]
44 Penalizing with this norm is efficient because summing all kernels kv is assumed feasible in polynomial time and we can bring to bear the usual kernel machinery; however, it does not lead to sparse solutions, where many βv will be exactly equal to zero. [sent-143, score-1.111]
45 As said earlier, we are only interested in the hull of the selected elements βv ∈ Fv , v ∈ V ; the hull of a set I is characterized by the set of v, such that D(v) ⊂ I c , i. [sent-144, score-0.714]
46 We thus consider the following structured block ℓ1 -norm defined as = v∈V dv βD(v) dv ( w∈D(v) βw 2 )1/2 , where (dv )v∈V are positive weights. [sent-149, score-0.474]
47 We thus consider the following minimization problem1: minβ∈Qv∈V Fv 1 n n i=1 ℓ(yi , v∈V βv , Φv (xi ) ) + λ 2 v∈V 2 dv βD(v) . [sent-151, score-0.237]
48 (1) Our Hilbertian norm is a Hilbert space instantiation of the hierarchical norms recently introduced by [5] and also considered by [7] in the MKL setting. [sent-152, score-0.203]
49 If all Hilbert spaces are finite dimensional, our particular choice of norms corresponds to an “ℓ1 -norm of ℓ2 -norms”. [sent-153, score-0.185]
50 In Section 3, we propose a novel algorithm to solve the associated optimization problem in time polynomial in the number of selected groups/kernels, for all group sizes, DAGs and losses. [sent-155, score-0.261]
51 (1) consistently estimates the hull of the sparsity pattern. [sent-157, score-0.432]
52 Finally, note that in certain settings (finite dimensional Hilbert spaces and distributions with absolutely continuous densities), these norms have the effect of selecting a given kernel only after all of its ancestors [5]. [sent-158, score-0.513]
53 (1), as well as optimization algorithms with polynomial time complexity in the number of selected kernels. [sent-161, score-0.26]
54 In simulations we consider total numbers of kernels larger than 1030 , and thus such efficient algorithms are essential to the success of hierarchical multiple kernel learning (HKL). [sent-162, score-0.767]
55 1 Reformulation in terms of multiple kernel learning Following [8, 9], we can simply derive an equivalent formulation of Eq. [sent-164, score-0.305]
56 Using Cauchy-Schwarz inequality, we have that for all η ∈ RV such that η 0 and v∈V d2 ηv 1, v ( v∈V dv βD(v) )2 v∈V βD(v) ηv 2 = w∈V ( −1 v∈A(w) ηv ) −1 βw 2 , with equality if and only if ηv = d−1 βD(v) ( v∈V dv βD(v) ) . [sent-166, score-0.474]
57 , weights of descendant kernels are smaller, which is consistent with the known fact that kernels should always be selected after all their ancestors. [sent-173, score-0.798]
58 Moreover, the solution is then βw = ζw i=1 αi Φw (xi ), where α ∈ Rn are the dual parameters associated with the single kernel learning problem. [sent-179, score-0.323]
59 (1), with ∀w, βw = ζw i=1 αi Φw (xi ), if and only if (a) given η, α is optimal for the single kernel learning problem with kernel matrix K = −1 −1 ⊤ α Kw α. [sent-182, score-0.52]
60 w∈V ζw (η)Kw , and (b) given α, η ∈ H maximizes w∈V ( v∈A(w) ηv ) Moreover, the total duality gap can be upperbounded as the sum of the two separate duality gaps for the two optimization problems, which will be useful in Section 3. [sent-183, score-0.235]
61 Note that in the case of “flat” regular multiple kernel learning, where the DAG has no edges, we obtain back usual optimality conditions [8, 9]. [sent-185, score-0.412]
62 In the next section, we show that this can be done in polynomial time although the number of kernels to consider leaving out is exponential (Section 3. [sent-189, score-0.522]
63 2 Conditions for global optimality of reduced problem We let denote J the complement of the set of norms which are set to zero. [sent-192, score-0.218]
64 We thus consider the optimal solution β of the reduced problem (on J), namely, n 2 1 , (3) minβJ ∈Qv∈J Fv n i=1 ℓ(yi , v∈J βv , Φv (xi ) ) + λ v∈V dv βD(v)∩J 2 with optimal primal variables βJ , dual variables α and optimal pair (ηJ , ζJ ). [sent-193, score-0.302]
65 We now consider necessary conditions and sufficient conditions for this solution (augmented with zeros for non active variables, i. [sent-194, score-0.19]
66 We denote by δ = v∈J dv βD(v)∩J the optimal value of the norm for the reduced problem. [sent-198, score-0.311]
67 (1) and all kernels in the extreme points of J are active, then we have maxt∈sources(J c ) α⊤ Kt α/d2 δ 2 . [sent-200, score-0.362]
68 t Proposition 3 (SJ,ε ) If maxt∈sources(J c ) w∈D(t) α⊤ Kw α/( v∈A(w)∩D(t) dv )2 δ 2 + ε/λ, then the total duality gap is less than ε. [sent-201, score-0.348]
69 The proof is fairly technical and can be found in [10]; this result constitutes the main technical contribution of the paper: it essentially allows to solve a very large optimization problem over exponentially many dimensions in polynomial time. [sent-202, score-0.192]
70 However, the sufficient condition (SJ,ε ) requires to sum over all descendants of the active kernels, which is impossible in practice (as shown in Section 5, we consider V of cardinal often greater than 1030 ). [sent-204, score-0.133]
71 Here, we need to bring to bear the specific structure of the kernel k. [sent-205, score-0.323]
72 In the context of directed grids we consider in this paper, if dv can also be decomposed as a product, then v∈A(w)∩D(t) dv is also factorized, and we can compute the sum over all v ∈ D(t) in linear time in p. [sent-206, score-0.638]
73 Moreover we can cache the sums 2 w∈D(t) Kw /( v∈A(w)∩D(t) dv ) in order to save running time. [sent-207, score-0.237]
74 3 Dual optimization for reduced or small problems When kernels kv , v ∈ V have low-dimensional feature spaces, we may use a primal representation and solve the problem in Eq. [sent-209, score-0.655]
75 Namely, we use the same technique as [8]: we consider for ζ ∈ Z, the function B(ζ) = 1 −1 minβ∈Qv∈V Fv n n ℓ(yi , v∈V βv , Φv (xi ) )+ λ w∈V ζw βw 2 , which is the optimal value i=1 2 of the single kernel learning problem with kernel matrix w∈V ζw Kw . [sent-214, score-0.52]
76 , positive diagonal) is added to the kernel matrices, the function B is differentiable [8]. [sent-219, score-0.298]
77 The overall complexity of the algorithm is then proportional to O(|V |n2 )—to form the kernel matrices—plus the complexity of solving a single kernel learning problem—typically between O(n2 ) and O(n3 ). [sent-223, score-0.584]
78 Note that the kernel matrices are never all needed explicitly, i. [sent-227, score-0.26]
79 • Input: kernel matrices Kv ∈ Rn×n , v ∈ V , maximal gap ε, maximal # of kernels Q • Algorithm 1. [sent-231, score-0.781]
80 (3) • Output: J, α, η The previous algorithm will stop either when the duality gap is less than ε or when the maximal number of kernels Q has been reached. [sent-236, score-0.524]
81 In practice, when the weights dv increase with the depth of v in the DAG (which we use in simulations), the small duality gap generally occurs before we reach a problem larger than Q. [sent-237, score-0.348]
82 In order to obtain a polynomial complexity, the maximal out-degree of the DAG (i. [sent-239, score-0.211]
83 , the maximal number of children of any given node) should be polynomial as well. [sent-241, score-0.211]
84 (1) will be equal to its hull, and thus we can only hope to obtain consistency of the hull of the pattern, which we consider in this section. [sent-244, score-0.377]
85 Following [4], we make the following assumptions on the underlying joint distribution of (X, Y ): (a) the joint covariance matrix Σ of (Φ(xv ))v∈V (defined with appropriate blocks of size fv × fw ) is invertible, (b) E(Y |X) = w∈W β w , Φw (x) with W ⊂ V and var(Y |X) = σ 2 > 0 almost surely. [sent-251, score-0.268]
86 With these simple assumptions, we obtain (see proof in [10]): Proposition 4 (Sufficient condition) If max t∈sources(W c ) w∈D(t) ΣwW Σ−1W Diag(dv βD(v) −1 )v∈W βW W P ( v∈A(w)∩D(t) dv )2 2 < 1, then β and the hull of W are consistently estimated when λn n1/2 → ∞ and λn → 0. [sent-252, score-0.624]
87 Proposition 5 (Necessary condition) If the β and the hull of W are consistently estimated for some sequence λn , then maxt∈sources(W c ) ΣwW Σ−1W Diag(dv / βD(v) )v∈W β W 2 /d2 1. [sent-253, score-0.387]
88 Moreover, the last propositions show that we indeed can estimate the correct hull of the sparsity pattern if the sufficient condition is satisfied. [sent-255, score-0.42]
89 5 0 2 7 HKL greedy L2 3 2 4 5 log2(p) 6 7 Figure 2: Comparison on synthetic examples: mean squared error over 40 replications (with halved standard deviations). [sent-258, score-0.129]
90 Finally, it is worth noting that if the ratios dw / maxv∈A(w) dv tend to infinity slowly with n, then we always consistently estimate the depth of the hull, i. [sent-464, score-0.285]
91 We are currently investigating extensions to the non parametric case [4], in terms of pattern selection and universal consistency. [sent-467, score-0.133]
92 , after the input variables were transformed by a random rotation (in this situation, the generating polynomial is not sparse anymore). [sent-474, score-0.197]
93 We can see that in situations where the underlying predictor function is sparse (left), HKL outperforms the two other methods when the total number of variables p increases, while in the other situation where the best predictor is not sparse (right), it performs only slightly better: i. [sent-475, score-0.235]
94 , in non sparse problems, ℓ1 -norms do not really help, but do help a lot when sparsity is expected. [sent-477, score-0.171]
95 For all methods, the kernels were held fixed, while in Table 1, we report the performance for the best regularization parameters obtained by 10 random half splits. [sent-541, score-0.42]
96 We can see from Table 1, that HKL outperforms other methods, in particular for the datasets bank32nm, bank-32nh, pumadyn-32nm, pumadyn-32nh, which are datasets dedicated to non linear regression. [sent-542, score-0.215]
97 6 Conclusion We have shown how to perform hierarchical multiple kernel learning (HKL) in polynomial time in the number of selected kernels. [sent-548, score-0.558]
98 This framework may be applied to many positive definite kernels and we have focused on polynomial and Gaussian kernels used for nonlinear variable selection. [sent-549, score-0.884]
99 We are currently investigating applications to string and graph kernels [2]. [sent-551, score-0.391]
100 Consistency of the group Lasso and multiple kernel learning. [sent-573, score-0.305]
wordName wordTfidf (topN-words)
[('kernels', 0.362), ('hull', 0.339), ('hkl', 0.311), ('kernel', 0.26), ('dv', 0.237), ('fv', 0.23), ('dag', 0.194), ('rbf', 0.175), ('kv', 0.161), ('polynomial', 0.16), ('kw', 0.153), ('hilbertian', 0.146), ('norms', 0.107), ('non', 0.089), ('sources', 0.085), ('rv', 0.085), ('jp', 0.084), ('spaces', 0.078), ('directed', 0.078), ('qv', 0.072), ('mkl', 0.072), ('dags', 0.072), ('ancestors', 0.068), ('feature', 0.065), ('ringnorm', 0.063), ('twonorm', 0.063), ('spambase', 0.063), ('datasets', 0.063), ('predictor', 0.062), ('greedy', 0.059), ('regularization', 0.058), ('maxt', 0.058), ('sj', 0.057), ('hierarchical', 0.057), ('gap', 0.057), ('descendants', 0.056), ('uci', 0.055), ('xi', 0.054), ('acyclic', 0.054), ('duality', 0.054), ('maximal', 0.051), ('polynomials', 0.048), ('grids', 0.048), ('proposition', 0.048), ('consistently', 0.048), ('kiji', 0.048), ('mushrooms', 0.048), ('decomposition', 0.048), ('rotated', 0.046), ('sparsity', 0.045), ('multiple', 0.045), ('selection', 0.044), ('simulations', 0.043), ('adapted', 0.043), ('jmlr', 0.042), ('regular', 0.042), ('ww', 0.042), ('penalizing', 0.042), ('complement', 0.042), ('try', 0.041), ('active', 0.039), ('vertices', 0.039), ('norm', 0.039), ('descendant', 0.038), ('fw', 0.038), ('replications', 0.038), ('sum', 0.038), ('differentiable', 0.038), ('consistency', 0.038), ('hilbert', 0.037), ('sparse', 0.037), ('situation', 0.037), ('selected', 0.036), ('caching', 0.036), ('hermite', 0.036), ('propositions', 0.036), ('hk', 0.036), ('reduced', 0.035), ('optimality', 0.034), ('anova', 0.034), ('abalone', 0.034), ('nj', 0.034), ('associated', 0.033), ('moreover', 0.033), ('anymore', 0.032), ('bring', 0.032), ('looking', 0.032), ('complexity', 0.032), ('optimization', 0.032), ('synthetic', 0.032), ('conditions', 0.031), ('embedded', 0.031), ('violating', 0.031), ('bear', 0.031), ('nite', 0.031), ('dual', 0.03), ('exploring', 0.03), ('concatenation', 0.03), ('summing', 0.029), ('graph', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
2 0.1831255 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
3 0.15658732 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
Author: Yen-yu Lin, Tyng-luh Liu, Chiou-shann Fuh
Abstract: In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. These representations are typically high dimensional and assume diverse forms. Thus finding a way to transform them into a unified space of lower dimension generally facilitates the underlying tasks, such as object recognition or clustering. We describe an approach that incorporates multiple kernel learning with dimensionality reduction (MKL-DR). While the proposed framework is flexible in simultaneously tackling data in various feature representations, the formulation itself is general in that it is established upon graph embedding. It follows that any dimensionality reduction techniques explainable by graph embedding can be generalized by our method to consider data in multiple feature representations.
4 0.15056784 44 nips-2008-Characteristic Kernels on Groups and Semigroups
Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1
5 0.14380032 143 nips-2008-Multi-label Multiple Kernel Learning
Author: Shuiwang Ji, Liang Sun, Rong Jin, Jieping Ye
Abstract: We present a multi-label multiple kernel learning (MKL) formulation in which the data are embedded into a low-dimensional space directed by the instancelabel correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, which can be cast into a semi-infinite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained convex optimization problem. In addition, we show that the objective function of the approximate formulation is differentiable with Lipschitz continuous gradient, and hence existing methods can be employed to compute the optimal solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.
6 0.13582766 225 nips-2008-Supervised Bipartite Graph Inference
7 0.1324098 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
8 0.12645495 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
9 0.11283199 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
10 0.11096804 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
11 0.11059289 113 nips-2008-Kernelized Sorting
12 0.10596494 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning
13 0.1046053 138 nips-2008-Modeling human function learning with Gaussian processes
14 0.087831236 112 nips-2008-Kernel Measures of Independence for non-iid Data
15 0.086910747 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
16 0.085501194 108 nips-2008-Integrating Locally Learned Causal Structures with Overlapping Variables
17 0.084545925 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
18 0.083847187 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
19 0.083794639 194 nips-2008-Regularized Learning with Networks of Features
20 0.082386382 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
topicId topicWeight
[(0, -0.241), (1, -0.081), (2, -0.118), (3, 0.116), (4, 0.12), (5, -0.04), (6, 0.008), (7, -0.042), (8, 0.032), (9, -0.048), (10, 0.242), (11, -0.131), (12, -0.081), (13, 0.003), (14, 0.154), (15, 0.04), (16, 0.088), (17, -0.145), (18, -0.159), (19, 0.011), (20, -0.002), (21, 0.059), (22, 0.102), (23, -0.043), (24, 0.033), (25, -0.01), (26, 0.09), (27, -0.027), (28, 0.043), (29, 0.016), (30, -0.051), (31, 0.056), (32, -0.048), (33, 0.037), (34, -0.029), (35, 0.095), (36, 0.014), (37, -0.023), (38, 0.015), (39, 0.005), (40, 0.027), (41, 0.056), (42, -0.086), (43, -0.042), (44, 0.004), (45, 0.122), (46, 0.044), (47, 0.017), (48, 0.057), (49, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.96655017 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
2 0.86161327 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
3 0.79696333 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
Author: Yen-yu Lin, Tyng-luh Liu, Chiou-shann Fuh
Abstract: In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. These representations are typically high dimensional and assume diverse forms. Thus finding a way to transform them into a unified space of lower dimension generally facilitates the underlying tasks, such as object recognition or clustering. We describe an approach that incorporates multiple kernel learning with dimensionality reduction (MKL-DR). While the proposed framework is flexible in simultaneously tackling data in various feature representations, the formulation itself is general in that it is established upon graph embedding. It follows that any dimensionality reduction techniques explainable by graph embedding can be generalized by our method to consider data in multiple feature representations.
4 0.76732105 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
Author: Pavel P. Kuksa, Pai-hsi Huang, Vladimir Pavlovic
Abstract: We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufficient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times. 1
5 0.75590616 44 nips-2008-Characteristic Kernels on Groups and Semigroups
Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1
6 0.66590911 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
7 0.66384536 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
8 0.66221368 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
9 0.65201008 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning
10 0.64839727 143 nips-2008-Multi-label Multiple Kernel Learning
11 0.61402887 225 nips-2008-Supervised Bipartite Graph Inference
12 0.55769438 113 nips-2008-Kernelized Sorting
13 0.50683564 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
14 0.48204336 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
15 0.47073251 138 nips-2008-Modeling human function learning with Gaussian processes
16 0.46223 196 nips-2008-Relative Margin Machines
17 0.44547379 178 nips-2008-Performance analysis for L\ 2 kernel classification
18 0.43067971 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
19 0.42305505 238 nips-2008-Theory of matching pursuit
20 0.42210704 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
topicId topicWeight
[(6, 0.087), (7, 0.098), (12, 0.043), (28, 0.158), (57, 0.071), (59, 0.017), (63, 0.033), (64, 0.024), (71, 0.026), (72, 0.153), (77, 0.07), (78, 0.022), (83, 0.093)]
simIndex simValue paperId paperTitle
1 0.89885348 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox
Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.
same-paper 2 0.88318884 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
3 0.82243943 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
4 0.81825817 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
Author: Liu Yang, Rong Jin, Rahul Sukthankar
Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.
5 0.81794912 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
6 0.81596249 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
7 0.80929554 202 nips-2008-Robust Regression and Lasso
8 0.80923778 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
9 0.80763054 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
10 0.80633694 143 nips-2008-Multi-label Multiple Kernel Learning
11 0.80551147 75 nips-2008-Estimating vector fields using sparse basis field expansions
12 0.8053214 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
13 0.80318737 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
14 0.80031407 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
15 0.80008847 196 nips-2008-Relative Margin Machines
16 0.80006111 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
17 0.79966652 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
18 0.7995559 151 nips-2008-Non-parametric Regression Between Manifolds
19 0.79861921 99 nips-2008-High-dimensional support union recovery in multivariate regression
20 0.7983464 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification