nips nips2010 nips2010-124 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper we consider the problem of semi-supervised kernel function learning. [sent-7, score-0.374]
2 We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. [sent-8, score-1.072]
3 Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. [sent-9, score-1.021]
4 We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. [sent-12, score-0.347]
5 To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. [sent-13, score-0.462]
6 We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. [sent-14, score-0.463]
7 Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. [sent-15, score-0.383]
8 1 Introduction Learning kernel functions is an ongoing research topic in machine learning that focuses on learning an appropriate kernel function for a given task. [sent-16, score-0.8]
9 In this paper, we propose and analyze a general kernel matrix learning problem using provided sideinformation over the training data. [sent-25, score-0.497]
10 Our learning problem regularizes the desired kernel matrix via a convex regularizer chosen from a broad class, subject to convex constraints on the kernel. [sent-26, score-0.589]
11 While the learned kernel matrix should be able to capture the provided side-information well, it is not clear how the information can be propagated to new data points. [sent-27, score-0.505]
12 Our first main result demonstrates that our kernel matrix learning problem is equivalent to learning a linear transformation (LT) kernel function (a kernel of the form ϕ(x)T W ϕ(y) for some matrix W ≽ 0) with a specific regularizer. [sent-28, score-1.363]
13 With the appropriate representation of W , this result implies that the learned LT kernel function can be naturally applied to new data. [sent-29, score-0.464]
14 Additionally, we demonstrate that a large class of Mahalanobis metric learning methods can be seen as learning an LT kernel function and so our result provides a 1 constructive method for kernelizing these methods. [sent-30, score-0.594]
15 Our analysis recovers some recent kernelization results for metric learning, but also implies several new results. [sent-31, score-0.311]
16 As our proposed kernel learning formulation learns a kernel matrix over the training points, the memory requirements scale quadratically in the number of training points, a common issue arising in kernel methods. [sent-32, score-1.289]
17 We prove that the equivalence to LT kernel function learning still holds with the addition of this constraint, and that the resulting formulation can be scaled to very large data sets. [sent-34, score-0.446]
18 We then focus on a novel application of our framework to the problem of inductive semi-supervised kernel dimensionality reduction. [sent-35, score-0.638]
19 Our method is a special case of our kernel function learning framework with trace-norm as the regularization function. [sent-36, score-0.492]
20 Related Work: Most of the existing kernel learning methods can be classified into two broad categories. [sent-44, score-0.4]
21 The first category includes parametric approaches, where the learned kernel function is restricted to be of a specific form and then the relevant parameters are learned according to the provided data. [sent-45, score-0.58]
22 Prominent methods include multiple kernel learning [5], hyperkernels [4], infinite kernel learning [7], and hyper-parameter cross-validation [8]. [sent-46, score-0.833]
23 Examples include spectral kernel learning [6], manifold-based kernel learning [9], and kernel target alignment [3]. [sent-49, score-1.264]
24 We propose a general non-parametric kernel matrix learning framework, similar to methods of the second category. [sent-52, score-0.441]
25 However, we show that our learned kernel matrix corresponds to a linear transformation kernel function parameterized by a PSD matrix. [sent-53, score-0.986]
26 POLA [13] and ITML [12] provide specialized kernelization techniques for their respective metric learning formulations. [sent-57, score-0.313]
27 Recently, [17] showed kernelization for a class of metric learning algorithms including LMNN and NCA [15]; as we will see, our result is more general and we can prove kernelization over a larger class of problems and can also reduce the number of parameters to be learned. [sent-59, score-0.522]
28 Kernel dimensionality reduction methods can generally be divided into two categories: 1) semisupervised dimensionality reduction in the transductive setting, 2) supervised dimensionality reduction in the inductive setting. [sent-62, score-0.768]
29 Methods in the first category include the incomplete Cholesky decomposition [19], colored maximum variance unfolding [20], manifold preserving semi-supervised dimensionality reduction [21]. [sent-63, score-0.273]
30 Methods in the second category include the kernel dimensionality reduction method [22] and Gaussian Process latent variable models [23]. [sent-64, score-0.599]
31 Kernel PCA [24] reduces the dimensionality in the inductive unsupervised setting, while various manifold learning methods can reduce the dimensionality but only in the unsupervised transductive setting. [sent-65, score-0.487]
32 In contrast, our dimensionality reduction method, which is an instantiation of our general kernel learning framework, can perform kernel dimensionality reduction simultaneously in both the semi-supervised as well as the inductive setting. [sent-66, score-1.318]
33 Additionally, it can capture the manifold structure using an appropriate baseline kernel function such as the one proposed by [25]. [sent-67, score-0.447]
34 2 2 Learning Framework Given an input kernel function κ : Rd × Rd → R, and some side-information over a set of points X = {x1 , x2 , . [sent-68, score-0.396]
35 , xn } the goal is to learn a new kernel function κW that is regularized against κ but incorporates the provided side-information (the use of the subscript W will become clear later). [sent-71, score-0.405]
36 The initial kernel function κ is of the form κ(x, y) = ϕ(x)T ϕ(y) for some mapping ϕ. [sent-72, score-0.374]
37 Learning a kernel function from the provided side-information is an ill-posed problem since infinitely many such kernels can satisfy the provided supervision. [sent-80, score-0.428]
38 A common approach is to formulate a transductive learning problem to learn a new kernel matrix over the training data. [sent-81, score-0.559]
39 Denoting the input kernel matrix K as K = ΦT Φ, we aim to learn a new kernel matrix KW that is regularized against K while satisfying the available side-information. [sent-82, score-0.861]
40 gi (KW ) ≤ bi , 1 ≤ i ≤ m, (1) KW ≽0 where f and gi are functions from Rn×n → R. [sent-85, score-0.338]
41 In general, such learning formulations are limited in that the learned kernel cannot readily be applied to new data points. [sent-89, score-0.49]
42 However, we will show that the above proposed problem is equivalent to learning linear transformation (LT) kernel functions. [sent-90, score-0.507]
43 Formally, an LT kernel function κW is a kernel function of the form κW (x, y) = ϕ(x)T W ϕ(y), where W is a positive semi-definite (PSD) matrix; we can think of the LT kernel as describing the linear transformation ϕi → W 1/2 ϕi . [sent-91, score-1.229]
44 A natural way to learn an LT kernel function would be to learn the parameterization matrix W using the provided side-information. [sent-92, score-0.477]
45 gi (ΦT W Φ) ≤ bi , 1 ≤ i ≤ m, (2) W ≽0 where, as before, the function f is the regularizer and the functions gi are the constraints that encode the side information. [sent-95, score-0.44]
46 The constraints gi are assumed to be a function of the matrix ΦT W Φ of learned kernel values over the training data. [sent-96, score-0.713]
47 We make two observations about this problem: first, for data mapped to high-dimensional spaces via kernel functions, this problem is seemingly impossible to optimize since the size of W grows quadratically with the dimensionality. [sent-97, score-0.427]
48 We will show that (2) need not explicitly be solved for learning an LT kernel function. [sent-98, score-0.4]
49 1 Examples of Regularizers and Constraints To make the kernel learning optimization problem concrete, we discuss a few examples of possible regularizers and constraints. [sent-101, score-0.469]
50 For the regularizer f (A) = 1 ∥A − I∥2 , the resulting kernel learning objective can be equivalently F 2 expressed as minimizing 1 ∥K −1 KW − I∥2 . [sent-102, score-0.456]
51 Thus, the goal is to keep the learned kernel close to the F 2 input kernel subject to the constraints in gi . [sent-103, score-1.023]
52 Relative distance constraints over a triplet (ϕi , ϕj , ϕk ) specify that ϕi should be closer to ϕj than ϕk , and are often used in metric learning formulations and ranking problems; such constraints can be easily formulated within our framework. [sent-110, score-0.265]
53 More importantly, this result will yield insight into the type of kernel that is learned by the kernel learning problem. [sent-115, score-0.864]
54 Given that KW = ΦT W Φ, we can see that the learned kernel function is a linear transformation kernel; that is, κW (ϕi , ϕj ) = ϕT W ϕj . [sent-138, score-0.571]
55 Given a pairs of new data points ϕn1 and ϕn2 , we use the fact i that the learned kernel is a linear transformation kernel, along with the first result of the theorem (W = αI + ΦSΦT ) to compute the learned kernel as: n ∑ κW (xn1 , xn2 ) = ϕT1 W ϕn2 = ακ(xn1 , xn2 ) + Sij κ(xn1 , xi )κ(xj , xn2 ). [sent-139, score-1.084]
56 Therefore, a corollary of Theorem 1 is that we can constructively apply these metric learning methods in kernel space by solving their corresponding kernel learning problem, and then compute the learned metrics via (3). [sent-141, score-1.006]
57 ITML is an instantiation of our framework with regularizer f (A) = tr(A) − log det(A) and pairwise distance constraints encoded as the gi functions. [sent-148, score-0.374]
58 The corresponding kernel learning optimization problem simplifies to: min KW gi (KW ) ≤ bi , 1 ≤ i ≤ m, Dℓd (KW , K) s. [sent-150, score-0.599]
59 This recovers the kernelized metric learning problem analyzed in [12], where kernelization for this special case was established and an iterative projection algorithm for optimization was developed. [sent-153, score-0.401]
60 Note that, in the analysis of [12], the gi were limited to similarity and dissimilarity constraints; our result is therefore more general than the existing kernelization result, even for this special case. [sent-154, score-0.333]
61 In this case, f is once again a convex spectral function, and its global minima is α = 0, so we can use (1) to solve for the learned kernel KW as min ∥KW K −1 ∥2 s. [sent-161, score-0.608]
62 Other Examples: The above two examples show that our analysis recovers two well-known kernelization results for Mahalanobis metric learning. [sent-166, score-0.33]
63 In particular, each of these methods may be run in kernel space, and our analysis yields new insights into these methods; for example, kernelization of LMNN [11] using Theorem 1 avoids the convex perturbation analysis in [16] that leads to suboptimal solutions in some cases. [sent-170, score-0.568]
64 Then, we will explicitly enforce that the learned kernel is of this form. [sent-176, score-0.464]
65 Below, we show that for any spectral function f and linear constraints gi (KW ) = Tr(Ci KW ), (6) reduces to a problem that applies f and gi ’s to r × r matrices only, which provides significant scalability. [sent-181, score-0.478]
66 5 (7) Note that (7) is over r × r matrices (after initial pre-processing) and is in fact similar to the kernel learning problem (1), but with a kernel K J of smaller size r × r, r ≪ n. [sent-189, score-0.793]
67 Similar to (1), we can show that (6) is also equivalent to linear transformation kernel function learning. [sent-191, score-0.481]
68 This enables us to naturally apply the above kernel learning problem in the inductive setting. [sent-192, score-0.507]
69 Then, (6) and (7) are equivalent to the following linear transformation kernel learning problem (analogous to the connection between (1) and (2)): min W ≽0,L f (W ) s. [sent-196, score-0.507]
70 (8) Note that, in contrast to (2), where the last constraint over W is achieved automatically, (8) requires that constraint should be satisfied during the optimization process which leads to a reduced number of parameters for our kernel learning problem. [sent-199, score-0.458]
71 The above theorem shows that our reduced parameters kernel learning method (6) also implicitly learns a linear transformation kernel function, hence we can generalize the learned kernel to unseen data points using an expression similar to (3). [sent-200, score-1.394]
72 Also note that using this parameter reduction technique, we can scale the optimization to kernel learning problems with millions of points of more. [sent-203, score-0.503]
73 4 Trace-norm based Inductive Semi-supervised Kernel Dimensionality Reduction (Trace-SSIKDR) We now consider applying our framework to the scenario of semi-supervised kernel dimensionality reduction, which provides a novel and practical application of our framework. [sent-205, score-0.531]
74 While there exists a variety of methods for kernel dimensionality reduction, most of these methods are unsupervised (e. [sent-206, score-0.512]
75 In contrast, we can use our kernel learning framework to learn a low-rank transformation of the feature vectors implicitly that in turn provides a low-dimensional embedding of the dataset. [sent-209, score-0.63]
76 Even when the dimensionality of ϕi is very large, if the rank of W is low enough, then the mapped embedding will have small dimensionality. [sent-213, score-0.317]
77 Then using Theorem 1, the resulting relaxed kernel learning problem is: min τ Tr(K −1/2 KW K −1/2 ) + ∥K −1/2 KW K −1/2 ∥2 F KW ≽0 s. [sent-222, score-0.4]
78 ˜ However, using elementary linear algebra we can show that K and the learned kernel function 1/2 ˜ can be computed efficiently without computing K by maintaining S = K −1/2 KK −1/2 from step to step. [sent-306, score-0.484]
79 Note that the learned embedding ϕi → ˜ K 1/2 K −1/2 ki , where ki is a vector of input kernel function values between ϕi and the training 1/2 data, can be computed efficiently as ϕi → Σk Dk Vk ki , which does not require K 1/2 explicitly. [sent-310, score-0.638]
80 //S t = Vk Dk Σk Dk Vk 7: until Convergence 8: Return Σk , Dk , Vk 5 Experimental Results We now present empirical evaluation of our kernel learning framework and our semi-supervised kernel dimensionality approach when applied in conjunction with k-nearest neighbor classification. [sent-313, score-0.931]
81 Additionally, we show that our semi-supervised kernel dimensionality reduction approach achieves comparable accuracy while significantly reducing the dimensionality of the linear mapping. [sent-315, score-0.741]
82 UCI Datasets: First, we evaluate the performance of our kernel learning framework on standard UCI datasets. [sent-316, score-0.439]
83 (b): Rank of the learned kernel functions obtained by various methods. [sent-356, score-0.464]
84 The rank of the learned kernel function is same as the reduced dimensionality of the dataset. [sent-357, score-0.683]
85 Table 4 shows the 5-NN classification accuracies achieved by our kernel learning framework with different regularization functions. [sent-361, score-0.469]
86 Note that for almost all the datasets (except Iris and Diabetes), both Frob and ITML improve upon the baseline Gaussian kernel significantly. [sent-364, score-0.418]
87 We also compare our semi-supervised dimensionality reduction method Trace-SSIKDR (see Section 4) with baseline kernel dimensionality reduction methods Frob LR, ITML LR-pre, and ITML LR-post. [sent-365, score-0.816]
88 Frob LR reduces the rank of the learned matrix W (equivalently, it reduces the dimensionality) using Frobenius norm regularization by taking the top eigenvectors. [sent-366, score-0.312]
89 Similarly, ITML LR-post reduces the rank of the learned kernel matrix obtained using ITML by taking its top eigenvectors. [sent-367, score-0.631]
90 ITML LR-pre reduces the rank of the kernel function by reducing the rank of the training kernel matrix. [sent-368, score-0.998]
91 The learned linear transformation W (or equivalently, the learned kernel function) should have the same rank as that of training kernel matrix as the LogDet divergence preserves the range space of the input kernel. [sent-369, score-1.2]
92 We fix the rank of the learned W for Frob LR, ITML LR-pre, ITML LR-post as the rank of the transformation W obtained by our Trace-SSIKDR method. [sent-370, score-0.379]
93 Caltech-101: Next, we evaluate our kernel learning framework on the Caltech-101 dataset, a benchmark object recognition dataset containing over 3000 images. [sent-373, score-0.439]
94 We use a pool of 30 images per class for our experiments, out of which a varying number of random images are selected for training and the remaining are used for testing the learned kernel function. [sent-375, score-0.506]
95 The baseline kernel function is selected to be the sum of four different kernel functions: PMK [32], SPMK [33], Geoblur-1 and Geoblur-2 [34]. [sent-376, score-0.792]
96 Clearly, ITML and Frob (which are specific instances of our framework) are able to learn significantly more accurate kernel functions than the baseline kernel function. [sent-378, score-0.823]
97 Furthermore, our Trace-SSIKDR method is able to achieve reasonable accuracy while reducing the rank of the kernel function significantly (Figure 1 (b)). [sent-379, score-0.505]
98 For the baseline kernel, we use the data-dependent kernel function proposed by [25] that also takes data’s manifold structure into account. [sent-385, score-0.447]
99 Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. [sent-439, score-0.774]
100 Cross-validation optimization for large scale hierarchical classification kernel methods. [sent-442, score-0.374]
wordName wordTfidf (topN-words)
[('kw', 0.606), ('kernel', 0.374), ('itml', 0.33), ('frob', 0.229), ('kernelization', 0.171), ('gi', 0.139), ('mahalanobis', 0.129), ('dimensionality', 0.118), ('metric', 0.116), ('inductive', 0.107), ('rank', 0.101), ('tr', 0.096), ('spectral', 0.09), ('learned', 0.09), ('lt', 0.089), ('transformation', 0.087), ('logdet', 0.082), ('reduction', 0.081), ('lr', 0.073), ('embedding', 0.073), ('fs', 0.067), ('pola', 0.065), ('transductive', 0.064), ('kulis', 0.062), ('dk', 0.06), ('bi', 0.06), ('inductively', 0.057), ('regularizer', 0.056), ('kernels', 0.054), ('vk', 0.053), ('zi', 0.053), ('lmnn', 0.053), ('regularizers', 0.05), ('digits', 0.049), ('jlj', 0.049), ('usps', 0.048), ('constraints', 0.046), ('dw', 0.045), ('baseline', 0.044), ('inderjit', 0.043), ('kernelized', 0.041), ('matrix', 0.041), ('ci', 0.04), ('det', 0.04), ('framework', 0.039), ('instantiation', 0.039), ('jmlr', 0.039), ('hyperkernels', 0.033), ('kci', 0.033), ('kernelizing', 0.033), ('sideinformation', 0.033), ('ssikdr', 0.033), ('minima', 0.031), ('distance', 0.031), ('learn', 0.031), ('accuracy', 0.03), ('yij', 0.03), ('regularization', 0.03), ('frobenius', 0.03), ('constraint', 0.029), ('manifold', 0.029), ('collapsing', 0.029), ('quadratically', 0.028), ('nips', 0.027), ('theorem', 0.027), ('bij', 0.026), ('iris', 0.026), ('formulation', 0.026), ('category', 0.026), ('learning', 0.026), ('ki', 0.026), ('reduces', 0.025), ('mapped', 0.025), ('cantly', 0.025), ('jain', 0.025), ('bangalore', 0.025), ('recovers', 0.024), ('classi', 0.024), ('pairwise', 0.024), ('diabetes', 0.023), ('training', 0.023), ('convex', 0.023), ('special', 0.023), ('sra', 0.022), ('icml', 0.022), ('furthermore', 0.022), ('points', 0.022), ('cvpr', 0.021), ('brian', 0.021), ('variety', 0.02), ('uci', 0.02), ('linear', 0.02), ('equivalence', 0.02), ('coded', 0.019), ('psd', 0.019), ('smola', 0.019), ('matrices', 0.019), ('examples', 0.019), ('class', 0.019), ('colored', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
2 0.20818555 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
Author: Andreas Christmann, Ingo Steinwart
Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1
3 0.19525932 133 nips-2010-Kernel Descriptors for Visual Recognition
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.
4 0.18208697 194 nips-2010-Online Learning for Latent Dirichlet Allocation
Author: Matthew Hoffman, Francis R. Bach, David M. Blei
Abstract: We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
Author: Serhat Bucak, Rong Jin, Anil K. Jain
Abstract: Recent studies have shown that multiple kernel learning is very effective for object recognition, leading to the popularity of kernel learning in computer vision problems. In this work, we develop an efficient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the classes under consideration share the same combination of kernel functions, and the objective is to find the optimal kernel combination that benefits all the classes. Although several algorithms have been developed for ML-MKL, their computational cost is linear in the number of classes, making them unscalable when the number of classes is large, a challenge frequently encountered in visual object recognition. We address this computational challenge by developing a framework for ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis √ shows that the complexity of our algorithm is O(m1/3 lnm), where m is the number of classes. Empirical studies with object recognition show that while achieving similar classification accuracy, the proposed method is significantly more efficient than the state-of-the-art algorithms for ML-MKL. 1
6 0.17710802 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
7 0.15954235 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
8 0.15644631 250 nips-2010-Spectral Regularization for Support Estimation
9 0.15127422 280 nips-2010-Unsupervised Kernel Dimension Reduction
10 0.11830612 138 nips-2010-Large Margin Multi-Task Metric Learning
11 0.10698269 287 nips-2010-Worst-Case Linear Discriminant Analysis
12 0.092675515 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions
13 0.088697784 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
14 0.084513038 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm
15 0.079055965 63 nips-2010-Distributed Dual Averaging In Networks
16 0.079006203 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
17 0.075235583 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
18 0.07417652 94 nips-2010-Feature Set Embedding for Incomplete Data
19 0.073134206 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
20 0.072998747 146 nips-2010-Learning Multiple Tasks using Manifold Regularization
topicId topicWeight
[(0, 0.197), (1, 0.086), (2, 0.079), (3, -0.09), (4, 0.175), (5, 0.074), (6, 0.355), (7, -0.049), (8, 0.025), (9, 0.051), (10, 0.032), (11, 0.006), (12, 0.008), (13, 0.009), (14, 0.09), (15, 0.026), (16, -0.059), (17, -0.034), (18, -0.048), (19, -0.008), (20, -0.022), (21, -0.075), (22, -0.057), (23, -0.069), (24, 0.004), (25, 0.028), (26, 0.054), (27, -0.025), (28, -0.08), (29, -0.132), (30, -0.012), (31, -0.051), (32, 0.015), (33, -0.081), (34, 0.079), (35, 0.007), (36, -0.104), (37, -0.006), (38, -0.007), (39, -0.034), (40, -0.073), (41, 0.005), (42, 0.056), (43, 0.036), (44, 0.02), (45, 0.01), (46, -0.027), (47, 0.051), (48, -0.038), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.95153719 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
2 0.81314552 280 nips-2010-Unsupervised Kernel Dimension Reduction
Author: Meihong Wang, Fei Sha, Michael I. Jordan
Abstract: We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
3 0.75327116 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
Author: Andreas Christmann, Ingo Steinwart
Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1
4 0.72031826 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
5 0.71363014 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
Author: Kun Gai, Guangyun Chen, Chang-shui Zhang
Abstract: In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classifier. The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e.g., L1 or L2 ) of norm constraints on combination coefficients. We establish the differentiability of our formulation, and propose a gradient projection algorithm for kernel learning. Experiments show that our method significantly outperforms both SVM with the uniform combination of basis kernels and other state-of-art MKL approaches. 1
6 0.68679476 133 nips-2010-Kernel Descriptors for Visual Recognition
9 0.63102102 287 nips-2010-Worst-Case Linear Discriminant Analysis
10 0.59597188 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm
11 0.59393191 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
12 0.49788606 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
13 0.47765699 138 nips-2010-Large Margin Multi-Task Metric Learning
14 0.44496855 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
15 0.44256431 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis
16 0.42729959 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
17 0.37011838 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
18 0.36388889 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
19 0.36217904 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
20 0.35742152 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
topicId topicWeight
[(13, 0.076), (17, 0.015), (26, 0.011), (27, 0.034), (30, 0.05), (35, 0.034), (45, 0.206), (48, 0.217), (50, 0.056), (52, 0.04), (60, 0.07), (77, 0.031), (78, 0.022), (90, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.80259788 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
2 0.7482177 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
3 0.74317539 161 nips-2010-Linear readout from a neural population with partial correlation data
Author: Adrien Wohrer, Ranulfo Romo, Christian K. Machens
Abstract: How much information does a neural population convey about a stimulus? Answers to this question are known to strongly depend on the correlation of response variability in neural populations. These noise correlations, however, are essentially immeasurable as the number of parameters in a noise correlation matrix grows quadratically with population size. Here, we suggest to bypass this problem by imposing a parametric model on a noise correlation matrix. Our basic assumption is that noise correlations arise due to common inputs between neurons. On average, noise correlations will therefore reflect signal correlations, which can be measured in neural populations. We suggest an explicit parametric dependency between signal and noise correlations. We show how this dependency can be used to ”fill the gaps” in noise correlations matrices using an iterative application of the Wishart distribution over positive definitive matrices. We apply our method to data from the primary somatosensory cortex of monkeys performing a two-alternativeforced choice task. We compare the discrimination thresholds read out from the population of recorded neurons with the discrimination threshold of the monkey and show that our method predicts different results than simpler, average schemes of noise correlations. 1
4 0.74114716 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
5 0.73959827 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
6 0.73893082 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
7 0.73874694 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
8 0.73762256 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
9 0.73538613 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
10 0.7341913 117 nips-2010-Identifying graph-structured activation patterns in networks
11 0.7332446 63 nips-2010-Distributed Dual Averaging In Networks
12 0.73207694 138 nips-2010-Large Margin Multi-Task Metric Learning
13 0.73198372 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
14 0.73153889 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
15 0.73013479 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
16 0.72983408 258 nips-2010-Structured sparsity-inducing norms through submodular functions
17 0.72966725 287 nips-2010-Worst-Case Linear Discriminant Analysis
18 0.72948271 148 nips-2010-Learning Networks of Stochastic Differential Equations
19 0.72944927 36 nips-2010-Avoiding False Positive in Multi-Instance Learning
20 0.72939026 243 nips-2010-Smoothness, Low Noise and Fast Rates