nips nips2011 nips2011-171 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jun Wang, Huyen T. Do, Adam Woznica, Alexandros Kalousis
Abstract: Metric learning has become a very active research field. The most popular representative–Mahalanobis metric learning–can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes finding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we fill this gap and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract Metric learning has become a very active research field. [sent-6, score-0.12]
2 The most popular representative–Mahalanobis metric learning–can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. [sent-7, score-0.732]
3 Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. [sent-8, score-0.476]
4 However, the problem then becomes finding the appropriate kernel function. [sent-9, score-0.187]
5 Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. [sent-10, score-0.3]
6 Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. [sent-11, score-0.85]
7 In this paper we fill this gap and present a general approach for metric learning with multiple kernel learning. [sent-12, score-0.597]
8 Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. [sent-13, score-0.414]
9 Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. [sent-14, score-1.241]
10 Its computation can be seen as a two-step process; in the first step we perform a linear projection of the instances and in the second step we compute their Euclidean metric in the projected space. [sent-17, score-0.421]
11 However, we are now faced with a new problem, namely that of finding the appropriate kernel function and the associated feature space matching the requirements of the learning problem. [sent-20, score-0.284]
12 The simplest approach to address this problem is to select the best kernel from a predefined kernel set using internal cross-validation. [sent-21, score-0.411]
13 The main drawback of this approach is that only one kernel is selected which limits the expressiveness of the resulting method. [sent-22, score-0.187]
14 Multiple Kernel Learning (MKL) [10, 17] lifts the above limitations by learning a linear combination of a number of predefined kernels. [sent-24, score-0.065]
15 In [11, 13] the authors propose a method that learns a distance metric for multiple-source problems within a multiple-kernel scenario. [sent-26, score-0.39]
16 The proposed method defines the distance of two instances as the sum of their distances in the feature spaces induced by the different kernels. [sent-27, score-0.206]
17 To the best of our knowledge most of the work on MKL has been confined in the framework of SVMs and despite the recent popularity of ML there exists so far no work that performs MKL in the ML framework by learning a distance metric in the weighted linear combination of feature spaces. [sent-30, score-0.502]
18 Since the learned metric matrix has a regularized form (i. [sent-35, score-0.395]
19 The experimental results suggest that LMNN-MKLP outperforms LMNN with an unweighted kernel combination and the single best kernel selected by internal cross-validation. [sent-39, score-0.542]
20 2 Preliminaries In the different flavors of metric learning we are given a matrix of learning instances X : n × d, the i-th row of which is the xT ∈ Rd instance, and a vector of class labels y = (y1 , . [sent-40, score-0.498]
21 Consider a mapping Φl (x) of instances x to some feature space Hl , i. [sent-47, score-0.128]
22 The corresponding kernel function kl (xi , xj ) computes the inner product of two instances in the Hl feature space, i. [sent-50, score-0.54]
23 The squared Mahalanobis distance of two instances in the Hl space is given by d2 l (Φl (xi ), Φl (xj )) = (Φl (xi ) − Φl (xj ))T Ml (Φl (xi ) − Φl (xj )), where Ml is M a Positive Semi-Definite (PSD) metric matrix in the Hl space (Ml 0). [sent-54, score-0.575]
24 For some given ML method h we optimize (most often minimize) some cost function Fh with respect to the Ml metric matrix1 under the PSD constraint for Ml and an additional set of pairwise distance constraints Ch ({d2 l (Φl (xi ), Φl (xj )) | i, j = 1, . [sent-55, score-0.478]
25 similarity and M dissimilarity pairwise constraint [3] and relative comparison constraint [22]. [sent-60, score-0.086]
26 The kernelized M ML optimization problem can be now written as: (1) min Fh (Ml ) s. [sent-62, score-0.107]
27 Ch (d2 l (Φl (xi ), Φl (xj ))), Ml 0 M Ml Kernelized ML methods do not require to learn the explicit form of the Mahalanobis metric Ml . [sent-64, score-0.369]
28 However, to keep the notation uncluttered we parametrize the optimization problem only over Ml . [sent-71, score-0.074]
29 2 In MKL we are given a set of kernel functions Z = {kl (xi , xj ) | l = 1 . [sent-72, score-0.377]
30 m} and the goal is to learn an appropriate kernel function kµ (xi , xj ) parametrized by µ under a cost function Q. [sent-75, score-0.433]
31 The cost function Q is determined by the cost function of the learning method that is coupled with multiple kernel learning, e. [sent-76, score-0.253]
32 As in [10, 17] we parametrize kµ (xi , xj ) by a linear combination of the form: m m µl kl (xi , xj ), µl ≥ 0, kµ (xi , xj ) = i=l µl = 1 (4) l We denote the feature space that is induced by the kµ kernel by Hµ , feature space which is given √ √ by the mapping x → Φµ (x) = ( µ1 Φ1 (x)T , . [sent-79, score-1.073]
33 Finally, we denote by H the feature space that we get by the unweighted concatenation of the m feature spaces, i. [sent-84, score-0.21]
34 3 Metric Learning with Multiple Kernel Learning The goal is to learn a metric matrix M in the feature space Hµ induced by the mapping Φµ as well as the kernel weight µ; we denote this metric by d2 . [sent-90, score-1.061]
35 Based on the optimal form of the M,µ Mahalanobis metric M for metric learning method learning with a single kernel function [9], we have the following lemma: Lemma 1. [sent-91, score-0.923]
36 Assume that for a metric learning method h the optimal parameterization of its Mahalanobis metric M ∗ is Φl (X)T A∗ Φl (X), for some A∗ , when learning with a single kernel function kl (x, x ). [sent-92, score-0.984]
37 Then, for h with multiple kernel learning the optimal parametrization of its Mahalanobis metric M ∗∗ is given by Φµ (X)T A∗∗ Φµ (X), for some A∗∗ . [sent-93, score-0.722]
38 Ch (d2 (Φµ (xi ), Φµ (xj ))), A A,µ A,µ 0, µl ≥ 0, µl = 1 (6) l We denote the resulting optimization problem and the learning method by MLh -MKLµ ; clearly this is not fully specified until we choose a specific ML method h. [sent-97, score-0.067]
39 As a result, instead of optimizing over µ we can also use the parametrization over P; the new optimization problem can now be written as: min Fh (ΦP (X)T AΦP (X)) s. [sent-104, score-0.168]
40 Ch (d2 (ΦP (xi ), ΦP (xj ))), A A,P A,P (8) Pij = 1, Pij ≥ 0, Rank(P) = 1, P = PT 0, ij where the constraints ij Pij = 1, Pij ≥ 0, Rank(P) = 1, and P = PT are added so that P = µµT . [sent-106, score-0.067]
41 We call the optimization problem and learning method (8) as MLh -MKLP ; as before in order to fully instantiate it we need to choose a specific metric learning method h. [sent-107, score-0.461]
42 3 Now, we derive an alternative parametrization of (5). [sent-108, score-0.125]
43 Φm (X) We have: d2 (Φµ (xi ), Φµ (xj )) = (Φ(xi ) − Φ(xj ))T M (Φ(xi ) − Φ(xj )) A,µ where: M = Φ (X)T A Φ (X) (9) (10) and A is a mn × mn matrix: A = Cµ1 µ1 A . [sent-124, score-0.064]
44 (11) From (9) we see that the Mahalanobis metric, parametrized by the M or A matrix, in the feature space Hµ induced by the kernel kµ , is equivalent to the Mahalanobis metric in the feature space H which is parametrized by M or A . [sent-140, score-0.776]
45 As we can see from (11), MLh -MKLµ and MLh -MKLP learn a regularized matrix A (i. [sent-141, score-0.076]
46 matrix with internal structure) that corresponds to a parametrization of the Mahalanobis metric M in the feature space H. [sent-143, score-0.63]
47 1 Non-Regularized Metric Learning with Multiple Kernel Learning We present here a more general formulation of the optimization problem (6) in which we lift the regularization of matrix A from (11), and learn instead a full PSD matrix A : A11 . [sent-145, score-0.17]
48 The respective Mahalanobis matrix, which we denote by M , still have the same parametrization form as in (10), i. [sent-161, score-0.125]
49 As a result, by using A instead of A the squared Mahalanobis distance can be written now as: d2 (Φ(xi ), Φ(xj )) = (Φ(xi ) − Φ(xj ))T M (Φ(xi ) − Φ(xj )) A = = − Kj )T , . [sent-164, score-0.073]
50 What we see here is that under the M m 1 parametrization computing the Mahalanobis metric in the H is equivalent to computing the Mahalanobis metric in the HZ space. [sent-174, score-0.813]
51 Under the parametrization of the Mahalanobis distance given by (13), the optimization problem of metric learning with multiple kernel learning is the following: min Fh (Φ (X)T A Φ (X)) s. [sent-175, score-0.835]
52 This is also valid for the subproblems of learning matrix A in MLh -MKLµ and MLh -MKLP given the weight vector µ. [sent-181, score-0.075]
53 Given the PSD matrix A, we have the following two lemmas for optimization problems MLh MKL{µ|P} : Lemma 2. [sent-182, score-0.094]
54 Given the PSD matrix A the MLh -MKLµ optimization problem is convex with µ if metric learning algorithm h is convex with µ. [sent-183, score-0.554]
55 The last two constraints on µ of the optimization problem from (6) are linear, thus this problem is convex if metric learning algorithm h is convex with µ. [sent-185, score-0.526]
56 Since d2 (Φµ (xi ), Φµ (xj )) is convex quadratic of µ, which can be easily proved based on the A,µ PSD property of matrix BABT in (7), many of the well known metric learning algorithms, such as Pairwise SVM [21], POLA [19] and Xing’s method [23] satisfy the conditions in Lemma 2. [sent-186, score-0.465]
57 The MLh -MKLP optimization problem (8) is not convex given a PSD matrix A because the rank constraint is not convex. [sent-187, score-0.186]
58 However, when the number of kernels m is small, e. [sent-188, score-0.128]
59 Given the PSD matrix A, the MLh -MKLP optimization problem (8) can be formulated as an equivalent convex problem with respect to P if the ML algorithm h is linear with P and the number of kernel m is small. [sent-192, score-0.327]
60 Given the PSD matrix A, if h is linear with P, we can formulate the rank constraint problem with the help of the two following convex problems [2]: min Fh (ΦP (X)T AΦP (X)) + w · tr(PT W) s. [sent-194, score-0.143]
61 global convergence defined in (17), and the direction matrix W is an optimal solution of the following problem: min tr(P∗ T W) s. [sent-198, score-0.076]
62 At global convergence the convex problem (15) is not a relaxation of the original problem, instead it is an equivalent convex problem [2]. [sent-204, score-0.117]
63 For a number of known metric learning algorithms, such as LMNN [22], POLA [19], MLSVM [14] and Xing’s method [23] linearity with respect to P holds given A 0. [sent-215, score-0.368]
64 2 Optimization Algorithms The NR-MLh -MKL optimization problem can be directly solved by any metric learning algorithm h on the space HZ when the optimization problem of the latter only involves the squared pairwise Mahalanobis distance, e. [sent-220, score-0.551]
65 When the metric learning algorithm h has regularization term on M, e. [sent-223, score-0.368]
66 Based on Lemmas 2 and 3 we propose for both methods a two-step iterative algorithm, Algorithm 1, at the first step of which we learn the kernel weighting and at the second the metric under the kernel weighting learned in the first step. [sent-227, score-0.793]
67 At the first step of the i-th iteration we learn the µ(i) kernel weight vector under fixed PSD matrices A(i−1) , learned at the preceding iteration (i − 1). [sent-228, score-0.237]
68 At the second step we apply the metric learning algorithm h and we learn the PSD matrices A(i) with (i) the Kµ(i) = l µl Ki kernel matrix using as the initial metric matrices the A(i−1) . [sent-230, score-1.025]
69 We should make clear that the optimization problem we are solving is only individually convex with respect to µ given the PSD matrix A and vice-versa. [sent-231, score-0.14]
70 As a result, the convergence of the two-step algorithm (possible to a local optima) is guaranteed [6] and checked by the variation of µ and the objective value of metric learning method h. [sent-232, score-0.424]
71 5 LMNN-Based Instantiation We have presented two basic approaches to metric learning with multiple kernel learning: MLh MKLµ (MLh -MKLP ) and NR-MLh -MKL. [sent-234, score-0.597]
72 ij (1 − Yik )ξijk } (18) k d2 (ΦP (xi ), ΦP (xk )) − d2 (ΦP (xi ), ΦP (xj )) ≥ 1 − ξijk , ξijk > 0, A A,P A,P Pkl = 1, Pkl ≥ 0, Rank(P) = 1, P = P 0 T kl where the matrix Y, Yij ∈ {0, 1}, indicates if the class labels yi and yj are the same (Yij = 1) or different (Yij = 0). [sent-241, score-0.134]
73 The matrix S is a binary matrix whose Sij entry is non-zero if instance xj is one of the k same class nearest neigbors of instance xi . [sent-242, score-0.421]
74 The objective is to minimize the sum of the distances of all instances to their k same class nearest neighbors while allowing for some errors, trade of which is controlled by the γ parameter. [sent-243, score-0.14]
75 As the objective function of LMNN only involves the squared pairwise Mahalanobis distances, the instantiation of NR-MLh -MKL is straightforward and it consists simply of the application of LMNN on the space HZ in order to learn the metric. [sent-244, score-0.193]
76 We compare these methods against two baselines: LMNN-MKLCV in which a kernel is selected from a set of kernels using 2-fold inner cross-validation (CV), and LMNN with the unweighted sum of kernels, which induces the H feature space, denoted by LMNNH . [sent-377, score-0.452]
77 Additionally, we report performance of 1-Nearest-Neighbor, denoted as 1-NN, with no metric learning. [sent-378, score-0.344]
78 The PSD matrix A and weight vector µ in LMNN-MKLP were respectively initialized by I and equal weighting (1 divided by the number of kernels). [sent-379, score-0.076]
79 After learning the metric and the multiple kernel combination we used 1-NN for classification. [sent-387, score-0.638]
80 The attributes of all the datasets are standardized in the preprocessing step. [sent-396, score-0.074]
81 The Z set of kernels that we use consists of the following 20 kernels: 10 polynomial with degree from one to ten, ten Gaussians with bandwidth σ ∈ {0. [sent-397, score-0.152]
82 5, 1, 2, 5, 7, 10, 12, 15, 17, 20} (the same set of kernels was used in [4]). [sent-398, score-0.128]
83 Each basic kernel Kk was normalized by the average of its diag(Kk ). [sent-399, score-0.187]
84 For NR-LMNN-MKL due to its scaling limitations we could only use a small subset of Z consisting of the linear, the second order polynomial, and the Gaussian kernel with the kernel width of 0. [sent-401, score-0.374]
85 First, we observe that by learning the kernel inside LMNNMKLP we improve performance over LMNNH that uses the unweighted kernel combination. [sent-409, score-0.488]
86 If we now compare LMNN-MKLP with LMNN-MKLCV , the other baseline method where we select the best kernel with CV, we can see that LMNN-MKLP also performs better being statistically significant 7 Table 2: Accuracy results on the multiple source datasets. [sent-411, score-0.229]
87 2 Multiple Source Datasets To evaluate the proposed method on problems with multiple sources of information we also perform experiments on the Multiple Features and the Oxford flowers datasets [16]. [sent-434, score-0.089]
88 In the experiment seven distance matrices from the website2 are used; these matrices are precomputed respectively from seven features, the details of which are described in [16, 15]. [sent-438, score-0.14]
89 For both datasets Gaussian kernels are constructed respectively using the different feature representations of instances with kernel width σ0 , where σ0 is the mean of all pairwise distances. [sent-439, score-0.508]
90 We can see that by learning a linear combination of different feature representations LMNN-MKLP achieves the best predictive performance on both datasets being significantly better than the two baselines, LMNNH and LMNN-MKLCV . [sent-443, score-0.185]
91 The bad performance of LMNN-MKLCV on the Oxford flowers dataset could be explained by the fact that the different Gaussian kernels are complementary for the given problem, but in LMNN-MKLCV only one kernel is selected. [sent-444, score-0.315]
92 7 Conclusions In this paper we combine two recent developments in the field of machine learning, namely metric learning and multiple kernel learning, and propose a general framework for learning a metric in a feature space induced by a weighted combination of a number of individual kernels. [sent-445, score-1.116]
93 This is in contrast with the existing kernelized metric learning techniques which consider only one kernel function (or possibly an unweighted combination of a number of kernels) and hence are sensitive to the selection of the associated feature space. [sent-446, score-0.797]
94 The proposed framework is general as it can be coupled with many existing metric learning techniques. [sent-447, score-0.368]
95 In this work, to practically demonstrate the effectiveness of the proposed approach, we instantiate it with the well know LMNN metric learning method. [sent-448, score-0.394]
96 The experimental results confirm that the adaptively induced feature space does bring an advantage in the terms of predictive performance with respect to feature spaces induced by an unweighted combination of kernels and the single best kernel selected by internal CV. [sent-449, score-0.703]
97 On the convergence of the block nonlinear gauss-seidel method under convex constraints* 1. [sent-491, score-0.071]
98 A new pairwise kernel for biological network inference with support vector machines. [sent-595, score-0.231]
99 Distance metric learning for large margin nearest neighbor classification. [sent-602, score-0.401]
100 Distance metric learning with application to clustering with side-information. [sent-612, score-0.368]
wordName wordTfidf (topN-words)
[('mlh', 0.423), ('metric', 0.344), ('mahalanobis', 0.319), ('ml', 0.303), ('lmnn', 0.23), ('psd', 0.214), ('xj', 0.19), ('kernel', 0.187), ('lmnnh', 0.181), ('ki', 0.157), ('mkl', 0.154), ('kernels', 0.128), ('parametrization', 0.125), ('kj', 0.116), ('fh', 0.101), ('hl', 0.097), ('ch', 0.096), ('xi', 0.096), ('unweighted', 0.09), ('pij', 0.077), ('kernelized', 0.064), ('kl', 0.061), ('ijk', 0.06), ('mcnemar', 0.06), ('instances', 0.055), ('mcfee', 0.053), ('tr', 0.052), ('matrix', 0.051), ('al', 0.05), ('pt', 0.048), ('datasets', 0.047), ('feature', 0.047), ('lemma', 0.046), ('distance', 0.046), ('instantiated', 0.046), ('owers', 0.046), ('convex', 0.046), ('pairwise', 0.044), ('optimization', 0.043), ('hz', 0.042), ('multiple', 0.042), ('combination', 0.041), ('babt', 0.04), ('centralnervous', 0.04), ('lmnnmklp', 0.04), ('malefemale', 0.04), ('pkl', 0.04), ('pola', 0.04), ('wdbc', 0.04), ('instantiation', 0.04), ('ower', 0.04), ('kk', 0.04), ('cv', 0.039), ('dl', 0.039), ('yij', 0.039), ('induced', 0.037), ('internal', 0.037), ('oxford', 0.036), ('ionosphere', 0.035), ('iris', 0.035), ('prostate', 0.035), ('sonar', 0.035), ('km', 0.034), ('nearest', 0.033), ('colon', 0.033), ('stroke', 0.033), ('nilsback', 0.033), ('mn', 0.032), ('objective', 0.031), ('parametrize', 0.031), ('wine', 0.031), ('leukemia', 0.031), ('parametrized', 0.031), ('prede', 0.03), ('squared', 0.027), ('standardized', 0.027), ('predictive', 0.026), ('space', 0.026), ('instantiate', 0.026), ('mum', 0.026), ('sij', 0.026), ('convergence', 0.025), ('learn', 0.025), ('matrices', 0.025), ('weighting', 0.025), ('svms', 0.025), ('kulis', 0.025), ('rank', 0.025), ('learning', 0.024), ('ten', 0.024), ('constraints', 0.023), ('ij', 0.022), ('seven', 0.022), ('accuracies', 0.022), ('projection', 0.022), ('euclidean', 0.021), ('jain', 0.021), ('constraint', 0.021), ('distances', 0.021), ('transformation', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 171 nips-2011-Metric Learning with Multiple Kernels
Author: Jun Wang, Huyen T. Do, Adam Woznica, Alexandros Kalousis
Abstract: Metric learning has become a very active research field. The most popular representative–Mahalanobis metric learning–can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes finding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we fill this gap and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. 1
2 0.2655035 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
Author: Kristen Grauman, Fei Sha, Sung J. Hwang
Abstract: We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members’ training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor (supercategory) nodes’ metrics. Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. 1
3 0.18347271 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1
4 0.13624161 150 nips-2011-Learning a Distance Metric from a Network
Author: Blake Shaw, Bert Huang, Tony Jebara
Abstract: Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges. 1
5 0.11951327 26 nips-2011-Additive Gaussian Processes
Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen
Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1
6 0.10455104 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
7 0.10280144 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
8 0.094325483 168 nips-2011-Maximum Margin Multi-Instance Learning
9 0.08529941 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
10 0.082289517 157 nips-2011-Learning to Search Efficiently in High Dimensions
11 0.078018121 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
12 0.075511232 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
13 0.075041212 263 nips-2011-Sparse Manifold Clustering and Embedding
14 0.073948331 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
15 0.073687784 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
16 0.071341261 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
17 0.070761457 54 nips-2011-Co-regularized Multi-view Spectral Clustering
18 0.068690248 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
19 0.068484075 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
20 0.060839556 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction
topicId topicWeight
[(0, 0.182), (1, 0.033), (2, -0.073), (3, -0.07), (4, -0.024), (5, 0.051), (6, 0.044), (7, -0.055), (8, 0.045), (9, -0.052), (10, -0.161), (11, 0.143), (12, 0.027), (13, 0.144), (14, -0.067), (15, 0.083), (16, -0.042), (17, 0.202), (18, -0.052), (19, 0.037), (20, -0.009), (21, -0.079), (22, 0.102), (23, 0.042), (24, 0.018), (25, -0.013), (26, 0.016), (27, -0.08), (28, 0.018), (29, -0.036), (30, -0.185), (31, -0.052), (32, -0.103), (33, 0.105), (34, 0.083), (35, 0.028), (36, -0.085), (37, 0.08), (38, 0.236), (39, -0.053), (40, -0.045), (41, -0.104), (42, -0.067), (43, 0.074), (44, 0.008), (45, 0.073), (46, -0.067), (47, -0.187), (48, 0.037), (49, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.95418102 171 nips-2011-Metric Learning with Multiple Kernels
Author: Jun Wang, Huyen T. Do, Adam Woznica, Alexandros Kalousis
Abstract: Metric learning has become a very active research field. The most popular representative–Mahalanobis metric learning–can be seen as learning a linear transformation and then computing the Euclidean metric in the transformed space. Since a linear transformation might not always be appropriate for a given learning problem, kernelized versions of various metric learning algorithms exist. However, the problem then becomes finding the appropriate kernel function. Multiple kernel learning addresses this limitation by learning a linear combination of a number of predefined kernels; this approach can be also readily used in the context of multiple-source learning to fuse different data sources. Surprisingly, and despite the extensive work on multiple kernel learning for SVMs, there has been no work in the area of metric learning with multiple kernel learning. In this paper we fill this gap and present a general approach for metric learning with multiple kernel learning. Our approach can be instantiated with different metric learning algorithms provided that they satisfy some constraints. Experimental evidence suggests that our approach outperforms metric learning with an unweighted kernel combination and metric learning with cross-validation based kernel selection. 1
2 0.66800344 150 nips-2011-Learning a Distance Metric from a Network
Author: Blake Shaw, Bert Huang, Tony Jebara
Abstract: Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges. 1
3 0.65533561 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
Author: Kristen Grauman, Fei Sha, Sung J. Hwang
Abstract: We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members’ training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor (supercategory) nodes’ metrics. Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. 1
4 0.61871994 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1
5 0.50865322 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding exα cess loss, namely fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. 1
6 0.48481214 139 nips-2011-Kernel Bayes' Rule
7 0.43562624 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
8 0.43519512 26 nips-2011-Additive Gaussian Processes
9 0.4316864 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
10 0.42863736 168 nips-2011-Maximum Margin Multi-Instance Learning
11 0.40063816 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
12 0.38625118 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
13 0.38591999 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
14 0.35900953 157 nips-2011-Learning to Search Efficiently in High Dimensions
15 0.35221791 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint
16 0.33689567 263 nips-2011-Sparse Manifold Clustering and Embedding
17 0.33648244 274 nips-2011-Structure Learning for Optimization
18 0.33164561 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
19 0.33110937 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
20 0.32021123 177 nips-2011-Multi-armed bandits on implicit metric spaces
topicId topicWeight
[(0, 0.032), (4, 0.078), (20, 0.023), (26, 0.027), (31, 0.047), (33, 0.022), (39, 0.024), (43, 0.073), (45, 0.106), (57, 0.33), (65, 0.039), (74, 0.021), (83, 0.041), (99, 0.038)]
simIndex simValue paperId paperTitle
1 0.96124947 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations
Author: Flavio Chierichetti, David Liben-nowell, Jon M. Kleinberg
Abstract: Motivated by the spread of on-line information in general and on-line petitions in particular, recent research has raised the following combinatorial estimation problem. There is a tree T that we cannot observe directly (representing the structure along which the information has spread), and certain nodes randomly decide to make their copy of the information public. In the case of a petition, the list of names on each public copy of the petition also reveals a path leading back to the root of the tree. What can we conclude about the properties of the tree we observe from these revealed paths, and can we use the structure of the observed tree to estimate the size of the full unobserved tree T ? Here we provide the first algorithm for this size estimation task, together with provable guarantees on its performance. We also establish structural properties of the observed tree, providing the first rigorous explanation for some of the unusual structural phenomena present in the spread of real chain-letter petitions on the Internet. 1
2 0.95887554 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations
Author: Emin Orhan, Robert A. Jacobs
Abstract: Extensive evidence suggests that items are not encoded independently in visual short-term memory (VSTM). However, previous research has not quantitatively considered how the encoding of an item influences the encoding of other items. Here, we model the dependencies among VSTM representations using a multivariate Gaussian distribution with a stimulus-dependent mean and covariance matrix. We report the results of an experiment designed to determine the specific form of the stimulus-dependence of the mean and the covariance matrix. We find that the magnitude of the covariance between the representations of two items is a monotonically decreasing function of the difference between the items’ feature values, similar to a Gaussian process with a distance-dependent, stationary kernel function. We further show that this type of covariance function can be explained as a natural consequence of encoding multiple stimuli in a population of neurons with correlated responses. 1
3 0.93201798 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
Author: Jiarong Jiang, Piyush Rai, Hal Daume
Abstract: We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and real-world datasets, against several baselines, demonstrate the efficacy of our proposed algorithm. 1
4 0.91261739 111 nips-2011-Hashing Algorithms for Large-Scale Learning
Author: Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König
Abstract: Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.
5 0.89757067 100 nips-2011-Gaussian Process Training with Input Noise
Author: Andrew Mchutchon, Carl E. Rasmussen
Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1
same-paper 6 0.8270154 171 nips-2011-Metric Learning with Multiple Kernels
7 0.74620098 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
8 0.72227877 157 nips-2011-Learning to Search Efficiently in High Dimensions
9 0.71011382 226 nips-2011-Projection onto A Nonnegative Max-Heap
10 0.69050181 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
11 0.67544842 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
12 0.67058343 177 nips-2011-Multi-armed bandits on implicit metric spaces
13 0.66153306 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
14 0.66149187 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
15 0.65493608 219 nips-2011-Predicting response time and error rates in visual search
16 0.64929295 150 nips-2011-Learning a Distance Metric from a Network
17 0.64769572 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
18 0.64581913 30 nips-2011-Algorithms for Hyper-Parameter Optimization
19 0.64186746 274 nips-2011-Structure Learning for Optimization
20 0.63695586 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron