jmlr jmlr2011 jmlr2011-55 knowledge-graph by maker-knowledge-mining

55 jmlr-2011-Learning Multi-modal Similarity

Source: pdf

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the deﬁnition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classiﬁcation, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, uniﬁed similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to ﬁlter similarity measurements, resulting in a simpliﬁed and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, uniﬁed similarity space. [sent-8, score-0.508]

2 Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. [sent-9, score-0.426]

3 Keywords: multiple kernel learning, metric learning, similarity 1. [sent-11, score-0.489]

4 In some cases, a natural notion of similarity may emerge from domain knowledge, for example, cosine similarity for bag-of-words models of text. [sent-13, score-0.572]

5 For example, if the corpus consists of musical data, each song or artist may be represented simultaneously by acoustic features (such as rhythm and timbre), semantic features (tags, lyrics), or social features (collaborative ﬁltering, artist reviews and biographies, etc). [sent-16, score-0.686]

6 In such cases, there is generally no obvious way to combine representations to form a uniﬁed similarity space which optimally integrates heterogeneous data. [sent-18, score-0.415]

7 , Euclidean distance in Rd ), and the problem of optimizing similarity reduces to ﬁnding a suitable embedding of the data under a speciﬁc choice of the distance metric. [sent-25, score-0.799]

8 Once a suitable metric has been learned, similarity to new, unseen data can be computed either directly (if the metric takes a certain parametric form, for example, a linear projection matrix), or via out-of-sample extensions (Bengio et al. [sent-36, score-0.477]

9 To guide the construction of a similarity space for multi-modal data, we adopt the idea of using similarity measurements, provided by human labelers, as side-information. [sent-38, score-0.572]

10 Thus, rather than rely on quantitative similarity or hard binary labels of pairwise similarity, it is now becoming increasingly common to collect similarity information in the form of triadic or relative comparisons (Schultz and Joachims, 2004; Agarwal et al. [sent-42, score-0.732]

11 ” Although this form of similarity measurement has been observed to be more stable than quantitative similarity (Kendall and Gibbons, 1990), and clearly provides a richer representation than binary pairwise similarities, it is still subject to problems of consistency and inter-labeler agreement. [sent-44, score-0.572]

12 In the present work, our goal is to develop a framework for integrating multi-modal data so as to optimally conform to perceptual similarity encoded by relative comparisons. [sent-46, score-0.536]

13 The algorithm must be able to integrate multi-modal data in an optimal way, that is, the distances between embedded points should conform to perceptual similarity measurements. [sent-50, score-0.567]

14 Humans supply perceptual similarity measurements in the form of relative pairwise comparisons, which are in turn ﬁltered by graph processing algorithms, and then used as constraints to optimize the multiple kernel embedding. [sent-55, score-0.759]

15 This type of embedding problem has been previously studied by Agarwal et al. [sent-57, score-0.427]

16 (2007) provide no out-of-sample extension, and neither support heterogeneous feature integration, nor do they address the problem of noisy similarity measurements. [sent-60, score-0.412]

17 However, none of these methods speciﬁcally address the problem of learning a uniﬁed data representation which conforms to perceptual similarity measurements. [sent-69, score-0.451]

18 First, we develop the partial order embedding (POE) framework (McFee and Lanckriet, 2009b), which allows us to use graph-theoretic algorithms to ﬁlter a collection of subjective similarity measurements for consistency and redundancy. [sent-72, score-0.906]

19 We then formulate a novel multiple kernel learning (MKL) algorithm which learns an ensemble of feature space projections to produce a uniﬁed similarity space. [sent-73, score-0.468]

20 Our method is able to produce non-linear embedding functions which generalize to unseen, out-of-sample data. [sent-74, score-0.427]

21 In Section 3, we derive an embedding algorithm which learns an optimal transformation of a single feature space. [sent-78, score-0.47]

22 In Section 4, we develop a novel multiple-kernel learning formulation for embedding problems, and derive an algorithm to learn an optimal space from heterogeneous data. [sent-79, score-0.51]

23 Section 5 provides experimental results illustrating the effects of graph-processing on noisy similarity data, and the effectiveness of the multiple-kernel embedding algorithm on a music similarity task with human 493 M C F EE AND L ANCKRIET perception measurements. [sent-80, score-1.045]

24 A Euclidean embedding is a function g : X → Rd which maps X into a d-dimensional space equipped with the Euclidean (ℓ2 ) metric: x−y 2 = (x − y)T (x − y). [sent-92, score-0.427]

25 A Graphical View of Similarity Before we can construct an embedding algorithm for multi-modal data, we must ﬁrst establish the form of side-information that will drive the algorithm, that is, the similarity measurements that will be collected from human labelers. [sent-105, score-0.796]

26 Unlike classical or metric MDS techniques, which seek to preserve quantitative distances, NDMS seeks an embedding in which the rank-ordering of distances is preserved. [sent-113, score-0.566]

27 Since NMDS only needs the rank-ordering of distances, and not the distances themselves, the task of collecting similarity measurements can be simpliﬁed by asking test subjects to order pairs of points by similarity: “Are i and j more similar than k and ℓ? [sent-114, score-0.444]

28 ” Based on this kind of relative comparison data, the embedding problem can be formulated as follows. [sent-116, score-0.473]

29 ) The goal is to ﬁnd an embedding function g : X → Rd such that ∀(i, j, k, ℓ) ∈ C : g(i) − g( j) 2 + 1 < g(k) − g(ℓ) 2 . [sent-119, score-0.427]

30 (2007) work with this kind of relative comparison data and describe a generalized NMDS algorithm (GNMDS), which formulates the embedding problem as a semi-deﬁnite program. [sent-122, score-0.473]

31 1, we develop a graphical framework to infer and interpret the global structure exhibited by the constraints of the embedding problem. [sent-128, score-0.475]

32 The ﬁnal, reduced set of relative comparison constraints deﬁnes a partial order, making for a more robust and efﬁcient embedding problem. [sent-131, score-0.593]

33 If C contains cycles, then there exists no embedding which can satisfy C . [sent-138, score-0.427]

34 If C is acyclic, any embedding that satisﬁes the transitive reduction C min also satisﬁes C . [sent-140, score-0.506]

35 The ﬁrst fact implies that no algorithm can produce an embedding which satisﬁes all measurements if the graph is cyclic. [sent-141, score-0.552]

36 In fact, the converse of this statement is also true: if C is acyclic, then an embedding exists in which all similarity measurements are preserved (see Appendix A). [sent-142, score-0.796]

37 These two observations allude to two desirable properties in C for embedding methods: transitivity and anti-symmetry. [sent-147, score-0.427]

38 2 Graph Simpliﬁcation Because a set of similarity measurements C containing cycles cannot be embedded in any Euclidean space, C is inherently inconsistent. [sent-152, score-0.417]

39 In the graph view of similarity measurements, redundancies can be easily removed by computing the transitive reduction of the graph (Aho et al. [sent-163, score-0.449]

40 Additionally, pruning the constraint set by transitive reduction focuses embedding algorithms on the most important core set of constraints while reducing overhead due to redundant information. [sent-167, score-0.554]

41 Partial Order Embedding Now that we have developed a language for expressing similarity between items, we are ready to formulate the embedding problem. [sent-169, score-0.713]

42 In this section, we develop an algorithm that learns a representation of data consistent with a collection of relative similarity measurements, and allows to map unseen data into the learned similarity space after learning. [sent-170, score-0.654]

43 By parameterizing the embedding function g in terms of the feature representation, we will be able to apply g to any point in the feature space, thereby generalizing to data outside of the training set. [sent-172, score-0.552]

44 There are of course many ways to deﬁne an embedding function g : RD → Rd . [sent-175, score-0.427]

45 After learning M by solving this optimization problem, the embedding can be extended to out-ofsample points x′ by applying the projection: x′ → Mx′ . [sent-187, score-0.427]

46 Finally, after the optimal W is found, the embedding function g : RD → RD can be recovered from the spectral decomposition of W : W = V ΛV T ⇒ g(x) = Λ1/2V T x, and a d-dimensional approximation can be recovered by taking the leading d eigenvectors of W . [sent-193, score-0.427]

47 Algorithm 2 Linear partial order embedding (LPOE) Input: n objects X , partial order C , data matrix X ∈ RD×n , β>0 Output: mapping g : X → Rd min tr (W ) + W,ξ β |C | ∑ C ξi jkℓ . [sent-194, score-0.675]

48 The embedding function g : X → Rd is the therefore the composition of the projection M with φ: g(x) = M(φ(x)). [sent-197, score-0.49]

49 Because φ may be non-linear, this allows us to learn a non-linear embedding g. [sent-198, score-0.427]

50 The embedding g can thus be expressed as g(x) = ( ω p , φ(x) H )d , p=1 where (·)d denotes concatenation. [sent-203, score-0.427]

51 (3) We can now reformulate the embedding problem as an optimization over N rather than M. [sent-215, score-0.427]

52 The embedding problem thus amounts to solving the following optimization problem in N and ξ: min N,ξ≥0 s. [sent-219, score-0.427]

53 The resulting embedding problem is listed as Algorithm 3, again a semi-deﬁnite programming problem (SDP), with an objective function and constraints that are linear in W . [sent-225, score-0.475]

54 The resulting embedding function takes the form: g(x) = Λ1/2V T Kx . [sent-227, score-0.427]

55 500 (6) L EARNING M ULTI - MODAL S IMILARITY Algorithm 3 Kernel partial order embedding (KPOE) Input: n objects X , partial order C , kernel matrix K, β>0 Output: mapping g : X → Rn min tr (W K) + W,ξ β ξi jkℓ |C | ∑ C . [sent-235, score-0.814]

56 Thus, if C deﬁnes a DAG, we could exploit the fact that a partial order over distances always allows an embedding which satisﬁes all constraints in C (see Appendix A) to eliminate the slack variables from the program entirely. [sent-243, score-0.622]

57 Multiple Kernel Embedding In the previous section, we derived an algorithm to learn an optimal projection from a kernel space H to Rd such that Euclidean distance between embedded points conforms to perceptual similarity. [sent-245, score-0.458]

58 —we can formulate a multiple kernel learning algorithm to optimally combine all feature spaces into a single, uniﬁed embedding space. [sent-254, score-0.655]

59 , those which can be transformed to best ﬁt C ) from bad features, and as a result, the quality of the resulting embedding may be degraded. [sent-276, score-0.427]

60 Applying this reasoning to learning an embedding that conforms to perceptual similarity, one might consider a two-stage approach to parameterizing the embedding (Figure 3(a)): ﬁrst construct a weighted kernel combination, and then project from the combined kernel space. [sent-283, score-1.297]

61 Constraining the embedding to a single metric W with a single weight µ p for each kernel may be too restrictive to take advantage of this phenomenon. [sent-291, score-0.63]

62 Intuitively, by deﬁning the embedding as the concatenation of m different projections, we allow the algorithm to learn an ensemble of 503 M C F EE AND L ANCKRIET projections, each tailored to its corresponding domain space and jointly optimized to produce an optimal space. [sent-298, score-0.465]

63 Mathematically, an embedding function of this form can be expressed as the concatenation g(x) = (M p (φ p (x)))m . [sent-300, score-0.427]

64 p=1 Now, given this characterization of the embedding function, we can adapt Algorithm 3 to optimize over multiple kernels. [sent-301, score-0.427]

65 Again, by invoking the representer theorem for each M p , it follows that M p = N p (Φ p )T , for some matrix N p , which allows to reformulate the embedding problem as a joint optimization over N p , p = 1, . [sent-303, score-0.427]

66 (8) g(x) = (M p (φ p (x)))m = (N p Kxp )m , p=1 p=1 (9) Mp 2 HS = p=1 p=1 The embedding function can now be rewritten as and the inner products between embedded points take the form: m Ai j = g(xi ), g(x j ) = ∑ N p Kip T N p K jp p=1 m = ∑ (Kip )T (N p )T (N p )(K jp ). [sent-311, score-0.581]

67 , µm to form a weighted feature space H ∗ , which is subsequently projected to the embedding space via M. [sent-393, score-0.47]

68 The projections are jointly optimized to produce the embedding space. [sent-395, score-0.465]

69 Algorithm 4 Multiple kernel partial order embedding (MKPOE) Input: n objects X , partial order C , m kernel matrices K 1 , K 2 , . [sent-396, score-0.885]

70 Experiments To evaluate our framework for learning multi-modal similarity, we ﬁrst test the multiple kernel learning formulation on a simple toy taxonomy data set, and then on a real-world data set of musical perceptual similarity measurements. [sent-412, score-0.64]

71 For 507 M C F EE AND L ANCKRIET each fold, we learned a diagonally-constrained embedding with Algorithm 4, using the subset of relative comparisons (i, j, k, ℓ) with i, j, k and ℓ restricted to the training set. [sent-433, score-0.624]

72 After learning the embedding, the held out data (validation or test) was mapped into the space, and the accuracy of the embedding was determined by counting the fraction of correctly predicted relative comparisons. [sent-434, score-0.473]

73 We repeat this experiment for each base kernel individually (that is, optimizing over K1 with a single base kernel), as well as the unweighted sum kernel (K1 with all base kernels), and ﬁnally MKPOE (K4 with all base kernels). [sent-436, score-0.559]

74 The artist similarity problem is motivated by several real-world applications, including recommendation and playlist-generation for online radio. [sent-451, score-0.537]

75 , tags, acoustic features, social data), such applications can beneﬁt greatly from an optimally integrated similarity metric. [sent-454, score-0.453]

76 Relative comparisons were acquired from human test subjects through a web survey; subjects were presented with a query artist (i), and asked to choose what they believe to be the most similar artist ( j) from a list of 10 candidates. [sent-457, score-0.578]

77 The MKL embedding results are presented in Section 5. [sent-466, score-0.427]

78 Finally, the kernel between artists i and j is the probability product kernel (Jebara et al. [sent-492, score-0.467]

79 fm,2 a social music website which allows users to label songs or artists with arbitrary tag words or social tags. [sent-502, score-0.496]

80 The kernel between artists for social tags is the cosine similarity (linear kernel) between TF-IDF vectors. [sent-505, score-0.753]

81 As with social tags, the kernel between artists is the cosine similarity between TF-IDF bag-of-words vectors. [sent-509, score-0.694]

82 The kernel between artists in this view is the cosine of the angle between corresponding rows in the matrix, which can be interpreted as counting the amount of overlap between the sets of users listening to each artist and normalizing for overall artist popularity. [sent-515, score-0.83]

83 ) After training, test artists were mapped into the learned space (by Equation 9), and accuracy was measured by counting the number of measurements (i, j, i, k) correctly predicted by distance in the learned space, where i belongs to the test set, and j, k belong to the training set. [sent-526, score-0.426]

84 9 1 Figure 6: aset400 embedding results for each of the base kernels. [sent-545, score-0.478]

85 ) After ﬁnding the best-performing β, the embedding is trained on the full (processed) training set. [sent-551, score-0.466]

86 8 Figure 7: aset400 embedding results with multiple kernel learning: the learned metrics are optimized over K4 by Algorithm 4. [sent-586, score-0.64]

87 The embedding captures a great deal of high-level genre structure: for example, the classic rock and metal genres lie at the opposite end of the space from pop and hiphop. [sent-590, score-0.427]

88 Using each of these variants of the training set, we test the embedding algorithm with both diagonal and full-matrix formulations. [sent-599, score-0.515]

89 04 Table 2: aset400 embedding results (Biography kernel) for three possible reﬁnements of the constraint set. [sent-628, score-0.427]

90 The embedding is constructed by optimizing over K4 with all ﬁve base kernels. [sent-641, score-0.478]

91 9 1 Figure 10: Accuracy of the learned embedding for each training/test split, averaged over ten trials with random maximal acyclic constraint subgraphs. [sent-653, score-0.534]

92 It remains to show that the R1 partial order embedding problem (hereafter referred to as 1-POE) is NP-Hard. [sent-665, score-0.499]

93 Since 1-POE can be reduced to the general optimization problem of ﬁnding an embedding of minimal dimensionality, we can conclude that dimensionality reduction subject to partial order constraints is also NP-Hard. [sent-674, score-0.547]

94 Table 1 provides a uniﬁed perspective of multiple kernel learning formulations for embedding problems, but it is clearly not complete. [sent-684, score-0.566]

95 , a chain graph), the problem reduces to non-metric multidimensional scaling (Kruskal, 1964), and a constraint-satisfying embedding can always be found by the constant-shift embedding algorithm of Roth et al. [sent-691, score-0.91]

96 An embedding can be found by ﬁrst applying classical multidimensional scaling (MDS) (Cox and Cox, 1994) to ∆: 1 A = − H∆H, 2 where H = I − 1 11T is the n × n centering matrix, and 1 is a vector of 1s. [sent-698, score-0.483]

97 The embedding g can be found by decomposing 1/2 A = V ΛV T , so that g(xi ) is the ith column of Λ V T ; this is the solution constructed by the constantshift embedding non-metric MDS algorithm of Roth et al. [sent-700, score-0.854]

98 Thus, for any instance (X , C ), an embedding can be found in Rn−1 . [sent-704, score-0.427]

99 As noted by Schultz and Joachims (2004), the fraction of relative comparisons satisﬁed by an embedding g is closely related to the area under the receiver operating characteristic curve (AUC). [sent-726, score-0.549]

100 An embedding which satisﬁes a complete set of constraints of this form will receive an AUC score of 1, since every relevant point must be closer to xi than every irrelevant point. [sent-736, score-0.534]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('embedding', 0.427), ('similarity', 0.286), ('artist', 0.251), ('artists', 0.189), ('modal', 0.171), ('anckriet', 0.171), ('imilarity', 0.171), ('kernel', 0.139), ('pw', 0.118), ('ulti', 0.117), ('perceptual', 0.115), ('kip', 0.113), ('ee', 0.106), ('native', 0.106), ('mfcc', 0.106), ('jk', 0.104), ('mkpoe', 0.101), ('mds', 0.096), ('ei', 0.096), ('bio', 0.088), ('schultz', 0.088), ('measurements', 0.083), ('heterogeneous', 0.083), ('social', 0.08), ('kernels', 0.08), ('transitive', 0.079), ('unweighted', 0.077), ('comparisons', 0.076), ('lemon', 0.075), ('distances', 0.075), ('partial', 0.072), ('acyclic', 0.071), ('gert', 0.071), ('tr', 0.068), ('ki', 0.067), ('metric', 0.064), ('mcfee', 0.064), ('biography', 0.063), ('musical', 0.063), ('projection', 0.063), ('psd', 0.062), ('xk', 0.061), ('xi', 0.059), ('tags', 0.059), ('agarwal', 0.058), ('multidimensional', 0.056), ('auc', 0.056), ('cox', 0.054), ('jp', 0.053), ('tag', 0.051), ('base', 0.051), ('betweenness', 0.05), ('conforms', 0.05), ('gnmds', 0.05), ('songs', 0.05), ('earning', 0.05), ('brian', 0.049), ('rd', 0.049), ('diagonal', 0.049), ('cf', 0.048), ('constraints', 0.048), ('embedded', 0.048), ('ek', 0.047), ('optimally', 0.046), ('music', 0.046), ('relative', 0.046), ('tw', 0.044), ('distance', 0.043), ('conform', 0.043), ('kei', 0.043), ('feature', 0.043), ('graph', 0.042), ('kk', 0.041), ('sdp', 0.041), ('acoustic', 0.041), ('globerson', 0.041), ('bernhard', 0.04), ('km', 0.04), ('lanckriet', 0.04), ('processed', 0.04), ('training', 0.039), ('hs', 0.038), ('subjective', 0.038), ('optimized', 0.038), ('angelova', 0.038), ('ellis', 0.038), ('gauc', 0.038), ('kkp', 0.038), ('labelers', 0.038), ('lgauc', 0.038), ('nmds', 0.038), ('pear', 0.038), ('subjectivity', 0.038), ('triadic', 0.038), ('wiip', 0.038), ('subgraph', 0.037), ('taxonomy', 0.037), ('euclidean', 0.037), ('learned', 0.036), ('objects', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999756 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

2 0.12806416 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

3 0.11869048 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet

Abstract: Over the last few years, two different notions of positive deﬁnite (pd) kernels—universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classiﬁcation/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd , all these notions are shown to be equivalent. Keywords: kernel methods, characteristic kernels, Hilbert space embeddings, universal kernels, strictly positive deﬁnite kernels, integrally strictly positive deﬁnite kernels, conditionally strictly positive deﬁnite kernels, translation invariant kernels, radial kernels, binary classiﬁcation, homogeneity testing

4 0.11518816 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efﬁcient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efﬁciency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aﬂalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

5 0.10982375 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is speciﬁcally deﬁned as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a uniﬁed and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-deﬁnite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A speciﬁc implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 signiﬁcantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

6 0.10440829 105 jmlr-2011-lp-Norm Multiple Kernel Learning

7 0.099392742 66 jmlr-2011-Multiple Kernel Learning Algorithms

8 0.095864289 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

9 0.090524085 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

10 0.089073956 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

11 0.073428497 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

12 0.063984051 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

13 0.062669456 48 jmlr-2011-Kernel Analysis of Deep Networks

14 0.056793302 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

15 0.053427052 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

16 0.049190026 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

17 0.048928704 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

18 0.046044726 54 jmlr-2011-Learning Latent Tree Graphical Models

19 0.045771465 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

20 0.045709755 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.281), (1, -0.082), (2, 0.131), (3, -0.303), (4, 0.054), (5, 0.049), (6, -0.054), (7, 0.021), (8, -0.042), (9, -0.071), (10, 0.032), (11, 0.033), (12, -0.037), (13, -0.003), (14, -0.008), (15, -0.07), (16, 0.223), (17, 0.046), (18, -0.012), (19, -0.123), (20, -0.066), (21, 0.051), (22, -0.248), (23, 0.023), (24, -0.026), (25, -0.061), (26, 0.08), (27, -0.111), (28, -0.049), (29, 0.074), (30, 0.064), (31, -0.055), (32, -0.102), (33, -0.037), (34, -0.062), (35, -0.058), (36, 0.004), (37, 0.081), (38, -0.043), (39, -0.074), (40, 0.029), (41, 0.009), (42, -0.068), (43, 0.038), (44, 0.013), (45, 0.069), (46, 0.03), (47, 0.008), (48, 0.02), (49, -0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95742935 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

2 0.82139701 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

3 0.59678555 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet

4 0.56484079 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

Abstract: In this article, we propose a family of efﬁcient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be deﬁned based on this Weisfeiler-Lehman sequence of graphs, including a highly efﬁcient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classiﬁcation benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classiﬁcation, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT

5 0.55609149 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

6 0.52198982 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

7 0.44206759 66 jmlr-2011-Multiple Kernel Learning Algorithms

8 0.43990195 101 jmlr-2011-Variable Sparsity Kernel Learning

9 0.42816159 105 jmlr-2011-lp-Norm Multiple Kernel Learning

10 0.39565569 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

11 0.38182551 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

12 0.32247636 48 jmlr-2011-Kernel Analysis of Deep Networks

13 0.30774119 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

14 0.30564129 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

15 0.30433387 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

16 0.30044752 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

17 0.26203662 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

18 0.25312385 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

19 0.24910814 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

20 0.24298818 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.035), (9, 0.032), (10, 0.027), (11, 0.011), (24, 0.041), (31, 0.074), (32, 0.024), (41, 0.021), (60, 0.012), (71, 0.012), (73, 0.556), (78, 0.034), (90, 0.025), (99, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94220173 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

2 0.93750244 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

Author: Botond Cseke, Tom Heskes

Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random ﬁelds, Laplace approximation, variational inference, expectation propagation

3 0.49411544 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

4 0.47872308 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

Abstract: This paper considers the robust and efﬁcient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modiﬁcations and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difﬁcult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difﬁcult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation

5 0.47412136 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

6 0.47010404 48 jmlr-2011-Kernel Analysis of Deep Networks

7 0.469028 66 jmlr-2011-Multiple Kernel Learning Algorithms

8 0.46312729 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

9 0.42361674 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

10 0.41476005 35 jmlr-2011-Forest Density Estimation

11 0.40405279 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

12 0.39803833 91 jmlr-2011-The Sample Complexity of Dictionary Learning

13 0.39523944 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

14 0.38924199 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

15 0.37688693 105 jmlr-2011-lp-Norm Multiple Kernel Learning

16 0.36627039 12 jmlr-2011-Bayesian Co-Training

17 0.35743597 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

18 0.35728395 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

19 0.3546519 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

20 0.35188687 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking