jmlr jmlr2011 jmlr2011-4 knowledge-graph by maker-knowledge-mining

4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms


Source: pdf

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. [sent-16, score-0.218]

2 Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. [sent-18, score-0.165]

3 Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints 1. [sent-19, score-0.19]

4 Empirical evidences show that the o generalization performance of kernel methods is often dominated by the chosen kernel function. [sent-21, score-0.228]

5 Therefore, the choice of an effective kernel plays a crucial role in many kernel based machine learning methods. [sent-23, score-0.228]

6 Typically, traditional kernel methods, for example, Support Vector Machines (SVMs), often adopt a predefined kernel function that is empirically chosen from a pool of parametric kernel functions, such as polynomial and Gaussian kernels. [sent-24, score-0.36]

7 , 2004), which aims at learning a convex combination of several predefined parametric kernels in order to identify a good target kernel for the applications. [sent-34, score-0.155]

8 Despite some encouraging results reported, these techniques often assume the target kernel function is of some parametric forms, which limits their capacity of fitting diverse patterns in real complex applications. [sent-40, score-0.159]

9 Instead of assuming some parametric forms for the target kernel, an emerging group of kernel learning studies are devoted to Non-Parametric Kernel Learning (NPKL) methods, which aim to learn a Positive Semi-Definite (PSD) kernel matrix directly from the data. [sent-41, score-0.248]

10 NPKL provides a flexible learning scheme of incorporating prior/side information into the known similarity measures such that the learned kernel can exhibit better ability to characterize the data similarity. [sent-50, score-0.141]

11 To achieve more robust performance, we propose the second SimpleNPKL algorithm that has other loss functions (including square hinge loss, hinge loss and square loss), which can be re-formulated as a mini-max optimization problem. [sent-63, score-0.246]

12 We extend the proposed SimpleNPKL scheme to resolve other non-parametric kernel learning problems, including colored maximum variance unfolding (Song et al. [sent-72, score-0.14]

13 , 2008), minimum volume embedding (Shaw and Jebara, 2007), and structure preserving embedding (Shaw and Jebara, 2009). [sent-73, score-0.237]

14 Section 2 presents some background of kernel learning, briefly reviews some representative work on kernel learning research, and indicates the motivations of our work. [sent-76, score-0.228]

15 Background Review and Related Work In this Section, we review some backgrounds of kernel methods, and related work on kernel learning research. [sent-84, score-0.228]

16 √ • K F = ∑i j Ki2j = tr KK denotes the Frobenius norm of a matrix K; • • • • • • • • • • • • A ◦ B denotes the element-wise multiplication between two matrices A and B. [sent-90, score-0.182]

17 2 Kernel Methods In general, kernel methods work by embedding data in some Hilbert spaces, and searching for linear relations in the Hilbert spaces. [sent-92, score-0.22]

18 Given the kernel function k, a matrix K ∈ Rn×n is called a kernel matrix, also known as gram matrix, if Ki j = k(xi , x j ) for a collection of examples x1 , . [sent-97, score-0.228]

19 Note that the choice of kernel plays a central role for the success of kernel methods. [sent-101, score-0.228]

20 3 Kernel Learning We refer the term kernel learning to the problem of learning a kernel function or a kernel matrix from given data, corresponding to the inductive and transductive learning setting, respectively. [sent-106, score-0.342]

21 1316 A FAMILY OF S IMPLE N ON -PARAMETRIC K ERNEL L EARNING A LGORITHMS One basic motivation of kernel learning is to further relax the optimization domain K such that the learned kernel can be as flexible as possible to fit the complex data. [sent-136, score-0.228]

22 The kernel learning formulation discussed above aims to optimize both the classifier and the kernel matrix simultaneously. [sent-144, score-0.228]

23 Another line of kernel learning research mainly focuses on optimizing the kernel only with respect to some criteria under some prior constraints or heuristics. [sent-147, score-0.264]

24 An important technique is the kernel target alignment criterion proposed in Cristianini et al. [sent-148, score-0.16]

25 Thus the optimization variables are reduced from the entire kernel matrix K to the kernel spectrum λ. [sent-156, score-0.228]

26 (2007) proposed an NPKL technique that aims to learn a fully non-parametric kernel matrix from pairwise constraints. [sent-158, score-0.154]

27 The target kernel is maximally aligned to the constraint matrix T and minimally aligned to the graph Laplacian. [sent-159, score-0.153]

28 The objective can be deemed as a form of kernel target alignment without normalization. [sent-160, score-0.16]

29 (2006) employed the Bregman divergence to measure distance between K and a known kernel K0 : minK 0 Dφ (K, K0 ) tr KK−1 − log det(KK−1 ) − N, 0 0 1317 (4) Z HUANG , T SANG AND H OI where Dφ is a Bregman divergence (Kulis et al. [sent-165, score-0.296]

30 Most of the existing constraints over the entries of K could be expressed by tr KT ≤ b. [sent-169, score-0.218]

31 For example, fixing the trace tr K = 1 is rather common in SDP solvers. [sent-173, score-0.182]

32 , Bregman divergence) to some prior similarity information; • To enforce some constraints to the kernel K with prior heuristics, such as distance constraint Kii + K j j − 2Ki j = di2j , or side information, etc; and • To include regularization terms over K to control capacity, such as tr K = 1. [sent-178, score-0.378]

33 Due to the non-parametric nature, the solution space K is capable of fitting diverse empirical data such that the learned kernel K can be more effective and powerful to achieve better empirical performance than traditional parametric kernel functions. [sent-180, score-0.228]

34 Moreover, we also show that the proposed algorithms are rather general, which can be easily extended to solving other kernel learning applications, including dimensionality reduction and data embedding applications. [sent-196, score-0.22]

35 In general, kernel learning with labeled data can be viewed as a special case of kernel learning with side information (Kwok and Tsang, 2003; Kulis et al. [sent-212, score-0.228]

36 A straightforward and intuitive principle for kernel learning is that the kernel entry Ki j should be aligned with the side information Ti j as much as possible (Cristianini et al. [sent-221, score-0.228]

37 The regularizer of the kernel matrix K, which captures the local dependency between the embedding of vi and v j (i. [sent-237, score-0.22]

38 the similarity Si j ), can be defined as: Ω(V, S) = vj 1 N vi ∑ Si j √Di − D j 2 i, j=1 2 2 ′ = tr (VLV ) = tr (LK), (8) where L is the graph Laplacian matrix defined as: L = I − D−1/2 SD−1/2 , (9) where D = diag(D1 , D2 , . [sent-242, score-0.391]

39 , 2007) as follows: min tr LK +C ∑(i, j)∈(S ∪D ) ℓ Ti j Ki j , K 0 (10) which generally belongs to a Semi-Definite Programming (SDP) problem (Boyd and Vandenberghe, 2004). [sent-248, score-0.182]

40 Here, C > 0 is a tradeoff parameter to control the empirical loss1 ℓ(·) of the alignment Ti j Ki j of the target kernel and the dependency among data examples with respect to the intrinsic data structure. [sent-249, score-0.16]

41 To alleviate this problem, we introduce a regularization term: tr (K p ), 1. [sent-256, score-0.182]

42 The common choice of the loss function ℓ(·) can be hinge loss, square hinge loss or linear loss. [sent-257, score-0.218]

43 (2002) used K F= K, K = tr (K2 ) in the objective to penalize the complexity of K; while Lanckriet et al. [sent-261, score-0.182]

44 (2004) proposed to adopt a hard constraint tr (K) ≤ B, where B > 0 a constant, to control the capacity of K. [sent-262, score-0.244]

45 1 ∑i:[σ]i =maxi [σ]i 1 for all i that [σ]i = n n Proof By introducing a dual variable γ ≥ 0 for the constraint tr K p ≤ B, and Z ∈ S+ (S+ is self-dual) for the constraint K 0, we have the Lagrangian of (13): L (K; γ, Z) = tr AK + γ(B − tr K p ) + tr KZ. [sent-270, score-0.766]

46 By the Karush-Kuhn-Tucker (KKT) conditions, we have: A − γpK p−1 + Z = 0 1321 and tr KZ = 0. [sent-271, score-0.182]

47 Z HUANG , T SANG AND H OI First, we show that tr (KZ) = 0 is equivalent to KZ = ZK = 0. [sent-272, score-0.182]

48 Since K 0, Z 0, we have tr (KZ) = tr (K1/2 K1/2 Z1/2 Z1/2 ) = K1/2 Z1/2 2 . [sent-273, score-0.364]

49 Therefore, we have tr K p = λ p p ′ tr AK = λ σ, σ = γpλ ≤ B, p−1 ′ λ µ = 0. [sent-282, score-0.364]

50 Alternatively, we add tr (K p ) directly into the objective, and arrive at the following formulation: min tr K L −C ∑ Ti j K + (i, j)∈(S ∪D ) G tr K p : K 0, p where G > 0 is a tradeoff parameter. [sent-295, score-0.546]

51 If we set G= 1 B tr p p−1 A+ p−1 p , these two formulations result in exactly the same solution. [sent-299, score-0.182]

52 Moreover, if we p p−1 set B = tr A+ , it means we just use the projection A+ as K. [sent-300, score-0.182]

53 In the sequel, we consider the regularization tr K p with p = 2 for its simplicity and smoothness. [sent-302, score-0.182]

54 To address this issue, in this section, we present another NPKL formulation that uses (square) hinge loss ℓ( f ) = (max(0, 1 − f ))d /d, which sometimes can be more robust, where d = 1 (hinge loss) or 2 (square hinge loss). [sent-305, score-0.162]

55 We first focus on the NPKL formulation with square hinge loss, which can be written into the following constrained optimization: minK,εi j tr LK + s. [sent-306, score-0.277]

56 C ∑ ε2 2 (i, j)∈(S ∪D ) i j ∀(i, j) ∈ (S ∪ D ), Ti j Ki j ≥ 1−εi j , (21) (22) p K 0, tr K ≤ B. [sent-308, score-0.182]

57 1 D UAL F ORMULATION : T HE S ADDLE -P OINT M INIMAX P ROBLEM By Lagrangian theory, we introduce dual variables αi j ’s (αi j ≥ 0) for the constraints in (22), and derive a partial Lagrangian of (21): tr LK + C ∑ ε2 − ∑ αi j (Ti j Ki j − 1 + εi j ). [sent-313, score-0.218]

58 the primal variables εi j ’s to zeros, we have ∀(i, j) ∈ (S ∪ D ), Cεi j = αi j ≥ 0 and substituting them back into (23), we arrive at the following saddle-point minimax problem J(K, α): maxα minK tr s. [sent-318, score-0.182]

59 L − ∑ αi j Ti j K − (i, j) p 1 ∑ α 2 + ∑ αi j 2C (i, j) i j (i, j) K 0, tr K ≤ B, ∀(i, j) ∈ S ∪ D , αi j ≥ 0, 1323 (24) Z HUANG , T SANG AND H OI where α = [ai j ] denotes a matrix of dual variables αi j ’s for (i, j) ∈ S ∪ D , and other entries are zeros. [sent-320, score-0.182]

60 Together with the above lemma, we compute the gradient at the point α by: 1 ∇Ji j = 1 − tr Ti j K − αi j , C 1 p where K = (25) 1 p−1 A+ , A = ∑(i, j) αti j Ti j − L. [sent-333, score-0.182]

61 B p p−1 A+ tr Similarly, for the another formulation: minK,εi j tr LK + s. [sent-334, score-0.364]

62 G C ∑ ε2j + p tr K p i 2 (i, j)∈(S ∪D ) (26) ∀(i, j) ∈ (S ∪ D ), Ti j Ki j ≥ 1−εi j , we can derive the corresponding saddle-point minimax problem of (26): maxα minK tr L − ∑ αi j Ti j K − (i, j) G 1 ∑ α2j + ∑ αi j + p tr K p i 2C (i, j) (i, j) K 0, ∀(i, j) ∈ S ∪ D , αi j ≥ 0. [sent-336, score-0.546]

63 2 0 tr (A0 K) with m linear constraints on Moreover, from the empirical study in Alizadeh et al. [sent-351, score-0.218]

64 Our formulation (21) has an additional constraint: tr K2 ≤ B for p = 2. [sent-354, score-0.182]

65 This condition equivalently constraints tr (K), which is a common assumption in SDP problems (Krishnan and Mitchell, 2006). [sent-355, score-0.218]

66 To show 1 1 1 this, we have B ≥ tr KK = N ∑i λ2 N ≥ N (∑i λi · 1)2 = N (tr K)2 , where the second inequality is rei √ sulted from the Cauchy inequality. [sent-356, score-0.182]

67 Consequently, we have, ∇J(α1 ) − ∇J(α2 ) F 1 (2) 1 (1) 1 − tr Ti j K1 − αi j − 1 − tr Ti j K2 − αi j C C ∑ = (i, j) ∑ = (i, j) tr Ti j K2 − K1 + ≤ T ≤ m 1 + G C F K1 − K2 F + α1 − α2 1 (2) (1) αi j − αi j C 1 α1 − α2 C 2 2 F F. [sent-375, score-0.546]

68 4 SimpleNPKL with Square Loss In this subsection, we consider square alignment loss for the SimpleNPKL framework: minK,εi j tr LK + C ∑ ε2 2 (i, j)∈(S ∪D ) i j ∀(i, j) ∈ (S ∪ D ), Ti j Ki j = 1−εi j , s. [sent-397, score-0.264]

69 3, we derive the following min-max problem: max min tr L − ∑ αi j Ti j K + ∑ αi j − α K ij ij 1 α2 : K 0, tr K p ≤ B. [sent-402, score-0.44]

70 5 SimpleNPKL with Hinge Loss In this subsection, we consider hinge loss for the SimpleNPKL framework: minK,εi j tr LK +C ∑ εi j (i, j)∈(S ∪D ) s. [sent-412, score-0.277]

71 ∀(i, j) ∈ (S ∪ D ), Ti j Ki j ≥ 1−εi j , εi j ≥ 0 K 0, tr K p ≤ B. [sent-414, score-0.182]

72 Following the standard techniques of Lagrangian dual, we arrive at the min-max problem: max min tr L − ∑ αi j Ti j K + ∑ αi j : K 0, tr K p ≤ B, 0 ≤ αi j ≤ C. [sent-415, score-0.364]

73 α: ∇Ji j = 1 − tr Ti j K The whole analysis of Section 4. [sent-419, score-0.182]

74 Moreover, from Proposition 4, the rank r of the kernel matrix K is upper bounded by the number of active constraints. [sent-446, score-0.16]

75 However, some less informative constraints often do not contribute much to the learning of the kernel matrix K, and fitting some noisy pairwise constraints may also lead to the poor generalization. [sent-456, score-0.226]

76 Instead of acquiring class label information for kernel learning; here, we consider another simple active constraint selection scheme. [sent-462, score-0.159]

77 In particular, we consider the data embedding problems, where the goal is to find a new data representation that preserves some similarity/distance constraints between pairs of data points. [sent-479, score-0.142]

78 These problems typically can be implemented by constraining the alignment of the target kernel matrix to some prior affinity or distance structures. [sent-480, score-0.16]

79 As a result, the kernel matrix K = V′ V implies a data embedding with a natural interpretation, in which the column vector of V corresponds to the new data representation. [sent-481, score-0.22]

80 ij where tr KTi j = Kii + K j j − 2Ki j is the square distance between xi and x j . [sent-487, score-0.248]

81 Consequently the optimization task of colored MVU is reformulated as: minK,ξ −tr HKHY + C ξ2 , : K 0, tr KTi j = Di j − ξi j , ∀(i, j) ∈ N 2 ∑ ij where Hi j = δi j − N −1 such that HKH centers K, Y = yy′ is the kernel matrix over labels. [sent-493, score-0.36]

82 Following the SimpleNPKL algorithms, we derive the minimax optimization problem by introducing dual variables for the inequality constraints: maxα minK tr − HYH − ∑ αi j Ti j K + ∑ αi j Di j − ij ij 1 α2 : K 0, tr KK ≤ B. [sent-495, score-0.44]

83 2C ∑ i j ij (31) By substituting the following results 1 A = HYH + ∑ αi j Ti j and ∇Jit j = Di j − tr Ti j K − αti j C ij back into Algorithm 1, the problem of (31) can be solved immediately. [sent-496, score-0.258]

84 When the intrinsic dimensionality d is available, MVE formulates the data embedding problem as follows: minK −tr KK0 : the same set of constraints of MVU. [sent-505, score-0.142]

85 Specifically, we make the following modifications: 1 A = K0 + ∑ αi j Ti j and ∇Jit j = Di j − tr Ti j K − αti j C ij By substitute the above results back into Algorithm 1, we can solve the MVE problem efficiently. [sent-510, score-0.22]

86 SPE learns a kernel matrix K such that the similarity tr KW is maximized while the global topological properties of the input graph are preserved. [sent-514, score-0.323]

87 More formally, the SPE problem is formulated into the following SDP optimization: minK −tr KW +Cξ : Di j > (1 −Wi j ) max(Wim Dim ) − ξ, ξ ≥ 0 m where Di j = Kii + K j j − 2Ki j = tr KTi j is the squared distance between xi and x j . [sent-515, score-0.182]

88 Then for each point xi , SPE essentially generates (n − |Ni |) × |Ni | constraints: tr KTi j > tr KTik − ξ, ∀i ∈ [n], j ∈ [n] − Ni , k ∈ Ni . [sent-520, score-0.364]

89 In order to speed up the SPE algorithm, we apply the SimpleNPKL technique to turn the SPE optimization into the following minimax optimization problem: maxα minK tr ∑ ∑ ∑ αi jk (Tik − Ti j ) − W i k∈Ni j∈Ni / K : K 0, tr KK ≤ B, ∑ αi jk ∈ [0,C]. [sent-521, score-0.364]

90 Similarly, we can derive the following results: A = W − ∑ αi jk (Tik − Ti j ) and ∇Jit jk = tr K(Tik − Ti j ). [sent-522, score-0.182]

91 1 Experimental Setup We examine both efficacy and efficiency of the proposed SimpleNPKL using side information to learn a kernel matrix for kernel k-means clustering. [sent-527, score-0.228]

92 (2007), the learned kernel matrix of the Non-Parametric Kernel Learning (NPKL) outperforms other kernel learning methods in the task of clustering using side information. [sent-529, score-0.272]

93 The proposed SimpleNPKL with square hinge loss produces very competitive clustering performance to the results of NPKL with hinge loss (as reported in Hoi et al. [sent-578, score-0.262]

94 SimpleNPKL with square hinge loss and NPKL with hinge loss often perform better than the NPKL methods using linear loss. [sent-580, score-0.218]

95 First of all, we can see that by learning better kernels from pairwise constraints, both SimpleNPKL algorithms produce better clustering performance than that of k-means clustering and constrained 6. [sent-803, score-0.149]

96 3; 1339 Z HUANG , T SANG AND H OI To examine the embedding results quantitatively, we follow the previous studies (Shaw and Jebara, 2007, 2009) to evaluate the classification performance on the embedding data by performing knearest neighbor classification. [sent-1011, score-0.212]

97 Our goal is to apply the proposed MVE+NPKL and SPE+NPKL algorithms on these Flickr users in order to draw the 2D embedding results of the Flickr users exclusively belonging to two different interest groups: B&W8; and Catchy Colors9 as shown in Figure 7. [sent-1206, score-0.174]

98 This may also benefit spam user detection and important user identification applications; 2) Friend suggestion: Given a Flickr user Ui , we can rank the other users U j according to their similarity to Ui computed by the learned non-parametric kernel Ki j . [sent-1235, score-0.195]

99 By applying the proposed kernel learning techniques to find similarity between Flickr users, it is possible for us to develop some recommendation scheme that suggests a Flickr user some interest groups that received the highest numbers of votes from its neighbors. [sent-1237, score-0.141]

100 We also explore some active constraint selection scheme to reduce the pairwise constraints in SimpleNPKL, which can further improve both computational efficiency and the clustering performance. [sent-1243, score-0.165]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('npkl', 0.567), ('simplenpkl', 0.549), ('cmvu', 0.238), ('tr', 0.182), ('mve', 0.158), ('spe', 0.158), ('sdp', 0.148), ('ti', 0.134), ('kernel', 0.114), ('sang', 0.11), ('embedding', 0.106), ('hoi', 0.098), ('flickr', 0.098), ('mink', 0.091), ('shaw', 0.085), ('imple', 0.079), ('oi', 0.079), ('hinge', 0.067), ('lib', 0.062), ('jebara', 0.06), ('ernel', 0.059), ('cpu', 0.056), ('kz', 0.049), ('npk', 0.049), ('pdiag', 0.049), ('iris', 0.047), ('zhuang', 0.047), ('clustering', 0.044), ('lgorithms', 0.043), ('shl', 0.043), ('photos', 0.042), ('pairwise', 0.04), ('ij', 0.038), ('kulis', 0.037), ('kti', 0.037), ('lanczos', 0.037), ('constraints', 0.036), ('huang', 0.036), ('kt', 0.035), ('users', 0.034), ('ind', 0.033), ('mvu', 0.033), ('tsang', 0.032), ('ki', 0.03), ('inde', 0.03), ('adult', 0.03), ('earning', 0.029), ('square', 0.028), ('loss', 0.028), ('laplacian', 0.027), ('similarity', 0.027), ('active', 0.026), ('kii', 0.026), ('spiral', 0.026), ('alignment', 0.026), ('colored', 0.026), ('capacity', 0.025), ('protein', 0.025), ('family', 0.025), ('mkl', 0.025), ('preserving', 0.025), ('eigenvectors', 0.025), ('catchy', 0.024), ('clp', 0.024), ('dlp', 0.024), ('ffp', 0.024), ('grn', 0.024), ('gwa', 0.024), ('interational', 0.024), ('knl', 0.024), ('phon', 0.024), ('kk', 0.024), ('lk', 0.024), ('glass', 0.023), ('wine', 0.023), ('kernels', 0.021), ('alp', 0.021), ('nat', 0.021), ('ncp', 0.021), ('target', 0.02), ('song', 0.02), ('sonar', 0.02), ('ni', 0.02), ('speedup', 0.02), ('rank', 0.02), ('ll', 0.02), ('di', 0.02), ('proposition', 0.019), ('constraint', 0.019), ('boyd', 0.018), ('alizadeh', 0.018), ('arpack', 0.018), ('chessboard', 0.018), ('eigen', 0.018), ('hl', 0.018), ('ndp', 0.018), ('tik', 0.018), ('cost', 0.018), ('neighbors', 0.018), ('adopt', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints

2 0.12806416 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

3 0.069871671 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

4 0.065070771 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

5 0.051474869 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

6 0.046589345 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

7 0.045299791 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

8 0.044815172 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

9 0.043125536 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

10 0.042726509 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

11 0.040343657 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

12 0.039225958 101 jmlr-2011-Variable Sparsity Kernel Learning

13 0.033481888 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

14 0.029280463 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

15 0.028695857 48 jmlr-2011-Kernel Analysis of Deep Networks

16 0.026996307 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

17 0.026483756 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

18 0.026275197 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

19 0.025737081 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

20 0.025727738 54 jmlr-2011-Learning Latent Tree Graphical Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.156), (1, -0.02), (2, 0.087), (3, -0.18), (4, 0.05), (5, 0.033), (6, -0.027), (7, 0.01), (8, 0.019), (9, -0.094), (10, 0.003), (11, 0.041), (12, -0.013), (13, -0.05), (14, -0.004), (15, -0.074), (16, 0.176), (17, 0.022), (18, -0.087), (19, -0.127), (20, -0.021), (21, 0.088), (22, -0.126), (23, -0.004), (24, -0.056), (25, -0.115), (26, 0.118), (27, -0.044), (28, 0.037), (29, 0.201), (30, 0.26), (31, 0.115), (32, -0.167), (33, -0.017), (34, -0.165), (35, -0.078), (36, 0.028), (37, 0.067), (38, 0.027), (39, -0.138), (40, -0.013), (41, 0.002), (42, 0.035), (43, -0.014), (44, -0.061), (45, 0.015), (46, -0.019), (47, 0.203), (48, -0.042), (49, -0.133)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91485447 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints

2 0.64802814 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

3 0.42834273 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

Author: Gilles Meyer, Silvère Bonnabel, Rodolphe Sepulchre

Abstract: The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. Keywords: linear regression, positive semidefinite matrices, low-rank approximation, Riemannian geometry, gradient-based learning

4 0.34089515 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

Author: Zhihua Zhang, Guang Dai, Michael I. Jordan

Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes

5 0.30918878 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

6 0.30815935 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

7 0.30129513 66 jmlr-2011-Multiple Kernel Learning Algorithms

8 0.28672439 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

9 0.27620965 105 jmlr-2011-lp-Norm Multiple Kernel Learning

10 0.25631535 6 jmlr-2011-A Simpler Approach to Matrix Completion

11 0.23205452 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

12 0.21632351 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

13 0.19112366 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

14 0.18691881 12 jmlr-2011-Bayesian Co-Training

15 0.16628331 102 jmlr-2011-Waffles: A Machine Learning Toolkit

16 0.15751027 16 jmlr-2011-Clustering Algorithms for Chains

17 0.15543661 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

18 0.15092868 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

19 0.14901128 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

20 0.14895201 101 jmlr-2011-Variable Sparsity Kernel Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.048), (9, 0.05), (10, 0.041), (11, 0.015), (24, 0.043), (31, 0.079), (32, 0.024), (41, 0.021), (71, 0.019), (73, 0.081), (78, 0.069), (86, 0.012), (90, 0.021), (99, 0.356)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82284617 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks

same-paper 2 0.68940336 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints

3 0.42931896 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

4 0.38515601 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

5 0.37647119 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

6 0.36825663 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

7 0.36824471 48 jmlr-2011-Kernel Analysis of Deep Networks

8 0.36487928 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

9 0.36425841 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

10 0.36315188 105 jmlr-2011-lp-Norm Multiple Kernel Learning

11 0.36279413 12 jmlr-2011-Bayesian Co-Training

12 0.36238483 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

13 0.35992333 55 jmlr-2011-Learning Multi-modal Similarity

14 0.35880375 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.35568374 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

16 0.35401025 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

17 0.35383007 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

18 0.35319498 59 jmlr-2011-Learning with Structured Sparsity

19 0.35264552 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

20 0.35097721 16 jmlr-2011-Clustering Algorithms for Chains