jmlr jmlr2011 jmlr2011-101 knowledge-graph by maker-knowledge-mining

101 jmlr-2011-Variable Sparsity Kernel Learning

Source: pdf

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efﬁcient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efﬁciency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aﬂalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. [sent-18, score-0.532]

2 Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. [sent-19, score-0.274]

3 The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. [sent-21, score-0.304]

4 training data points, nmax , ntot are the maximum no. [sent-22, score-0.277]

5 Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. [sent-28, score-0.218]

6 (2008) extended the framework of MKL to the case where kernels are partitioned into groups and introduced a generic mixed-norm (that is (r, s)-norm; r, s ≥ 0) regularization based MKL formulation (refer (11) in Szafranski et al. [sent-41, score-0.275]

7 The idea is to employ a r-norm regularization over RKHS norms for kernels belonging to the same group and a s-norm regularization across groups. [sent-43, score-0.212]

8 (2008) was on applications where it is known that most of the groups of kernels are noisy/redundant and hence only those mixed-norms promoting sparsity among kernels within and across groups were employed, for example, 0 < r, s < 2 (following the terminology of Szafranski et al. [sent-45, score-0.444]

9 Needless to say, all the groups of kernels need not be “equally” important and not all kernels belonging to a group may be important. [sent-48, score-0.327]

10 Here, p = 1 is employed for promoting sparsity among kernels belonging to the same group and s ≥ 2 for promoting non-sparse combinations of kernels across groups. [sent-50, score-0.42]

11 Since by varying the values of s and the groupings of kernels various levels of sparsity in combining the given kernels can be achieved, the formulations studied here are henceforth called as “Variable Sparsity Kernel Learning” (VSKL) formulations. [sent-52, score-0.461]

12 The VSKL formulations are motivated by multi-modal learning applications like object categorization where multiple feature representations need to be employed simultaneously for achieving good generalization. [sent-54, score-0.326]

13 For instance, in the case of ﬂower categorization feature descriptors for shape, color and texture need to be employed in order to achieve good visual discrimination as well as signiﬁcant within-class variation (Nilsback and Zisserman, 2006). [sent-55, score-0.296]

14 Combining feature descriptors for object categorization using the framework of MKL for object categorization has been a topic of interest for many recent studies (Varma and Ray, 2007; Nilsback and Zisserman, 2008) and is shown to achieve state-of-the-art performance. [sent-56, score-0.499]

15 A key ﬁnding of Nilsback and Zisserman (2006) is the following: in object categorization tasks, employing few of the feature descriptors or employing a canonical combination of them often leads to sub-optimal solutions. [sent-57, score-0.41]

16 566 VARIABLE S PARSITY K ERNEL L EARNING object categorization where the kernel are grouped based on the feature descriptor generating them. [sent-63, score-0.341]

17 The ls (s ≥ 2)-norm regularization leads to non-sparse combinations of kernels generated from different feature descriptors and the l1 norm leads to sparse selection of non-redundant/noisy kernels generated from a feature descriptor. [sent-64, score-0.544]

18 (2008) cannot be employed for solving the VSKL formulations (that is, with ls , s ≥ 2 regularization across groups) efﬁciently as it solves a non-convex variant of the original convex formulation! [sent-70, score-0.264]

19 Let the feature-space mapping induced by the kth kernel of the jth component be φ jk (·) and the corresponding gram-matrix of training data points be K jk . [sent-110, score-1.17]

20 Consider the problem of learning a linear discriminant function of the form nj n f (x) = ∑ ∑ w⊤ φ jk (x) − b. [sent-120, score-0.653]

21 jk j=1 k=1 Given a training set the idea is to learn a w ≡ [w⊤ w⊤ . [sent-121, score-0.563]

22 0 ≤ r < 2, 0 ≤ s < 2,   n ∑  j=1 s r nj ∑ k=1 568 w jk r 2 1 s  . [sent-134, score-0.653]

23 (2008) is to achieve sparsity, the focus was only on the cases 0 ≤ r < 2, 0 ≤ s < 2 making most of the individual norms w jk zero at optimality. [sent-136, score-0.563]

24 In view of this we begin by deﬁning  q 1 q p nj  n 1 2p Ω(p,q) (w) = ∑ ∑ w jk 2  . [sent-139, score-0.653]

25 2  j=1 k=1 2 This can be interpreted as a mixed norm operating on w jk Ω(p,q) (w) = 1 w 2 2 r,s , and the following relationship holds r = 2p, s = 2q. [sent-140, score-0.596]

26 In this paper we analyze the case p = 1 and q ≥ 1 which is equivalent to considering an l1 (sparse) 2 norm regularization within kernels of each group and ls (s ≥ 2) (non-sparse) norm across groups. [sent-141, score-0.273]

27 In other words, we consider the following regularization: Ω(w) =  1 n ∑ 2  j=1 nj ∑ w jk 2 1 2q  q ,  k=1 where q ≥ 1. [sent-142, score-0.653]

28 Since this formulation allows for ﬂexibility from sparsity to non-sparsity, it is called as the Variable Sparsity Kernel Learning (VSKL) formulation and denoted by VSKLq , where q ≥ 1: min w jk ,b,ξi 1 2 2q n j ∑ j ∑k=1 w jk 2 1 q +C ∑i ξi n j s. [sent-148, score-1.299]

29 yi ∑n ∑k=1 w⊤ φ jk (xi ) − b ≥ 1 − ξi , ξi ≥ 0 ∀ i. [sent-150, score-0.586]

30 j=1 jk (1) n 2 j 1 In the extreme case q → ∞, the regularization term is to be written as 2 max j ∑k=1 w jk 2 . [sent-151, score-1.161]

31 1, the objective in (1), for any q ≥ 1, becomes: 1 max 2 γ∈∆n,q∗ n j=1 2 nj k=1 ∑ γj ∑ w jk 570 +C ∑ ξi . [sent-196, score-0.653]

32 2 (with d = n, r = 1): n √ ∑ 2 n ai , λ∈∆n i=1 λi = min ∑ ai i=1 so (2) can be written equivalently as:   1 max min  γ∈∆n,q∗ λ j ∈∆n j  2  n nj 2 w jk ∑ ∑ γ j λ jk j=1 k=1 f (w,λ,γ,ξ) The equivalent primal formulation we arrive at is ﬁnally    +C ∑ ξi  . [sent-199, score-1.367]

33   i Problem (P) max min f (w, λ, γ, ξ) min γ∈∆n,q∗ λ j ∈∆n j ξi ,b,w jk n s. [sent-200, score-0.607]

34 nj ∑ ∑ wT φ jk (xi ) − b jk yi 1 − ξi , ∀ i, j=1 k=1 ξi 0 , ∀ i. [sent-202, score-1.239]

35 (3) (4) Note that at optimality, the following relations hold λ jk = 0 ⇒ w jk = 0, if q = ∞, then γ j = 0 ⇔ w jk = 0 ∀ k. [sent-203, score-1.689]

36 In case q = ∞, w jk = 0 ∀ k ⇒ γ j = 0 unless w jk = 0 ∀ j, k, which is an un-interesting case. [sent-204, score-1.126]

37 Hence, by the Sion-Kakutani minmax theorem (Sion, 1958), the maxmin can be interchanged, and when this is done, problem (P) becomes min min ξi ,b,w jk λ∈ j ∆n j max f (w, λ, γ, ξ) , γ∈∆n,q∗ s. [sent-207, score-0.634]

38 (3), (4), or similarly min λ∈ j ∆n j min max f (w, λ, γ, ξ) , ξi ,b,w jk γ∈∆n,q∗ s. [sent-209, score-0.607]

39 ξi ,b,w jk 571 (6) A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN Replacing the convex problem in the curly brackets in (6) by its dual the following theorem is immediate: Theorem 2. [sent-217, score-0.651]

40 3 Let Q jk be the m × m matrix Q jk ih = yh yi φ jk (xi )⊤ φ jk (xh ) i, h = 1, . [sent-218, score-2.275]

41 the variables (w, b, ξ) is the following:4 Problem (D)       min λ∈ j ∆n j max α∈Sm , γ∈∆n,q∗    where Sm =   1 ∑ α i − 2 αT λ jk Q jk γj j=1 k=1 n nj ∑∑ m α ∈ Rm | ∑ αi yi = 0, 0 αi C, α , i = 1, . [sent-225, score-1.261]

42 i=1 w jk The relation between the primal and dual variables is given by: γ j λ jk = ∑m αi yi φ jk (xi ). [sent-229, score-1.763]

43 Let K jk be positive kernel functions deﬁned over the same input space X . [sent-237, score-0.607]

44 Each K jk deﬁnes a Reproducing Kernel Hilbert Space (RKHS) H jk with the inner product . [sent-238, score-1.126]

45 An element h ∈ H jk has the norm h H jk = h, h H jk . [sent-241, score-1.722]

46 Now for any λ jk non-negative, deﬁne a new Hilbert space h H jk ′ < ∞} H jk = {h|h ∈ H jk , λ jk with inner product as . [sent-242, score-2.815]

47 We use the convention that if λ jk = 0 then the only jk ′ ′ member of H jk is h = 0. [sent-247, score-1.689]

48 It is easy to see that H jk is an RKHS with kernel as λ jk K jk (see Rako′ tomamonjy et al. [sent-248, score-1.733]

49 A direct sum of such RKHS, H j = k H jk is also an RKHS with the ′ kernel as K j = ∑k λ jk K jk . [sent-250, score-1.733]

50 Again H j are RKHS j ∑k λ jk K jk and their direct sum is in-turn an RKHS H with kernel as n 1 nj K = ∑ j=1 γ j ∑k=1 λ jk K jk . [sent-256, score-2.386]

51 With this functional framework in mind we now let w jk be an element of H jk with the norm w jk H jk = w jk , w jk H jk and let w ∈ H where H is as deﬁned above. [sent-257, score-3.974]

52 572 VARIABLE S PARSITY K ERNEL L EARNING max min f (w, λ, γ, ξ) min (7) γ j ∈∆n,q∗ λ j ∈∆n j ξi ,b,w jk ∈H jk s. [sent-260, score-1.17]

53 yi ( w, xi H − b) 2 1 − ξi , ξi 0, w jk H n j 1 where f (w, λ, γ, ξ) = 2 ∑n ∑k=1 γ j λ jk jk +C ∑i ξi . [sent-262, score-1.712]

54 4 Let Q jk be the m × m matrix Q jk ih = yh yi K jk (xi , xh ) i, h = 1, . [sent-265, score-1.712]

55 The dual problem of (7) with respect to {w, b, ξ} is the following optimization problem: fλ (α,γ) 1 1T α − αT 2 λ j ∈∆n j α∈Sm ,γ∈∆n,q∗ min max λ jk Q jk γj j=1 k=1 n nj ∑∑ α, (D) G(λ) where Sm = α ∈ Rm |0 α C, yT α = 0 . [sent-269, score-1.289]

56 The dual (D) problem provides more insight into the formulation: λ jk can be viewed as a weight given to the kernel K jk and γ1j can be thought of as an additional weight factor for the entire jth group/descriptor. [sent-275, score-1.221]

57 Since λ j ∈ ∆n j (that is, λ j s are l1 regularized), most of the λ j s will be zero at optimality and since γ ∈ ∆n,q∗ , it amounts to combining kernels across descriptors in a non-trivial (and in case q∗ ≥ 2 in a non-sparse) fashion. [sent-276, score-0.254]

58 Indeed, this is in-sync with ﬁndings of Nilsback and Zisserman (2006): kernels from different feature descriptors (components) are combined using non-trivial weights (that is, γ1j ); moreover, only the “best” kernels from each feature descriptor (component) are employed by the model. [sent-277, score-0.54]

59 Note that in the case optimal weights (λ, γ) are known/ﬁxed, then the problem is equivalent to solving an SVM with an effective kernel: Ke f f ≡ ∑n j=1 nj ∑k=1 λ jk K jk γj . [sent-279, score-1.258]

60 Algorithm for Solving the Dual Problem This section presents the mirror descent based algorithm for efﬁciently solving the dual (D). [sent-282, score-0.247]

61 , λn ) = max γ∈∆n,q∗ ,α∈Sm n 1 1T α − αT 2 ∑ j=1 ∑k λ jk Q jk γj α . [sent-296, score-1.126]

62 If α∗ , γ∗ represent the variables maximizing f for given λ, then the jkth component of the sub-gradient G′ (λ) is 1 −2 α∗⊤ Q jk α∗ . [sent-321, score-0.563]

63 1 If there exists scalars 0 < τ < 1, µ > 0 such that all eigenvalues of each Q jk matrix lie within an interval (τµ, µ), then the function G given by 1 G(λ1 , · · · , λn ) = max 1 α − αT α∈Sm ,γ∈∆n,q∗ 2 T n ∑ j=1 nj ∑k=1 λ jk Q jk γj α is convex and Lipschitz continuous w. [sent-350, score-1.816]

64 2 Let Φ j (λ j ) = nj ∑ λ jk ln(λ jk ), λ j ∈ ∆ j ∀ j = 1, . [sent-356, score-1.216]

65 k=1 n j The function Φ(λ) = ∑n Φ j (λ j ) = ∑n ∑k=1 λ jk ln(λ jk ) is strongly convex with parameter j=1 j=1 respect to the l1 norm. [sent-360, score-1.163]

66 The corresponding distance generating function is given by BΦ (λ , λ ) = ∗ 1 n nj ∑∑ j=1 k=1 576 λ∗ ln jk λ∗ jk λ1 jk . [sent-361, score-1.779]

67 If one chooses λ1 = n j then one can obtain an estimate of Γ(λ1 ) as jk follows: BΦ (λ∗ , λ1 ) ≤ n ∑ log n j ≤ n log nmax j=1 where nmax = max n j . [sent-375, score-0.883]

68 j The ﬁrst inequality follows from the fact that ∑k λ jk log λ jk ≤ 0, ∀λ ∈ j ∆n j and the second inequality follows from the deﬁnition of nmax . [sent-376, score-1.286]

69 1 ) now writes as √ 1 2 n n log nmax 1 2 log nmax 1 √ = √ , LG LG t t where LG is the Lipschitz constant of G. [sent-379, score-0.32]

70 A more pragmatic choice could be 1 1 1 1 √ = A log nmax √ , st = A Γ(λ1 )σ t) t) ∇λ G(λ ∞ t ∇λ G(λ ∞ t where A is a constant. [sent-381, score-0.219]

71 Owing to the clever choice of prox-function, the projection step in our case is very easy to calculate and has an analytical expression given by: ∇Φ(λ) jk = ln(λ jk ) + 1 ,   ˜  λ jk   ˜ jk =  e ∇Φ (λ) . [sent-389, score-2.283]

72 iterations is O log nmax /ε2 and nmax = ntot where ntot is the total number of kernels. [sent-401, score-0.579]

73 Assuming the SVM problem can be solved in O(m2 ) time, we have the following complexity bound in case n = 1: O m2 ntot log ntot /ε2 . [sent-403, score-0.234]

74 Also, in the case q = 1, the optimal value of γ j is 1 for all j and hence maximizing f again corresponding to solving an SVM with effective kernel as canonical (equal-weight) sum of all the active kernels in each group. [sent-404, score-0.228]

75 Again, in this case, the overall complexity is O m2 ntot log nmax /ε2 . [sent-405, score-0.277]

76 3 Computing the Oracle The joint maximization in (α, γ) of fλ in the case q = ∞ can be posed as a Quadratically Constrained Quadratic Program (QCQP): max α∈Sm ,γ∈∆n nj ∑k=1 λ jk Q jk γj n 1 fλ (γ, α) = 1T α − αT 2 ∑ j=1 n = max α∈Sm ,γ∈∆n ,v s. [sent-408, score-1.216]

77 2γ j v j 1T α − ∑ v j αT j=1 nj α         ∑ λ jk Q jk α ∀ j  . [sent-410, score-1.216]

78 (13) k=1 Using the identity 1 1 2γ j v j = (γ j + v j )2 − (γ j − v j )2 , 2 2 the constraint in problem (13) becomes αT ∑ λ jk Q jk k 1 1 α + (γ j − v j )2 ≤ (γ j + v j )2 , 2 2 and consequently problem (13) is a conic quadratic (CQ) problem. [sent-411, score-1.126]

79 n j If q = 1 (that is, q∗ = ∞), optimality is achieved at γi = 1 iff Di > 0 where D j = ∑k=1 λ jk α⊤ Q jk α. [sent-427, score-1.126]

80 Proof Recall that max α∈Sm ,γ∈∆n,q∗ fλ (α, γ) = max α⊤ e − α∈Sm 1 min 2 γ∈∆n,q∗ n Dj , j=1 γ j ∑ n j where D j = ∑k=1 λ jk α⊤ Q jk α. [sent-428, score-1.148]

81 Proof We begin by arguing that fλ is bounded when Q jk are p. [sent-453, score-0.563]

82 γ Then, ∀t ∈ (0, 1) g(t) ≡ ˙ Bj dg = B0 + ∑ = 0, ˜ γ1 dt j (t + 2 j 1 )2 ˜ γ γ −˜ j where Bj = 1 3 ˜j ˜j γ2 − γ1 ˜j ˜ ˜j ˜ γ2 α1 − γ1 α2 T j ˜j ˜ ˜j ˜ Q j γ2 α1 − γ1 α2 , nj Qj = ∑ λ jk Q jk , k=1 and n Qj 1 ˜ ˜ ˜ ˜ ˜ ˜ (α1 − α2 ). [sent-465, score-1.216]

83 2) α∈Sm ,γ∈∆n Q ˜ λt+1 ← (∇Φ(λt ) − st G′ (λ)) jk = ln(λtjk ) + 1 + st α∗ T γ∗jk α∗ jk j ˜ λt+1 ← ∇Φ∗ λt+1 = jk nj eλ jk / ∑ eλ jk ˜ t+1 ˜ t+1 (Descent Direction) (Projection step) k=1 until convergence The algorithm converges to the optimal of (D) for arbitrary q ≥ 1. [sent-505, score-3.023]

84 With this assumption, even in the general case (n > 1, q > 1), the computational complexity of mirrorVSKL remains to be O m2 ntot log nmax /ε2 . [sent-508, score-0.277]

85 Numerical Experiments This section presents results of simulations which prove the suitability of employing the proposed VSKL formulations for multi-modal tasks like object categorization. [sent-514, score-0.273]

86 1 Performance on Object Categorization Data Sets The experimental results summarized in this section aim at proving the suitability of employing the proposed VSKL formulations for tasks like object categorization. [sent-517, score-0.244]

87 As mentioned previously, it was observed in the literature (see Nilsback and Zisserman, 2006) that employing feature values obtained from various descriptors simultaneously is beneﬁcial for object 7. [sent-544, score-0.257]

88 The state-of-the-art performance on these data sets is achieved by a methodology which generates kernels using each of the feature descriptors and then chooses the best among them using the framework of MKL (Varma and Ray, 2007; Nilsback and Zisserman, 2008). [sent-577, score-0.279]

89 For the Caltech-101, Caltech-256 and Oxford ﬂowers data sets we have used 15, 25, 60 images per object category as training images and 15, 15, 20 images per object category as testing images respectively. [sent-586, score-0.324]

90 Note that the CKL formulations were not previously applied to object categorization and we wish to compare them here with VSKL in order to stress on the need for solving (1) for the cases q ≥ 1. [sent-596, score-0.314]

91 Secondly, the number of iterations in solving the formulation is nearly-independent of the number of kernels in case of the proposed MD based algorithm. [sent-700, score-0.264]

92 Hence the number of iterations required by the BCA algorithm can be assumed to be a constant and the computational complexity bound O m2 ntot log nmax /ε2 indeed is valid. [sent-710, score-0.302]

93 Conclusions This paper makes two important contributions to the MKL literature: a) a speciﬁc mixed-norm regularization based MKL formulation which is well-suited for object categorization and other multi-modal tasks is studied. [sent-712, score-0.271]

94 Empirical results show that the new formulation achieves far better generalization than state-of-the-art object categorization techniques. [sent-716, score-0.236]

95 Deﬁne D jk = α∗⊤ Q jk α∗ where α∗ and γ∗ denote optimal values, that maximize fλ (α, γ), for a given λ. [sent-746, score-1.126]

96 From the deﬁnition of τ and µ we immediately have the following bound τµ α∗ 2 ≤ D jk ≤ µ α∗ 2 . [sent-747, score-0.563]

97 The strategy would be to exploit the above limits on D jk to bound the norm of the sub-gradient. [sent-749, score-0.596]

98 3) and then examine the sub-gradient: Case q > 1    1∗ q∗    ∑ ′ ∑ ′ λ ′ ′ D ′ ′ q∗ +1  q   1 j k j k j k ∂G − 2 D jk if ∑k′ λ jk′ D jk′ > 0, = ∑k′ λ jk′ D jk′   ∂λ jk     0 otherwise. [sent-751, score-1.126]

99 Case q = 1 ∂G = ∂λ jk − 1 D jk if ∑k′ λ jk′ D jk′ > 0, 2 0 otherwise. [sent-752, score-1.126]

100 From these equations, it is easy to see that: 1 n ∂G ≤ ∂λ jk 2 τ 1 q∗ µ α∗ 2 2 q∗ (q∗ +1)−1 q∗ (q∗ +1) . [sent-753, score-0.563]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('jk', 0.563), ('mirrorvskl', 0.319), ('vskl', 0.298), ('mkl', 0.186), ('nmax', 0.16), ('simplemkl', 0.149), ('aman', 0.149), ('flalo', 0.149), ('hattacharyya', 0.149), ('kernels', 0.142), ('nilsback', 0.138), ('nath', 0.135), ('md', 0.124), ('ntot', 0.117), ('descriptors', 0.112), ('categorization', 0.107), ('bca', 0.106), ('szafranski', 0.106), ('sm', 0.095), ('formulations', 0.091), ('parsity', 0.091), ('nj', 0.09), ('zisserman', 0.089), ('mirror', 0.077), ('object', 0.074), ('svm', 0.074), ('owers', 0.072), ('ernel', 0.07), ('descriptor', 0.065), ('xt', 0.062), ('st', 0.059), ('bj', 0.057), ('formulation', 0.055), ('hessianmkl', 0.054), ('dual', 0.051), ('calls', 0.049), ('lg', 0.049), ('en', 0.049), ('descent', 0.048), ('employing', 0.046), ('henceforth', 0.045), ('kernel', 0.044), ('groups', 0.043), ('bangalore', 0.043), ('vsklq', 0.043), ('solving', 0.042), ('sparsity', 0.041), ('varma', 0.039), ('rkhs', 0.037), ('convex', 0.037), ('caltech', 0.036), ('oxford', 0.036), ('regularization', 0.035), ('earning', 0.034), ('norm', 0.033), ('promoting', 0.033), ('wrapper', 0.033), ('suitability', 0.033), ('hemivariate', 0.032), ('iisc', 0.032), ('testset', 0.032), ('wpbc', 0.032), ('block', 0.031), ('projection', 0.031), ('category', 0.03), ('ls', 0.03), ('berg', 0.03), ('lipschitz', 0.03), ('beck', 0.029), ('presents', 0.029), ('employed', 0.029), ('images', 0.029), ('rakotomamonjy', 0.028), ('technion', 0.028), ('minmax', 0.027), ('ckl', 0.027), ('simplices', 0.027), ('liver', 0.027), ('vision', 0.027), ('interior', 0.027), ('grouped', 0.026), ('oracle', 0.026), ('accuracies', 0.026), ('concave', 0.026), ('ai', 0.026), ('feature', 0.025), ('teboulle', 0.025), ('ray', 0.025), ('indian', 0.025), ('iterations', 0.025), ('blake', 0.024), ('passing', 0.024), ('sonnenburg', 0.024), ('visual', 0.023), ('yi', 0.023), ('scalability', 0.023), ('min', 0.022), ('ionosphere', 0.022), ('ower', 0.022), ('bhattacharyya', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

2 0.28550395 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the beneﬁts of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique

3 0.21372634 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efﬁcient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artiﬁcial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

4 0.13221107 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

5 0.11518816 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the deﬁnition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classiﬁcation, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, uniﬁed similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to ﬁlter similarity measurements, resulting in a simpliﬁed and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

6 0.1104015 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

7 0.094713986 54 jmlr-2011-Learning Latent Tree Graphical Models

8 0.077464037 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

9 0.070056267 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

10 0.059160285 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

11 0.056734324 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

12 0.054497652 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.052602787 14 jmlr-2011-Better Algorithms for Benign Bandits

14 0.048634503 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

15 0.047149595 59 jmlr-2011-Learning with Structured Sparsity

16 0.045216743 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

17 0.040256854 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

18 0.039225958 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

19 0.037118051 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

20 0.036488529 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.255), (1, -0.024), (2, 0.229), (3, -0.357), (4, 0.147), (5, 0.139), (6, -0.027), (7, 0.084), (8, 0.259), (9, 0.024), (10, 0.213), (11, -0.121), (12, -0.09), (13, 0.089), (14, -0.024), (15, 0.132), (16, -0.084), (17, -0.031), (18, -0.074), (19, 0.04), (20, 0.052), (21, -0.146), (22, -0.005), (23, -0.096), (24, 0.038), (25, 0.047), (26, 0.032), (27, 0.088), (28, -0.023), (29, -0.089), (30, 0.082), (31, -0.195), (32, 0.108), (33, 0.047), (34, 0.043), (35, -0.065), (36, -0.047), (37, -0.03), (38, 0.025), (39, 0.043), (40, -0.025), (41, 0.024), (42, -0.069), (43, -0.041), (44, 0.03), (45, 0.049), (46, 0.03), (47, -0.008), (48, 0.047), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96869403 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

2 0.70283544 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

3 0.50805455 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

4 0.50060183 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: Gaussian process regression is a ﬂexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be parallelized for faster computation. Moreover, the method can be adaptive to non-stationary features because of its local nature and, in particular, its use of different hyperparameters of the covariance function for different local regions. We illustrate application of the method and demonstrate its advantages over existing methods using two synthetic data sets and two real spatial data sets. Keywords: domain decomposition, boundary value problem, Gaussian process regression, parallel computation, spatial prediction

5 0.42576194 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

6 0.35716054 55 jmlr-2011-Learning Multi-modal Similarity

7 0.29738405 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

8 0.28889561 54 jmlr-2011-Learning Latent Tree Graphical Models

9 0.23539041 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

10 0.21415938 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

11 0.21392708 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

12 0.20236084 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

13 0.19716747 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

14 0.18644069 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

15 0.18386589 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

16 0.17690448 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

17 0.16932164 59 jmlr-2011-Learning with Structured Sparsity

18 0.16919884 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

19 0.16872057 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

20 0.161961 28 jmlr-2011-Double Updating Online Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.033), (9, 0.071), (10, 0.022), (24, 0.034), (31, 0.58), (32, 0.02), (41, 0.016), (71, 0.026), (73, 0.029), (78, 0.063), (90, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98733628 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

same-paper 2 0.97712708 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

3 0.97183001 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes

Author: David M. Blei, Peter I. Frazier

Abstract: We develop the distance dependent Chinese restaurant process, a ﬂexible class of distributions over partitions that allows for dependencies between the elements. This class can be used to model many kinds of dependencies between data in inﬁnite clustering models, including dependencies arising from time, space, and network connectivity. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings. We study its empirical performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better ﬁt to sequential data and network data. We also show that the distance dependent CRP representation of the traditional CRP mixture leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation. Keywords: Chinese restaurant processes, Bayesian nonparametrics

4 0.96912456 95 jmlr-2011-Training SVMs Without Offset

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We develop, analyze, and test a training algorithm for support vector machine classiﬁers without offset. Key features of this algorithm are a new, statistically motivated stopping criterion, new warm start options, and a set of inexpensive working set selection strategies that signiﬁcantly reduce the number of iterations. For these working set strategies, we establish convergence rates that, not surprisingly, coincide with the best known rates for SVMs with offset. We further conduct various experiments that investigate both the run time behavior and the performed iterations of the new training algorithm. It turns out, that the new algorithm needs signiﬁcantly less iterations and also runs substantially faster than standard training algorithms for SVMs with offset. Keywords: support vector machines, decomposition algorithms

5 0.9400878 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We ﬁrst propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efﬁcient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line ﬁlling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture

6 0.78619796 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

7 0.7550953 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

8 0.72520202 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

9 0.7250182 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

10 0.72435856 12 jmlr-2011-Bayesian Co-Training

11 0.72341585 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.72166675 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

13 0.71891403 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

14 0.70411259 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

15 0.70356119 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

16 0.70149344 36 jmlr-2011-Generalized TD Learning

17 0.69915104 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

18 0.69531292 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

19 0.69138122 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

20 0.69089943 16 jmlr-2011-Clustering Algorithms for Chains