jmlr jmlr2011 jmlr2011-66 knowledge-graph by maker-knowledge-mining

66 jmlr-2011-Multiple Kernel Learning Algorithms


Source: pdf

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: support vector machines, kernel machines, multiple kernel learning 1. [sent-11, score-0.335]

2 Generally, a cross-validation procedure is used to choose the best performing kernel function among a set of kernel functions on a separate validation set different from the training set. [sent-26, score-0.312]

3 η parameterizes the combination function and the more common implementation is kη (xi , x j ) = fη ({km (xm , xm )}P |η) i j m=1 where the parameters are used to combine a set of predefined kernels (i. [sent-29, score-0.26]

4 , we know the kernel functions and corresponding kernel parameters before training). [sent-31, score-0.312]

5 It is also possible to view this as kη (xi , x j ) = fη ({km (xm , xm |η)}P ) i j m=1 2212 M ULTIPLE K ERNEL L EARNING A LGORITHMS where the parameters integrated into the kernel functions are optimized during training. [sent-32, score-0.253]

6 In the weighted sum case, we can linearly parameterize the combination function: kη (xi , x j ) = fη ({km (xm , xm )}P |η) = i j m=1 P ∑ ηm km (xm , xm ) i j m=1 where η denotes the kernel weights. [sent-75, score-0.508]

7 As can be seen, the conic sum is a special + m=1 case of the linear sum and the convex sum is a special case of the conic sum. [sent-83, score-0.299]

8 First, when we have positive kernel weights, we can extract the relative importance of the combined kernels by looking at them. [sent-85, score-0.31]

9 Second, when we restrict the kernel weights to be nonnegative, this corresponds to scaling the feature spaces and using the concatenation of them as the combined feature representation: √  η1 Φ1 (x1 )  √η2 Φ2 (x2 )    Φη (x) =   . [sent-86, score-0.274]

10 √ P) ηP ΦP (x and the dot product in the combined feature space gives the combined kernel: ⊤  √  √ η1 Φ1 (x1 ) η1 Φ1 (x1 ) j i  √η2 Φ2 (x2 )   √η2 Φ2 (x2 )  P j  i    Φη (xi ), Φη (x j ) =     = ∑ ηm km (xm , xm ). [sent-89, score-0.303]

11 √ √ ηP ΦP (xP ) ηP ΦP (xP ) i j 2214 M ULTIPLE K ERNEL L EARNING A LGORITHMS The combination parameters can also be restricted using extra constraints, such as the ℓ p norm on the kernel weights or trace restriction on the combined kernel matrix, in addition to their domain definitions. [sent-96, score-0.424]

12 For example, the ℓ1 -norm promotes sparsity on the kernel level, which can be interpreted as feature selection when the kernels use different feature subsets. [sent-97, score-0.325]

13 Similarity-based functions calculate a similarity metric between the combined kernel matrix and an optimum kernel matrix calculated from the training data and select the combination function parameters that maximize the similarity. [sent-106, score-0.443]

14 The similarity between two kernel matrices can be calculated using kernel alignment, Euclidean distance, Kullback-Leibler (KL) divergence, or any other similarity measure. [sent-107, score-0.392]

15 For example, structural risk function can use the ℓ1 norm, the ℓ2 -norm, or a mixed-norm on the kernel weights or feature spaces to pick the model parameters. [sent-111, score-0.255]

16 Bayesian functions measure the quality of the resulting kernel function constructed from candidate kernels using a Bayesian formulation. [sent-113, score-0.277]

17 Kernel Fisher discriminant analysis (KFDA), regularized kernel discriminant analysis (RKDA), and kernel ridge regression (KRR) are three other popular methods used in MKL. [sent-127, score-0.372]

18 1 Fixed Rules Fixed rules obtain kη (·, ·) using fη (·) and then train a canonical kernel machine with the kernel matrix calculated using kη (·, ·). [sent-145, score-0.344]

19 For example, we can obtain a valid kernel by taking the summation 2216 M ULTIPLE K ERNEL L EARNING A LGORITHMS or multiplication of two valid kernels (Cristianini and Shawe-Taylor, 2000): kη (xi , x j ) = k1 (x1 , x1 ) + k2 (x2 , x2 ) i j i j kη (xi , x j ) = k1 (x1 , x1 )k2 (x2 , x2 ). [sent-146, score-0.277]

20 For example, the summation or multiplication of P kernels is also a valid kernel: P ∑ km (xm , xm ) i j kη (xi , x j ) = m=1 P kη (xi , x j ) = ∏ km (xm , xm ). [sent-150, score-0.547]

21 (2001) report that on a gene functional classification task, training an SVM with an unweighted sum of heterogeneous kernels gives better results than the combination of multiple SVMs each trained with one of these kernels. [sent-152, score-0.237]

22 This approach can be encoded as a pairwise kernel using a kernel function between individual objects, called the genomic kernel (Ben-Hur and Noble, 2005), as follows: kP ({xa , xa }, {xb , xb }) = k(xa , xb )k(xa , xb ) + k(xa , xb )k(xa , xb ). [sent-156, score-0.653]

23 km (xa , xb ) i j P ∑ km (xaj , xb ) i m=1 The combined pairwise kernels improve the classification performance for protein-protein interaction prediction task. [sent-158, score-0.448]

24 However, each kernel function corresponds to a different neighborhood and ηm (·, ·) is calculated on the neighborhood induced by km (·, ·). [sent-419, score-0.304]

25 We can also use a linear combination instead of a data-dependent combination and formulate the combined kernel function as follows: P kη (xi , x j ) = ∑ ηm km (xm , xm ) i j m=1 where we select the kernel weights by looking at the performance values obtained by each kernel separately. [sent-426, score-0.835]

26 (2002) define a notion of similarity between two kernels called kernel alignment. [sent-432, score-0.301]

27 yy⊤ can be defined as ideal kernel for a binary classification task, and the alignment between a kernel and the ideal kernel becomes A(K, yy⊤ ) = K, yy⊤ K, K F F ⊤ , yy⊤ yy = F K, yy⊤ F . [sent-435, score-0.569]

28 Qiu and Lane (2009) propose the following simple heuristic for classification problems to select the kernel weights using kernel alignment: ηm = A(Km , yy⊤ ) P ∑ A(Kh , yy⊤ ) ∀m h=1 where we obtain the combined kernel as a convex combination of the input kernels. [sent-439, score-0.615]

29 (2004a) propose to optimize the kernel alignment as follows: maximize A(Ktra , yy⊤ ) η with respect to Kη ∈ SN subject to tr Kη = 1 Kη 0 where the trace of the combined kernel matrix is arbitrarily set to 1. [sent-442, score-0.402]

30 3 F In a transcription initiation site detection task for bacterial genes, they obtain better results by optimizing the kernel weights of the combined kernel function that is composed of six sequence kernels, using the gradient above. [sent-447, score-0.382]

31 (2004a) restrict the kernel weights to be nonnegative and their SDP formulation reduces to the following QCQP problem: P maximize ∑ ηm Ktra , yy⊤ m F m=1 with respect to η ∈ RP + P P subject to ∑ ∑ ηm ηh Km , Kh F m=1 h=1 ≤ 1. [sent-464, score-0.247]

32 (2010a) also restrict the kernel weights to be nonnegative by changing the definition of M in (3) to {η : η 2 = 1, η ∈ RP } and obtain the following QP: + minimize v⊤ Mv − 2v⊤ a with respect to v ∈ RP + (6) where the kernel weights are given by η = v/ v 2 . [sent-466, score-0.386]

33 (2008) choose to optimize the distance between the combined kernel matrix and the ideal kernel, instead of optimizing the kernel alignment measure, using the following optimization problem: minimize Kη − yy⊤ , Kη − yy⊤ with respect to η ∈ RP + 2 F P subject to ∑ ηm = 1. [sent-469, score-0.428]

34 (2008) optimize the kernel weights for the convex combination of kernels by minimizing this measure: minimize FSM(Kη , y) with respect to η ∈ RP + P ∑ ηm = 1. [sent-473, score-0.391]

35 (2009) follow an information-theoretic approach based on the KL divergence between the combined kernel matrix and the optimal kernel matrix: minimize KL(N (0, Kη ) N (0, yy⊤ )) with respect to η ∈ RP + P ∑ ηm = 1 subject to m=1 where 0 is the vector of zeros with proper dimension. [sent-477, score-0.369]

36 The combined kernel matrix is selected from the following set: P KL = K : K = ∑ ηm Km , m=1 2225 K 0, tr (K) ≤ c ¨ G ONEN AND A LPAYDIN where the selected kernel matrix is forced to be positive semidefinite. [sent-483, score-0.345]

37 Conforti and Guido (2010) propose another SDP formulation that removes trace restriction on the combined kernel matrix and introduces constraints over the kernel weights for an inductive setting. [sent-491, score-0.412]

38 This optimization problem is also developed for a transductive setting, but we can simply take the number of test instances as zero and find the kernel combination weights for an inductive setting. [sent-495, score-0.261]

39 (2004b) use a QCQP formulation to integrate multiple kernel functions calculated on heterogeneous views of the genome data obtained through different experimental procedures. [sent-506, score-0.267]

40 This approach assigns near zero weights to random kernels added to the candidate set of kernels before training. [sent-511, score-0.279]

41 (b) When some of the kernels or the data sources are noisy or irrelevant, we should optimize the kernel weights. [sent-518, score-0.277]

42 (2004) propose an iterative algorithm using the kernel Fisher discriminant analysis as the base learner to combine heterogeneous kernels in a linear manner with nonnegative weights. [sent-520, score-0.362]

43 On a colorectal cancer diagnosis 2227 ¨ G ONEN AND A LPAYDIN task, this method obtains similar results using much less computation time compared to selecting a kernel for standard kernel Fisher discriminant analysis. [sent-522, score-0.342]

44 (2004) learn the kernel combination weights by minimizing an approximation of the cross-validation error for kernel Fisher discriminant analysis. [sent-524, score-0.421]

45 They use the sigmoid function for error approximation and derive the update rules of the kernel weights. [sent-526, score-0.269]

46 Fisher discriminant analysis with the combined kernel matrix that is optimized using the cross-validation error approximation, gives significantly better results than single kernels for both tasks. [sent-529, score-0.34]

47 The corresponding dual formulation is derived as the following SOCP problem: N maximize ∑ α i − p⊤ δ i=1 with respect to α ∈ RN , δ ∈ RP + + subject to σm − δ⊤ A(:, k) ≥ 1 N N ∑ ∑ αi α j yi y j km (xm , xm ) i j 2 i=1 j=1 ∀m N i=1 ∑ αi yi = 0 ∀m C ≥ αi ≥ 0 ∀i. [sent-536, score-0.333]

48 They combine kernels with ridge regression using the ℓ2 -norm regularization over the kernel weights. [sent-543, score-0.277]

49 Both studies formulate an alternating optimization method that solves an SVM at each iteration and update the kernel weights as follows: wm ηm = P ∑ wh h=1 2 p+1 2 2p p+1 1 p (8) 2 where wm 2 = η2 ∑N ∑N αi α j yi y j km (xm , xm ) from the duality conditions. [sent-560, score-0.573]

50 We get rid of kernels whose ηm = 0 and use the kernels whose ηm = 1. [sent-570, score-0.242]

51 (2009b) define a combined kernel over the set of kernels calculated on each feature independently and perform feature selection using this definition. [sent-572, score-0.39]

52 8 Structural Risk Optimizing Linear Approaches with Kernel Weights on a Simplex We can think of kernel combination as a weighted average of kernels and consider η ∈ RP and + ∑P ηm = 1. [sent-580, score-0.319]

53 This strategy is also a multiple kernel learning approach, because the optimized parameters can be interpreted as the kernel parameters and we combine these kernel values over all features. [sent-594, score-0.491]

54 The modified primal formulation is 2 P 1 minimize 2 ∑ dm wm 2 i=1 m=1 with respect to wm ∈ R , ξ ∈ Sm P ∑ subject to yi N +C ∑ ξi RN , + b∈R wm , Φm (xm ) + b i m=1 ≥ 1 − ξi ∀i where the feature space constructed using Φm (·) has the dimensionality Sm and the weight dm . [sent-604, score-0.317]

55 These allow us to use the SILP formulation to learn the kernel combination weights for hundreds of kernels on hundreds of thousands of training instances efficiently. [sent-625, score-0.386]

56 (2006) show that selecting the optimal kernel from the set of convex combinations over the candidate kernels can be formulated as a convex optimization problem. [sent-629, score-0.373]

57 (2006) for learning an optimal kernel over a convex set of candidate kernels for RKDA. [sent-634, score-0.312]

58 In order to prevent the combined kernel from overfitting, they also propose a modified mathematical model that defines lower limits for the kernel weights. [sent-641, score-0.345]

59 Hence, 2233 ¨ G ONEN AND A LPAYDIN each kernel in the set of candidate kernels is used in the combined kernel and we obtain a more regularized solution. [sent-642, score-0.466]

60 The proposed multiclass formulation is tested on different bioinformatics applications such as bacterial protein location prediction (Zien and Ong, 2007) and protein subcellular location prediction (Zien and Ong, 2007, 2008), and outperforms individual kernels and unweighted sum of kernels. [sent-647, score-0.236]

61 Due to strong duality, one can also calculate J(η) using the dual formulation: N maximize J(η) = ∑ αi − i=1 1 N N ∑ ∑ αi α j yi y j 2 i=1 j=1 P ∑ ηm km (xm , xm ) i j m=1 kη (xi , x j ) with respect to α ∈ RN + N subject to ∑ αi yi = 0 i=1 C ≥ αi ≥ 0 ∀i. [sent-655, score-0.303]

62 The SILP formulation does not regularize the kernel weights obtained from the cutting plane method and S IMPLE MKL uses the gradient calculated only in the last iteration. [sent-675, score-0.255]

63 (2010a) learns a convex combination of kernels when we use the ℓ1 -norm for regularizing the kernel weights. [sent-679, score-0.354]

64 Micchelli and Pontil (2005) try to learn the optimal kernel over the convex hull of predefined basic kernels by minimizing a regularization functional. [sent-687, score-0.312]

65 (2005, 2006) build practical algorithms for learning a suboptimal kernel when the basic kernels are continuously parameterized by a compact set. [sent-690, score-0.277]

66 Instead of selecting kernels from a predefined finite set, we can increase the number of candidate kernels in an iterative manner. [sent-692, score-0.242]

67 We can basically select kernels from an uncountably infinite ¨ o˘ ¨ set constructed by considering base kernels with different kernel parameters (Oz¨ gur-Aky¨ z and u Weber, 2008; Gehler and Nowozin, 2008). [sent-693, score-0.398]

68 Gehler and Nowozin (2008) propose a forward selection algorithm that finds the kernel weights for a fixed size of candidate kernels using one of the methods described above, then adds a new kernel to the set of candidate kernels, until convergence. [sent-694, score-0.47]

69 (2010) propose another MKL method that considers the group structure between the kernels and this method assumes that every kernel group carries important information. [sent-702, score-0.277]

70 Subrahmanya and Shin (2010) generalize group-feature selection to kernel selection by introducing a log-based concave penalty term for obtaining extra sparsity; this is called sparse multiple kernel learning (SMKL). [sent-706, score-0.335]

71 (2003) propose to learn a kernel function instead of a kernel matrix. [sent-730, score-0.312]

72 They define a kernel function in the space of kernels called a hyperkernel. [sent-731, score-0.277]

73 This formulation regularizes both the hyperplane weights and the kernel combination weights. [sent-738, score-0.265]

74 The gradient with respect to the kernel weights is calculated as ∂kη (xi , x j ) ∂J(η) ∂r(η) 1 N N = − ∑ ∑ αi α j yi y j ∂ηm ∂ηm 2 i=1 j=1 ∂ηm ∀m. [sent-744, score-0.258]

75 Varma and Babu (2009) perform gender identification experiments on a face image data set by combining kernels calculated on each individual feature, and hence, for kernels whose ηm goes to 0, they perform feature selection. [sent-745, score-0.298]

76 P We see that using kη (·, ·) as the combined kernel function is equivalent to using different scaling 2238 M ULTIPLE K ERNEL L EARNING A LGORITHMS parameters on each feature and using an RBF kernel over these scaled features with unit radius, as done by Grandvalet and Canu (2003). [sent-748, score-0.369]

77 (2010b) develop a nonlinear kernel combination method based on KRR and polynomial combination of kernels. [sent-750, score-0.261]

78 For example, when d = 2, the combined kernel + m=1 function becomes P kη (xi , x j ) = P ∑ ∑ ηm ηh km (xm , xm )kh (xh , xhj ). [sent-768, score-0.402]

79 (2007) follow a different approach and combine kernels using a compositional method that constructs a (P × N) × (P × N) compositional kernel matrix. [sent-774, score-0.277]

80 G¨ nen and Alpaydın (2008) propose a data-dependent formulation called localized multiple o kernel learning (LMKL) that combines kernels using weights calculated from a gating model. [sent-780, score-0.47]

81 The softmax gating model uses kernels in a competitive manner and generally a single kernel is active for each input. [sent-785, score-0.478]

82 We may also use the sigmoid function instead of softmax and thereby allow multiple kernels to be used in a cooperative manner: ηm (x|V) = 1 exp(− vm , xG − vm0 ) ∀m. [sent-786, score-0.394]

83 The combined kernel function can be written as P kη (xi , x j ) = ∑ ηm km (xm , xm )ηm i j c c i j m=1 where ηm corresponds to the weight of kernel km (·, ·) in the cluster xi belongs to. [sent-792, score-0.694]

84 The corresponding combined kernel function is P kη (xi , x j ) = ∑ ηm km (xm , xm )ηm i i j j m=1 where ηm corresponds to the weight of kernel km (·, ·) for xi and these instance-specific weights i are optimized using alternating optimization over the training set. [sent-796, score-0.757]

85 (2002) modify the decision function in order to use multiple kernels: N f (x) = ∑ P ∑ αm km (xm , xm ) + b. [sent-814, score-0.236]

86 We use both the linear kernel and the Gaussian kernel in our experiments; we will give our results with the linear kernel first and then compare them with the results of the Gaussian kernel. [sent-825, score-0.468]

87 LMKL (softmax) uses the softmax gating model in (15), whereas LMKL (sigmoid) uses the sigmoid gating model in (16). [sent-880, score-0.334]

88 The active kernel count and the number of calls to the optimization toolbox for SVM (best) are taken as 1 and P, respectively, because it uses only one of the feature representations but needs to train the individual SVMs on all feature representations before choosing the best. [sent-890, score-0.325]

89 Similarly, the active kernel count and the number of calls to the optimization toolbox for SVM (all) are taken as P and 1, respectively, because it uses all of the feature representations but trains a single SVM. [sent-891, score-0.278]

90 ABMKL (conic) and CABMKL (conic) are the two MKL algorithms that perform kernel selection and use less than five kernels on the average, while the others use all six kernels, except CABMKL (linear) which uses five kernels in one of 30 folds. [sent-919, score-0.398]

91 When the number of kernels combined becomes large as in this experiment, as a result of multiplication, RBMKL (product) starts to have very small kernel values at the off-diagonal entries of the combined kernel matrix. [sent-926, score-0.499]

92 ABMKL (conic), ABMKL (convex), CABMKL (linear), CABMKL (conic), MKL, SimpleMKL, and GMKL are the seven MKL algorithms that perform kernel selection and use fewer than 10 kernels on the average, while others use all 10 kernels. [sent-929, score-0.299]

93 ABMKL (conic), ABMKL (convex), CABMKL (linear), CABMKL (conic), MKL, SimpleMKL, and GMKL are the seven MKL algorithms that perform kernel selection and use fewer than 12 kernels on the average, while others use all 12 kernels, except GLMKL ( p = 1) which uses 11 kernels in one of 30 folds. [sent-936, score-0.42]

94 All combination algorithms except ABMKL (convex) use four kernels in all folds, whereas this latter uses exactly three kernels in all folds by eliminating S TA 8 representation. [sent-1395, score-0.284]

95 When we look at the number of active kernels, ABMKL (convex) selects only one kernel and this is the same kernel that SVM (best) picks. [sent-1856, score-0.334]

96 ABMKL (conic), ABMKL (convex), and CABMKL (conic) are the three MKL algorithms that perform kernel selection and use fewer than five kernels on the average, while others use all of the kernels. [sent-1870, score-0.299]

97 ABMKL (ratio), GLMKL ( p = 2), NLMKL ( p = 1), NLMKL ( p = 2), and LMKL (sigmoid) do not eliminate any of the base kernels even though we have three different kernels for each feature representation. [sent-2502, score-0.266]

98 In such a case, a good procedure for kernel combination implies a good combination of inputs from those multiple sources. [sent-2517, score-0.263]

99 We also perform 10 experiments on four real data sets with simple linear kernels and eight experiments on three real data sets with complex Gaussian kernels comparing 16 MKL algorithms in practice. [sent-2521, score-0.242]

100 Multiple kernel learning, conic duality, and the SMO algorithm. [sent-2561, score-0.288]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nlmkl', 0.377), ('lmkl', 0.369), ('abmkl', 0.317), ('rbmkl', 0.306), ('mkl', 0.298), ('glmkl', 0.231), ('cabmkl', 0.213), ('kernel', 0.156), ('svm', 0.15), ('softmax', 0.137), ('bc', 0.132), ('conic', 0.132), ('gmkl', 0.131), ('kernels', 0.121), ('km', 0.116), ('sigmoid', 0.113), ('lpaydin', 0.108), ('onen', 0.108), ('qp', 0.107), ('xm', 0.097), ('rp', 0.081), ('ultiple', 0.073), ('simplemkl', 0.071), ('yy', 0.068), ('qcqp', 0.068), ('statistically', 0.066), ('sdp', 0.056), ('wm', 0.054), ('ernel', 0.053), ('endigits', 0.045), ('lgorithms', 0.044), ('combination', 0.042), ('gating', 0.042), ('silp', 0.041), ('cent', 0.041), ('risk', 0.038), ('qiu', 0.038), ('weights', 0.037), ('alpayd', 0.037), ('krr', 0.037), ('kloft', 0.037), ('convex', 0.035), ('alignment', 0.033), ('combined', 0.033), ('yi', 0.033), ('cortes', 0.033), ('calculated', 0.032), ('lanckriet', 0.032), ('lane', 0.032), ('imple', 0.031), ('socp', 0.031), ('representative', 0.031), ('xb', 0.031), ('formulation', 0.03), ('xa', 0.03), ('rotein', 0.03), ('discriminant', 0.03), ('learner', 0.029), ('nen', 0.029), ('kc', 0.029), ('ulti', 0.028), ('xu', 0.028), ('rakotomamonjy', 0.028), ('calls', 0.027), ('heterogeneous', 0.026), ('conforti', 0.026), ('ktra', 0.026), ('optimization', 0.026), ('earning', 0.026), ('nello', 0.026), ('varma', 0.025), ('sonnenburg', 0.025), ('unweighted', 0.025), ('ong', 0.025), ('subject', 0.024), ('feature', 0.024), ('similarity', 0.024), ('ye', 0.024), ('cantly', 0.024), ('multiple', 0.023), ('solver', 0.023), ('representations', 0.023), ('rkda', 0.022), ('stafford', 0.022), ('yves', 0.022), ('active', 0.022), ('dm', 0.022), ('fewer', 0.022), ('nonlinear', 0.021), ('girolami', 0.021), ('kh', 0.021), ('zien', 0.021), ('store', 0.02), ('protein', 0.02), ('xi', 0.02), ('bach', 0.02), ('bioinformatics', 0.02), ('boosting', 0.019), ('xp', 0.019), ('percentages', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

2 0.33254707 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

3 0.13221107 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efficient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efficiency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

4 0.099392742 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

5 0.070719987 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

Abstract: In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classification, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT

6 0.065082327 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

7 0.065070771 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

8 0.063028812 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

9 0.05336025 48 jmlr-2011-Kernel Analysis of Deep Networks

10 0.051233109 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

11 0.04353033 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

12 0.041638393 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

13 0.041061562 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

14 0.036675312 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

15 0.035649478 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

16 0.033634055 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

17 0.033296425 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

18 0.033058736 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

19 0.031495225 28 jmlr-2011-Double Updating Online Learning

20 0.031196574 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.2), (1, -0.078), (2, 0.217), (3, -0.386), (4, 0.12), (5, 0.054), (6, -0.113), (7, -0.043), (8, -0.024), (9, -0.304), (10, 0.045), (11, 0.061), (12, -0.124), (13, -0.047), (14, 0.105), (15, -0.084), (16, -0.212), (17, 0.102), (18, 0.037), (19, 0.103), (20, 0.137), (21, 0.008), (22, 0.201), (23, -0.06), (24, 0.009), (25, -0.042), (26, -0.003), (27, 0.03), (28, 0.05), (29, -0.002), (30, -0.047), (31, 0.139), (32, 0.076), (33, 0.004), (34, 0.031), (35, 0.015), (36, 0.151), (37, -0.021), (38, 0.002), (39, 0.002), (40, 0.007), (41, -0.021), (42, 0.11), (43, -0.014), (44, 0.007), (45, 0.022), (46, -0.038), (47, 0.026), (48, 0.003), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93061298 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

2 0.9052211 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

3 0.40329707 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efficient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efficiency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

4 0.34665909 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints

5 0.33331585 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

Abstract: In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classification, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT

6 0.32099903 55 jmlr-2011-Learning Multi-modal Similarity

7 0.27317154 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

8 0.26907781 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

9 0.26666155 48 jmlr-2011-Kernel Analysis of Deep Networks

10 0.23185849 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

11 0.22275084 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

12 0.21558006 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

13 0.20519362 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

14 0.18036522 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

15 0.17943697 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

16 0.17190821 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

17 0.15843533 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

18 0.15221967 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

19 0.14350221 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

20 0.14253576 12 jmlr-2011-Bayesian Co-Training


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.027), (9, 0.113), (10, 0.038), (24, 0.056), (31, 0.1), (32, 0.026), (41, 0.011), (71, 0.013), (73, 0.064), (78, 0.043), (86, 0.371), (90, 0.017), (99, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.79538882 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

Author: Umut Ozertem, Deniz Erdogmus

Abstract: Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with any density estimator that gives continuous first and second derivatives. Therefore, we first present our principal curve/surface definition without assuming any particular density estimation method. Afterwards, we develop practical algorithms for the commonly used kernel density estimation (KDE) and Gaussian mixture models (GMM). Results of these algorithms are presented in notional data sets as well as real applications with comparisons to other approaches in the principal curve literature. All in all, we present a novel theoretical understanding of principal curves and surfaces, practical algorithms as general purpose machine learning tools, and applications of these algorithms to several practical problems. Keywords: unsupervised learning, dimensionality reduction, principal curves, principal surfaces, subspace constrained mean-shift

same-paper 2 0.74993002 66 jmlr-2011-Multiple Kernel Learning Algorithms

Author: Mehmet Gönen, Ethem Alpaydın

Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning

3 0.43235949 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

4 0.4145346 48 jmlr-2011-Kernel Analysis of Deep Networks

Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller

Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations

5 0.40650424 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

6 0.40131703 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

7 0.38672456 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

8 0.37193015 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

9 0.35425499 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

10 0.35351962 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

11 0.35342064 91 jmlr-2011-The Sample Complexity of Dictionary Learning

12 0.35133111 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

13 0.35120067 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

14 0.34810773 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

15 0.34775907 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

16 0.34686583 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

17 0.34521359 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

18 0.34515846 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

19 0.34446999 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

20 0.34384653 101 jmlr-2011-Variable Sparsity Kernel Learning