jmlr jmlr2010 jmlr2010-65 knowledge-graph by maker-knowledge-mining

65 jmlr-2010-Learning Translation Invariant Kernels for Classification

Source: pdf

Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi

Abstract: Appropriate selection of the kernel function, which implicitly deﬁnes the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classiﬁcation. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a ﬁnite mixture of cosine functions. The kernel learning problem is then formulated as a semi-inﬁnite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classiﬁer has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artiﬁcial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 IR Faculty of Mathematics and Computer Science Amirkabir University of Technology Tehran, 15914, Iran Editor: John Shawe-Taylor Abstract Appropriate selection of the kernel function, which implicitly deﬁnes the feature space of an algorithm, has a crucial role in the success of kernel methods. [sent-7, score-0.479]

2 In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classiﬁcation. [sent-8, score-0.905]

3 We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. [sent-10, score-0.779]

4 It will be proven that the optimal kernel is a ﬁnite mixture of cosine functions. [sent-14, score-0.409]

5 The kernel learning problem is then formulated as a semi-inﬁnite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. [sent-15, score-0.315]

6 Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. [sent-16, score-0.59]

7 We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. [sent-17, score-0.407]

8 As a by-product, we also generalize the kernel trick to complex-valued kernel functions. [sent-19, score-0.49]

9 Keywords: kernel learning, translation invariant kernels, capacity control, support vector machines, classiﬁcation, semi-inﬁnite programming 1. [sent-21, score-0.628]

10 The earliest method for learning a kernel function is cross-validation which is very slow and is only applicable to kernels with a small number of parameters. [sent-29, score-0.576]

11 Glasmachers and Igel (2005) proposed a gradient-based method for learning the covariance matrix of Gaussian kernels (Note that since the covariance matrix of a Gaussian kernel is constrained to be positive semi-deﬁnite, the method of Chapelle et al. [sent-35, score-0.576]

12 A milestone in the kernel learning literature is the introduction of the multiple kernel learning (MKL) framework by Lanckriet et al. [sent-40, score-0.45]

13 They considered the problem of ﬁnding the optimal convex combination of multiple kernels and formulated it as a quadratically constrained quadratic programming (QCQP) problem. [sent-42, score-0.424]

14 In their seminal work, Micchelli and Pontil (2005) generalized the class of admissible kernels to convex combination of an inﬁnite number of kernels indexed by a compact set and applied their method to the problem of learning radial kernels (Argyriou et al. [sent-59, score-1.223]

15 They used a classical 1354 L EARNING T RANSLATION I NVARIANT K ERNELS FOR C LASSIFICATION result proved by Schoenberg (1938) which states that every continuous radial kernel belongs to the convex hull of radial Gaussian kernels. [sent-61, score-0.369]

16 They also proposed an efﬁcient DC programming algorithm for numerically learning radial kernels in Argyriou et al. [sent-62, score-0.454]

17 In this work, we consider the class of translation invariant kernel functions which encompasses the class of radial kernels as well as the class of anisotropic Gaussian kernel functions. [sent-66, score-1.218]

18 This class contains exactly those kernels which can be deﬁned solely based on the difference of kernel arguments; that is, the kernel functions with the property: ˜ k(x, z) = k(x − z). [sent-67, score-0.801]

19 The general form of continuous translation invariant kernels on Rn was discovered by Bochner (1933). [sent-68, score-0.68]

20 He also proved that, conversely, every continuous translation invariant positive semi-deﬁnite kernel function can be represented in the above form. [sent-71, score-0.554]

21 In statistics, the translation invariance property is referred to as the stationarity of the kernel function. [sent-72, score-0.422]

22 Although the kernel learning formulation of Micchelli and Pontil (2005) contains a regularization term for controlling the complexity of the RKHS associated with the kernel function, there is no mechanism for controlling the capacity of the class of admissible kernels. [sent-76, score-0.647]

23 In our formulation, we have provisioned a mechanism for controlling the complexity of the class of admissible kernels which is described in Section 2. [sent-77, score-0.469]

24 The problem of ﬁnding an optimal kernel which minimizes this criterion over the class of translation invariant kernels leads to (4) which is our main variational problem. [sent-83, score-0.933]

25 In this paper we will represent translation invariant kernels both as k : Rd × Rd → C with two arguments and as ˜ k : Rd → C with only one argument. [sent-88, score-0.68]

26 In addition, it will be shown that the optimal kernel is a ﬁnite mixture of the basic kernels of the form kγ (x, z) = exp( jγT (x − z)). [sent-91, score-0.62]

27 5, using a topological argument, the issue of including other classes of kernels in the learning process will be addressed. [sent-97, score-0.351]

28 It is well known that the regularization parameter of the 2-norm SVM, usually denoted by τ, can be regarded as the weight of a Kronecker delta kernel function, which is incidently a translation invariant kernel, as well. [sent-98, score-0.554]

29 4 we introduce the semi-inﬁnite programming problem (19) which is our main numerical optimization problem and corresponds to the simultaneous learning of the optimal translation invariant kernel as well as the parameter τ. [sent-100, score-0.599]

30 7 we will introduce a method for learning the best combination of stabilized translation invariant kernels and isotropic Gaussian kernels. [sent-103, score-0.794]

31 An important feature of the proposed optimization algorithm is that it does not require loading the kernel matrices into memory and so it is applicable to large-scale problems. [sent-106, score-0.288]

32 Since algorithms are usually designed for real Euclidean spaces, this complicates the application of the kernel trick to the optimal kernel (consider for example an algorithm that checks the sign of a dot product). [sent-109, score-0.49]

33 In Section 6, we show that the feature space induced by the real part of a complex-valued kernel is essentially equivalent to the original complex-valued kernel and deduce that the optimal real-valued kernel is a mixture of cosines. [sent-110, score-0.748]

34 Usually, multiple kernel learning methods yield a model whose evaluation time is in the order of the number of kernels times the number of support vectors. [sent-112, score-0.576]

35 In Section 7, we show that the evaluation time of the optimal translation invariant kernel is proportional to the number of cosine kernels, regardless of the number of support vectors. [sent-113, score-0.694]

36 In Section 8, using a learning theory discussion, we show the necessity of controlling the complexity of the class of translation invariant kernels. [sent-114, score-0.379]

37 To obtain the best tradeoff between approximation and estimation errors, we introduce a hierarchy of classes of translation invariant kernels. [sent-131, score-0.329]

38 It must be mentioned that the idea of controlling the complexity of the class of admissible kernels has been previously used by Ong et al. [sent-133, score-0.437]

39 To restrict the class of translation invariant kernels, we propose to limit the high frequency components of the kernel function. [sent-136, score-0.554]

40 In addition, to ensure that these classes of kernels are nested, we require that Gβ1 (r) ≥ Gβ2 (r) for every r > 0 and β1 ≤ β2 . [sent-140, score-0.351]

41 Kernel Selection Criterion In the process of learning the kernel function, one needs a criterion for choosing a kernel (or equivalently, feature space) from the class of admissible kernels. [sent-146, score-0.543]

42 For translation invariant kernels we have R2 = Φ(x) 2 = k(x, x) = k(0). [sent-168, score-0.68]

43 5 Proof For each α ∈ Rl we have max αT H(γ)α = (1 − τ) max n n γ∈R γ∈R    2 l ∑ αu yu exp( jγT xu ) u=1 2   Gβ ( γ ) + ταT α. [sent-262, score-0.303]

44 5 Including Other Kernels in the Learning Process Although the focus of this paper is on the task of learning translation invariant kernels, it is easy to furnish the set of admissible kernels with other kernel functions. [sent-306, score-0.941]

45 For example, one may want to ﬁnd the best convex combination of stabilized translation invariant kernels along with isotropic/nonisotropic Gaussian kernels, and polynomial kernels with degrees one to ﬁve. [sent-307, score-1.117]

46 6 In general, assume that we have M classes of kernels Ki := kγ (x, z) : γ ∈ Γi i = 1, . [sent-308, score-0.351]

47 The problem of learning the best convex combination of kernels from these classes for classiﬁcation with support vector machines can be stated as sup min Z p∈P (Γ0 ) α∈A Γ0 αT H0 (γ)α d p(γ) (16) where Γ0 := Γ1 ∪ . [sent-319, score-0.467]

48    HM (γ) i f γ ∈ ΓM The results of the previous sections will hold for this combined class of kernels if we prove that Γ0 is a compact Hausdoff space and that h0 (γ, α) := αT H0 (γ)α is continuous with respect to γ. [sent-326, score-0.399]

49 Although the class of translation invariant kernels includes the set of isotropic/non-isotropic Gaussian kernels, it is not the case for the stabilized class Kβ . [sent-346, score-0.738]

50 So, the problem of learning the parameter τ is equivalent to choosing the best convex combination of the set of translation invariant kernels augmented with the delta kernel δ(x, z). [sent-458, score-0.933]

51 7 Furnishing the Class of Admissible Kernels with Isotropic Gaussian Kernels Although the class of translation invariant kernels encompasses the class of Gaussian kernels with arbitrary covariance matrices, the stabilized class of translation invariant kernels Kβ does not. [sent-481, score-1.799]

52 In this section, we consider learning the best convex combination of kernels of the classes Kβ and the stabilized class of isotropic Gaussian kernels Kη := k(x, z) = Z R e−ω xu −xv 2 e−η ω 2 d p(ω) : p is a probablity measure on R where η > 0. [sent-483, score-0.998]

53 By using Theorem 12, it follows that the expansion of the optimal kernel along with the weight of each kernel can be obtained by solving the SIP problem Pτ (Γβ , Ωη ), where Pτ (Γ, Ω) is deﬁned as: minα,t t s. [sent-487, score-0.45]

54 Let us ﬁrst write the distance between means of two classes in the feature space of some kernel k′ m1 − m2 = 2 1 1 = ∑+ Φ(xu ) − l2 ∑− Φ(xv ) l1 u∈C v∈C 2 1 1 1 k′ (xu , xv ) + 2 ∑ ∑ k′ (xu , xv ) + l 2 ∑− ∑− k′ (xu , xv ). [sent-506, score-0.491]

55 terminate algorithm with the kernel k(x, z) := ∑m µ j cos (x − z)T γ j j=1 (i) (i) {we assume that Γ(i) = γ1 , . [sent-537, score-0.337]

56 , yl sin(γT xl ) 1372 L EARNING T RANSLATION I NVARIANT K ERNELS FOR C LASSIFICATION where ℜ{z} and ℑ{z} are the real and imaginary parts of z, respectively, show that the kernel matrices G(γ1 ), . [sent-583, score-0.289]

57 Now, there is no need to load the kernel matrices into memory and so general-purpose QCQP solvers such as Mosek (Andersen and Andersen, 2000) can be used to solve (21) even when the training set size is huge. [sent-599, score-0.303]

58 Although the traditional algorithms for solving quadratic programming problems, such as active set methods, are fast, they need to store the kernel matrix in memory which prevents their application in large-scale problems. [sent-672, score-0.304]

59 Again, since the effective rank of the kernels G(γ1 ), . [sent-674, score-0.351]

60 1375 G HIASI -S HIRAZI , S AFABAKHSH AND S HAMSI nsv ˜ ∑ αu yu cos(γT xu ) j j = 1, . [sent-709, score-0.515]

61 , m, ˜ c′j = − ∑ αu yu xu sin(γT xu ) j j = 1, . [sent-715, score-0.377]

62 cj = u=1 nsv sj = u=1 nsv s′j = u=1 nsv u=1 5. [sent-722, score-0.876]

63 For complex-valued kernels, the dot-product of any two vectors may be complex-valued which makes the application of these kernels to machine learning algorithms more tricky. [sent-805, score-0.351]

64 i∈I i∈I After fully developing the paper based on the complex-valued form of translation invariant kernels, one of the reviewers introduced us to the real-valued form of these kernels as was discovered by Bochner (1955). [sent-819, score-0.68]

65 He proved that every continuous real-valued translation invariant positive deﬁnite kernel in Rn has the general form 13. [sent-820, score-0.554]

66 The fact that real part of a complex kernel is a real kernel is not new (see Sch¨ lkopf and Smola, 2002 page 31). [sent-821, score-0.45]

67 It is interesting that after applying the appropriate kernel trick to both real-valued and complexvalued forms of translation invariant kernels, the optimal kernel is found to be a mixture of cosines. [sent-824, score-0.863]

68 i u=1 But ∑nsv αu yu cos(γT xu ) and ∑nsv αu yu sin(γT xu ) are constant values. [sent-834, score-0.446]

69 In addition, by Theorem 2, the number of kernels is limited to l + 1. [sent-837, score-0.351]

70 Furthermore, since the deletion of nonsupport vector samples from the training set has no effect on the optimal classiﬁer, it follows that m ≤ nsv + 1. [sent-838, score-0.336]

71 Although, theoretically, the number of kernels can reach the number of support vectors, our experiments show that the number of kernels is usually a fraction of the number of support vectors. [sent-839, score-0.702]

72 (2005b) used this feature along with a result from Yiming and Zhou (2007) to obtain a probably approximately correct (PAC) upper bound on the generalization error of their kernel learning framework over the class of radial kernel functions. [sent-843, score-0.537]

73 But, the situation for translation invariant kernels is completely different. [sent-845, score-0.68]

74 So, controlling the complexity of the class of translation invariant kernels is a necessary ingredient of our framework. [sent-849, score-0.73]

75 The variance parameter σ of the isotropic Gaussian kernel and the parameter C of the 1-norm soft-margin SVM are optimized by performing 5-fold cross-validation on the ﬁrst ﬁve instances of the training set. [sent-876, score-0.325]

76 In fact, the classes of radial/translation invariant kernel functions contain kernels with arbitrary positive values for k(x, x). [sent-878, score-0.708]

77 The number of Gaussian kernels m, their parameters, and the parameter τ are learnt automatically. [sent-881, score-0.379]

78 The number of cosine kernels m, their parameters, and the parameter τ are learnt automatically. [sent-886, score-0.519]

79 The number of cosine kernels m, the number of Gaussian kernels m, the parameters of cosine and Gaussian ¯ kernels, and the parameter τ are learnt automatically. [sent-890, score-1.01]

80 The only beneﬁt of the GM over SG is that while the latter requires specifying the kernel function by hand, GM learns the kernel function automatically. [sent-901, score-0.45]

81 So, it seems that the Gaussian kernel is inherently much more suitable for solving the Ringnorm data set than the cosine kernel. [sent-912, score-0.365]

82 In fact, this is exactly why combining several kernels is important. [sent-913, score-0.351]

83 Solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform Single Gaussian testing nsv (ms) 144. [sent-1084, score-0.292]

84 3 Gaussian Mixture ¯ testing nsv m (ms) Gauss 153. [sent-1097, score-0.292]

85 0 Cosine & Gaussian Mixture ¯ testing nsv m m (ms) cos Gauss 13. [sent-1162, score-0.404]

86 After two hours of training, the algorithm produced a model with 244 cosine kernels and 790 support vectors. [sent-1241, score-0.491]

87 3 Experiments on the MNIST Data Set While many algorithms for kernel learning consider the combination of a ﬁnite number of kernels, learning translation invariant kernels corresponds to combining an inﬁnite number of kernels. [sent-1249, score-0.905]

88 They considered the problem of ﬁnding an optimal kernel over the whole class of radial kernels which is equivalent to the problem of learning the best convex combination of Gaussian kernels with isotropic covariance matrices. [sent-1254, score-1.069]

89 4 Assessing the Effect of the Proposed Complexity Control Mechanism In Section 8 we provided theoretical support for the necessity of controlling the complexity of the class of translation invariant kernels. [sent-1291, score-0.379]

90 This experiment conﬁrms the usefulness of controlling the complexity of the class of translation invariant kernels, as was claimed in Section 8. [sent-1328, score-0.379]

91 Conclusions In this paper we addressed the problem of learning a translation invariant kernel function for the task of binary classiﬁcation with SVM. [sent-1330, score-0.554]

92 We proposed a mechanism for controlling the complexity of the class of translation invariant kernels which was found to be very useful in practice. [sent-1331, score-0.762]

93 We have also shown that how other classes of kernels can be included in the learning process. [sent-1336, score-0.351]

94 Since the optimal translation invariant kernel is complex-valued, we then introduced a method for applying the kernel trick to complex-valued kernels. [sent-1338, score-0.819]

95 It revealed that the optimal translation invariant kernel is a mixture of cosine kernels. [sent-1339, score-0.738]

96 While an ordinary MKL algorithm with m kernels requires O (m × nsv × n) steps for computing the classiﬁer, the optimal classiﬁer of the proposed method can be computed in O (m × n) steps. [sent-1341, score-0.643]

97 First, we intend to generalize the proposed kernel learning method from binary classiﬁcation to other learning problems, including regression, multiclass classiﬁcation, clustering, and kernel PCA. [sent-1343, score-0.45]

98 Figure 3: Comparison between the parameters β and C for controlling the capacity of the class of translation invariant kernels on the Heart data set. [sent-1348, score-0.759]

99 Our long time plan is to investigate the use of other low-rank kernels and make it a competing popular technology. [sent-1354, score-0.351]

100 Classes of kernels for machine learning: A statistics perspective. [sent-1458, score-0.351]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernels', 0.351), ('nsv', 0.292), ('kernel', 0.225), ('afabakhsh', 0.213), ('hamsi', 0.213), ('hiasi', 0.213), ('hirazi', 0.213), ('ranslation', 0.202), ('translation', 0.197), ('nvariant', 0.173), ('xu', 0.154), ('cosine', 0.14), ('qcqp', 0.134), ('ernels', 0.134), ('rn', 0.134), ('invariant', 0.132), ('cos', 0.112), ('micchelli', 0.104), ('cm', 0.099), ('lassification', 0.097), ('sip', 0.09), ('rl', 0.086), ('mkl', 0.086), ('xv', 0.079), ('andersen', 0.079), ('reemtsen', 0.079), ('dc', 0.078), ('rakotomamonjy', 0.078), ('argyriou', 0.074), ('lanckriet', 0.074), ('sin', 0.07), ('yu', 0.069), ('earning', 0.069), ('pontil', 0.066), ('tsch', 0.064), ('sg', 0.064), ('yv', 0.063), ('ringnorm', 0.06), ('radial', 0.058), ('stabilized', 0.058), ('cgm', 0.056), ('cic', 0.056), ('silp', 0.056), ('isotropic', 0.056), ('min', 0.055), ('controlling', 0.05), ('rner', 0.048), ('tm', 0.048), ('twonorm', 0.048), ('compact', 0.048), ('gaussian', 0.047), ('chapelle', 0.047), ('programming', 0.045), ('gehler', 0.045), ('kortanek', 0.045), ('xire', 0.045), ('zre', 0.045), ('oi', 0.045), ('gm', 0.045), ('bfgs', 0.044), ('training', 0.044), ('mixture', 0.044), ('hettich', 0.043), ('vs', 0.043), ('integration', 0.043), ('sec', 0.042), ('svm', 0.042), ('max', 0.04), ('trick', 0.04), ('lagrange', 0.04), ('mosek', 0.038), ('usps', 0.038), ('heart', 0.038), ('imaginary', 0.037), ('mnist', 0.036), ('admissible', 0.036), ('memory', 0.034), ('hi', 0.033), ('nocedal', 0.033), ('sup', 0.033), ('theorem', 0.032), ('mechanism', 0.032), ('bi', 0.031), ('dp', 0.03), ('multipliers', 0.03), ('digit', 0.03), ('encompasses', 0.03), ('capacity', 0.029), ('cristianini', 0.029), ('feature', 0.029), ('crisp', 0.029), ('stabilizer', 0.029), ('whilst', 0.029), ('slater', 0.029), ('topology', 0.028), ('criterion', 0.028), ('convex', 0.028), ('learnt', 0.028), ('open', 0.027), ('xl', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi

2 0.19742571 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be deﬁned as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that deﬁnes the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difﬁcult to check, our conditions are straightforward and intuitive: integrally strictly positive deﬁnite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und

3 0.1751657 44 jmlr-2010-Graph Kernels

Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt

Abstract: We present a uniﬁed framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We ﬁnd a spectral decomposition approach even more efﬁcient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and ﬁxed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the inﬁnite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-deﬁnite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks

4 0.16057082 15 jmlr-2010-Approximate Tree Kernels

Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller

Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are signiﬁcantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels

5 0.10011916 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin

Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efﬁciently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing

6 0.08762233 84 jmlr-2010-On Spectral Learning

7 0.080128863 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

8 0.075651243 40 jmlr-2010-Fast and Scalable Local Kernel Machines

9 0.070100315 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

10 0.068368465 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

11 0.068319529 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

12 0.064356081 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

13 0.061994188 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

14 0.059769504 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

15 0.058215879 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

16 0.053335462 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

17 0.053173479 22 jmlr-2010-Classification Using Geometric Level Sets

18 0.051185716 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

19 0.050215069 27 jmlr-2010-Consistent Nonparametric Tests of Independence

20 0.04907332 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.261), (1, -0.083), (2, 0.019), (3, 0.159), (4, -0.038), (5, 0.071), (6, -0.378), (7, -0.194), (8, -0.056), (9, -0.1), (10, 0.008), (11, -0.139), (12, 0.051), (13, -0.102), (14, -0.033), (15, 0.061), (16, 0.083), (17, -0.018), (18, 0.11), (19, 0.035), (20, -0.136), (21, -0.051), (22, -0.139), (23, -0.02), (24, 0.014), (25, 0.062), (26, -0.114), (27, 0.016), (28, 0.003), (29, -0.061), (30, 0.148), (31, -0.032), (32, -0.035), (33, 0.002), (34, 0.019), (35, -0.003), (36, -0.037), (37, -0.01), (38, 0.01), (39, 0.002), (40, 0.007), (41, -0.143), (42, 0.063), (43, -0.038), (44, -0.028), (45, -0.023), (46, -0.002), (47, -0.049), (48, -0.038), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94095045 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi

2 0.84270072 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

3 0.67897677 15 jmlr-2010-Approximate Tree Kernels

Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller

4 0.61456555 44 jmlr-2010-Graph Kernels

Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt

5 0.54788655 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc

Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for uniﬁed large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the speciﬁc algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R

6 0.42173302 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

7 0.34370086 40 jmlr-2010-Fast and Scalable Local Kernel Machines

8 0.33721638 84 jmlr-2010-On Spectral Learning

9 0.31459025 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

10 0.30871817 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

11 0.2921176 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

12 0.29077736 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

13 0.26757932 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

14 0.26556662 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

15 0.26296845 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

16 0.26249772 27 jmlr-2010-Consistent Nonparametric Tests of Independence

17 0.25140497 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

18 0.24878262 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

19 0.24550544 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

20 0.24387814 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.024), (8, 0.013), (21, 0.014), (32, 0.04), (36, 0.02), (37, 0.04), (75, 0.687), (81, 0.011), (85, 0.043), (96, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99790663 77 jmlr-2010-Model-based Boosting 2.0

Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner

Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for ﬁtting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for ﬁtting and evaluating a variety of regression and classiﬁcation models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for ﬁtting complex prediction models, taking potentially many interactions of features into account, as well as for ﬁtting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly deﬁned as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a ﬁtted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientiﬁc question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The deﬁnition of ξ is determined by deﬁning a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then ﬁtted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively ﬁts residuals deﬁned by the negative gradient of the loss function evaluated at the current ﬁt by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j deﬁnes the ﬁnal model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for ﬁtting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classiﬁcation (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be ﬁtted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners deﬁning the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more uniﬁed way. Speciﬁcally, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for ﬁtting smooth functions via penalized splines, varying coefﬁcients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efﬁcient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is speciﬁed via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are deﬁned by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (speciﬁed using the formula fm), the data and the family are then passed to function mboost() for model ﬁtting:3 R> library(

same-paper 2 0.99615222 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi

3 0.99521017 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda

Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classiﬁer, such as feature selection, robustness to noise, and as a result, classiﬁer generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a ﬁxed point continuation phase and an interior point phase. The ﬁrst phase is based completely on memory efﬁcient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can signiﬁcantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efﬁcient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, ﬁxed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA

4 0.99442071 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine

Author: Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu

Abstract: We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE). A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuoustime Markov process. This software provides libraries and programs for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data. For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern. Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence. Keywords: continuous time Bayesian networks, C++, open source software

5 0.98879069 6 jmlr-2010-A Rotation Test to Verify Latent Structure

Author: Patrick O. Perry, Art B. Owen

Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one ﬁts a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We ﬁnd that only non-Gaussian latent variables are detectable, a ﬁnding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method ﬁnds and veriﬁes some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data

6 0.98822361 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

7 0.93075818 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

8 0.93000263 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

9 0.92279422 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

10 0.92056489 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

11 0.90919435 84 jmlr-2010-On Spectral Learning

12 0.90853345 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

13 0.90479028 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

14 0.90399617 40 jmlr-2010-Fast and Scalable Local Kernel Machines

15 0.90261018 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

16 0.90102208 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks

17 0.89985096 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

18 0.89887291 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

19 0.89799076 63 jmlr-2010-Learning Instance-Specific Predictive Models

20 0.89621955 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning