jmlr jmlr2009 jmlr2009-61 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
Reference: text
sentIndex sentText sentNum sentScore
1 PT Instituto de Telecomunicacoes ¸˜ Instituto Superior T´ cnico e Lisboa, Portugal Editor: Francis Bach Abstract Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. [sent-21, score-0.205]
2 Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. [sent-22, score-0.264]
3 Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. [sent-23, score-0.275]
4 This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. [sent-24, score-0.479]
5 Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. [sent-27, score-0.275]
6 We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. [sent-28, score-0.534]
7 Nonextensive string kernels are also defined that generalize the p-spectrum kernel. [sent-29, score-0.26]
8 We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. [sent-30, score-0.228]
9 Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. [sent-31, score-0.535]
10 Some of these kernels have a natural information theoretic interpretation, establishing a bridge between kernel methods and information theory (Cuturi et al. [sent-58, score-0.278]
11 The main goal of this paper is to widen that bridge; we do that by introducing a new class of kernels rooted in nonextensive information theory, which contains previous information theoretic kernels as particular elements. [sent-60, score-0.656]
12 The Shannon and R´ nyi entropies (Shannon, 1948; R´ nyi, 1961) share e e the extensivity property: the joint entropy of a pair of independent random variables equals the sum of the individual entropies. [sent-61, score-0.245]
13 The so-called Tsallis entropies (Havrda and Charv´ t, 1967; Tsallis, 1988) form a parametric family of nonextensive a entropies that includes the Shannon-Boltzmann-Gibbs entropy as a particular case. [sent-65, score-0.502]
14 These divergences are then used to define new informationtheoretic kernels between probability distributions. [sent-70, score-0.222]
15 Based on these concepts, we introduce the Jensen-Tsallis (JT) q-difference, a nonextensive generalization of the JS divergence, which is also a “mutual information” in the sense of Furuichi (2006). [sent-73, score-0.275]
16 • A broad family of (nonextensive information theoretic) positive definite kernels, interpretable as nonextensive mutual information kernels, ranging from the Boolean to the linear kernels, and including the JS kernel proposed by Hein and Bousquet (2005). [sent-76, score-0.431]
17 • A family of (nonextensive information theoretic) positive definite kernels between stochastic processes, subsuming well-known string kernels (e. [sent-77, score-0.459]
18 Section 2 reviews nonextensive entropies, with emphasis on the Tsallis case. [sent-85, score-0.275]
19 The new family of entropic kernels is introduced and characterized in Section 6, which also introduces nonextensive kernels between stochastic processes. [sent-89, score-0.725]
20 Nonextensive Entropies and Tsallis Statistics In this section, we start with a brief overview of nonextensive entropies. [sent-93, score-0.275]
21 Inspired by the axiomatic formulation of Shannon’s entropy (Khinchin, 1957; Shannon and Weaver, 1949), Suyari (2004) proposed an axiomatic framework for nonextensive entropies and a uniqueness theorem. [sent-100, score-0.407]
22 , pn ) = − ∑ p(x)q lnq p(x), (3) x∈X where lnq (x) (x1−q − 1)/(1 − q) is the q-logarithm function, which satisfies lnq (xy) = lnq (x) + x1−q lnq (y) and lnq (1/x) = −xq−1 lnq (x). [sent-126, score-0.589]
23 Tsallis joint and conditional entropies are defined, respectively, as Sq (X,Y ) − ∑ p(x, y)q lnq p(x, y) x,y and Sq (X|Y ) − ∑ p(x, y)q lnq p(x|y) = ∑ p(y)q Sq (X|y), x,y (4) y and the chain rule Sq (X,Y ) = Sq (X) + Sq (Y |X) holds. [sent-130, score-0.286]
24 In Section 5, we establish a relationship between Tsallis mutual entropy and a quantity called Jensen-Tsallis q-difference, generalizing the one between mutual information and the JS divergence (shown, e. [sent-133, score-0.316]
25 In Section 5, we show that this alternative definition also leads to a nonextensive analogue of the JS divergence. [sent-140, score-0.275]
26 Although, as shown below, these functionals are completely characterized by their restriction to the normalized probability distributions, the denormalization expressions will play an important role in Section 6 to derive novel positive definite kernels inspired by mutual informations. [sent-143, score-0.266]
27 g (9) S Let us now proceed similarly with the nonextensive entropies. [sent-157, score-0.275]
28 The nonextensive counterpart Sq of (8), defined on M+ (X ), is Z ϕq ◦ f , (10) ϕH (y) if q = 1, k (y − yq ) if q = 1, φ(q) (11) Sq ( f ) = where ϕq : R+ → R is given by ϕq (y) = and φ : R+ → R satisfies conditions (i)-(iii) stated following Equation (1). [sent-159, score-0.275]
29 (12) Similarly, a nonextensive generalization of the generalized KL divergence (9) is Dq ( f , g) = − k φ(q) Z q f + (1 − q)g − f q g1−q , for q = 1, and D1 ( f , g) limq→1 Dq ( f , g) = D( f , g). [sent-161, score-0.412]
30 k If | f | = |g| = 1, we recover the pseudo-additivity property of nonextensive entropies: Sq ( f ⊗ g) = Sq ( f ) + Sq (g) − φ(q) Sq ( f )Sq (g). [sent-167, score-0.275]
31 (This relationship between JS divergence and mutual information was pointed out by Grosse et al. [sent-205, score-0.197]
32 943 M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO Unlike in the JS divergence case, there is no counterpart of equality (18) based on the R´ nyi qe divergence Z 1 q 1−q DRq (p1 p2 ) = ln p1 p2 . [sent-234, score-0.357]
33 Unlike the JS divergence, the JT divergence lacks an interpretation as a mutual information. [sent-248, score-0.197]
34 In the next section, we propose an alternative to the JT divergence which, among other features, is interpretable as a nonextensive mutual information (in the sense of Furuichi 2006) and is jointly convex, for q ∈ [0, 1]. [sent-250, score-0.472]
35 Later (in Section 5), we will use these functions to define the Jensen-Tsallis qdifference, which we will propose as an alternative nonextensive generalization of the JS divergence, instead of the JT divergence discussed in Section 3. [sent-253, score-0.412]
36 The qexpectation is a convenient concept in nonextensive information theory; for example, it yields a very compact form for the Tsallis entropy: Sq (X) = −Eq [lnq p(X)]. [sent-261, score-0.275]
37 The Jensen-Tsallis q-Difference This section introduces the Jensen-Tsallis q-difference, a nonextensive generalization of the JensenShannon divergence. [sent-297, score-0.275]
38 Observe that (25) is a nonextensive analogue of (17). [sent-307, score-0.275]
39 , Furuichi 2006) only consider values of q ≥ 1, when looking for nonextensive analogues of Shannon’s information theory. [sent-336, score-0.275]
40 (Interestingly, this “complements” the joint convexity of the JT divergence (20), for q ∈ [1, 2], proved by Burbea and Rao 1982. [sent-379, score-0.2]
41 Consider also: 1 • for each t ∈ T , the marginals pt (Y ) ∈ M+ (Y ), 1 • for each t ∈ T and y ∈ Y , the conditionals pt (X|Y = y) ∈ M+ (X ), • the mixture r(X,Y ) R T 1 π(t) pt (X,Y ) ∈ M+ (X × Y ), 1 • the marginal r(Y ) ∈ M+ (Y ), 1 • for each y ∈ Y , the conditionals r(X|Y = y) ∈ M+ (X ). [sent-441, score-0.291]
42 ˆ ˆ ˆ Notice that, at the minimum, we have cond,(λ,1−λ) λD( ps r) + (1 − λ)D( pt r) = JSk ˆ ˆ ˆ ˆ ( ps , pt ). [sent-477, score-0.246]
43 ˆ ˆ It is tempting to investigate the asymptotic behavior of the conditional and joint JS divergences when k → ∞; however, unlike other asymptotic information theoretic quantities, like the entropy or cross entropy rates, this behavior fails to characterize the sources s and t. [sent-478, score-0.243]
44 Nonextensive Mutual Information Kernels In this section we consider the application of extensive and nonextensive entropies to define kernels on measures; since kernels involve pairs of measures, throughout this section |T | = 2. [sent-483, score-0.702]
45 3, we devise novel kernels related to the JS divergence and the JT q-difference; these kernels allow setting a weight for each argument, thus will be called weighted Jensen-Tsallis kernels. [sent-485, score-0.491]
46 We also introduce kernels related to the JR divergence (Section 3. [sent-486, score-0.314]
47 4), and establish a connection between the Tsallis kernels and a family of kernels investigated by Hein et al. [sent-488, score-0.376]
48 3, we devise k-th order Jensen-Tsallis kernels between stochastic processes that subsume the well-known p-spectrum kernel of Leslie et al. [sent-491, score-0.251]
49 The sets of pd and nd kernels are both closed under pointwise sums/integrations, the former being also closed under pointwise products; moreover, both sets are closed under pointwise convergence. [sent-515, score-0.299]
50 While pd kernels “correspond” to inner products via embedding in a Hilbert space, nd kernels that vanish on the diagonal and are positive anywhere else, “correspond” to squared Hilbertian distances. [sent-516, score-0.476]
51 2 A function ϕ : X → R is called pd (in the semigroup sense) if k : X × X → R, defined as k(x, y) = ϕ(x ⊕ y), is a pd kernel. [sent-529, score-0.273]
52 2 Jensen-Shannon and Tsallis Kernels The basic result that allows deriving pd kernels based on the JS divergence and, more generally, on the JT q-difference, is the fact that the denormalized Tsallis q-entropies (10) are nd functions on S (M+q (X ), +), for q ∈ [0, 2]. [sent-533, score-0.436]
53 S The kernel kq : M+q (X ) \ {0} kq (µ1 , µ2 ) 2 → R is defined as kq (ω1 p1 , ω2 p2 ) = Sq (π) − Tqπ (p1 , p2 ). [sent-553, score-0.38]
54 Hence, kq can be seen (in Bayesian classification terms) as a nonextensive expected measure of uncertainty in correctly identifying the class, given the prior π = (π1 , π2 ), and a sample from the mixture π1 p1 + π2 p2 . [sent-555, score-0.377]
55 Observe now that kq (µ1 , µ2 ) = kq (µ1 , µ2 )(ω1 + ω2 )−q . [sent-561, score-0.204]
56 Since the product of two pd kernels is a pd kernel and (Proposition 25) (ω1 + ω2 )−q is a pd kernel, for q ∈ [0, 1], we conclude that kq is pd. [sent-562, score-0.719]
57 As we can see, the weighted Jensen-Tsallis kernels have two inherent properties: they are parameterized by the entropic index q and they allow their arguments to be unbalanced, that is, to have different weights ωi . [sent-563, score-0.251]
58 We now keep q ∈ [0, 2] but consider the weighted JT kernel family restricted to normalized measures, kq |(M+ (X ))2 . [sent-577, score-0.198]
59 This corresponds to setting uniform weights (ω1 = ω2 = 1/2); note that in 1 this case kq and kq collapse into the same kernel, kq (p1 , p2 ) = kq (p1 , p2 ) = lnq (2) − Tq (p1 , p2 ). [sent-578, score-0.488]
60 Proposition 27 guarantees that these kernels are pd for q ∈ [0, 2]. [sent-579, score-0.299]
61 2 Corollary 37 The kernels kBool and klin are pd. [sent-591, score-0.2]
62 One of the key features of our generalization is that the kernels are defined on unnormalized measures, with arbitrary mass. [sent-596, score-0.213]
63 This is relevant, for example, in applications of kernels on empirical measures (e. [sent-597, score-0.205]
64 Proposition 38 (Jensen-R´ nyi and Jensen-Tsallis kernels) For any q ∈ [0, 2], the kernel e (p1 , p2 ) → Sq p1 + p2 2 1 1 and the (unweighted) Jensen-Tsallis divergence JSq (20) are nd kernels on M+ (X ) × M+ (X ). [sent-606, score-0.471]
65 Also, for any q ∈ [0, 1], the kernel (p1 , p2 ) → Rq p1 + p2 2 1 1 and the (unweighted) Jensen-R´ nyi divergence JRq (19) are nd kernels on M+ (X ) × M+ (X ). [sent-607, score-0.471]
66 4) and a family of difference kernels introduced by Fuglede (2005), xα + yα 2 ψα,β (x, y) = 1/α − xβ + yβ 2 1/β . [sent-618, score-0.199]
67 Fuglede (2005) derived the negative definiteness of the above family of kernels provided 1 ≤ α ≤ ∞ and 1/2 ≤ β ≤ α; he went further by providing representations for these kernels. [sent-619, score-0.199]
68 (2004) R used the fact that the integral ψα,β (x(t), y(t))dτ(t) is also nd to derive a family of pd kernels for probability measures that included the Jensen-Shannon kernel (see Def. [sent-621, score-0.423]
69 ” From (38) we have that Z ψα,β (x, y) = (α − 1)J˜Sα (x, y) − (β − 1)J˜Sβ (x, y), so the family of probabilistic kernels studied in Hein et al. [sent-626, score-0.199]
70 4 k-th Order Jensen-Tsallis String Kernels This subsection introduces a new class of string kernels inspired by the k-th order JT q-difference introduced in Section 5. [sent-629, score-0.26]
71 We will now see how Jensen-Tsallis kernels may be used as string kernels. [sent-652, score-0.26]
72 Let the kernel kq,k : (R++ × S (A ))2 → R be defined as kq (ω1 ps1 ,k , ω2 ps2 ,k ) kq,k ((ω1 , s1 ), (ω2 , s2 )) = joint,π Sq (π) − Tq,k (42) (s1 , s2 ) , Proposition 40 The kernel kq,k is pd, for q ∈ [0, 2]. [sent-660, score-0.25]
73 From Proposition 27, the kernel kq (g(ω1 , s1 ), g(ω2 , s2 )) is pd and therefore so is kq,k ((ω1 , s1 ), (ω2 , s2 )); proceed analogously for kq,k . [sent-663, score-0.298]
74 The cond cond following proposition shows that kq,k and kq,k are not pd in general. [sent-666, score-0.461]
75 cond cond Proposition 41 Let kq,k be defined as kq,k (s1 , s2 ) cond cond kq,k be defined as kq,k (s1 , s2 ) in general. [sent-668, score-0.564]
76 cond,π Sq (π) − Tq,k (s1 , s2 ) (ω1 + ω2 )q ; and cond,π cond cond Sq (π) − Tq,k (s1 , s2 ) . [sent-669, score-0.282]
77 It holds that kq,k and kq,k are not pd 961 M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO Despite the negative result in Proposition 41, the chain rule in Proposition 15 still allows us to define pd kernels by combining conditional JT q-differences. [sent-670, score-0.444]
78 ≥ βn → 0 Any kernel of the form ∞ cond ∑ βk kq,k (43) k=0 is pd for q ∈ [0, 2]; and any kernel of the form ∞ cond ∑ βk kq,k k=0 is pd for q ∈ [0, 1], provided both series above converge pointwise. [sent-674, score-0.674]
79 Since (βk )k∈N is non-increasing, we have that (αk )k∈N\{0} is non-negative, which makes (44) the pointwise limit of a conic combination of pd kernels, and therefore a pd kernel. [sent-676, score-0.244]
80 Therefore, we conclude that the JT string kernels introduced in this section subsume these two well-known string kernels. [sent-684, score-0.343]
81 Experiments We illustrate the performance of the proposed nonextensive information theoretic kernels, in comparison with common kernels, for SVM-based text classification. [sent-686, score-0.302]
82 Therefore, both bags of words kernels and string kernels were employed for each task. [sent-691, score-0.461]
83 Although Lafferty and Lebanon (2005) provide empirical evidence that the heat kernel outperforms the linear kernel, it is not guaranteed to be pd for an arbitrary choice of t, as we show in Appendix E. [sent-708, score-0.26]
84 We report the performance of the Tsallis kernels as a function of the entropic index q. [sent-714, score-0.251]
85 In the second task, the Tsallis kernel outperformed the ℓ2 -normalized linear kernel for both representations, and the heat kernel for tf-idf ; the differences are statistically significant (using the unpaired t test at the 0. [sent-717, score-0.286]
86 For the first task, the JT string kernel and the WASK outperformed the PSK (with statistical significance for p = 3), all kernels performed similarly for p = 4, and the JT string kernel outperformed the WASK for p = 5; all other differences are not statiscally significant. [sent-768, score-0.491]
87 75 2 Entropic index q Figure 4: Results for earn-vs-acq using string kernels and p = 3, 4, 5. [sent-805, score-0.26]
88 also observe that the 5-th order JT string kernel remarkably outperforms all bags of words kernels for the stud-vs-fac task, even though it does not use or build any sort of language model at the word level. [sent-810, score-0.358]
89 Conclusions In this paper we have introduced a new family of positive definite kernels between measures, which includes previous information-theoretic kernels on probability measures as particular cases. [sent-812, score-0.404]
90 One of the key features of the new kernels is that they are defined on unnormalized measures (not necessarily normalized probabilities). [sent-813, score-0.241]
91 This is relevant, for example, for kernels on empirical measures (such as word counts, pixel intensity histograms); instead of the usual step of normalization (Hein et al. [sent-814, score-0.205]
92 75 2 Entropic index q Figure 5: Results for stud-vs-fac using string kernels and p = 3, 4, 5. [sent-841, score-0.26]
93 In addition, we define positive definite kernels between stochastic processes that subsume well-known string kernels. [sent-847, score-0.26]
94 Suyari’s Axioms for Nonextensive Entropies Suyari (2004) proposed the following set of axioms (above referred as Suyari’s axioms) that determine nonextensive entropies of the form stated in (1). [sent-864, score-0.376]
95 128 , (52) 10 9 5 4 ∗ ∗ 0 ∗ ∗ 0 which fails to be negative definite, since cond JS1 (s1 , s2 ) + cond JS1 (s2 , s3 ) < cond JS1 (s1 , s3 ), cond which violates the triangle inequality required for JS1 to be a metric. [sent-969, score-0.564]
96 Interestingly, the 0-th order conditional Jensen-Shannon divergence matrix (this one ensured to be negative definite because it equals a standard Jensen-Shannon divergence matrix) is 1 2 0 1 0. [sent-970, score-0.297]
97 The resulting heat kernel approximation is n k heat (p1 , p2 ) = (4πt)− 2 exp − 1 2 d (p1 , p2 ) , 4t g √ where t > 0 and dg (p1 , p2 ) = 2 arccos ∑i p1i p2i . [sent-978, score-0.226]
98 On the convexity of some divergence measures based on entropy functions. [sent-1046, score-0.257]
99 A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. [sent-1321, score-0.211]
100 Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy. [sent-1361, score-0.578]
wordName wordTfidf (topN-words)
[('sq', 0.451), ('tsallis', 0.344), ('jt', 0.283), ('nonextensive', 0.275), ('tq', 0.268), ('kernels', 0.177), ('js', 0.17), ('aguiar', 0.143), ('cond', 0.141), ('divergence', 0.137), ('pd', 0.122), ('artins', 0.115), ('easures', 0.115), ('igueiredo', 0.115), ('mith', 0.115), ('onextensive', 0.115), ('wjtk', 0.115), ('kq', 0.102), ('nformation', 0.097), ('heoretic', 0.097), ('pt', 0.097), ('jensen', 0.085), ('nyi', 0.083), ('string', 0.083), ('dq', 0.08), ('lnq', 0.08), ('ernels', 0.08), ('entropic', 0.074), ('kernel', 0.074), ('entropies', 0.073), ('hein', 0.07), ('kwjs', 0.069), ('str', 0.069), ('heat', 0.064), ('wask', 0.063), ('mutual', 0.06), ('shannon', 0.06), ('entropy', 0.059), ('burbea', 0.057), ('proposition', 0.057), ('px', 0.056), ('iq', 0.056), ('cuturi', 0.052), ('psk', 0.049), ('furuichi', 0.046), ('divergences', 0.045), ('rq', 0.045), ('ing', 0.044), ('supp', 0.04), ('jsq', 0.04), ('qcv', 0.04), ('suyari', 0.04), ('fq', 0.039), ('pm', 0.038), ('sr', 0.036), ('unnormalized', 0.036), ('tf', 0.036), ('lebanon', 0.035), ('hilbertian', 0.034), ('instituto', 0.034), ('jrq', 0.034), ('kbool', 0.034), ('radon', 0.034), ('rao', 0.034), ('strings', 0.034), ('convexity', 0.033), ('lim', 0.032), ('joint', 0.03), ('ct', 0.029), ('py', 0.029), ('pn', 0.029), ('charv', 0.029), ('denormalization', 0.029), ('havrda', 0.029), ('pxy', 0.029), ('semigroup', 0.029), ('measures', 0.028), ('axioms', 0.028), ('theoretic', 0.027), ('documents', 0.027), ('ps', 0.026), ('martins', 0.026), ('kl', 0.025), ('degenerate', 0.024), ('eq', 0.024), ('dg', 0.024), ('cq', 0.024), ('jr', 0.024), ('bags', 0.024), ('conditional', 0.023), ('lafferty', 0.023), ('fuglede', 0.023), ('jsk', 0.023), ('kjs', 0.023), ('klin', 0.023), ('portugal', 0.023), ('pi', 0.023), ('bt', 0.022), ('family', 0.022), ('vert', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
2 0.11373696 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
3 0.074651837 78 jmlr-2009-Refinement of Reproducing Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: We continue our recent study on constructing a refinement kernel for a given kernel so that the reproducing kernel Hilbert space associated with the refinement kernel contains that with the original kernel as a subspace. To motivate this study, we first develop a refinement kernel method for learning, which gives an efficient algorithm for updating a learning predictor. Several characterizations of refinement kernels are then presented. It is shown that a nontrivial refinement kernel for a given kernel always exists if the input space has an infinite cardinal number. Refinement kernels for translation invariant kernels and Hilbert-Schmidt kernels are investigated. Various concrete examples are provided. Keywords: reproducing kernels, reproducing kernel Hilbert spaces, learning with kernels, refinement kernels, translation invariant kernels, Hilbert-Schmidt kernels
4 0.072973616 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
5 0.050488599 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
6 0.048201911 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
8 0.044372972 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
9 0.043094344 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
10 0.041637041 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
11 0.033504754 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
12 0.033220198 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
13 0.03265705 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
14 0.032150388 23 jmlr-2009-Discriminative Learning Under Covariate Shift
15 0.031408526 90 jmlr-2009-Structure Spaces
16 0.031127548 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
17 0.030138614 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
18 0.026744632 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
19 0.026533695 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
20 0.025284657 49 jmlr-2009-Learning Permutations with Exponential Weights
topicId topicWeight
[(0, 0.146), (1, 0.002), (2, 0.061), (3, -0.039), (4, -0.137), (5, 0.093), (6, -0.02), (7, 0.069), (8, -0.114), (9, -0.061), (10, -0.017), (11, 0.187), (12, 0.087), (13, -0.017), (14, -0.027), (15, 0.111), (16, 0.127), (17, -0.077), (18, 0.113), (19, 0.179), (20, -0.159), (21, -0.2), (22, -0.165), (23, 0.022), (24, -0.139), (25, 0.046), (26, 0.058), (27, 0.106), (28, 0.196), (29, 0.003), (30, 0.088), (31, 0.002), (32, 0.057), (33, 0.183), (34, 0.193), (35, -0.066), (36, -0.104), (37, 0.097), (38, -0.076), (39, 0.153), (40, 0.254), (41, -0.057), (42, 0.076), (43, -0.147), (44, 0.083), (45, -0.054), (46, -0.007), (47, 0.103), (48, 0.068), (49, -0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.95932049 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
2 0.39229742 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
3 0.34741727 78 jmlr-2009-Refinement of Reproducing Kernels
Author: Yuesheng Xu, Haizhang Zhang
Abstract: We continue our recent study on constructing a refinement kernel for a given kernel so that the reproducing kernel Hilbert space associated with the refinement kernel contains that with the original kernel as a subspace. To motivate this study, we first develop a refinement kernel method for learning, which gives an efficient algorithm for updating a learning predictor. Several characterizations of refinement kernels are then presented. It is shown that a nontrivial refinement kernel for a given kernel always exists if the input space has an infinite cardinal number. Refinement kernels for translation invariant kernels and Hilbert-Schmidt kernels are investigated. Various concrete examples are provided. Keywords: reproducing kernels, reproducing kernel Hilbert spaces, learning with kernels, refinement kernels, translation invariant kernels, Hilbert-Schmidt kernels
Author: Leonid (Aryeh) Kontorovich, Boaz Nadler
Abstract: We propose a novel framework for supervised learning of discrete concepts. Since the 1970’s, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them. Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and an oracle for evaluating the universal kernel, we can efficiently compute a linear classifier (via SVM, for example) and use margin bounds to control its generalization error. Even though exact evaluation of the universal kernel may be infeasible, in various natural situations it is efficiently approximable. Though our approach is general, our main application is to regular languages. Our approach presents a substantial departure from current learning paradigms and in particular yields a novel method for learning this fundamental concept class. Unlike existing techniques, we make no structural assumptions on the corresponding unknown automata, the string distribution or the completeness of the training set. Instead, given a labeled sample our algorithm outputs a classifier with guaranteed distribution-free generalization bounds; to our knowledge, the proposed framework is the only one capable of achieving the latter. Along the way, we touch upon several fundamental questions in complexity, automata, and machine learning. Keywords: grammar induction, regular language, finite state automaton, maximum margin hyperplane, kernel approximation
5 0.29960966 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
6 0.29601115 38 jmlr-2009-Hash Kernels for Structured Data
7 0.25197995 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
8 0.23344451 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
9 0.17485398 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
10 0.16786216 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
11 0.15450868 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
12 0.15334344 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
13 0.15157045 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
14 0.12227608 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
15 0.11339355 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
16 0.11158641 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
17 0.11123943 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
18 0.10659222 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
19 0.10459161 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
20 0.1035808 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
topicId topicWeight
[(11, 0.029), (19, 0.029), (38, 0.019), (47, 0.03), (52, 0.046), (55, 0.033), (58, 0.024), (66, 0.094), (68, 0.016), (82, 0.511), (90, 0.042), (96, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.7283718 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
2 0.60751969 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
Author: Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar
Abstract: This paper proposes the application of particle swarm optimization (PSO) to the problem of full model selection, FMS, for classification tasks. FMS is defined as follows: given a pool of preprocessing methods, feature selection and learning algorithms, to select the combination of these that obtains the lowest classification error for a given data set; the task also includes the selection of hyperparameters for the considered methods. This problem generates a vast search space to be explored, well suited for stochastic optimization techniques. FMS can be applied to any classification domain as it does not require domain knowledge. Different model types and a variety of algorithms can be considered under this formulation. Furthermore, competitive yet simple models can be obtained with FMS. We adopt PSO for the search because of its proven performance in different problems and because of its simplicity, since neither expensive computations nor complicated operations are needed. Interestingly, the way the search is guided allows PSO to avoid overfitting to some extend. Experimental results on benchmark data sets give evidence that the proposed approach is very effective, despite its simplicity. Furthermore, results obtained in the framework of a model selection challenge show the competitiveness of the models selected with PSO, compared to models selected with other techniques that focus on a single algorithm and that use domain knowledge. Keywords: full model selection, machine learning challenge, particle swarm optimization, experimentation, cross validation
3 0.23243392 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
4 0.22992314 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
5 0.22759853 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
7 0.22666658 13 jmlr-2009-Bounded Kernel-Based Online Learning
8 0.22585504 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
9 0.22494258 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
10 0.22467175 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
11 0.22436728 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
12 0.22348596 38 jmlr-2009-Hash Kernels for Structured Data
13 0.22329651 78 jmlr-2009-Refinement of Reproducing Kernels
14 0.22295223 18 jmlr-2009-Consistency and Localizability
15 0.22279002 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
16 0.22268471 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
17 0.22193013 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
18 0.22188695 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
19 0.22143403 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
20 0.22133964 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs