jmlr jmlr2005 jmlr2005-16 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Leila Mohammadi, Sara van de Geer
Abstract: In this paper, we study a two-category classification problem. We indicate the categories by labels Y = 1 and Y = −1. We observe a covariate, or feature, X ∈ X ⊂ Rd . Consider a collection {ha } of classifiers indexed by a finite-dimensional parameter a, and the classifier ha∗ that minimizes the prediction error over this class. The parameter a∗ is estimated by the empirical risk minimizer an over the class, where the empirical risk is calculated on a training sample of size n. We apply ˆ the Kim Pollard Theorem to show that under certain differentiability assumptions, an converges to ˆ a∗ with rate n−1/3 , and also present the asymptotic distribution of the renormalized estimator. For example, let V0 denote the set of x on which, given X = x, the label Y = 1 is more likely (than the label Y = −1). If X is one-dimensional, the set V0 is the union of disjoint intervals. The problem is then to estimate the thresholds of the intervals. We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class. We also discuss various rates of convergence when the differentiability conditions are possibly violated. Here, we again restrict ourselves to one-dimensional X. We show that the rate is n−1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error. Keywords: asymptotic distribution, classification theory, estimation error, nonparametric models, threshold-based classifiers
Reference: text
sentIndex sentText sentNum sentScore
1 NL EURANDOM Post Office Box 513 5600 MB Eindhoven The Netherlands Sara van de Geer GEER @ STAT. [sent-3, score-0.118]
2 The parameter a∗ is estimated by the empirical risk minimizer an over the class, where the empirical risk is calculated on a training sample of size n. [sent-10, score-0.238]
3 We apply ˆ the Kim Pollard Theorem to show that under certain differentiability assumptions, an converges to ˆ a∗ with rate n−1/3 , and also present the asymptotic distribution of the renormalized estimator. [sent-11, score-0.26]
4 The problem is then to estimate the thresholds of the intervals. [sent-14, score-0.066]
5 We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. [sent-15, score-0.221]
6 We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class. [sent-16, score-0.07]
7 We also discuss various rates of convergence when the differentiability conditions are possibly violated. [sent-17, score-0.153]
8 We show that the rate is n−1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error. [sent-19, score-0.11]
9 Keywords: asymptotic distribution, classification theory, estimation error, nonparametric models, threshold-based classifiers 1. [sent-20, score-0.063]
10 Let the training set (X1 ,Y1 ), · · · , (Xn ,Yn ) consist of n independent copies of the couple (X,Y ) with distribution P, where X ∈ X ⊂ Rd is called a feature and Y ∈ {−1, 1} is the label of X. [sent-24, score-0.076]
11 Following Vapnik (2000) and Vapnik (1998), we c 2005 Leila Mohammadi and Sara van de Geer. [sent-27, score-0.118]
12 M OHAMMADI AND VAN DE G EER consider the empirical counterpart of the risk which is the number of misclassified examples, i. [sent-28, score-0.08]
13 We will study empirical risk minimization over a model class H of classifiers h. [sent-32, score-0.08]
14 (2005), Koltchinskii (2003a), Koltchinskii (2003b), Mohammadi (2004) and Tsybakov and van de Geer (2005). [sent-43, score-0.118]
15 Under regularity assumptions, one can establish rates, as well as the asymptotic distributions. [sent-48, score-0.091]
16 In Section 2, we generalize the problem considered in Mohammadi and van de Geer (2003). [sent-52, score-0.118]
17 It gives an application of the cube root asymptotics derived by Kim and Pollard (1990). [sent-53, score-0.126]
18 2028 A SYMPTOTICS IN E MPIRICAL R ISK M INIMIZATION A simple case, with just one threshold, has been presented in Mohammadi and van de Geer (2003). [sent-65, score-0.118]
19 We will establish the asymptotic behavior of estimators of the thresholds, using the set of classifiers with K thresholds as model class. [sent-66, score-0.151]
20 Here K is fixed, and not bigger than, but not necessarily equal to, the number of thresholds of Bayes classifier. [sent-67, score-0.066]
21 We let X = (U,V ), with U ∈ Rd−1 and V ∈ R and minimize the empirical classification error over the classifiers ha (u, v) := 2 {ka (u) ≥ v} − 1, where a is an r-dimensional parameter and ka : Rd−1 → R is some given smooth function of a. [sent-74, score-0.941]
22 Under differentiability conditions, this will again lead to cube root asymptotics. [sent-75, score-0.17]
23 In Section 3, we study various other rates, and also the asymptotic distribution in the case of a (1/n)-rate. [sent-76, score-0.063]
24 For example, in Corollary 2 the asymptotic distribution of the prediction error follows as a corollary. [sent-82, score-0.063]
25 The conclusion is that by considering some assumptions on the distribution of the data, we can prove rates of convergence and asymptotic distributions. [sent-83, score-0.092]
26 The results of the present paper give more insight in the dependency of the asymptotic behavior on the underlying distribution. [sent-85, score-0.063]
27 We consider asymptotics as n → ∞, regarding the sample (X1 ,Y1 ), . [sent-86, score-0.057]
28 2029 M OHAMMADI AND VAN DE G EER Define the empirical risk Ln (a) := Pn (ha (X) = Y ), (3) L(a) := P(ha (X) = Y ). [sent-110, score-0.08]
29 (4) and the theoretical risk Moreover, let an = arg min Ln (a) ˆ a∈A be the empirical risk minimizer, and let a∗ = arg min L(a) a∈A be its theoretical counterpart. [sent-111, score-0.311]
30 ˆ ˆ (6) Under regularity conditions L(a) − L(a∗ ) behaves like the squared distance a − a∗ 2 . [sent-120, score-0.096]
31 Moreover, √ again under regularity conditions, the right hand side of (6) behaves in probability like σ(an )/ n, ˆ where σ(a) is the standard deviation of [νn (a) − νn (a∗ )]. [sent-121, score-0.073]
32 Due to the fact that we are dealing with indicator functions, the standard deviation of [νn (a) − νn (a∗ )] behaves like the square root a − a∗ 1/2 of the distance between a and a∗ . [sent-122, score-0.072]
33 Let us continue with a rough sketch of the arguments used for establishing the asymptotic distribution. [sent-125, score-0.063]
34 We may write 1 2 an = arg min n 6 νn (a) − νn (a∗ ) + n 3 L(a) − L(a∗ ) . [sent-126, score-0.086]
35 ˆ a When we already have the n−1/3 -rate, it is convenient to renormalize to 1 1 1 2 1 ˆ n 3 (an − a∗ ) = arg min n 6 νn (a∗ + n− 3 t) − νn (a∗ ) + n 3 L(a∗ + n− 3 t) − L(a∗ ) . [sent-127, score-0.086]
36 t Now, under differentiability assumptions, 2 1 n 3 L(a∗ + n− 3 t) − L(a∗ ) ≈ t T V t/2, 2030 A SYMPTOTICS IN E MPIRICAL R ISK M INIMIZATION 1 where V is the matrix of second derivatives of L at a∗ . [sent-128, score-0.101]
37 We call the locations of the K 1 sign changes thresholds. [sent-141, score-0.107]
38 With a sign change we mean that the function has strictly opposite sign in sufficiently small intervals to the left and right side of each threshold. [sent-142, score-0.106]
39 The boundary points a0 = 0 0 and a0 0 +1 = 1 are thus not considered as locations of a sign change. [sent-143, score-0.113]
40 K+1 ha (x) := ∑ bk k=1 Let for a ∈ UK (7) {ak−1 ≤ x < ak }, where a0 = 0, aK+1 = 1 and b1 = −1, bk+1 = −bk , k = 2, . [sent-151, score-0.618]
41 (9) Let The empirical risk minimizer is an := arg min Ln (a). [sent-157, score-0.244]
42 ˆ a∈UK (10) We emphasize that we take the number of thresholds K in our model class fixed. [sent-158, score-0.066]
43 With a consistent estimator K in our model class, the asymptotics presented in this paper generally still go through. [sent-162, score-0.057]
44 e The following theorem states that an converges to the minimizer a∗ of L(a) with rate n−1/3 ˆ and also provides its asymptotic distribution after renormalization. [sent-169, score-0.257]
45 If K = K0 , one can show that when the minimizer a∗ is unique, it is equal to a0 , i. [sent-171, score-0.078]
46 , a∗ ) := arg min L(a), 1 2 K (11) a∈UK is the unique minimizer of L(a), that a∗ is in the interior of UK , and that L(a) is a continuous function of a. [sent-181, score-0.237]
47 Suppose that F0 has non-zero derivative f0 in a neighborhood of a∗ , k = 1, . [sent-182, score-0.061]
48 , K, where g, the density of G, is continuous in a neighborhood of a∗ . [sent-189, score-0.061]
49 The theorem therefore also provides us the rate n−2/3 for the convergence of the prediction error L(an ) of the classifier han , to the prediction error of ha∗ , and the asymptotic distribution ˆ ˆ of the prediction error L(an ) after renormalization. [sent-215, score-0.166]
50 We present this asymptotic distribution in a ˆ corollary. [sent-216, score-0.063]
51 Recall that one of the conditions in the above theorem is that L has a unique minimizer in the interior of UK . [sent-219, score-0.217]
52 Note that 2 1 2 1 L(a) = P(Y = 1, ha (X) = −1) + P(Y = −1, ha (X) = 1) = Z a 0 = Z a 0 F0 dG + Z 1 a (2F0 − 1)dG + (1 − F0 )dG Z 1 0 (1 − F0 )dG. [sent-223, score-0.856]
53 If a10 (2F0 − 1)dG > 0, then a∗ = a0 is the unique minimizer of L. [sent-224, score-0.104]
54 The minimizer is not in the open interval (0, 1), and Theorem 1 indeed fails. [sent-226, score-0.078]
55 Consider given functions ka : Rd−1 → R, a ∈ A , and classifiers ha = 2 {Ca } − 1, a ∈ A , 2033 M OHAMMADI AND VAN DE G EER where Ca := {(u, v) : v ≤ ka (u)}, a ∈ A . [sent-231, score-1.412]
56 A famous example is when ka is linear in a, see for instance Hastie et al. [sent-234, score-0.492]
57 Tsybakov and van de Geer (2005) consider this case for large r, depending on n. [sent-236, score-0.118]
58 Let again a∗ = arg min L(a), a be the minimizer of the theoretical risk L(a), and an = arg min Ln (a) ˆ a be the empirical risk minimizer. [sent-238, score-0.389]
59 We would like to know the asymptotic distribution of an . [sent-239, score-0.063]
60 We also suppose that ka is a regular function of the parameter a ∈ Rr , i. [sent-243, score-0.492]
61 , the gradient ∂ ka (u) = ka (u) (13) ∂a of ka (u) exists for all u, and also its Hessian ∂2 ka (u) = ka (u). [sent-245, score-2.46]
62 It is called locally dominated integrable with respect to the measure µ and variable a if for each a there is a neighborhood I of a and a nonnegative µ-integrable function g1 such that for all u ∈ U and b ∈ I, | fb (u)| ≤ g1 (u). [sent-249, score-0.124]
63 The probability of misclassification using the classifier ha is L(a) = P(ha (X) = Y ) = = Z Ca Z Ca (1 − F0 )dG + Z c Ca F0 dG (1 − 2F0 )dG + P(Y = 1). [sent-250, score-0.428]
64 ∂v 2034 (15) A SYMPTOTICS IN E MPIRICAL R ISK M INIMIZATION ∂ Assume furthermore that the functions m(u, ka (u))ka (u) and ∂aT [m(u, ka (u))ka (u)] are locally dominated integrable with respect to Lebesgue measure and variable a. [sent-254, score-1.047]
65 Also, assume that the funcR tion ka (u)g(u, ka (u))du is uniformly bounded for a in a neighborhood of a∗ , and that for each u, m (u, ka (u)) and ka (u) are continuous in a neighborhood of a∗ . [sent-255, score-2.116]
66 ∂a∂aT Σa (u)m(u, ka (u))du, where Σa (u) = ka (u)kaT (u) m (u, ka (u)) + ka (u). [sent-257, score-1.968]
67 m(u, ka (u)) (16) (17) 1 In the following theorem, we show that n 3 (an − a∗ ) converges to the location of the minimum ˆ of some Gaussian process. [sent-258, score-0.539]
68 As an example of Theorem 4, suppose r = d and ka is the linear function ka (u) := a1 u1 + . [sent-266, score-0.984]
69 Other Rates of Convergence In this section, we will investigate the rates that can occur if we do not assume the differentiability conditions needed for the Kim Pollard Theorem. [sent-281, score-0.153]
70 We first assume K = 1, and that 2F0 − 1 has at most one sign change (i. [sent-283, score-0.053]
71 Now, either 2F0 − 1 changes sign at a∗ ∈ (0, 1) or there are no sign changes in (0, 1), i. [sent-289, score-0.162]
72 One easily verifies that a∗ is the minimizer of L(a) over a ∈ [0, 1]. [sent-294, score-0.078]
73 Throughout, a neighborhood of a∗ is some set of the form (a∗ − δ, a∗ + δ), δ > 0, intersected with [0, 1]. [sent-299, score-0.061]
74 Assumption B: Let there exist c > 0 and ε ≥ 0 such that |1 − 2F0 (x)|g(x) ≥ c|x − a∗ |ε , (19) for all x in a neighborhood of a∗ . [sent-300, score-0.061]
75 In Section 2, we assumed differentiability of F0 in a neighborhood of a∗ ∈ (0, 1), with positive derivative f0 . [sent-301, score-0.162]
76 This theorem then gives that for large T and large n, P sup a−a∗ ≤δ2 n |νn (a) − νn (a∗ )| ≥ C1 T 2 δn ≤ C exp(−T ). [sent-326, score-0.079]
77 Now, by the peeling device, see for example van de Geer (2000), we can show that lim lim sup P √ sup T →∞ n→∞ ∗ a−a So, |νn (a) − νn (a∗ )| ≥T a − a∗ >δn = 0. [sent-327, score-0.19]
78 Under the conditions of Theorem 5 with ε = 0, we derive the asymptotic distribution of the renormalized empirical risk, locally in a neighborhood of order 1/n of a∗ , the local empirical risk. [sent-338, score-0.212]
79 However, since the local empirical risk has a limit law which has no unique minimum, n(an − a∗ ) generally does not converge in distribution. [sent-340, score-0.106]
80 2 Extension to Several Thresholds and Sign Changes Recall that K0 is the number of sign changes of 2F0 − 1, and that K is the number of thresholds in the model class H defined in (8). [sent-392, score-0.147]
81 Below, whenever we mention the rate n−1/3 or n−1 , we mean the rate can be obtained under some conditions on F0 and g (see Theorem 1 (where ε = 1), and Theorem 5 with ε = 0). [sent-393, score-0.075]
82 Recall that a0 denotes the K0 -vector of the locations of the sign changes of 2F0 − 1. [sent-394, score-0.107]
83 Then, K0 of the elements of an converge to a0 , and either a1,n converges ˆ ˆ to 0 or aK,n converges to 1. [sent-401, score-0.094]
84 The rate of convergence to the interior points is n−1/3 and the rate of ˆ convergence to the boundary point is n−1 . [sent-402, score-0.133]
85 If ˆ K − K0 is odd, one element of an converges to one of the boundary points 0 or 1. [sent-406, score-0.081]
86 k (26) The envelope GR of this class is defined as GR (·) = sup |φ(·)|. [sent-420, score-0.121]
87 If G is VC-subgraph, then a sufficient condition for the class GR to be uniformly manageable is that its envelope function GR is uniformly square integrable for R near zero. [sent-424, score-0.278]
88 Theorem 7 ( Kim and Pollard (1990)) Let {an } be a sequence of estimators for which ˆ (i) Ln (an ) ≤ infa∈A Ln (a) + oP (n−2/3 ), ˆ (ii) an converges in probability to the unique a∗ that minimizes L(a), ˆ (iii) a∗ is an interior point of A . [sent-425, score-0.165]
89 If Z has non-degenerate increments, then n1/3 (an − a∗ ) converges in distribution to the (almost surely unique) random vector that minimizes ˆ {Z(t) : t ∈ Rr }. [sent-428, score-0.07]
90 To check Condition (ii), we note that, because ˆ {φ(·, a) : a ∈ UK } is a uniformly bounded VC-subgraph class, we have the uniform law of large numbers sup |Ln (a) − L(a)| → 0, a. [sent-430, score-0.062]
91 To check Condition (iv), for odd i, we have ∂ P(ha (X) = Y ) = (2F0 (ai ) − 1)g(ai ) ∂ai so ∂2 P(ha (X) = Y ) ∂a2 i = 2 f0 (ai )g(ai ) + (2F0 (ai ) − 1)g (ai ) ai =a∗ i ai =a∗ i = −2 f0 (a∗ )g(a∗ ). [sent-438, score-0.117]
92 Now, a∗ minimizes L(a) for a in the interior of UK , so 2F0 − 1 changes sign from negative to positive at a∗ for odd k, and it changes sign from positive to negative at a∗ for even k. [sent-462, score-0.285]
93 Finally, by (27) and (28), the limit of τEφ(X,Y, a∗ + s/τ)φ(X,Y, a∗ + t/τ) as τ → ∞ becomes ∑ K H(s,t) = min(sk ,tk )g(a∗ ) (0 < sk ,tk ) k k=1 − max(sk ,tk )g(a∗ ) (0 > sk ,tk ) . [sent-473, score-0.224]
94 Now we calculate an upper bound for the envelope function. [sent-478, score-0.085]
95 To maximize the function φ(x, y, a) = (ha (x) = y) − (ha∗ (x) = y)|, note that for y = 1, this function is increasing in ak ’s for even k and decreasing in ak ’s for odd k. [sent-480, score-0.285]
96 1 2 3 K (30) Similarly, (ha∗ (x) = y) − (ha (x) = y) is maximized for y = 1, in case (30) and for y = −1, it is maximized in case (29). [sent-490, score-0.062]
97 1 1 2 2 K K and So the envelope GR of GR satisfies GR ≤ GR where GR = x ∈ ∪K [a∗ − R, a∗ + R] . [sent-498, score-0.085]
98 k=1 k k Now, note that 2 E(GR ) ≤ and K ∑ P(a∗ − R ≤ X ≤ a∗ + R) k k k=1 P(a∗ − R ≤ X ≤ a∗ + R) 2Rg(ak ) k k = < R∗ , ∃R∗ < ∞, R R for some ak ∈ (a∗ − R, a∗ + R), when R is close to zero. [sent-499, score-0.116]
99 Since GR R k k is bounded by one, it is also easy to see that GR is uniformly square integrable for R close to zero. [sent-501, score-0.068]
100 Finally, since G is VC-subgraph, we conclude that GR is uniformly manageable for the envelope GR . [sent-502, score-0.154]
wordName wordTfidf (topN-words)
[('ka', 0.492), ('ha', 0.428), ('pollard', 0.237), ('gr', 0.211), ('dg', 0.201), ('rr', 0.18), ('eer', 0.152), ('geer', 0.152), ('ohammadi', 0.152), ('inimization', 0.135), ('isk', 0.135), ('mpirical', 0.135), ('symptotics', 0.135), ('van', 0.118), ('op', 0.118), ('kim', 0.118), ('ak', 0.116), ('sk', 0.112), ('differentiability', 0.101), ('mohammadi', 0.101), ('ln', 0.094), ('envelope', 0.085), ('minimizer', 0.078), ('jn', 0.074), ('bk', 0.074), ('va', 0.068), ('thresholds', 0.066), ('tsybakov', 0.063), ('asymptotic', 0.063), ('uk', 0.063), ('neighborhood', 0.061), ('rd', 0.06), ('risk', 0.059), ('asymptotics', 0.057), ('sign', 0.053), ('odd', 0.053), ('subsection', 0.052), ('koltchinskii', 0.05), ('converges', 0.047), ('interior', 0.047), ('arg', 0.046), ('cc', 0.046), ('hb', 0.046), ('behaves', 0.045), ('theorem', 0.043), ('manageable', 0.043), ('cube', 0.042), ('integrable', 0.042), ('min', 0.04), ('ers', 0.039), ('lugosi', 0.038), ('ur', 0.038), ('poisson', 0.037), ('parametric', 0.036), ('sup', 0.036), ('boundary', 0.034), ('barbour', 0.034), ('eurandom', 0.034), ('han', 0.034), ('leila', 0.034), ('manageability', 0.034), ('ai', 0.032), ('maximized', 0.031), ('iv', 0.031), ('ca', 0.031), ('near', 0.029), ('rates', 0.029), ('changes', 0.028), ('regularity', 0.028), ('copies', 0.028), ('classi', 0.027), ('condition', 0.027), ('root', 0.027), ('union', 0.027), ('uniformly', 0.026), ('lebesgue', 0.026), ('zn', 0.026), ('du', 0.026), ('unique', 0.026), ('rate', 0.026), ('locations', 0.026), ('jump', 0.025), ('vii', 0.025), ('label', 0.025), ('paths', 0.025), ('bayes', 0.024), ('indexed', 0.023), ('ez', 0.023), ('renormalized', 0.023), ('sara', 0.023), ('met', 0.023), ('feature', 0.023), ('conditions', 0.023), ('minimizes', 0.023), ('pn', 0.022), ('differentiable', 0.022), ('estimators', 0.022), ('empirical', 0.021), ('max', 0.021), ('dominated', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
Author: Leila Mohammadi, Sara van de Geer
Abstract: In this paper, we study a two-category classification problem. We indicate the categories by labels Y = 1 and Y = −1. We observe a covariate, or feature, X ∈ X ⊂ Rd . Consider a collection {ha } of classifiers indexed by a finite-dimensional parameter a, and the classifier ha∗ that minimizes the prediction error over this class. The parameter a∗ is estimated by the empirical risk minimizer an over the class, where the empirical risk is calculated on a training sample of size n. We apply ˆ the Kim Pollard Theorem to show that under certain differentiability assumptions, an converges to ˆ a∗ with rate n−1/3 , and also present the asymptotic distribution of the renormalized estimator. For example, let V0 denote the set of x on which, given X = x, the label Y = 1 is more likely (than the label Y = −1). If X is one-dimensional, the set V0 is the union of disjoint intervals. The problem is then to estimate the thresholds of the intervals. We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class. We also discuss various rates of convergence when the differentiability conditions are possibly violated. Here, we again restrict ourselves to one-dimensional X. We show that the rate is n−1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error. Keywords: asymptotic distribution, classification theory, estimation error, nonparametric models, threshold-based classifiers
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
3 0.068649992 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
4 0.052436016 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
5 0.036283188 3 jmlr-2005-A Classification Framework for Anomaly Detection
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classification, SVMs
6 0.035779607 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
7 0.033605974 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
8 0.033509169 22 jmlr-2005-Concentration Bounds for Unigram Language Models
9 0.031863295 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.031565424 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
11 0.031442311 71 jmlr-2005-Variational Message Passing
12 0.030910371 11 jmlr-2005-Algorithmic Stability and Meta-Learning
13 0.029131779 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
14 0.028755836 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
15 0.027462408 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
16 0.027402677 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
17 0.026123725 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
18 0.026032133 39 jmlr-2005-Information Bottleneck for Gaussian Variables
19 0.02558862 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
20 0.023326661 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
topicId topicWeight
[(0, 0.153), (1, -0.107), (2, -0.025), (3, 0.168), (4, -0.042), (5, -0.061), (6, -0.025), (7, -0.017), (8, -0.013), (9, -0.084), (10, -0.126), (11, -0.026), (12, -0.12), (13, -0.017), (14, 0.103), (15, -0.02), (16, 0.132), (17, 0.163), (18, -0.091), (19, -0.024), (20, -0.001), (21, 0.074), (22, -0.108), (23, -0.153), (24, -0.321), (25, 0.04), (26, -0.293), (27, 0.017), (28, -0.126), (29, 0.075), (30, -0.133), (31, -0.003), (32, -0.034), (33, -0.2), (34, -0.411), (35, -0.295), (36, -0.05), (37, -0.153), (38, -0.166), (39, -0.058), (40, -0.083), (41, 0.236), (42, 0.144), (43, -0.002), (44, 0.159), (45, 0.008), (46, 0.021), (47, 0.007), (48, -0.022), (49, -0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.96440536 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
Author: Leila Mohammadi, Sara van de Geer
Abstract: In this paper, we study a two-category classification problem. We indicate the categories by labels Y = 1 and Y = −1. We observe a covariate, or feature, X ∈ X ⊂ Rd . Consider a collection {ha } of classifiers indexed by a finite-dimensional parameter a, and the classifier ha∗ that minimizes the prediction error over this class. The parameter a∗ is estimated by the empirical risk minimizer an over the class, where the empirical risk is calculated on a training sample of size n. We apply ˆ the Kim Pollard Theorem to show that under certain differentiability assumptions, an converges to ˆ a∗ with rate n−1/3 , and also present the asymptotic distribution of the renormalized estimator. For example, let V0 denote the set of x on which, given X = x, the label Y = 1 is more likely (than the label Y = −1). If X is one-dimensional, the set V0 is the union of disjoint intervals. The problem is then to estimate the thresholds of the intervals. We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class. We also discuss various rates of convergence when the differentiability conditions are possibly violated. Here, we again restrict ourselves to one-dimensional X. We show that the rate is n−1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error. Keywords: asymptotic distribution, classification theory, estimation error, nonparametric models, threshold-based classifiers
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
3 0.2257897 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
4 0.18568909 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
5 0.17926957 3 jmlr-2005-A Classification Framework for Anomaly Detection
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classification, SVMs
6 0.1190635 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
7 0.11215633 20 jmlr-2005-Clustering with Bregman Divergences
8 0.10897784 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
9 0.10866489 11 jmlr-2005-Algorithmic Stability and Meta-Learning
10 0.10846742 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
11 0.10538542 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
12 0.1038609 39 jmlr-2005-Information Bottleneck for Gaussian Variables
13 0.10074594 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
14 0.099706896 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
15 0.097408637 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
16 0.094776057 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
17 0.091513999 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
18 0.089592405 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
19 0.08782924 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
20 0.086077906 47 jmlr-2005-Learning from Examples as an Inverse Problem
topicId topicWeight
[(17, 0.016), (19, 0.014), (37, 0.013), (43, 0.037), (47, 0.012), (52, 0.036), (88, 0.739)]
simIndex simValue paperId paperTitle
same-paper 1 0.99274677 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
Author: Leila Mohammadi, Sara van de Geer
Abstract: In this paper, we study a two-category classification problem. We indicate the categories by labels Y = 1 and Y = −1. We observe a covariate, or feature, X ∈ X ⊂ Rd . Consider a collection {ha } of classifiers indexed by a finite-dimensional parameter a, and the classifier ha∗ that minimizes the prediction error over this class. The parameter a∗ is estimated by the empirical risk minimizer an over the class, where the empirical risk is calculated on a training sample of size n. We apply ˆ the Kim Pollard Theorem to show that under certain differentiability assumptions, an converges to ˆ a∗ with rate n−1/3 , and also present the asymptotic distribution of the renormalized estimator. For example, let V0 denote the set of x on which, given X = x, the label Y = 1 is more likely (than the label Y = −1). If X is one-dimensional, the set V0 is the union of disjoint intervals. The problem is then to estimate the thresholds of the intervals. We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class. We also discuss various rates of convergence when the differentiability conditions are possibly violated. Here, we again restrict ourselves to one-dimensional X. We show that the rate is n−1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error. Keywords: asymptotic distribution, classification theory, estimation error, nonparametric models, threshold-based classifiers
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
3 0.95742482 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
Author: Gal Elidan, Nir Friedman
Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods
4 0.94800574 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
Author: Petros Drineas, Michael W. Mahoney
Abstract: A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n3 ), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n × n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of ˜ the form Gk = CWk+CT , where C is a matrix consisting of a small number c of columns of G and Wk is the best rank-k approximation to W , the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let · 2 and · F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let Gk be the best rank-k approximation to G. We prove that by choosing O(k/ε4 ) columns G −CWk+CT ξ ≤ G − Gk ξ +ε n ∑ G2 , ii i=1 both in expectation and with high probability, for both ξ = 2, F, and for all k : 0 ≤ k ≤ rank(W ). This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage. The relationships between this algorithm, other related matrix decompositions, and the Nystr¨ m method from integral equation theory are discussed.1 o Keywords: kernel methods, randomized algorithms, Gram matrix, Nystr¨ m method o
5 0.81046093 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, Dan Roth
Abstract: We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. Keywords: generalization bounds, area under the ROC curve, ranking, large deviations, uniform convergence ∗. Parts of the results contained in this paper were presented at the 18th Annual Conference on Neural Information Processing Systems in December, 2004 (Agarwal et al., 2005a) and at the 10th International Workshop on Artificial Intelligence and Statistics in January, 2005 (Agarwal et al., 2005b). ©2005 Shivan
6 0.79039997 11 jmlr-2005-Algorithmic Stability and Meta-Learning
7 0.784486 20 jmlr-2005-Clustering with Bregman Divergences
8 0.76862431 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
9 0.76280302 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures
10 0.75852507 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
11 0.75504887 48 jmlr-2005-Learning the Kernel Function via Regularization
12 0.74425852 67 jmlr-2005-Stability of Randomized Learning Algorithms
13 0.74252641 47 jmlr-2005-Learning from Examples as an Inverse Problem
14 0.71943992 44 jmlr-2005-Learning Module Networks
15 0.70905477 71 jmlr-2005-Variational Message Passing
16 0.70731217 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
17 0.70718706 39 jmlr-2005-Information Bottleneck for Gaussian Variables
18 0.69926023 3 jmlr-2005-A Classification Framework for Anomaly Detection
19 0.69844484 22 jmlr-2005-Concentration Bounds for Unigram Language Models
20 0.67281806 64 jmlr-2005-Semigroup Kernels on Measures