jmlr jmlr2008 jmlr2008-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
Reference: text
sentIndex sentText sentNum sentScore
1 To simplify our notation we denote ∆ pv = Pr[X = v] and ∆ qv = Pr[Y = 1|X = v]. [sent-77, score-0.951]
2 Let cv = |{i : xi = v}| be the number of examples in S for which the feature takes the value v and let c+ = |{i : xi = v ∧ yi = 1}| be the number of examples in which the value of the feature is v and the v label is 1. [sent-84, score-0.471]
3 Then {pv } and {qv } are estimated as follows: cv pv = ˆ m ∆ and ∆ qv = ˆ c+ v cv 1 2 cv > 0 cv = 0. [sent-85, score-2.499]
4 Note that pv and qv are implicit functions of the training set S. [sent-86, score-0.969]
5 , 2001) are the misclassification error ˆ ˆ ˆ ∑ pv min{qv , (1 − qv )} , (1) v∈V and the Gini index 2 ˆ ˆ ˆ ∑ pv qv (1 − qv ) . [sent-88, score-2.435]
6 Because the SSN alone determines the target label, we have that q v is either 0 or 1 for any v ˆ such that pv > 0. [sent-94, score-0.453]
7 The generalization error of h is the probability to incorrectly predict the label, ∆ (h) = ∑ pv (qv (1 − h(v)) + (1 − qv) h(v)) . [sent-109, score-0.518]
8 To see this, we S note that the generalization error of hGini is S (hGini ) = S ˆ ˆ ∑ pv (qv (1 − qv ) + (1 − qv ) qv ) . [sent-114, score-2.012]
9 v∈V If the estimated probabilities { pv } and {qv } coincide with the true probabilities {pv } and {qv }, then ˆ ˆ (hGini ) is identical to the Gini index defined in Eq. [sent-115, score-0.48]
10 The second hypothesis we define is 1 qv > 1 ˆ 2 Bayes hS (v) = 0 qv < 1 . [sent-121, score-1.014]
11 (5) ˆ 2 1 1 qv = 2 ˆ 2 Note that if {qv } coincide with {qv } then hBayes is the Bayes optimal classifier, which we denote by ˆ S hBayes . [sent-122, score-0.498]
12 If in addition { pv } and {pv } are the same, then (hBayes ) is identical to the misclassification ˆ ∞ S 1086 R ANKING C ATEGORICAL F EATURES error defined in Eq. [sent-123, score-0.477]
13 For X1 , the SSN feature we have (hGini ) = S ∆ (hBayes ) = 1 M0 , where M0 = ∑v:cv =0 pv . [sent-129, score-0.48]
14 We propose the following estimator for the generalization error of h Gini : S ∆ ˆ = |{v : cv = 1}| + ∑ 2cv pv qv (1 − qv ) . [sent-150, score-1.945]
15 (7), let us first rewrite (hGini ) as the sum ∑v v (hGini ), where S S Gini v (hS ) is the amount of error due to value v and is formally defined as v (h) ∆ = Pr[X = v] Pr[h(X) = Y | X = v] = pv (qv (1 − h(v)) + (1 − qv ) h(v)) . [sent-254, score-0.983]
16 To estimate Pr[X = v], we use the straightforward estimator pv . [sent-258, score-0.497]
17 The resulting estimation therefore becomes 1 cv i∈S(v) ∑ 1 cv − 1 ∑ 1yi =y j = j∈S(v): j=i 1 ∑ 1y =y . [sent-269, score-0.781]
18 Multiplying this estimator by pv we obtain the following estimator for v (hGini ), ˆ S ˆv = pv ˆ = pv ˆ 1 ∑ 1y =y cv (cv − 1) i, j∈S(v),i= j i j (12) 2c2 qv (1 − qv ) ˆ ˆ 2cv 2c+ (cv − c+ ) v v = pv v ˆ = pv · ˆ qv (1 − qv ). [sent-271, score-4.732]
19 ˆ ˆ cv (cv − 1) cv (cv − 1) cv − 1 Finally, for values v that appear only once in the training set, the above cross-validation procedure cannot be applied, and we therefore estimate their generalization error to be 1 , the highest possible 2 1090 R ANKING C ATEGORICAL F EATURES error. [sent-272, score-1.251]
20 The full definition of ˆv is thus: pv · 1 ˆ 2 cv ≤ 1 2cv pv · cv −1 qv (1 − qv ) cv ≥ 2. [sent-273, score-3.063]
21 Lemma 5 For k such that 1 < k ≤ m, E[ ˆv | cv (S) = k] = k m · 2qv (1 − qv ). [sent-280, score-0.885]
22 (12), we have ˆ 1 k E m k(k − 1) E[ ˆv | cv (S) = k] = ∑ i, j∈S(v),i= j 1yi =y j | cv (S) = k . [sent-283, score-0.774]
23 , Zk be independent binary random variables with Pr[Zi = 1] = qv for all i ∈ [k]. [sent-287, score-0.498]
24 (14) equals to E[ ∑ 1Zi =Z j ] = i= j ∑ E[ 1Z =Z i j ] = i= j ∑ 2 qv (1 − qv ) i= j = k(k − 1) · 2 qv (1 − qv ) . [sent-289, score-1.992]
25 lated the expectation of S Lemma 6 E[ v (hGini )] = pv ( 1 − 2qv (1 − qv )) Pr[cv = 0] + pv · 2qv (1 − qv ). [sent-298, score-1.919]
26 S 2 1091 S ABATO AND S HALEV-S HWARTZ Proof From the definition of v (hS Gini ), we have that E[ v (hGini )] = E[pv (qv (1 − hGini (v)) + (1 − qv )hGini (v))] S S S = pv (qv (1 − E[hGini (v)]) + (1 − qv ) E[hGini (v)]) S S = pv (qv + (1 − 2 qv ) E[hGini (v)])) . [sent-299, score-2.4]
27 (16) and Lemma 6 we have 1 1 E[ ˆv ] − E[ v (hGini )] = ( − 2qv (1 − qv ))( Pr[cv = 1] − pv Pr[cv = 0]) . [sent-311, score-0.951]
28 S 2 m Also, it is easy to see that 1 Pr[cv = 1] − pv Pr[cv = 0] = pv (1 − pv )m−1 − pv (1 − pv )m m pv = p2 (1 − pv )m−1 = Pr[cv = 1] . [sent-312, score-3.171]
29 (20) we obtain: 1 1 E[ ˆv ] − E[ v (hGini )] = ( − 2qv (1 − qv )) pv Pr[cv = 1]. [sent-314, score-0.951]
30 S 2 m For any qv we have that 0 ≤ 2qv (1 − qv ) ≤ 1 , which implies the following inequality: 2 0 ≤ E[ ˆv ] − E[ v (hGini )] ≤ S pv 1 pv Pr[cv = 1] ≤ . [sent-315, score-1.902]
31 (19) we conclude that 0 ≤ E[ ˆ] − E[ (hGini )] ≤ S pv ∑ 2m = v 1 . [sent-317, score-0.453]
32 1093 S ABATO AND S HALEV-S HWARTZ Based on the definitions of pv = cv /m and qv = c+ /cv , we can rewrite Eq. [sent-368, score-1.338]
33 (13) as ˆ ˆ v ˆv (S) = 1 2m 2c+ (cv −c+ ) v v m(cv −1) cv = 1 cv ≥ 2. [sent-369, score-0.774]
34 Therefore, if cv ≥ 3, 2c+ cv − c+ cv − c+ − 1 2c+ (c+ − 1) v v v v | ˆv (S) − ˆv (S\i )| = v − = m cv − 1 cv − 2 m(cv − 1)(cv − 2) 2cv (cv − 1) 2cv 6 ≤ = ≤ , m(cv − 1)(cv − 2) m(cv − 2) m while if cv = 2 then 1 2 2c+ (2 − c+ ) v − ≤ . [sent-370, score-2.322]
35 This lemma states that except for values with small probabilities, we can assure that with high confidence, cv (S) grows with pv . [sent-407, score-0.886]
36 This means that as long as pv is not too small, a change of a single example in cv (S) does not change hδ (v) too much. [sent-408, score-0.84]
37 On the other hand, S if pv is small then the value v has little effect on the error to begin with. [sent-409, score-0.485]
38 Therefore, regardless of the probability pv , the error (hδ ) cannot be changed too much by a single change of example in S. [sent-410, score-0.484]
39 Lemma 10 below states that cv (S) is likely to be at least ρ(δ/m, pv , m) for all values with nonnegligible probabilities. [sent-414, score-0.84]
40 Then, ∀δ S ∀v ∈ V : pv ≥ 6 ln( 2m ) δ m ⇒ 1095 cv (S) ≥ ρ(δ/m, pv , m) > 1. [sent-416, score-1.293]
41 This lemma states that for all v ∈ V such that pv ≥ 3 ln( 2 ) δ m we have that δ ∀ S |pv − pv | ≤ ˆ pv · 3 ln( 2 ) δ . [sent-418, score-1.405]
42 m (24) Based on this lemma, we immediately get that for all v such that p v ≥ 3 ln( 2 )/m, δ ∀δ S cv ≥ ρ(δ, pv , m). [sent-419, score-0.84]
43 Note, however, that this bound is trivial for pv = 3 ln( 2 )/m, because in this case ρ(δ, pv , m) = 0. [sent-420, score-0.924]
44 We δ therefore use the bound for values in which pv ≥ 6 ln( 2 )/m. [sent-421, score-0.478]
45 For these values it is easy to show that δ ρ(δ, pv , m) > 1 for any δ ∈ (0, 1). [sent-422, score-0.453]
46 Based on the bound given in the above lemma, we define h δ to be S 6 ln( 2m ) δ hGini (v) pv < m δ or cv ≥ ρ( m , pv , m) S ∆ δ hS (v) = c+ +qv ( ρ( δ ,pv ,m) −cv ) m v otherwise. [sent-425, score-1.311]
47 δ ρ( m ,pv ,m) That is, hδ (v) is equal to hGini (v) if either pv is negligible or if there are enough representatives of S S v in the sample. [sent-426, score-0.453]
48 If this is not the case, then S is not a typical sample and thus we “force” it to δ be typical by adding ρ( m , pv , m) − cv ‘pseudo-examples’ to S with the value v and with labels that are distributed according to qv . [sent-427, score-1.338]
49 Therefore, except for values with negligible probability p v , δ the hypothesis hδ (v) is determined by at least ρ( m , pv , m) ‘examples’. [sent-428, score-0.471]
50 First, for values v such that pv < 6 ln( 2m )/m, we have S S δ that hGini (v) = hδ (v). [sent-438, score-0.453]
51 For the rest of the values, pv ≥ 6 ln( 2m )/m S S S S δ and thus the definition of v (hδ ) implies S E[ v (hGini ) − v (hδ )] = S S Pr [cv < ρ(δ/m, pv , m)] · E Gini v (hS ) − v (hδ ) | cv < ρ(δ/m, pv , m) S . [sent-440, score-1.746]
52 (24) again, we obtain that Pr[cv < ρ(δ/m, pv , m)] ≤ δ/m. [sent-442, score-0.453]
53 In addition, since both and v (hδ ) are in [0, pv ] we have that S E Gini v (hS ) − v (hδ ) | cv < ρ(δ/m, pv , m) S 1096 ≤ pv . [sent-443, score-1.746]
54 (26) we get E[ v (hGini ) − v (hδ )] ≤ S S pv pv δ ≤ . [sent-445, score-0.906]
55 (25) we conclude that, E[ (hGini ) − (hδ )] ≤ ∑ S S v pv 1 = . [sent-447, score-0.453]
56 Then, v E[ (h[S])] = ∑ pv E[qv (1 − fh (cv (S), c+ (S))) + (1 − qv ) fh (cv (S), c+ (S))] v v v = ∑ pv (qv + (1 − 2qv )) E[ fh (cv (S), c+ (S))]. [sent-489, score-1.632]
57 v v 1 From the above expression it is clear that if qv < 2 then E[ (h[S])] is minimal when E[ f h (cv (S), c+ (S))] v 1 is minimal, and if qv > 2 then E[ (h[S])] is minimal when E[ f h (cv (S), c+ (S))] is maximal. [sent-490, score-0.996]
58 If qv = 1 v 2 the expectation equals 1 regardless of the choice of f h . [sent-491, score-0.522]
59 We have 2 E[ fh (cv (S), c+ (S))] = ∑ Pr[S] fh (cv (S), c+ (S)) v v S m = k k=0 i=0 ∑ Pr[cv (S) = k] ∑ Pr[c+ (S) = i | cv (S) = k] fh (k, i) v 1098 R ANKING C ATEGORICAL F EATURES Consider the summation on i for a single k from the above sum. [sent-492, score-0.615]
60 In the above expression, note that if qv < 1 then 2 for each i ≤ k−1 , Pr[c+ = i | cv = k] − Pr[c+ = k − i | cv = k] is positive, and that if qv > 1 then this v v 2 2 expression is negative. [sent-494, score-1.77]
61 Then v (h∞ ) = pv (1 − qv ), and 1 1 1 1 ˆ ˆ E[ v (hBayes )] = pv (qv Pr[qv < ] + (1 − qv )(1 − Pr[qv < ]) + Pr[qv = ]). [sent-503, score-1.902]
62 ˆ S 2 2 2 2 Subtracting, we have 1 1 1 E[ v (hBayes )] − v (hBayes ) = pv (2qv − 1)(Pr[qv < ] + Pr[qv = ]) ˆ ˆ ∞ S 2 2 2 m 1 1 ≤ pv (2qv − 1) Pr[cv = 0] · + pv ∑ Pr[cv = k](2qv − 1) Pr[qv ≤ |cv = k]. [sent-504, score-1.359]
63 This leads us to the following bound: E[ v (hBayes )] − v (hBayes ) ∞ S ≤ m 1 1 1 1 pv Pr[cv = 0] + pv Pr[cv = 1] + pv Pr[cv = 2] + ∑ √ pv Pr[cv = k]. [sent-507, score-1.812]
64 Additionally, given i S S i two samples S and S , let κ(S, S ) be the predicate that gets the value “true” if for all v ∈ V we have cv (S) = cv (S ). [sent-528, score-0.782]
65 4 we bound each of the above terms as follows: First, to bound δ (S) − E[ δ ] 1 1 (Lemma 15 below), we use the fact that for each v ∈ V1δ the probability pv is small. [sent-531, score-0.489]
66 To bound δ (S) − E[ δ (S ) | κ(S, S )] (Lemma 16 below) we note that the expectation is taken 2 2 with respect to those samples S in which cv (S ) = cv (S) for all v. [sent-533, score-0.809]
67 In this case, | δ (S) − δ (S\i )| = | xi (hBayes ) − 1 1 S Bayes xi (hS\i )| ≤ pv ≤ 6 ln 2 2m m− 3 . [sent-545, score-0.596]
68 δ Applying McDiarmid’s theorem, it follows that | δ (S) − E[ δ ]| is at most 1 1 1 2m 2 2 m · 12 ln m− 3 ln 2 δ δ Lemma 16 ∀δ > 0 ∀δ S 2 = 12 ln 2m δ m1/6 | δ (S) − E[ δ (S ) | κ(S, S )]| ≤ 2 2 1 1 ln 2 δ 3 ln( 2m ) ln( 2 ) δ δ m1/4 . [sent-546, score-0.572]
69 Proof Since the expectation is taken over samples S for which cv (S ) = cv (S) for each v ∈ V , we get that the value of the random variable v (hBayes ) for each v depends only on the assignment of label S for each example. [sent-548, score-0.813]
70 Thus, by Hoeffding’s inequality, −2t 2 / ∑v∈V δ p2 v Pr[| δ (S) − E[ δ (S ) | κ(S, S )]| ≥ t] ≤ 2e 2 2 √ Using the fact that for v in V2δ , pv ≤ 6 ln 2m / m we obtain that δ ∑ v∈V2δ p2 ≤ max{pv } · v v∈V2δ ∑ v∈V2δ pv ≤ 6 ln 2 Bayes v (hS . [sent-552, score-1.192]
71 Since pv is no longer small, we expect cv to be relatively large. [sent-562, score-0.84]
72 The S resulting bound depends on the difference between q v and 1/2 and becomes vacuous whenever qv is close to 1/2. [sent-564, score-0.516]
73 On the other hand, if qv is close to 1/2, the price we pay for a wrong prediction is small. [sent-565, score-0.498]
74 In the second part of this lemma, we balance these two terms and end up with a bound that does not depend on qv . [sent-566, score-0.516]
75 First, we show that if 3 3 cv (S) is large then v (S) is likely to be close to the expectation of v over samples S in which cv (S) = cv (S ). [sent-579, score-1.178]
76 Choose v ∈ V3δ and 3 3 without loss of generality assume that qv ≥ 1/2. [sent-596, score-0.498]
77 Thus, from Lemma 18 and the definition of v (S) we get that with probability of at least 1 − δ/m over the choice of the labels in S(v): | v (S) − E[ v (S ) | κ(S, S )]| ≤ pv 2 ln m δ . [sent-597, score-0.596]
78 e · cv (S) (30) By the definition of V3δ and Lemma 10, ∀δ S, ∀v ∈ V3δ , cv (S) ≥ ρ(δ/m, pv , m). [sent-598, score-1.227]
79 Using the fact that ρ is monotonically increasing with respect to pv it is possible to show (see Lemma 21 in the appendix)that ρ(δ/m, pv , m) ≥ 2 ln m m1/2 for all v ∈ V3δ for m ≥ 4. [sent-599, score-1.057]
80 Therefore, if indeed δ cv (S) ≥ ρ(δ/m, pv , m) for any v ∈ V3δ , we have that 2 ln m δ ≤ pv m−1/4 . [sent-600, score-1.436]
81 m m Proof As in the proof of Lemma 19, we use the definitions of V3δ and V2δ along with Lemma 10 and Lemma 21 to get that for m ≥ 8 ∀δ S ∀v ∈ V2δ ∪V3δ cv (S) ≥ ρ(δ/m, pv , m) ≥ 3 ln(m/δ)m1/3 . [sent-605, score-0.852]
82 To simplify our notation, denote α1 = Pr[qv (S ) > 1/2 | cv (S ) = cv (S)], β1 = Pr[qv (S ) < 1/2 | cv (S ) = cv (S)], and γ1 = Pr[qv (S ) = ˆ ˆ ˆ 1/2 | cv (S ) = cv (S)]. [sent-607, score-2.322]
83 Using the definition of v we have that ˆ 1 E[ v (S ) | cv (S) = cv (S )] = pv (1 − qv ) α1 + q β1 + γ1 2 . [sent-609, score-1.725]
84 Similarly, for the unconditional expectation: 1 E[ v (S )] = pv (1 − qv ) α2 + q β2 + γ2 2 . [sent-610, score-0.951]
85 (32) Subtracting the above two equations and rearranging terms it can be shown that ∆ ∆ = | E[ v (S ) | cv (S) = cv (S )] − E[ v (S )]| 1 = pv (q − ) | (β1 + γ1 ) − (β2 + γ2 ) + (γ1 − γ2 ) | . [sent-611, score-1.227]
86 sequence of random variables with Pr[Zi = 1] = qv . [sent-618, score-0.498]
87 Assume without loss of cv (S) ρ generality that qv ≥ 1/2. [sent-621, score-0.885]
88 To prove the right-hand side inequality, we note that m qv (S ) ≤ ˆ 1 | cv (S ) = i 2 ≤ Pr[cv (S ) ≤ ρ] + Pr qv (S ) ≤ ˆ 1 | cv (S ) = ρ 2 β2 + γ 2 = ∑ Pr[cv (S ) = i] Pr i=1 ≤ ρ δ + Pr[ ∑ Zi ≤ ρ/2] . [sent-626, score-1.778]
89 (33) we get that ∆ ≤ pv (2q − 1) ρ δ + Pr[ ∑ Zi ≤ ρ/2] m i=1 ≤ pv 1 + m 1 δ e · ρ( m , pv , m) , where the last inequality follows from Lemma 17. [sent-632, score-1.371]
90 Then, if pv > 6 ln ∀δ > 0 ρ(δ, pv , m) ≥ 3 ln 1111 2 δ 1 m−c , and m ≥ 2 1−c we have 2 m1−c . [sent-1011, score-1.192]
91 δ S ABATO AND S HALEV-S HWARTZ Proof By the definition of ρ, ρ(δ, pv , m) = mpv − mpv · 3 ln 2 δ = √ √ mpv mpv − δ Therefore, ρ( m , pv , m) is upward monotonic with pv . [sent-1012, score-1.646]
92 Thus if pv > 6 ln ρ(δ, pv , m) = mpv − ≥ 6 ln mpv · 3 ln 3 ln 2m δ 2 δ . [sent-1013, score-1.55]
93 m−c , 2 δ 2 2 m1−c · 3 ln δ δ √ 1−c 2m 2 − 2 2 m1−c − δ 6 ln 1−c 2 m 2 δ √ 1−c 1−c 1−c 2 = 3 ln m 2 (m 2 + m 2 − 2) δ 2 ≥ 3 ln m1−c . [sent-1014, score-0.572]
94 Therefore | (hδ ) − (hδ\i )| = | v (hδ ) − v (hδ\i )| S S S S = pv qv (1 − hδ (v)) + (1 − qv )hδ (v) − pv qv (1 − hδ\i (v)) + (1 − qv )hδ (v) S S S S = pv (1 − 2qv )(hδ (v) − hδ\i (v)) S S ≤ pv hδ (v) − hδ\i (v) . [sent-1019, score-3.804]
95 S S For v such that pv < 6 ln( 2m ) δ m , | (hδ ) − (hδ\i )| ≤ pv < S S For v such that pv ≥ in S: 6 ln( 2m ) δ m , 6 ln( 2m ) δ . [sent-1020, score-1.359]
96 ρ( m , pv , m) ≤ cv < ρ( m , pv , m) + 1, 1112 R ANKING C ATEGORICAL F EATURES δ 3. [sent-1023, score-1.293]
97 In case 1, hδ (v) = S δ c+ + qv ( ρ( m , pv , m) − cv ) v δ ρ( m , pv , m) hence and hδ\i (v) = S δ c+ + qv ( ρ( m , pv , m) − (cv − 1)) v δ ρ( m , pv , m) qv δ ρ( m , pv , m) |hδ (v) − hδ\i (v)| = S S , . [sent-1025, score-4.146]
98 δ In case 2, ρ( m , pv , m) = cv , therefore hδ (v) = hGini (v) = S S hence c+ c+ + qv (cv − (cv − 1)) v and hδ\i (v) = v , S cv cv |hδ (v) − hδ\i (v)| = S S qv qv . [sent-1026, score-3.115]
99 = δ cv ρ( m , pv , m) δ In case 3, since ρ( m , pv , m) > 1 we have cv ≥ 2 and hδ (v) = hGini (v) = S S Hence |hδ (v) − hδ\i (v)| = S S c+ c+ v and hδ\i (v) = hδ\i (v) = v S S cv cv − 1 cv 1 1 c+ v . [sent-1027, score-2.841]
100 (36), we have | (hδ ) − (hδ\i )| ≤ max S S pv = 4 6 ln( 2m ) δ , m m = mpv · 3 ln( 2m ) δ 4 = 1 √ 2 ≤ . [sent-1029, score-0.489]
wordName wordTfidf (topN-words)
[('qv', 0.498), ('pv', 0.453), ('cv', 0.387), ('hgini', 0.354), ('hbayes', 0.274), ('gini', 0.217), ('pr', 0.187), ('ln', 0.143), ('ginger', 0.126), ('ranking', 0.099), ('fh', 0.076), ('abato', 0.072), ('ategorical', 0.067), ('hwartz', 0.061), ('igr', 0.058), ('anking', 0.057), ('categorical', 0.054), ('hs', 0.052), ('lemma', 0.046), ('estimator', 0.044), ('mk', 0.044), ('criteria', 0.041), ('eatures', 0.041), ('generalization', 0.041), ('mpv', 0.036), ('ig', 0.035), ('criterion', 0.035), ('concentration', 0.033), ('ssn', 0.031), ('mcdiarmid', 0.031), ('predictor', 0.028), ('zi', 0.027), ('feature', 0.027), ('bayes', 0.025), ('error', 0.024), ('ek', 0.022), ('quinlan', 0.022), ('label', 0.022), ('tree', 0.021), ('mcallester', 0.019), ('pvi', 0.019), ('bound', 0.018), ('hypothesis', 0.018), ('training', 0.018), ('drukh', 0.018), ('splitting', 0.018), ('expectation', 0.017), ('decision', 0.016), ('features', 0.015), ('misclassi', 0.015), ('antos', 0.015), ('xm', 0.015), ('mitchell', 0.015), ('plots', 0.014), ('schapire', 0.014), ('haifa', 0.013), ('mantaras', 0.013), ('svi', 0.013), ('removal', 0.013), ('synthetic', 0.013), ('mansour', 0.012), ('proof', 0.012), ('inequality', 0.012), ('relevance', 0.011), ('mass', 0.011), ('person', 0.011), ('index', 0.011), ('hypotheses', 0.011), ('summing', 0.01), ('dence', 0.01), ('shai', 0.009), ('correction', 0.009), ('gain', 0.009), ('atypical', 0.009), ('mcdiardmid', 0.009), ('portrays', 0.009), ('sabato', 0.009), ('sivan', 0.009), ('torkkola', 0.009), ('unemployed', 0.009), ('growing', 0.009), ('chicago', 0.009), ('predictors', 0.009), ('ranked', 0.008), ('hoeffding', 0.008), ('effect', 0.008), ('missing', 0.008), ('let', 0.008), ('prove', 0.008), ('union', 0.008), ('probabilities', 0.008), ('mortgage', 0.008), ('limm', 0.008), ('obstacle', 0.008), ('monotonically', 0.008), ('leaf', 0.008), ('trees', 0.007), ('attributes', 0.007), ('therefore', 0.007), ('regardless', 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
2 0.081907377 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
Author: Hannes Nickisch, Carl Edward Rasmussen
Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC
3 0.07022161 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
4 0.065753758 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
Author: David C. Hoyle
Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference
5 0.056414735 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
6 0.055624332 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
7 0.047467891 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
8 0.045931265 54 jmlr-2008-Learning to Select Features using their Properties
9 0.044136148 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
10 0.039181329 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
11 0.037470441 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
12 0.031403881 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
13 0.030655226 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
14 0.029059412 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
15 0.02710798 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
16 0.022334343 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
18 0.019645872 7 jmlr-2008-A Tutorial on Conformal Prediction
19 0.018847439 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
20 0.018552756 58 jmlr-2008-Max-margin Classification of Data with Absent Features
topicId topicWeight
[(0, 0.115), (1, -0.053), (2, -0.06), (3, 0.019), (4, 0.027), (5, 0.094), (6, 0.029), (7, -0.215), (8, 0.024), (9, 0.019), (10, 0.076), (11, 0.092), (12, 0.1), (13, -0.032), (14, -0.063), (15, -0.074), (16, -0.043), (17, 0.088), (18, -0.075), (19, 0.049), (20, 0.111), (21, -0.196), (22, -0.18), (23, 0.267), (24, 0.06), (25, 0.134), (26, 0.116), (27, 0.031), (28, 0.057), (29, -0.054), (30, -0.16), (31, -0.127), (32, -0.16), (33, -0.107), (34, 0.138), (35, -0.044), (36, -0.202), (37, 0.037), (38, 0.175), (39, 0.177), (40, 0.051), (41, -0.11), (42, 0.069), (43, -0.119), (44, 0.029), (45, 0.047), (46, 0.187), (47, 0.164), (48, 0.003), (49, -0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.96694088 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
2 0.32181659 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
3 0.25953615 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
Author: Hannes Nickisch, Carl Edward Rasmussen
Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC
4 0.23592602 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
5 0.22878484 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
7 0.22232513 52 jmlr-2008-Learning from Multiple Sources
8 0.21477371 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
9 0.21195875 54 jmlr-2008-Learning to Select Features using their Properties
10 0.19432074 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
11 0.18308313 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
12 0.17505254 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
13 0.15811808 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
14 0.14644383 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
15 0.12919994 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
16 0.12116329 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
17 0.11945117 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
18 0.11479644 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
20 0.10336423 26 jmlr-2008-Consistency of Trace Norm Minimization
topicId topicWeight
[(0, 0.035), (40, 0.057), (49, 0.436), (54, 0.034), (58, 0.022), (66, 0.046), (76, 0.021), (88, 0.043), (92, 0.115), (94, 0.023), (99, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.72811908 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
2 0.31978127 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
3 0.31755558 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
4 0.2797918 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
Author: Imhoi Koo, Rhee Man Kil
Abstract: This paper presents a new method of model selection for regression problems using the modulus of continuity. For this purpose, we suggest the prediction risk bounds of regression models using the modulus of continuity which can be interpreted as the complexity of functions. We also present the model selection criterion referred to as the modulus of continuity information criterion (MCIC) which is derived from the suggested prediction risk bounds. The suggested MCIC provides a risk estimate using the modulus of continuity for a trained regression model (or an estimation function) while other model selection criteria such as the AIC and BIC use structural information such as the number of training parameters. As a result, the suggested MCIC is able to discriminate the performances of trained regression models, even with the same structure of training models. To show the effectiveness of the proposed method, the simulation for function approximation using the multilayer perceptrons (MLPs) was conducted. Through the simulation for function approximation, it was demonstrated that the suggested MCIC provides a good selection tool for nonlinear regression models, even with the limited size of data. Keywords: regression models, multilayer perceptrons, model selection, information criteria, modulus of continuity
6 0.2763539 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
7 0.26403174 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
8 0.26398081 56 jmlr-2008-Magic Moments for Structured Output Prediction
9 0.26318228 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
10 0.2602056 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
11 0.25474292 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
12 0.25359246 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
13 0.25343075 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
14 0.25048721 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
15 0.24987802 58 jmlr-2008-Max-margin Classification of Data with Absent Features
16 0.24953383 57 jmlr-2008-Manifold Learning: The Price of Normalization
17 0.24810283 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
18 0.24789189 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
19 0.24618687 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
20 0.2460214 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition