jmlr jmlr2012 jmlr2012-80 knowledge-graph by maker-knowledge-mining

80 jmlr-2012-On Ranking and Generalization Bounds


Source: pdf

Author: Wojciech Rejchel

Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 COM Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/18 87-100 Toru´ , Poland n Editor: Nicolas Vayatis Abstract The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. [sent-2, score-0.472]

2 In this paper we consider ranking estimators that minimize the empirical convex risk. [sent-3, score-0.519]

3 We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . [sent-4, score-0.418]

4 We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. [sent-5, score-0.389]

5 Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process 1. [sent-7, score-0.378]

6 Introduction The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. [sent-8, score-0.472]

7 , 2005; Cossock and Zhang, 2006; Rudin, 2006; Cl´ mencon et al. [sent-14, score-0.224]

8 We are to construct a function f : X × X → R, called a ranking rule, which predicts the ordering between objects in the following way: if f (x1 , x2 ) ≤ 0, then we predict that y1 ≤ y2 . [sent-20, score-0.472]

9 To measure the quality of a ranking rule f we introduce a probabilistic setting. [sent-21, score-0.367]

10 R EJCHEL Most natural approach is to look for a function f which minimizes the risk (the probability of incorrect ranking) L( f ) = P(sign(Y1 −Y2 ) f (X1 , X2 ) < 0) (1) in some family of ranking rules F , where sign(t) = 1 for t > 0, sign(t) = −1 for t < 0 and sign(t) = 0 for t = 0. [sent-26, score-0.618]

11 , Zn = (Xn ,Yn ), then we can consider a sample analog of (1), namely the empirical risk Ln ( f ) = 1 ∑ I[sign(Yi −Y j ) f (Xi , X j ) < 0], n(n − 1) i= j (2) where I(·) is the indicator function. [sent-31, score-0.152]

12 The ranking rule that minimizes (2) can be used as an estimator of the function that minimizes (1). [sent-32, score-0.403]

13 Denote the ”convex” risk of a ranking rule f by Q( f ) = E ψ[ sign(Y1 −Y2 ) f (X1 , X2 )], and the ”convex” empirical risk as Qn ( f ) = 1 ∑ ψ f (Zi , Z j ), n(n − 1) i= j where ψ f (z1 , z2 ) = ψ[sign(y1 − y2 ) f (x1 , x2 )]. [sent-40, score-0.625]

14 Therefore, features of U-process {Qn ( f ) : f ∈ F } are the basis for our consideration on statistical properties of the rule fn = arg min Qn ( f ) as an estimator of the f ∈F unknown function f ∗ = arg min Q( f ). [sent-42, score-0.213]

15 Niemiro and Rejchel (2009) stated theorems about the strong f ∈F consistency and the asymptotical normality of the estimator fn in the linear case, that is, when we consider linear ranking rules f (x1 , x2 ) = θT (x1 − x2 ) , where θ ∈ Rd . [sent-43, score-0.532]

16 In this paper we are interested in the excess risk of an estimator fn (in the general model, not necessarily linear). [sent-45, score-0.402]

17 This is the case when one compares the convex risk of fn with the convex risk of the best rule in the class. [sent-46, score-0.506]

18 In ranking one can find them in Cl´ mencon e ¸ 1374 O N R ANKING AND G ENERALIZATION B OUNDS et al. [sent-52, score-0.556]

19 Their inequalities can be applied to ranking analogs of support vector machines or boosting algorithms. [sent-55, score-0.422]

20 Noticing the close relation between ranking and the classification theory Cl´ mencon et al. [sent-57, score-0.592]

21 e ¸ (2008) formulated the question if one can get generalization bounds with ”fast rates” for the excess risk in ranking? [sent-58, score-0.301]

22 , 2008, Corollary 6) but only for estie ¸ mators that minimize the empirical risk with 0 − 1 loss. [sent-60, score-0.152]

23 Convex loss functions and estimators that minimize the convex empirical risk are used in practice. [sent-62, score-0.293]

24 In this paper we indicate assumptions and methods that allowed us 1 to obtain generalization bounds with better rates than √n for the excess convex risk of such estimators. [sent-63, score-0.427]

25 We state the main theorem and describe its applications to commonly used ranking algorithms in Section 2. [sent-75, score-0.37]

26 Generalization Bounds First, let us write conditions on a family of ranking rules F that we need in later work. [sent-79, score-0.512]

27 Furthermore, we need some restrictions on the ”richness” of a family of ranking rules F . [sent-83, score-0.512]

28 They are bounds for the covering number of F and are similar to conditions that can be often found in the literature (Pollard, 1984; de la Pe˜ a and Gin´ , 1999; Mendelson, 2002). [sent-84, score-0.162]

29 The family F that we consider satisfies one of the following conditions: 1375 R EJCHEL Assumption A There exist constants Di ,Vi > 0, i = 1, 2 such that for every measures of the form: X µ1 = PX ⊗ Pn , µ2 = νn and each t ∈ (0, 1] we have N(t, F , ρµi ) ≤ Dit −Vi i = 1, 2. [sent-89, score-0.268]

30 Assumption B There exist constants Di > 0, Vi ∈ (0, 1), i = 1, 2 such that for every measures of X the form: µ1 = PX ⊗ Pn , µ2 = νn and each t ∈ (0, 1] we have ln N(t, F , ρµi ) ≤ Dit −Vi i = 1, 2. [sent-90, score-0.389]

31 In what follows, we will separately look for probabilistic inequalities of the appropriate order for the empirical and degenerate term. [sent-97, score-0.127]

32 The modulus of convexity of a function ψ that appears in this theorem is described in the next subsection. [sent-139, score-0.364]

33 Theorem 2 Let the family of ranking rules F satisfy Assumption A and be convex. [sent-140, score-0.512]

34 If the family F satisfies Assumption B instead of Assumption A, then for every α ∈(0, 1) and K > 1 with probability at least 1 − α ∀ f ∈F where 2 3 <β= Q( f ) − Q( f ∗ ) ≤ 2 2+V1 ln n 1 K Pn (Pψ f − Pψ f ∗ ) +C2 max , K −1 n nβ < 1. [sent-142, score-0.417]

35 1377 +C3 ln(1/α) n R EJCHEL Remark 3 Although the constants C1 ,C2 ,C3 can be recovered from the proofs we do not write their explicit formulas, because our task is to prove bounds with better rates, that is, which decrease fast with n → ∞. [sent-144, score-0.141]

36 Moreover, we show in the next subsection that if F is convex and the modulus of convexity of ψ satisfies the assumption given in Theorem 2, then one can prove that for some constant B and every function f ∈ F E [Pψ f (Z1 ) − Pψ f ∗ (Z1 )]2 ≤ B[Q( f ) − Q( f ∗ )]. [sent-149, score-0.486]

37 To finish the proof of the first part of the theorem we have to bound the fixed point of the sub-root φ by ln n . [sent-155, score-0.284]

38 g∈Gr∗ n i=1 ξ = sup Using Chaining Lemma for empirical processes (Pollard, 1984) we obtain √ ξ/4 C1 ∗ E R n ( Gr ) ≤ √ E ln N (t, Gr∗ , ρPn ) dt, n 0 (6) where ρPn (g1 , g2 ) = 1 n ∑ [g1 (Zi ) − g2 (Zi )]2 . [sent-159, score-0.335]

39 n i=1 Notice that N (t, Gr∗ , ρPn ) ≤ N (t, G ∗ , ρPn ) ≤ N (t/2, G , ρPn ) 1 since from a cover of a family G t with radius t/2 and a cover of the interval [0, 1] with radius t/2 one can easily construct a cover of a family G ∗ . [sent-160, score-0.361]

40 Thus, Assumption A and above properties of covering numbers imply that for some positive constants C and C1 C ln N (t, Gr∗ , ρPn ) ≤ C1V1 ln . [sent-162, score-0.632]

41 8) and Jensen’s inequality we obtain C1 V1 E n √ ξ/4 ln 0 C dt ≤ C1 t V1 n Eξ C . [sent-165, score-0.332]

42 Eξ ln Furthermore, applying Talagrand (1994, Corollary 3. [sent-166, score-0.246]

43 Summarizing we have just shown that E Rn (Gr∗ ) ≤ C1 V1 n 8E Rn (Gr∗ ) + r ln C r which for the fixed point r∗ implies r∗ ≤ C1V1 C ln ∗ , n r and now it is easy to get that r∗ ≤ CV1 ln n . [sent-168, score-0.738]

44 Reasoning is the same as in the previous case, we need only to notice that ln N(t, Gr∗ , ρPn ) ≤C ln N t, F , ρPX ⊗Pn + ln X Therefore, the right side of (6) can be bounded by √ ξ/4 √ C C √ E t −V1 dt + √ E n n 0 C1 C1 ≤C t −V1 + ln . [sent-170, score-1.062]

45 4) it is less than C √ [8E Rn (Gr∗ ) + r]1/2−V1 /4 n which implies that 1 V1 C E Rn (Gr∗ ) ≤ √ (8E Rn (Gr∗ ) + r) 2 − 4 + n 1379 (8E Rn (Gr∗ ) + r) ln C1 . [sent-175, score-0.246]

46 r (8) R EJCHEL For the fixed point r∗ the inequality (8) takes the form 1 V1 C r∗ ≤ √ (r∗ ) 2 − 4 + n r∗ ln C1 , r∗ so r∗ ≤ C max and 2 3 < 2 2+V1 ln n 1 , 2 n n 2+V1 < 1, since 0 < V1 < 1. [sent-176, score-0.541]

47 2 On the Inequality (5) Theorem 2 in the previous subsection shows that better rates can be obtained if we are able to bound second moments of functions from the family PψF − Pψ f ∗ by their expectations. [sent-178, score-0.221]

48 (9) The key object in further analysis is the modulus of convexity of the loss ψ. [sent-180, score-0.326]

49 Definition 4 The modulus of convexity of ψ is the function δ : [0, ∞) → [0, ∞] defined as δ(t) = inf x1 + x2 ψ(x1 ) + ψ(x2 ) −ψ 2 2 : |x1 − x2 | ≥ t . [sent-184, score-0.326]

50 We illustrate this object with a few examples: for the quadratic function ψ(x) = x2 we obtain δ(t) = t 2 /4, the modulus of convexity of the exponential function defined on the interval [−a, a] is equal t2 to δ(t) = 8 exp(a) + o(t 2 ), whereas for ψ(x) = max[0, 1 − x] we have δ(t) = 0. [sent-185, score-0.326]

51 If the class F is convex, then the risk Q : F → R is the convex functional. [sent-186, score-0.181]

52 It allows to consider the modulus of convexity of Q, that is given by Q( f1 ) + Q( f2 ) ˜ δ(t) = inf −Q 2 f1 + f2 2 : d( f1 , f2 ) ≥ t , where d is the L2 -pseudometric on F , that is, d( f1 , f2 ) = E [ f1 (X1 , X2 ) − f2 (X1 , X2 )]2 . [sent-187, score-0.326]

53 The important property of the modulus of convexity is the fact that it can be often lower bounded by Ct p for some C, p > 0. [sent-188, score-0.326]

54 This property implies the similar one for the modulus of convexity of the functional Q, which is sufficient to prove the relationship (9) between second moments and expectations of functions from the family ψF − ψ f ∗ . [sent-190, score-0.496]

55 (12) The second step of the proof is based on showing that if the modulus δ satisfies (10), then the ˜ modulus δ also fulfills a similar condition. [sent-194, score-0.48]

56 , 2006, the proof of Lemma 8) indicates that the modulus δ fulfills ˜ δ(t) ≥ Cpt max(2,p) , (13) where Cp = C for p ≥ 2 and Cp = C(2A1 ) p−2 , otherwise. [sent-198, score-0.24]

57 Moreover, from the definition of the ˜ modulus δ and the fact that f ∗ is the minimizer of Q( f ) in the convex class F we have Q( f ) + Q( f ∗ ) ≥Q 2 f + f∗ 2 ˜ ˜ + δ(d( f , f ∗ )) ≥ Q( f ∗ ) + δ(d( f , f ∗ )). [sent-199, score-0.315]

58 ˜ Combining this fact with the inequality (12) and the property (13) of the modulus δ we get ˜ Q( f ) − Q( f ∗ ) ≥ 2δ ≥ 2Cp E [ψ f (Z1 , Z2 ) − ψ f ∗ (Z1 , Z2 )]2 Lψ E [ψ f (Z1 , Z2 ) − ψ f ∗ (Z1 , Z2 )]2 Lψ max(2,p) which is equivalent to the inequality (11). [sent-200, score-0.338]

59 Thus, for convex functions that were mentioned before Lemma 5 we obtain in the inequality (11) the exponent equal to 1, because their modulus of convexity can be easily bounded from below with p = 2. [sent-201, score-0.45]

60 We bound the 1 second term in Hoeffding’s decomposition by n that is sufficient to get better rates for the excess risk of ranking estimators. [sent-207, score-0.64]

61 Theorem 6 If a family of ranking rules F satisfies Assumption A, then for every α ∈ (0, 1) P ∀ f ∈F |Un (h f − h f ∗ ) | ≤ C1 max(V1 ,V2 ) ln(C2 /α) n ≥ 1−α for some constants C1 ,C2 > 0. [sent-211, score-0.655]

62 If a family of ranking rules F satisfies Assumption B, then for every α ∈ (0, 1) P ∀ f ∈F |Un (h f − h f ∗ ) | ≤ ln(C4 /α) C3 1 − max(V1 ,V2 ) n ≥ 1−α for some constants C3 ,C4 > 0. [sent-212, score-0.655]

63 n(n − 1) i= j Using Symmetrization for U-processes (de la Pe˜ a and Gin´ , 1999) we can bound (15) by n e C2 E exp C1 λ sup |(n − 1)Sn (h f − h f ∗ ) | . [sent-221, score-0.15]

64 Therefore, we obtain the inequality 1 ,h2 ) 1/4 Eε sup |(n − 1)Sn (h f − h f ∗ ) | ≤ C1 f ∈F 0 ln N (t, H , ρ) dt. [sent-243, score-0.338]

65 (18) Besides, the covering number of the family H1 + H2 = {h1 + h2 : h1 ∈ H1 , h2 ∈ H2 } clearly satisfies the inequality N(2t, H1 + H2 , ρ) ≤ N(t, H1 , ρ) N(t, H2 , ρ). [sent-244, score-0.217]

66 Therefore, for some constants C,C1 > 0 N(t, H , ρ) ≤ Ct −C1 max(V1 ,V2 ) and the right-hand side of (18) is bounded (for some constants C,C1 ,C2 > 0) by 1/4 C1 max(V1 ,V2 ) ln 0 C dt ≤ C2 max(V1 ,V2 ). [sent-246, score-0.477]

67 t 1383 R EJCHEL If the family satisfies Assumption B, then the right-hand side of (18) is bounded (for some constants C,C1 > 0) by 1/4 C1 t − max(V1 ,V2 ) dt ≤ C . [sent-247, score-0.259]

68 (19) 1 − max(V1 ,V2 ) 0 Summarizing we obtain for every λ > 0 E exp λ sup |(n − 1)Un (h f − h f ∗ ) | f ∈F ≤ C2 exp(C1 λ2 ) and the form of the constant C1 depends on the assumption (A or B) that is satisfied by the family F . [sent-248, score-0.285]

69 4 Main Result and Examples Our task relied on showing that in ranking, similarly to the classification theory, the convex excess 1 e ¸ risk can be bounded with better rates than √n which were proved in Cl´ mencon et al. [sent-251, score-0.607]

70 Theorem 8 Let the family of ranking rules F satisfy Assumption A and be convex. [sent-258, score-0.512]

71 Moreover, if the modulus of convexity of a function ψ fulfills on the interval [−A1 , A1 ] the condition δ(t) ≥ Ct p for some constants C > 0 and p ≤ 2, then for every α ∈(0, 1) P Q( fn ) − Q( f ∗ ) ≤ C1 max(V1 ,V2 ) ln n + ln(C2 /α) n ≥ 1−α (20) for some constants C1 ,C2 . [sent-259, score-0.921]

72 If the family F satisfies Assumption B instead of Assumption A, then for every α ∈(0, 1) P Q( fn ) − Q( f ∗ ) ≤ C3 max for some constants C3 ,C4 ,C5 and β = ln n 1 , n nβ 2 2+V1 ∈ + 2 3,1 C4 ln(C5 /α) 1 − max(V1 ,V2 ) n ≥ 1−α . [sent-260, score-0.623]

73 Remark 9 The dependence on exponents V1 ,V2 in the inequality (20) is the same as in Cl´ mencon e ¸ et al. [sent-261, score-0.273]

74 (2008, Corollary 6), where one considered minimizers of the empirical risk with 0 − 1 loss and the family F with finite Vapnik-Chervonenkis dimension. [sent-262, score-0.277]

75 Moreover, for the rule fn that minimizes the empirical convex risk we have Qn ( fn ) − Qn ( f ∗ ) ≤ 0. [sent-265, score-0.48]

76 Now we give three examples of ranking procedures that we can apply Theorem 8 to. [sent-267, score-0.332]

77 Example 1 Consider the family F containing linear ranking rules F = { f (x1 , x2 ) = θT (x1 − x2 ) : θ, x1 , x2 ∈ Rd } In this case our prediction of the ordering between objects depends on the hyperplane that the vector x1 − x2 belongs to. [sent-268, score-0.652]

78 2), then we obtain generalization bounds for the excess risk of the estimator fn of the order ln n . [sent-276, score-0.692]

79 n Theorem 8 can be also applied to a popular ranking procedure called ”boosting”. [sent-277, score-0.332]

80 Here we are interested in a ranking version of AdaBoost that uses the exponential loss function. [sent-278, score-0.332]

81 Example 2 Let R = {r : X × X → {−1, 1}} be a family of ”base” ranking rules with finite VapnikChervonenkis dimension. [sent-279, score-0.512]

82 12) we obtain that N (t, R , ρµ ) ≤ Ct −V for some constants C,V > 0 and every probability measure µ on X × X . [sent-290, score-0.143]

83 Furthermore, the modulus of convexity of ψ(x) = exp(−x) fulfills on the t2 interval [−A1 , A1 ] the condition δ(t) > 8 exp(A1 ) . [sent-292, score-0.326]

84 Thus, in this example we also obtain generalization bounds for the excess convex risk of fn of the order 1385 ln n n . [sent-293, score-0.731]

85 R EJCHEL The last example is a ranking version of support vector machines. [sent-294, score-0.332]

86 1) we obtain that for every compact set X , σ ≥ 1 and 0 < V < 1 ln N(t, F ,C(X 2 )) ≤ Ct −V (21) for some constant C dependent on V, d, σ and R. [sent-308, score-0.292]

87 Therefore, we get P Q( fn ) − Q( f ∗ ) ≤ C1 max with 2 3 ln n 1 , n nβ +C2 ln(C3 /α) n ≥ 1−α < β < 1. [sent-316, score-0.355]

88 In the paper we consider ranking estimators that minimize the convex empirical risk. [sent-317, score-0.519]

89 Is there any relation between the excess risk and the convex excess risk? [sent-319, score-0.519]

90 It is easy to see that the ranking rule f¯(x1 , x2 ) = 2 I[ρ+ (x1 ,x2 )≥ρ− (x1 ,x2 )] − 1 minimizes the risk (1) in the class of all measurable functions. [sent-322, score-0.473]

91 (2006) proved the relation between the excess risks and the convex excess risk for the classification theory. [sent-326, score-0.519]

92 They obtained e ¸ that for every ranking rule f γ(L( f ) − L∗ ) ≤ Q( f ) − Q∗ for some invertible function γ that depends on ψ. [sent-329, score-0.413]

93 (22) The first component in (22), so called ”estimation error”, tells us how close the risk of f is to the risk of the best element in the class F . [sent-332, score-0.212]

94 The second term (”approximation error”) describes how much we lose using the family F . [sent-333, score-0.159]

95 We compare the performance of different SVM’s for ranking problems. [sent-338, score-0.332]

96 Therefore, we can use SVM for the classification theory to solve ranking problems if we consider differences of observations in place of observations. [sent-344, score-0.332]

97 On the second subset we test the estimator, that is, we take two objects and check if the ordering indicated by the estimator is the same as the true one. [sent-357, score-0.176]

98 There are more than 1000 observations, 9 features are considered such that the age of material, contents of water, cement and other ingredients, and finally the concrete compressive strength. [sent-362, score-0.14]

99 In Table 1 we compare errors in predicting the ordering between objects by six algorithms. [sent-363, score-0.14]

100 The quality of a wine was determined by wine experts. [sent-381, score-0.152]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ranking', 0.332), ('ln', 0.246), ('modulus', 0.24), ('ejchel', 0.224), ('mencon', 0.224), ('qn', 0.204), ('pn', 0.182), ('gin', 0.178), ('gr', 0.167), ('anking', 0.155), ('pollard', 0.155), ('excess', 0.151), ('ounds', 0.143), ('eneralization', 0.133), ('family', 0.125), ('cl', 0.121), ('pakes', 0.112), ('hoeffding', 0.111), ('fn', 0.109), ('risk', 0.106), ('un', 0.097), ('constants', 0.097), ('arcones', 0.096), ('lls', 0.095), ('pe', 0.093), ('zi', 0.092), ('scovel', 0.086), ('convexity', 0.086), ('bartlett', 0.08), ('rademacher', 0.077), ('nolan', 0.076), ('wine', 0.076), ('convex', 0.075), ('la', 0.075), ('objects', 0.072), ('ful', 0.07), ('talagrand', 0.07), ('sign', 0.069), ('ct', 0.068), ('ordering', 0.068), ('niemiro', 0.067), ('rejchel', 0.067), ('jn', 0.066), ('estimators', 0.066), ('mendelson', 0.063), ('lugosi', 0.058), ('cucker', 0.057), ('boosting', 0.057), ('rules', 0.055), ('px', 0.054), ('chaining', 0.052), ('blanchard', 0.052), ('rates', 0.051), ('inequality', 0.049), ('degenerate', 0.048), ('steinwart', 0.048), ('vayatis', 0.048), ('compressive', 0.048), ('rn', 0.047), ('zn', 0.047), ('concrete', 0.047), ('every', 0.046), ('empirical', 0.046), ('cement', 0.045), ('convt', 0.045), ('cossock', 0.045), ('dimitriadou', 0.045), ('moments', 0.045), ('bounds', 0.044), ('covering', 0.043), ('sup', 0.043), ('asymptotics', 0.042), ('nonnegative', 0.042), ('notice', 0.041), ('subgraph', 0.04), ('assumption', 0.039), ('kernels', 0.038), ('dit', 0.038), ('wojciech', 0.038), ('cortez', 0.038), ('pseudometric', 0.038), ('herbrich', 0.038), ('lemma', 0.038), ('theorem', 0.038), ('dt', 0.037), ('kernel', 0.037), ('cover', 0.037), ('sn', 0.037), ('freund', 0.036), ('estimator', 0.036), ('bousquet', 0.036), ('relation', 0.036), ('moreover', 0.036), ('rule', 0.035), ('lose', 0.034), ('svm', 0.034), ('boston', 0.033), ('consideration', 0.033), ('inequalities', 0.033), ('exp', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 80 jmlr-2012-On Ranking and Generalization Bounds

Author: Wojciech Rejchel

Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process

2 0.15311874 82 jmlr-2012-On the Necessity of Irrelevant Variables

Author: David P. Helmbold, Philip M. Long

Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory

3 0.14063405 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi

Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression

4 0.13642204 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning

5 0.10993078 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image

6 0.10603953 111 jmlr-2012-Structured Sparsity and Generalization

7 0.10380727 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

8 0.099756442 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

9 0.092638604 73 jmlr-2012-Multi-task Regression using Minimal Penalties

10 0.089541599 91 jmlr-2012-Plug-in Approach to Active Learning

11 0.085442252 59 jmlr-2012-Linear Regression With Random Projections

12 0.084697656 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

13 0.083177917 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

14 0.082958445 68 jmlr-2012-Minimax Manifold Estimation

15 0.079685688 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

16 0.078902118 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss

17 0.070855699 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

18 0.068583801 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

19 0.067954659 13 jmlr-2012-Active Learning via Perfect Selective Classification

20 0.066926882 20 jmlr-2012-Analysis of a Random Forests Model


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.307), (1, 0.118), (2, -0.231), (3, 0.062), (4, -0.04), (5, -0.164), (6, 0.068), (7, 0.088), (8, 0.031), (9, -0.039), (10, 0.063), (11, -0.007), (12, 0.184), (13, 0.055), (14, -0.142), (15, -0.049), (16, 0.066), (17, -0.035), (18, 0.051), (19, 0.05), (20, 0.156), (21, 0.012), (22, -0.049), (23, -0.048), (24, 0.025), (25, 0.074), (26, -0.037), (27, -0.096), (28, -0.035), (29, 0.049), (30, 0.005), (31, 0.067), (32, 0.015), (33, 0.013), (34, 0.05), (35, -0.002), (36, -0.095), (37, -0.038), (38, 0.022), (39, 0.015), (40, 0.097), (41, 0.058), (42, 0.086), (43, -0.029), (44, 0.127), (45, 0.004), (46, -0.053), (47, 0.17), (48, -0.028), (49, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95204997 80 jmlr-2012-On Ranking and Generalization Bounds

Author: Wojciech Rejchel

Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process

2 0.62699389 82 jmlr-2012-On the Necessity of Irrelevant Variables

Author: David P. Helmbold, Philip M. Long

Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory

3 0.58934563 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi

Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression

4 0.57944012 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss

Author: Tim van Erven, Mark D. Reid, Robert C. Williamson

Abstract: Mixability of a loss characterizes fast rates in the online learning setting of prediction with expert advice. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses. We then extend this result to multiclass proper losses where there are few existing results. We show that mixability is governed by the maximum eigenvalue of the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms of the geometry of the Bayes risk. Although all calculations are for proper losses, we also show how to carry the results across to improper losses. Keywords: mixability, multiclass, prediction with expert advice, proper loss, learning rates

5 0.57227671 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

6 0.54928696 97 jmlr-2012-Regularization Techniques for Learning with Matrices

7 0.52660692 111 jmlr-2012-Structured Sparsity and Generalization

8 0.49097082 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

9 0.47522303 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

10 0.47152027 73 jmlr-2012-Multi-task Regression using Minimal Penalties

11 0.46733597 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

12 0.46456021 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

13 0.45233914 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

14 0.43272653 91 jmlr-2012-Plug-in Approach to Active Learning

15 0.41737691 68 jmlr-2012-Minimax Manifold Estimation

16 0.40208942 59 jmlr-2012-Linear Regression With Random Projections

17 0.39417571 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

18 0.3791396 20 jmlr-2012-Analysis of a Random Forests Model

19 0.354435 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

20 0.34476385 13 jmlr-2012-Active Learning via Perfect Selective Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.021), (21, 0.049), (26, 0.055), (29, 0.044), (35, 0.022), (49, 0.027), (56, 0.017), (57, 0.018), (59, 0.01), (72, 0.316), (75, 0.057), (77, 0.011), (79, 0.021), (92, 0.135), (96, 0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7496618 80 jmlr-2012-On Ranking and Generalization Bounds

Author: Wojciech Rejchel

Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process

2 0.53341228 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

3 0.5323081 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

4 0.53040063 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification

5 0.5275023 73 jmlr-2012-Multi-task Regression using Minimal Penalties

Author: Matthieu Solnon, Sylvain Arlot, Francis Bach

Abstract: In this paper we study the kernel multiple ridge regression framework, which we refer to as multitask regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples. Keywords: multi-task, oracle inequality, learning theory

6 0.52170253 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

7 0.52087885 111 jmlr-2012-Structured Sparsity and Generalization

8 0.51982385 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

9 0.51875526 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

10 0.51858133 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

11 0.51824826 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

12 0.5177334 82 jmlr-2012-On the Necessity of Irrelevant Variables

13 0.51650143 103 jmlr-2012-Sampling Methods for the Nyström Method

14 0.51577348 13 jmlr-2012-Active Learning via Perfect Selective Classification

15 0.51522648 34 jmlr-2012-Dynamic Policy Programming

16 0.5139029 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

17 0.50929415 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

18 0.50479704 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

19 0.50430399 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

20 0.50422812 84 jmlr-2012-Online Submodular Minimization