jmlr jmlr2006 jmlr2006-34 knowledge-graph by maker-knowledge-mining

34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

Source: pdf

Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin

Abstract: The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. In machine learning, this model is related to multi-class probability estimates by coupling all pairwise classiﬁcation results. Error correcting output codes (ECOC) are a general framework to decompose a multi-class problem to several binary problems. To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. We propose a simple algorithm with convergence proofs to solve the model and obtain individual skill. Experiments on synthetic and real data demonstrate that the algorithm is useful for obtaining multi-class probability estimates. Moreover, we discuss four extensions of the proposed model: 1) weighted individual skill, 2) home-ﬁeld advantage, 3) ties, and 4) comparisons with more than two teams. Keywords: Bradley-Terry model, probability estimates, error correcting output codes, support vector machines

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 tw Department of Computer Science National Taiwan University Taipei 106, Taiwan Editor: Greg Ridgeway Abstract The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. [sent-10, score-0.183]

2 To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. [sent-13, score-0.346]

3 It considers a set of k individuals for which P (individual i beats individual j) = pi , pi + pj (1) and pi > 0 is the overall skill of individual i. [sent-19, score-0.631]

4 Suppose that the outcomes of all comparisons are independent and denote rij as the number of times that i beats j. [sent-20, score-0.181]

5 One can then estimate pi by i=1 min l(p) p k subject to 0 ≤ pj , j = 1, . [sent-25, score-0.24]

6 , pt , 1 s−1 i:i=s rsi rsi +ris i:i=s pt +pt s i T , pt , . [sent-45, score-1.278]

7 This work estimates pi = P (x in class i) by minimizing the (weighted) KullbackLeibler (KL) distance between rij and µij ≡ pi /(pi + pj ): ¯ min p ¯ nij rij log i < pj < 1, f (p)j = 0, j = 1, . [sent-59, score-0.697]

8 Furthermore, an interesting side-result is that from k pj = 1 j=1 and (13), we obtain a point in Rk satisfying (k + 1) equations. [sent-65, score-0.169]

9 As {pt }∞ is in a compact set t=0 k {p | 0 ≤ ps ≤ 1, pj = 1}, j=1 there is at least one convergent subsequence. [sent-71, score-0.558]

10 If in certain games s beats s1 , s1 beats j r=1 r s2 , . [sent-89, score-0.233]

11 , and st beats j, then p∗ , the skill of individual s, should not be as bad as zero. [sent-92, score-0.193]

12 − − The idea is that if p∗ > 0, and s beats Is0 , a subset of Is0 beats Is1 , a subset of Is1 beats j − , . [sent-113, score-0.321]

13 Whether the generalized model satisﬁes Assumption 1 or not, an easy way to fulﬁll it is to add an additional term k ps (14) −µ log k j=1 pj s=1 to l(p), where µ is a small positive number. [sent-121, score-0.687]

14 As k pj = 1 is one of the constraints, (14) reduces j=1 k to −µ s=1 log ps , which is usually used as a barrier term in optimization to ensure that ps does not go to zero. [sent-126, score-0.99]

15 , k, one can show by induction that pt > 0, ∀t j j and hence the denominator of (7) is never zero: If pt > 0, Assumption 1 implies that there j t,+ t,− is some i such that Ii+ = {j} or Ii− = {j}. [sent-131, score-0.82]

16 Then either i:j∈I + ri /qi or i:j∈I − ri /qi is i i positive. [sent-132, score-0.7]

17 As the convergence proof will use the strictly decreasing result, we note that Assumption 1 implies the condition i:s∈Ii (ri + ri ) > 0, ∀s, required by Theorem 1. [sent-140, score-0.35]

18 Though ri in the Bradley-Terry model is an integer indicating the number of times that team Ii+ beats Ii− , in the convergence proof we do not use such a property. [sent-143, score-0.536]

19 Hence later for multi-class probability estimates, where ri is a real number, the convergence result still holds. [sent-144, score-0.372]

20 93 Huang, Weng, and Lin Given ni , the number of training data with classes in Ii = Ii+ ∪ Ii− , we assume here that for any given data x, ri = P (x in classes of Ii+ | x in classes of Ii ) ¯ (15) is available, and the task is to estimate P (x in class s), s = 1, . [sent-178, score-0.391]

21 We minimize the + − (weighted) KL distance between ri and qi /qi similar to (Hastie and Tibshirani, 1998): ¯ m ¯ ni ri log min p i=1 ri ¯ + (qi /qi ) + (1 − ri ) log ¯ 1 − ri ¯ − (qi /qi ) . [sent-182, score-2.527]

22 (16) By deﬁning ri ≡ ni ri and ri ≡ ni (1 − ri ), ¯ ¯ (17) and removing constant terms, (16) reduces to (6), the negative log-likelihood of the generalized Bradley-Terry model. [sent-183, score-1.53]

23 This and (17) suggest that we can solve the problem by simply taking ni = 1 and ri + ri = 1, ∀i. [sent-219, score-0.741]

24 ∂ps ps 1 − pj j:j=s Setting ∂l(p)/∂ps = 0 ∀s, we have 1 − rs rs − =k− ps 1 − ps 94 k j=1 rj . [sent-221, score-1.395]

25 1 − pj (18) Generalized Bradley-Terry Models and Multi-Class Probability Estimates Since the right-hand side of (18) is the same for all s, we can denote it by δ. [sent-222, score-0.169]

26 If δ = 0, (18) implies i=1 ps = (1 + δ)2 − 4rs δ . [sent-225, score-0.372]

27 2δ (1 + δ) − (19) In Appendix E we show that ps deﬁned in (19) satisﬁes 0 ≤ ps ≤ 1. [sent-226, score-0.744]

28 Then the solution procedure is as the following: If k ri = 1, i=1 optimal p = [r1 , . [sent-228, score-0.35]

29 This “one-against-the rest” setting is special as we can directly prove the existence of a solution satisfying k + 1 equations: k s=1 ps = 1 and ∂l(p)/∂ps = 0, s = 1, . [sent-240, score-0.372]

30 From (19), if δ > 0, larger ps implies smaller (1 + δ)2 − 4rs δ and hence larger rs . [sent-245, score-0.427]

31 By considering the same optimization problem (16), she proposes a heuristic updating rule − + i:s∈Ii ri i:s∈Ii ri + t pt+1 ≡ (20) t,− ps , t,+ s ni qi ni qi + i:s∈I − qt + t i:s∈I q i i i i but does not provide a convergence proof. [sent-257, score-2.353]

32 Since we keep k pt = 1 in the algorithm, i=1 i with the property ri + ri = 1, for this example the factor in the updating rule (20) is rs + pt + s i:i=s ri i:i=s (1 − pt ) i = k − 1 + 2rs − k ri 2rs i=1 = . [sent-265, score-2.696]

33 ,k} ri − i:s∈Ii q t,− i ri +ri t i:s∈Ii qi ri + i:s∈Ii q t,+ i + − 1 < 0. [sent-273, score-1.632]

34 , m, we generate ri by adding some noise to qi /qi and then check if the proposed + method obtains good probability estimates. [sent-308, score-0.954]

35 Since qi /qi of these three cases are diﬀerent, it is diﬃcult to have a fair way of adding noise. [sent-309, score-0.582]

36 Furthermore, various ECOC settings + (described later) will also result in diﬀerent qi /qi . [sent-310, score-0.6]

37 An “absolute” amount of noise: ri = min(max( , Then ri = 1 − ri . [sent-312, score-1.05]

38 qi (23) = 10−7 is used so that all ri , ri are positive. [sent-315, score-1.282]

39 A “relative” amount of noise: ri = min(max( , ri and + qi (1 + 0. [sent-318, score-1.282]

40 Since in general + qi for “1vs1” qi + qi for “1vsrest,” qi + adding 0. [sent-359, score-2.328]

41 1N (0, 1) to qi /qi result in very inaccurate ri for “1vsrest. [sent-360, score-0.932]

42 ” On the other hand, if using a relative way, noise added to ri and ri for “1vsrest” is smaller than that for 3. [sent-361, score-0.7]

43 Figure 2 gives the (relative) mean squared error (MSE): MSE = 1 500 500 j=1 k k i=1 (ˆj − pi )2 / pi p2 i , (25) i=1 ˆ where pj is the probability estimate obtained in the jth of the 500 replicates. [sent-370, score-0.355]

44 2 3 2 4 log k 5 0 2 6 3 2 (a) (b) 4 log2 k 5 6 (c) Figure 3: Accuracy of predicting the true class by four encodings and (24) for generating noise. [sent-441, score-0.157]

45 , 2003) of (Platt, 2000) obtains ri using SVM decision values. [sent-513, score-0.35]

46 1 l l j=1 k i=1 (Iyj =i − pj )2 ˆi , ˆ where l is the number of test data, pj is the probability estimate of the jth data, yj is the true class label, and Iyj =i is an indicator function (1 if yj = i and 0 otherwise). [sent-530, score-0.42]

47 This measurement (Brier, 1950), popular in meteorology, satisﬁes the following property: k arg min EY [ ˆ p k 2 i=1 (IY =i − pi ) ] ≡ arg min ˆ ˆ p i=1 (ˆi − pi )2 , p where Y , a random variable for the class label, has the probability distribution p. [sent-531, score-0.164]

48 Log loss: −1 l l log pj j , ˆy j=1 ˆ where pj is the probability estimate of the jth data and yj is its actual class label. [sent-541, score-0.478]

49 It is another useful criterion when true probabilities are unknown: k min EY [− ˆ p i=1 k log pi · IY =i ] ≡ min − ˆ ˆ p pi log pi ˆ i=1 has the minimum at pi = pi , i = 1, . [sent-542, score-0.527]

50 01 0 dna waveform satimage segment Data sets usps mnist 0 dna letter (a) 300 training, 500 testing waveform satimage segment Data sets usps mnist letter (b) 800 training, 1000 testing Figure 7: MSE by four encodings. [sent-585, score-0.781]

51 2 0 dna waveform satimage segment Data sets usps mnist 0 dna letter (a) 300 training, 500 testing waveform satimage segment Data sets usps mnist letter (b) 800 training, 1000 testing Figure 8: Log loss by four encodings. [sent-603, score-0.781]

52 We can extend the generalized Bradley-Terry model to this case: Deﬁne qi ≡ ¯ wij pj , j:j∈Ii qi ≡ ¯+ wij pj , + j:j∈Ii qi ≡ ¯− wij pj , − j:j∈Ii where wij > 0 is a given weight parameter reﬂecting individual j’s position in the game between Ii+ and Ii− . [sent-608, score-2.558]

53 Here Algorithm 2 can still be applied but with the updating rule replaced by ri wis ri wis + − i:s∈Ii q t,+ + i:s∈Ii q t,− ¯i ¯i t+1 ps = pt , (26) s (r +r )w i i i:s∈Ii is qi ¯t which is derived similarly to (7) so that the multiplicative factor is equal to one when ∂l(p)/∂ps = 0. [sent-610, score-2.147]

54 However, it may be harder to obtain the global optimality: Case 1 in Theorem 4 still holds, but Case 2 may not since qi needs not be equal to one (the proof of Case 2 requires qi = 1, which is guaranteed by ¯ |Ii | = k). [sent-612, score-1.183]

55 Let ri ≥ 0 and ri ≥ 0 be the number of times that Ii+ wins and loses at home, respec¯ ¯ tively. [sent-618, score-0.7]

56 For Ii+ ’s away games, we let ri ≥ 0 and ri ≥ 0 be the number of times that Ii+ wins ˜ ˜ and loses. [sent-619, score-0.7]

57 The minimization of the negative log-likelihood function thus becomes: min l(p, θ) = p,θ m − ri log ¯ i=1 + θqi q+ q− θq − + ri log + i − + ri log + i − + ri log + i − ˜ ¯ ˜ + − θqi + qi qi + θqi θqi + qi qi + θqi under the constraints in (6) and the condition θ ≥ 0. [sent-620, score-4.036]

58 For each s, ∂l(p, θ)/∂ps = 0 leads to the following rule:6 pt+1 s + i:s∈Ii = + i:s∈Ii θ(¯i +¯i ) r r − + θqi +qi + ri +˜i ¯ r + qi ri +˜i ˜ r − + qi +θqi + − i:s∈Ii + − i:s∈Ii ri +˜i ¯ r − qi ri +¯i ¯ r − + θqi +qi + θ(˜i +˜i ) r r − + qi +θqi pt . [sent-622, score-4.13]

59 s (27) For θ, from ∂l(p, θ)/∂θ = 0, we have θt+1 = m r i=1 (¯i m i=1 + r r qi (¯i +¯i ) − + θt qi +qi + ri ) ˜ + − r r qi (˜i +˜i ) − + qi +θt qi . [sent-623, score-3.26]

60 For convenience, qi (qi ) is abbreviated as qi (qi ). [sent-625, score-1.164]

61 105 Huang, Weng, and Lin Unlike the case of updating pt , there is no need to normalize θt+1 . [sent-629, score-0.437]

62 If ps is updated, we can slightly modify the proof of Theorem 1 and obtain the strict decrease of l(p, θ). [sent-634, score-0.372]

63 Extending the model proposed in (Rao and Kupper, 1967), we consider: P (Ii+ beats Ii− ) = + qi + −, qi + θqi P (Ii− beats Ii+ ) = − qi + − , and θqi + qi P (Ii+ ties Ii− ) = + − (θ2 − 1)qi qi + − + − , (qi + θqi )(θqi + qi ) where θ > 1 is a threshold parameter to be estimated. [sent-641, score-3.781]

64 Let ti be the number of times that Ii+ ties Ii− and ri , ri deﬁned as before. [sent-642, score-0.82]

65 For updating pt , θ is considered as a constant and (29) is in a form of the Home-ﬁeld s model, so the rule is similar to (27). [sent-644, score-0.437]

66 For updating θ, we have θt+1 = where Ct = 1 2 m i=1 ti m i=1 1 + 2Ct 1+ 1 2, 4Ct − (ri + ti )qi + − + qi + θ t qi m i=1 + (ri + ti )qi + − θ t qi + qi (31) . [sent-646, score-2.561]

67 Pendergrass and Bradley (1960) proposed using P (i best, j in the middle, and k worst) = P (i beats j and k) · P (j beats k) pj pi = · . [sent-653, score-0.454]

68 pi + (pj + pk ) pj + pk A general model introduced in (Placket, 1975) is: k P (a(1) → a(2) → · · · → a(k)) = i=1 pa(i) , pa(i) + pa(i+1) + · · · + pa(k) (32) where a(i), 1 ≤ i ≤ k is the ith ranked individual and → denotes the relation “is ranked higher than. [sent-654, score-0.372]

69 We consider the model: gm 1 2 gm P (Im → Im → · · · → Im ) = i=1 gm j=i i s:s∈Im ps j s:s∈Im ps . [sent-662, score-0.867]

70 (33) Deﬁning i qm = ps , i s:s∈Im we minimize the negative log-likelihood function: N min l(p) = − p gm log m=1 i=1 i qm gm j j=i qm (34) under the constraints in (6). [sent-663, score-0.6]

71 Each ranking can be viewed as the result of a series of paired team comparisons: the ﬁrst ranked team beats the others, the second ranked team beats the others except the ﬁrst, and so on; for each paired comparison, ri = 1 and ri = 0. [sent-665, score-1.244]

72 Therefore, Algorithm 2 can be applied and the updating rule is: pt+1 s = φj (s) −1 ) j:s∈Ij (qj pt , s φj (s) gj v )−1 j:s∈Ij i=1 ( v=i qj (35) where φj (s) is the rank of the team that individual s belongs to in the jth game and gj i Ij = ∪i=1 Ij . [sent-666, score-0.664]

73 Discussion and Conclusions We propose a generalized Bradley-Terry model which gives individual skill from group competition results. [sent-680, score-0.155]

74 In particular, minimizing the negative log likelihood of the proposed model coincides with minimizing the KL distance for multi-class probability estimates under error correcting output codes. [sent-683, score-0.19]

75 Proof of Theorem 1 Deﬁne t,+ qi\s ≡ + j∈Ii ,j=s t pt , and qi\s ≡ j pt . [sent-693, score-0.804]

76 j − j∈Ii ,j=s Using − log x ≥ 1 − log y − x/y with equality if and only if x = y, we have Q1 (ps ) ≥ l([pt , . [sent-694, score-0.154]

77 With pt > 0, s t,+ log(qi\s + ps ) t,+ qi\s = log t,+ qi pt s ≥ t,+ qi ·1+ pt s t,+ qi · ps pt s t,+ + log(qi ) t,+ (log ps − log pt ) + log qi with equality if ps = pt . [sent-702, score-6.459]

78 > 0, Q2 (ps ) is a strictly convex function of ps . [sent-710, score-0.372]

79 By ri + i:s∈Ii − − i:s∈Ii  ri + ri ri  pt s = t,− t ps qi qi i:s∈I i leads to the updating rule. [sent-711, score-3.373]

80 Thus, if pt+1 = pt , then s s l(pt+1 ) ≤ Q2 (pt+1 ) < Q2 (pt ) = l(pt ). [sent-712, score-0.402]

81 Proof of Lemma 2 If the result does not hold, there is an index s and an inﬁnite index set T such that ¯ lim t∈T,t→∞ Since k t s=1 ps pt = p∗ = 0. [sent-714, score-0.816]

82 s pt = s lim s=1 s=1 Thus, there is an index j such that lim t∈T,t→∞ pt = p∗ > 0. [sent-716, score-0.888]

83 If this claim is wrong, then u m t l(p ) = − ri log i=1 ≥ −rs0 log = −rs0 log t,+ qi q t,− + ri log i t t qi qi t,+ q s0 t q s0 pt s ¯ pt + s ¯ − u∈Is0 pt u → ∞ when t ∈ T, t → ∞. [sent-720, score-3.96]

84 Proof of Theorem 3 Recall that we assume limt∈K,t→∞ pt = p∗ . [sent-727, score-0.402]

85 For each pt , t ∈ K, there is a corresponding index s for the updating rule (7). [sent-728, score-0.437]

86 Without loss of generality, we assume that all pt , t ∈ K have the same corresponding s. [sent-733, score-0.402]

87 ¯ ¯ ∗ ∗ As ps > 0, by applying one iteration of Algorithm 2 on ps , and using Theorem 1, we obtain ¯ ¯ p∗+1 = p∗ and l(p∗+1 ) < l(p∗ ). [sent-744, score-0.744]

88 ∂ps ∂ps−1 ¯ Thus, at the tth iteration, lim t∈K,t→∞ pt+1 s = lim ri − i:s∈Ii q t,− i ri +ri t i:s∈Ii qi ri + i:s∈Ii q t,+ i t∈K,t→∞ + pt s ri − i:s∈Ii q ∗,− i ri +ri ∗ i:s∈Ii qi ri + i:s∈Ii q ∗,+ i = + p∗ = p∗ s s and hence pt+1 = lim t∈K,t→∞ lim t∈K,t→∞ pt = p∗ . [sent-746, score-4.236]

89 ¯ Assume s corresponds to the tth iteration, by a similar derivation, ¯ lim t∈K,t→∞ pt+1 = · · · = lim t∈K,t→∞ ¯ pt = p∗ and ¯ lim t∈K,t→∞ pt+1 = p∗+1 . [sent-747, score-0.528]

90 For the second case, as qi = k pj = 1, l(p) can be reformulated as j=1 m ¯ l(p) = − + − (ri log qi + ri log qi ), (38) i=1 which is a convex function of p. [sent-753, score-2.419]

91 From Assumption 1, for each pj , there is i such that either Ii+ = {s} with ri > 0, or − Ii = {s} with ri > 0. [sent-756, score-0.869]

92 Therefore, either −ri log ps or −ri log ps appears in (38). [sent-757, score-0.898]

93 Since they are strictly convex functions of ps , the summation on all s = 1, . [sent-758, score-0.372]

94 Solving Nonlinear Equations for the “One-against-the-rest” Approach We show that if δ = 0, ps deﬁned in (19) satisﬁes 0 ≤ ps ≤ 1. [sent-765, score-0.744]

95 If δ > 0, then (1 + δ) ≥ (1 + δ)2 − 4rs δ, so ps ≥ 0. [sent-766, score-0.372]

96 Using 0 < δ < 1, ps = (1 + δ) − (1 + δ)2 − 4rs δ 1 + δ − (1 − δ) . [sent-775, score-0.372]

97 Adding 1 + δ on both sides and dividing them by 2δ leads to ps ≤ 1. [sent-781, score-0.372]

98 k s=1 ps To ﬁnd δ by solving calculation shows k lim δ→0 − 1 = 0, the discontinuity at δ = 0 is a concern. [sent-782, score-0.414]

99 Terms of l(pt , θ) related to θ m = − ≤ − i=1 m i=1 + − + − ti log(θ2 − 1) − (ri + ti ) log(qi + θqi ) − (ri + ti ) log(θqi + qi ) + − ti log(θ2 − 1) − (ri + ti ) 1 + log(qi + θt qi ) + + − −(ri + ti ) 1 + log(θt qi + qi ) + − + qi + θqi + − qi + θ t qi + − θqi + qi + − θ t qi + qi ≡ Q(θ). [sent-796, score-6.216]

100 Then Q (θ) = 0 implies m i=1 + − (r + ti )qi (ri + ti )qi 2θti − ti + − + − − θ2 − 1 qi + θ t qi θ qi + qi 113 =0 Huang, Weng, and Lin and hence θt+1 is deﬁned as in (31). [sent-797, score-2.526]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qi', 0.582), ('pt', 0.402), ('ps', 0.372), ('ri', 0.35), ('pj', 0.169), ('ii', 0.168), ('beats', 0.107), ('weng', 0.1), ('ij', 0.095), ('huang', 0.083), ('mse', 0.082), ('log', 0.077), ('satimage', 0.076), ('ecoc', 0.072), ('pi', 0.071), ('ti', 0.066), ('encodings', 0.063), ('team', 0.058), ('paired', 0.056), ('rs', 0.055), ('rij', 0.054), ('ties', 0.054), ('dna', 0.054), ('hunter', 0.053), ('skill', 0.053), ('lin', 0.052), ('letter', 0.051), ('mnist', 0.051), ('game', 0.051), ('waveform', 0.049), ('generalized', 0.048), ('usps', 0.047), ('taiwan', 0.044), ('lim', 0.042), ('ni', 0.041), ('legend', 0.041), ('allwein', 0.041), ('gm', 0.041), ('im', 0.039), ('wij', 0.038), ('correcting', 0.038), ('segment', 0.038), ('wu', 0.037), ('brier', 0.036), ('isr', 0.036), ('rsi', 0.036), ('updating', 0.035), ('individual', 0.033), ('estimates', 0.032), ('teams', 0.031), ('coding', 0.028), ('erent', 0.027), ('zadrozny', 0.027), ('marked', 0.026), ('ist', 0.025), ('rk', 0.025), ('individuals', 0.023), ('qm', 0.023), ('eld', 0.023), ('ranked', 0.022), ('jth', 0.022), ('probability', 0.022), ('kl', 0.022), ('model', 0.021), ('dense', 0.021), ('figures', 0.021), ('gj', 0.021), ('home', 0.021), ('qj', 0.021), ('stationary', 0.02), ('multiplicative', 0.02), ('comparisons', 0.02), ('global', 0.019), ('yj', 0.019), ('sports', 0.019), ('taipei', 0.019), ('games', 0.019), ('bradley', 0.019), ('settings', 0.018), ('dashdot', 0.018), ('davidson', 0.018), ('iyj', 0.018), ('minorizing', 0.018), ('nsc', 0.018), ('pendergrass', 0.018), ('ruby', 0.018), ('simons', 0.018), ('wis', 0.018), ('probabilities', 0.018), ('decoding', 0.017), ('hastie', 0.017), ('globally', 0.017), ('pk', 0.017), ('convergent', 0.017), ('di', 0.017), ('four', 0.017), ('denominator', 0.016), ('testing', 0.016), ('unique', 0.016), ('earlier', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin

2 0.15625776 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classiﬁcation algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially sufﬁces to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classiﬁers) in multiclass classiﬁcation. Our approach is, under some mild conditions, to establish a quantitative relationship between classiﬁcation errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classiﬁcation, classiﬁer, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

3 0.10397349 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: A selective sampling algorithm is a learning algorithm for classiﬁcation that, based on the past observed data, decides whether to ask the label of each new instance to be classiﬁed. In this paper, we introduce a general technique for turning linear-threshold classiﬁcation algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classiﬁers

4 0.089326434 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

Author: Luis M. de Campos

Abstract: We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented. Keywords: Bayesian networks, scoring functions, learning, mutual information, conditional independence tests

5 0.08426223 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efﬁcient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

6 0.078067854 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study (Special Topic on Machine Learning and Optimization)

7 0.077348657 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

8 0.074955076 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

9 0.066200495 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

10 0.048994463 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

11 0.040147338 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

12 0.039419342 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

13 0.037802506 8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis

14 0.035560008 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

15 0.0327994 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)

16 0.031714838 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

17 0.030303244 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)

18 0.029057257 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

19 0.029005339 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

20 0.028121246 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, 0.047), (2, -0.102), (3, -0.007), (4, -0.039), (5, 0.192), (6, 0.156), (7, 0.009), (8, 0.238), (9, -0.172), (10, 0.129), (11, -0.223), (12, -0.199), (13, -0.347), (14, -0.167), (15, -0.007), (16, -0.027), (17, 0.04), (18, 0.178), (19, 0.159), (20, -0.084), (21, 0.143), (22, -0.072), (23, 0.022), (24, -0.071), (25, 0.044), (26, 0.081), (27, -0.144), (28, 0.074), (29, 0.016), (30, 0.013), (31, -0.075), (32, 0.046), (33, -0.032), (34, -0.04), (35, 0.022), (36, 0.033), (37, 0.07), (38, 0.066), (39, 0.057), (40, 0.012), (41, -0.021), (42, -0.043), (43, -0.039), (44, 0.044), (45, 0.024), (46, 0.03), (47, 0.003), (48, 0.106), (49, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98618549 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin

2 0.62223208 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

3 0.39674401 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study (Special Topic on Machine Learning and Optimization)

Author: Chen Yanover, Talya Meltzer, Yair Weiss

Abstract: The problem of ﬁnding the most probable (MAP) conﬁguration in graphical models comes up in a wide range of applications. In a general graphical model this problem is NP hard, but various approximate algorithms have been developed. Linear programming (LP) relaxations are a standard method in computer science for approximating combinatorial problems and have been used for ﬁnding the most probable assignment in small graphical models. However, applying this powerful method to real-world problems is extremely challenging due to the large numbers of variables and constraints in the linear program. Tree-Reweighted Belief Propagation is a promising recent algorithm for solving LP relaxations, but little is known about its running time on large problems. In this paper we compare tree-reweighted belief propagation (TRBP) and powerful generalpurpose LP solvers (CPLEX) on relaxations of real-world graphical models from the ﬁelds of computer vision and computational biology. We ﬁnd that TRBP almost always ﬁnds the solution signiﬁcantly faster than all the solvers in CPLEX and more importantly, TRBP can be applied to large scale problems for which the solvers in CPLEX cannot be applied. Using TRBP we can ﬁnd the MAP conﬁgurations in a matter of minutes for a large range of real world problems.

4 0.33869573 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

Author: Luis M. de Campos

5 0.32687539 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efﬁciency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds

6 0.29698035 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

7 0.27219003 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

8 0.21270677 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

9 0.21255697 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

10 0.20511851 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

11 0.19617553 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

12 0.19012004 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

13 0.17343563 8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis

14 0.14605287 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

15 0.14532577 53 jmlr-2006-Learning a Hidden Hypergraph

16 0.1448039 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

17 0.13771884 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

18 0.13364689 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

19 0.12429978 14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)

20 0.12273076 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.018), (35, 0.018), (36, 0.077), (45, 0.028), (50, 0.035), (63, 0.02), (76, 0.023), (78, 0.021), (81, 0.057), (83, 0.429), (84, 0.019), (90, 0.028), (91, 0.041), (96, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74554574 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin

2 0.30434716 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classiﬁcation where the documents are allowed to belong to more than one category at a time. The classiﬁcation model is a variant of the Maximum Margin Markov Network framework, where the classiﬁcation hierarchy is represented as a Markov tree equipped with an exponential family deﬁned on the edges. We present an efﬁcient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classiﬁcation hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efﬁcient as training independent SVM-light classiﬁers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classiﬁcation learning algorithms. Keywords: kernel methods, hierarchical classiﬁcation, text categorization, convex optimization, structured outputs

3 0.28356236 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

Author: Francis R. Bach, David Heckerman, Eric Horvitz

Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classiﬁers for all feasible ratios of the costs associated with false positives and false negatives. For linear classiﬁers, the set of classiﬁers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classiﬁers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efﬁciently the entire ROC curve, and that has the same computational complexity as training a single classiﬁer. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classiﬁer and the cost model assumed in applying the classiﬁer. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantiﬁable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classiﬁcation

4 0.27867261 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to ﬁnd such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufﬁcient for Lasso to select the true model both in the classical ﬁxed p setting and in the large p setting as the sample size n gets large. Based on these results, sufﬁcient conditions that are veriﬁable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clariﬁed) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency

5 0.27673176 44 jmlr-2006-Large Scale Transductive SVMs

Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou

Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the ﬁrst time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP

6 0.27638823 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

7 0.27638716 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

8 0.27634808 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

9 0.27634302 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

10 0.27587458 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

11 0.27585679 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

12 0.27454576 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

13 0.27408695 70 jmlr-2006-Online Passive-Aggressive Algorithms

14 0.27297404 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming (Special Topic on Machine Learning and Optimization)

15 0.27132648 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

16 0.27014032 53 jmlr-2006-Learning a Hidden Hypergraph

17 0.26955318 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)

18 0.26933846 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

19 0.2691775 88 jmlr-2006-Streamwise Feature Selection

20 0.26905981 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting