jmlr jmlr2006 jmlr2006-34 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin
Abstract: The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. In machine learning, this model is related to multi-class probability estimates by coupling all pairwise classification results. Error correcting output codes (ECOC) are a general framework to decompose a multi-class problem to several binary problems. To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. We propose a simple algorithm with convergence proofs to solve the model and obtain individual skill. Experiments on synthetic and real data demonstrate that the algorithm is useful for obtaining multi-class probability estimates. Moreover, we discuss four extensions of the proposed model: 1) weighted individual skill, 2) home-field advantage, 3) ties, and 4) comparisons with more than two teams. Keywords: Bradley-Terry model, probability estimates, error correcting output codes, support vector machines
Reference: text
sentIndex sentText sentNum sentScore
1 tw Department of Computer Science National Taiwan University Taipei 106, Taiwan Editor: Greg Ridgeway Abstract The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. [sent-10, score-0.183]
2 To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. [sent-13, score-0.346]
3 It considers a set of k individuals for which P (individual i beats individual j) = pi , pi + pj (1) and pi > 0 is the overall skill of individual i. [sent-19, score-0.631]
4 Suppose that the outcomes of all comparisons are independent and denote rij as the number of times that i beats j. [sent-20, score-0.181]
5 One can then estimate pi by i=1 min l(p) p k subject to 0 ≤ pj , j = 1, . [sent-25, score-0.24]
6 , pt , 1 s−1 i:i=s rsi rsi +ris i:i=s pt +pt s i T , pt , . [sent-45, score-1.278]
7 This work estimates pi = P (x in class i) by minimizing the (weighted) KullbackLeibler (KL) distance between rij and µij ≡ pi /(pi + pj ): ¯ min p ¯ nij rij log i < pj < 1, f (p)j = 0, j = 1, . [sent-59, score-0.697]
8 Furthermore, an interesting side-result is that from k pj = 1 j=1 and (13), we obtain a point in Rk satisfying (k + 1) equations. [sent-65, score-0.169]
9 As {pt }∞ is in a compact set t=0 k {p | 0 ≤ ps ≤ 1, pj = 1}, j=1 there is at least one convergent subsequence. [sent-71, score-0.558]
10 If in certain games s beats s1 , s1 beats j r=1 r s2 , . [sent-89, score-0.233]
11 , and st beats j, then p∗ , the skill of individual s, should not be as bad as zero. [sent-92, score-0.193]
12 − − The idea is that if p∗ > 0, and s beats Is0 , a subset of Is0 beats Is1 , a subset of Is1 beats j − , . [sent-113, score-0.321]
13 Whether the generalized model satisfies Assumption 1 or not, an easy way to fulfill it is to add an additional term k ps (14) −µ log k j=1 pj s=1 to l(p), where µ is a small positive number. [sent-121, score-0.687]
14 As k pj = 1 is one of the constraints, (14) reduces j=1 k to −µ s=1 log ps , which is usually used as a barrier term in optimization to ensure that ps does not go to zero. [sent-126, score-0.99]
15 , k, one can show by induction that pt > 0, ∀t j j and hence the denominator of (7) is never zero: If pt > 0, Assumption 1 implies that there j t,+ t,− is some i such that Ii+ = {j} or Ii− = {j}. [sent-131, score-0.82]
16 Then either i:j∈I + ri /qi or i:j∈I − ri /qi is i i positive. [sent-132, score-0.7]
17 As the convergence proof will use the strictly decreasing result, we note that Assumption 1 implies the condition i:s∈Ii (ri + ri ) > 0, ∀s, required by Theorem 1. [sent-140, score-0.35]
18 Though ri in the Bradley-Terry model is an integer indicating the number of times that team Ii+ beats Ii− , in the convergence proof we do not use such a property. [sent-143, score-0.536]
19 Hence later for multi-class probability estimates, where ri is a real number, the convergence result still holds. [sent-144, score-0.372]
20 93 Huang, Weng, and Lin Given ni , the number of training data with classes in Ii = Ii+ ∪ Ii− , we assume here that for any given data x, ri = P (x in classes of Ii+ | x in classes of Ii ) ¯ (15) is available, and the task is to estimate P (x in class s), s = 1, . [sent-178, score-0.391]
21 We minimize the + − (weighted) KL distance between ri and qi /qi similar to (Hastie and Tibshirani, 1998): ¯ m ¯ ni ri log min p i=1 ri ¯ + (qi /qi ) + (1 − ri ) log ¯ 1 − ri ¯ − (qi /qi ) . [sent-182, score-2.527]
22 (16) By defining ri ≡ ni ri and ri ≡ ni (1 − ri ), ¯ ¯ (17) and removing constant terms, (16) reduces to (6), the negative log-likelihood of the generalized Bradley-Terry model. [sent-183, score-1.53]
23 This and (17) suggest that we can solve the problem by simply taking ni = 1 and ri + ri = 1, ∀i. [sent-219, score-0.741]
24 ∂ps ps 1 − pj j:j=s Setting ∂l(p)/∂ps = 0 ∀s, we have 1 − rs rs − =k− ps 1 − ps 94 k j=1 rj . [sent-221, score-1.395]
25 1 − pj (18) Generalized Bradley-Terry Models and Multi-Class Probability Estimates Since the right-hand side of (18) is the same for all s, we can denote it by δ. [sent-222, score-0.169]
26 If δ = 0, (18) implies i=1 ps = (1 + δ)2 − 4rs δ . [sent-225, score-0.372]
27 2δ (1 + δ) − (19) In Appendix E we show that ps defined in (19) satisfies 0 ≤ ps ≤ 1. [sent-226, score-0.744]
28 Then the solution procedure is as the following: If k ri = 1, i=1 optimal p = [r1 , . [sent-228, score-0.35]
29 This “one-against-the rest” setting is special as we can directly prove the existence of a solution satisfying k + 1 equations: k s=1 ps = 1 and ∂l(p)/∂ps = 0, s = 1, . [sent-240, score-0.372]
30 From (19), if δ > 0, larger ps implies smaller (1 + δ)2 − 4rs δ and hence larger rs . [sent-245, score-0.427]
31 By considering the same optimization problem (16), she proposes a heuristic updating rule − + i:s∈Ii ri i:s∈Ii ri + t pt+1 ≡ (20) t,− ps , t,+ s ni qi ni qi + i:s∈I − qt + t i:s∈I q i i i i but does not provide a convergence proof. [sent-257, score-2.353]
32 Since we keep k pt = 1 in the algorithm, i=1 i with the property ri + ri = 1, for this example the factor in the updating rule (20) is rs + pt + s i:i=s ri i:i=s (1 − pt ) i = k − 1 + 2rs − k ri 2rs i=1 = . [sent-265, score-2.696]
33 ,k} ri − i:s∈Ii q t,− i ri +ri t i:s∈Ii qi ri + i:s∈Ii q t,+ i + − 1 < 0. [sent-273, score-1.632]
34 , m, we generate ri by adding some noise to qi /qi and then check if the proposed + method obtains good probability estimates. [sent-308, score-0.954]
35 Since qi /qi of these three cases are different, it is difficult to have a fair way of adding noise. [sent-309, score-0.582]
36 Furthermore, various ECOC settings + (described later) will also result in different qi /qi . [sent-310, score-0.6]
37 An “absolute” amount of noise: ri = min(max( , Then ri = 1 − ri . [sent-312, score-1.05]
38 qi (23) = 10−7 is used so that all ri , ri are positive. [sent-315, score-1.282]
39 A “relative” amount of noise: ri = min(max( , ri and + qi (1 + 0. [sent-318, score-1.282]
40 Since in general + qi for “1vs1” qi + qi for “1vsrest,” qi + adding 0. [sent-359, score-2.328]
41 1N (0, 1) to qi /qi result in very inaccurate ri for “1vsrest. [sent-360, score-0.932]
42 ” On the other hand, if using a relative way, noise added to ri and ri for “1vsrest” is smaller than that for 3. [sent-361, score-0.7]
43 Figure 2 gives the (relative) mean squared error (MSE): MSE = 1 500 500 j=1 k k i=1 (ˆj − pi )2 / pi p2 i , (25) i=1 ˆ where pj is the probability estimate obtained in the jth of the 500 replicates. [sent-370, score-0.355]
44 2 3 2 4 log k 5 0 2 6 3 2 (a) (b) 4 log2 k 5 6 (c) Figure 3: Accuracy of predicting the true class by four encodings and (24) for generating noise. [sent-441, score-0.157]
45 , 2003) of (Platt, 2000) obtains ri using SVM decision values. [sent-513, score-0.35]
46 1 l l j=1 k i=1 (Iyj =i − pj )2 ˆi , ˆ where l is the number of test data, pj is the probability estimate of the jth data, yj is the true class label, and Iyj =i is an indicator function (1 if yj = i and 0 otherwise). [sent-530, score-0.42]
47 This measurement (Brier, 1950), popular in meteorology, satisfies the following property: k arg min EY [ ˆ p k 2 i=1 (IY =i − pi ) ] ≡ arg min ˆ ˆ p i=1 (ˆi − pi )2 , p where Y , a random variable for the class label, has the probability distribution p. [sent-531, score-0.164]
48 Log loss: −1 l l log pj j , ˆy j=1 ˆ where pj is the probability estimate of the jth data and yj is its actual class label. [sent-541, score-0.478]
49 It is another useful criterion when true probabilities are unknown: k min EY [− ˆ p i=1 k log pi · IY =i ] ≡ min − ˆ ˆ p pi log pi ˆ i=1 has the minimum at pi = pi , i = 1, . [sent-542, score-0.527]
50 01 0 dna waveform satimage segment Data sets usps mnist 0 dna letter (a) 300 training, 500 testing waveform satimage segment Data sets usps mnist letter (b) 800 training, 1000 testing Figure 7: MSE by four encodings. [sent-585, score-0.781]
51 2 0 dna waveform satimage segment Data sets usps mnist 0 dna letter (a) 300 training, 500 testing waveform satimage segment Data sets usps mnist letter (b) 800 training, 1000 testing Figure 8: Log loss by four encodings. [sent-603, score-0.781]
52 We can extend the generalized Bradley-Terry model to this case: Define qi ≡ ¯ wij pj , j:j∈Ii qi ≡ ¯+ wij pj , + j:j∈Ii qi ≡ ¯− wij pj , − j:j∈Ii where wij > 0 is a given weight parameter reflecting individual j’s position in the game between Ii+ and Ii− . [sent-608, score-2.558]
53 Here Algorithm 2 can still be applied but with the updating rule replaced by ri wis ri wis + − i:s∈Ii q t,+ + i:s∈Ii q t,− ¯i ¯i t+1 ps = pt , (26) s (r +r )w i i i:s∈Ii is qi ¯t which is derived similarly to (7) so that the multiplicative factor is equal to one when ∂l(p)/∂ps = 0. [sent-610, score-2.147]
54 However, it may be harder to obtain the global optimality: Case 1 in Theorem 4 still holds, but Case 2 may not since qi needs not be equal to one (the proof of Case 2 requires qi = 1, which is guaranteed by ¯ |Ii | = k). [sent-612, score-1.183]
55 Let ri ≥ 0 and ri ≥ 0 be the number of times that Ii+ wins and loses at home, respec¯ ¯ tively. [sent-618, score-0.7]
56 For Ii+ ’s away games, we let ri ≥ 0 and ri ≥ 0 be the number of times that Ii+ wins ˜ ˜ and loses. [sent-619, score-0.7]
57 The minimization of the negative log-likelihood function thus becomes: min l(p, θ) = p,θ m − ri log ¯ i=1 + θqi q+ q− θq − + ri log + i − + ri log + i − + ri log + i − ˜ ¯ ˜ + − θqi + qi qi + θqi θqi + qi qi + θqi under the constraints in (6) and the condition θ ≥ 0. [sent-620, score-4.036]
58 For each s, ∂l(p, θ)/∂ps = 0 leads to the following rule:6 pt+1 s + i:s∈Ii = + i:s∈Ii θ(¯i +¯i ) r r − + θqi +qi + ri +˜i ¯ r + qi ri +˜i ˜ r − + qi +θqi + − i:s∈Ii + − i:s∈Ii ri +˜i ¯ r − qi ri +¯i ¯ r − + θqi +qi + θ(˜i +˜i ) r r − + qi +θqi pt . [sent-622, score-4.13]
59 s (27) For θ, from ∂l(p, θ)/∂θ = 0, we have θt+1 = m r i=1 (¯i m i=1 + r r qi (¯i +¯i ) − + θt qi +qi + ri ) ˜ + − r r qi (˜i +˜i ) − + qi +θt qi . [sent-623, score-3.26]
60 For convenience, qi (qi ) is abbreviated as qi (qi ). [sent-625, score-1.164]
61 105 Huang, Weng, and Lin Unlike the case of updating pt , there is no need to normalize θt+1 . [sent-629, score-0.437]
62 If ps is updated, we can slightly modify the proof of Theorem 1 and obtain the strict decrease of l(p, θ). [sent-634, score-0.372]
63 Extending the model proposed in (Rao and Kupper, 1967), we consider: P (Ii+ beats Ii− ) = + qi + −, qi + θqi P (Ii− beats Ii+ ) = − qi + − , and θqi + qi P (Ii+ ties Ii− ) = + − (θ2 − 1)qi qi + − + − , (qi + θqi )(θqi + qi ) where θ > 1 is a threshold parameter to be estimated. [sent-641, score-3.781]
64 Let ti be the number of times that Ii+ ties Ii− and ri , ri defined as before. [sent-642, score-0.82]
65 For updating pt , θ is considered as a constant and (29) is in a form of the Home-field s model, so the rule is similar to (27). [sent-644, score-0.437]
66 For updating θ, we have θt+1 = where Ct = 1 2 m i=1 ti m i=1 1 + 2Ct 1+ 1 2, 4Ct − (ri + ti )qi + − + qi + θ t qi m i=1 + (ri + ti )qi + − θ t qi + qi (31) . [sent-646, score-2.561]
67 Pendergrass and Bradley (1960) proposed using P (i best, j in the middle, and k worst) = P (i beats j and k) · P (j beats k) pj pi = · . [sent-653, score-0.454]
68 pi + (pj + pk ) pj + pk A general model introduced in (Placket, 1975) is: k P (a(1) → a(2) → · · · → a(k)) = i=1 pa(i) , pa(i) + pa(i+1) + · · · + pa(k) (32) where a(i), 1 ≤ i ≤ k is the ith ranked individual and → denotes the relation “is ranked higher than. [sent-654, score-0.372]
69 We consider the model: gm 1 2 gm P (Im → Im → · · · → Im ) = i=1 gm j=i i s:s∈Im ps j s:s∈Im ps . [sent-662, score-0.867]
70 (33) Defining i qm = ps , i s:s∈Im we minimize the negative log-likelihood function: N min l(p) = − p gm log m=1 i=1 i qm gm j j=i qm (34) under the constraints in (6). [sent-663, score-0.6]
71 Each ranking can be viewed as the result of a series of paired team comparisons: the first ranked team beats the others, the second ranked team beats the others except the first, and so on; for each paired comparison, ri = 1 and ri = 0. [sent-665, score-1.244]
72 Therefore, Algorithm 2 can be applied and the updating rule is: pt+1 s = φj (s) −1 ) j:s∈Ij (qj pt , s φj (s) gj v )−1 j:s∈Ij i=1 ( v=i qj (35) where φj (s) is the rank of the team that individual s belongs to in the jth game and gj i Ij = ∪i=1 Ij . [sent-666, score-0.664]
73 Discussion and Conclusions We propose a generalized Bradley-Terry model which gives individual skill from group competition results. [sent-680, score-0.155]
74 In particular, minimizing the negative log likelihood of the proposed model coincides with minimizing the KL distance for multi-class probability estimates under error correcting output codes. [sent-683, score-0.19]
75 Proof of Theorem 1 Define t,+ qi\s ≡ + j∈Ii ,j=s t pt , and qi\s ≡ j pt . [sent-693, score-0.804]
76 j − j∈Ii ,j=s Using − log x ≥ 1 − log y − x/y with equality if and only if x = y, we have Q1 (ps ) ≥ l([pt , . [sent-694, score-0.154]
77 With pt > 0, s t,+ log(qi\s + ps ) t,+ qi\s = log t,+ qi pt s ≥ t,+ qi ·1+ pt s t,+ qi · ps pt s t,+ + log(qi ) t,+ (log ps − log pt ) + log qi with equality if ps = pt . [sent-702, score-6.459]
78 > 0, Q2 (ps ) is a strictly convex function of ps . [sent-710, score-0.372]
79 By ri + i:s∈Ii − − i:s∈Ii ri + ri ri pt s = t,− t ps qi qi i:s∈I i leads to the updating rule. [sent-711, score-3.373]
80 Thus, if pt+1 = pt , then s s l(pt+1 ) ≤ Q2 (pt+1 ) < Q2 (pt ) = l(pt ). [sent-712, score-0.402]
81 Proof of Lemma 2 If the result does not hold, there is an index s and an infinite index set T such that ¯ lim t∈T,t→∞ Since k t s=1 ps pt = p∗ = 0. [sent-714, score-0.816]
82 s pt = s lim s=1 s=1 Thus, there is an index j such that lim t∈T,t→∞ pt = p∗ > 0. [sent-716, score-0.888]
83 If this claim is wrong, then u m t l(p ) = − ri log i=1 ≥ −rs0 log = −rs0 log t,+ qi q t,− + ri log i t t qi qi t,+ q s0 t q s0 pt s ¯ pt + s ¯ − u∈Is0 pt u → ∞ when t ∈ T, t → ∞. [sent-720, score-3.96]
84 Proof of Theorem 3 Recall that we assume limt∈K,t→∞ pt = p∗ . [sent-727, score-0.402]
85 For each pt , t ∈ K, there is a corresponding index s for the updating rule (7). [sent-728, score-0.437]
86 Without loss of generality, we assume that all pt , t ∈ K have the same corresponding s. [sent-733, score-0.402]
87 ¯ ¯ ∗ ∗ As ps > 0, by applying one iteration of Algorithm 2 on ps , and using Theorem 1, we obtain ¯ ¯ p∗+1 = p∗ and l(p∗+1 ) < l(p∗ ). [sent-744, score-0.744]
88 ∂ps ∂ps−1 ¯ Thus, at the tth iteration, lim t∈K,t→∞ pt+1 s = lim ri − i:s∈Ii q t,− i ri +ri t i:s∈Ii qi ri + i:s∈Ii q t,+ i t∈K,t→∞ + pt s ri − i:s∈Ii q ∗,− i ri +ri ∗ i:s∈Ii qi ri + i:s∈Ii q ∗,+ i = + p∗ = p∗ s s and hence pt+1 = lim t∈K,t→∞ lim t∈K,t→∞ pt = p∗ . [sent-746, score-4.236]
89 ¯ Assume s corresponds to the tth iteration, by a similar derivation, ¯ lim t∈K,t→∞ pt+1 = · · · = lim t∈K,t→∞ ¯ pt = p∗ and ¯ lim t∈K,t→∞ pt+1 = p∗+1 . [sent-747, score-0.528]
90 For the second case, as qi = k pj = 1, l(p) can be reformulated as j=1 m ¯ l(p) = − + − (ri log qi + ri log qi ), (38) i=1 which is a convex function of p. [sent-753, score-2.419]
91 From Assumption 1, for each pj , there is i such that either Ii+ = {s} with ri > 0, or − Ii = {s} with ri > 0. [sent-756, score-0.869]
92 Therefore, either −ri log ps or −ri log ps appears in (38). [sent-757, score-0.898]
93 Since they are strictly convex functions of ps , the summation on all s = 1, . [sent-758, score-0.372]
94 Solving Nonlinear Equations for the “One-against-the-rest” Approach We show that if δ = 0, ps defined in (19) satisfies 0 ≤ ps ≤ 1. [sent-765, score-0.744]
95 If δ > 0, then (1 + δ) ≥ (1 + δ)2 − 4rs δ, so ps ≥ 0. [sent-766, score-0.372]
96 Using 0 < δ < 1, ps = (1 + δ) − (1 + δ)2 − 4rs δ 1 + δ − (1 − δ) . [sent-775, score-0.372]
97 Adding 1 + δ on both sides and dividing them by 2δ leads to ps ≤ 1. [sent-781, score-0.372]
98 k s=1 ps To find δ by solving calculation shows k lim δ→0 − 1 = 0, the discontinuity at δ = 0 is a concern. [sent-782, score-0.414]
99 Terms of l(pt , θ) related to θ m = − ≤ − i=1 m i=1 + − + − ti log(θ2 − 1) − (ri + ti ) log(qi + θqi ) − (ri + ti ) log(θqi + qi ) + − ti log(θ2 − 1) − (ri + ti ) 1 + log(qi + θt qi ) + + − −(ri + ti ) 1 + log(θt qi + qi ) + − + qi + θqi + − qi + θ t qi + − θqi + qi + − θ t qi + qi ≡ Q(θ). [sent-796, score-6.216]
100 Then Q (θ) = 0 implies m i=1 + − (r + ti )qi (ri + ti )qi 2θti − ti + − + − − θ2 − 1 qi + θ t qi θ qi + qi 113 =0 Huang, Weng, and Lin and hence θt+1 is defined as in (31). [sent-797, score-2.526]
wordName wordTfidf (topN-words)
[('qi', 0.582), ('pt', 0.402), ('ps', 0.372), ('ri', 0.35), ('pj', 0.169), ('ii', 0.168), ('beats', 0.107), ('weng', 0.1), ('ij', 0.095), ('huang', 0.083), ('mse', 0.082), ('log', 0.077), ('satimage', 0.076), ('ecoc', 0.072), ('pi', 0.071), ('ti', 0.066), ('encodings', 0.063), ('team', 0.058), ('paired', 0.056), ('rs', 0.055), ('rij', 0.054), ('ties', 0.054), ('dna', 0.054), ('hunter', 0.053), ('skill', 0.053), ('lin', 0.052), ('letter', 0.051), ('mnist', 0.051), ('game', 0.051), ('waveform', 0.049), ('generalized', 0.048), ('usps', 0.047), ('taiwan', 0.044), ('lim', 0.042), ('ni', 0.041), ('legend', 0.041), ('allwein', 0.041), ('gm', 0.041), ('im', 0.039), ('wij', 0.038), ('correcting', 0.038), ('segment', 0.038), ('wu', 0.037), ('brier', 0.036), ('isr', 0.036), ('rsi', 0.036), ('updating', 0.035), ('individual', 0.033), ('estimates', 0.032), ('teams', 0.031), ('coding', 0.028), ('erent', 0.027), ('zadrozny', 0.027), ('marked', 0.026), ('ist', 0.025), ('rk', 0.025), ('individuals', 0.023), ('qm', 0.023), ('eld', 0.023), ('ranked', 0.022), ('jth', 0.022), ('probability', 0.022), ('kl', 0.022), ('model', 0.021), ('dense', 0.021), ('figures', 0.021), ('gj', 0.021), ('home', 0.021), ('qj', 0.021), ('stationary', 0.02), ('multiplicative', 0.02), ('comparisons', 0.02), ('global', 0.019), ('yj', 0.019), ('sports', 0.019), ('taipei', 0.019), ('games', 0.019), ('bradley', 0.019), ('settings', 0.018), ('dashdot', 0.018), ('davidson', 0.018), ('iyj', 0.018), ('minorizing', 0.018), ('nsc', 0.018), ('pendergrass', 0.018), ('ruby', 0.018), ('simons', 0.018), ('wis', 0.018), ('probabilities', 0.018), ('decoding', 0.017), ('hastie', 0.017), ('globally', 0.017), ('pk', 0.017), ('convergent', 0.017), ('di', 0.017), ('four', 0.017), ('denominator', 0.016), ('testing', 0.016), ('unique', 0.016), ('earlier', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin
Abstract: The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. In machine learning, this model is related to multi-class probability estimates by coupling all pairwise classification results. Error correcting output codes (ECOC) are a general framework to decompose a multi-class problem to several binary problems. To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. We propose a simple algorithm with convergence proofs to solve the model and obtain individual skill. Experiments on synthetic and real data demonstrate that the algorithm is useful for obtaining multi-class probability estimates. Moreover, we discuss four extensions of the proposed model: 1) weighted individual skill, 2) home-field advantage, 3) ties, and 4) comparisons with more than two teams. Keywords: Bradley-Terry model, probability estimates, error correcting output codes, support vector machines
2 0.15625776 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
Author: Di-Rong Chen, Tao Sun
Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition
3 0.10397349 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
4 0.089326434 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
Author: Luis M. de Campos
Abstract: We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented. Keywords: Bayesian networks, scoring functions, learning, mutual information, conditional independence tests
5 0.08426223 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation
Author: Jelle R. Kok, Nikos Vlassis
Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation
7 0.077348657 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann
8 0.074955076 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
9 0.066200495 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
10 0.048994463 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
11 0.040147338 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling
12 0.039419342 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
13 0.037802506 8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis
14 0.035560008 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
15 0.0327994 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)
16 0.031714838 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification
18 0.029057257 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
19 0.029005339 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
20 0.028121246 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
topicId topicWeight
[(0, 0.165), (1, 0.047), (2, -0.102), (3, -0.007), (4, -0.039), (5, 0.192), (6, 0.156), (7, 0.009), (8, 0.238), (9, -0.172), (10, 0.129), (11, -0.223), (12, -0.199), (13, -0.347), (14, -0.167), (15, -0.007), (16, -0.027), (17, 0.04), (18, 0.178), (19, 0.159), (20, -0.084), (21, 0.143), (22, -0.072), (23, 0.022), (24, -0.071), (25, 0.044), (26, 0.081), (27, -0.144), (28, 0.074), (29, 0.016), (30, 0.013), (31, -0.075), (32, 0.046), (33, -0.032), (34, -0.04), (35, 0.022), (36, 0.033), (37, 0.07), (38, 0.066), (39, 0.057), (40, 0.012), (41, -0.021), (42, -0.043), (43, -0.039), (44, 0.044), (45, 0.024), (46, 0.03), (47, 0.003), (48, 0.106), (49, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.98618549 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin
Abstract: The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. In machine learning, this model is related to multi-class probability estimates by coupling all pairwise classification results. Error correcting output codes (ECOC) are a general framework to decompose a multi-class problem to several binary problems. To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. We propose a simple algorithm with convergence proofs to solve the model and obtain individual skill. Experiments on synthetic and real data demonstrate that the algorithm is useful for obtaining multi-class probability estimates. Moreover, we discuss four extensions of the proposed model: 1) weighted individual skill, 2) home-field advantage, 3) ties, and 4) comparisons with more than two teams. Keywords: Bradley-Terry model, probability estimates, error correcting output codes, support vector machines
2 0.62223208 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
Author: Di-Rong Chen, Tao Sun
Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition
Author: Chen Yanover, Talya Meltzer, Yair Weiss
Abstract: The problem of finding the most probable (MAP) configuration in graphical models comes up in a wide range of applications. In a general graphical model this problem is NP hard, but various approximate algorithms have been developed. Linear programming (LP) relaxations are a standard method in computer science for approximating combinatorial problems and have been used for finding the most probable assignment in small graphical models. However, applying this powerful method to real-world problems is extremely challenging due to the large numbers of variables and constraints in the linear program. Tree-Reweighted Belief Propagation is a promising recent algorithm for solving LP relaxations, but little is known about its running time on large problems. In this paper we compare tree-reweighted belief propagation (TRBP) and powerful generalpurpose LP solvers (CPLEX) on relaxations of real-world graphical models from the fields of computer vision and computational biology. We find that TRBP almost always finds the solution significantly faster than all the solvers in CPLEX and more importantly, TRBP can be applied to large scale problems for which the solvers in CPLEX cannot be applied. Using TRBP we can find the MAP configurations in a matter of minutes for a large range of real world problems.
Author: Luis M. de Campos
Abstract: We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented. Keywords: Bayesian networks, scoring functions, learning, mutual information, conditional independence tests
5 0.32687539 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
Author: Ron Begleiter, Ran El-Yaniv
Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds
6 0.29698035 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation
7 0.27219003 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
8 0.21270677 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
9 0.21255697 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
10 0.20511851 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling
11 0.19617553 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann
12 0.19012004 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
13 0.17343563 8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis
14 0.14605287 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
15 0.14532577 53 jmlr-2006-Learning a Hidden Hypergraph
16 0.1448039 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
17 0.13771884 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
18 0.13364689 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
19 0.12429978 14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)
20 0.12273076 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)
topicId topicWeight
[(8, 0.018), (35, 0.018), (36, 0.077), (45, 0.028), (50, 0.035), (63, 0.02), (76, 0.023), (78, 0.021), (81, 0.057), (83, 0.429), (84, 0.019), (90, 0.028), (91, 0.041), (96, 0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.74554574 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin
Abstract: The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. In machine learning, this model is related to multi-class probability estimates by coupling all pairwise classification results. Error correcting output codes (ECOC) are a general framework to decompose a multi-class problem to several binary problems. To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. We propose a simple algorithm with convergence proofs to solve the model and obtain individual skill. Experiments on synthetic and real data demonstrate that the algorithm is useful for obtaining multi-class probability estimates. Moreover, we discuss four extensions of the proposed model: 1) weighted individual skill, 2) home-field advantage, 3) ties, and 4) comparisons with more than two teams. Keywords: Bradley-Terry model, probability estimates, error correcting output codes, support vector machines
Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor
Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs
3 0.28356236 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
Author: Francis R. Bach, David Heckerman, Eric Horvitz
Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classifiers for all feasible ratios of the costs associated with false positives and false negatives. For linear classifiers, the set of classifiers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classifiers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efficiently the entire ROC curve, and that has the same computational complexity as training a single classifier. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classifier and the cost model assumed in applying the classifier. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantifiable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classification
4 0.27867261 66 jmlr-2006-On Model Selection Consistency of Lasso
Author: Peng Zhao, Bin Yu
Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency
5 0.27673176 44 jmlr-2006-Large Scale Transductive SVMs
Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou
Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP
7 0.27638716 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
8 0.27634808 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
9 0.27634302 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
10 0.27587458 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
11 0.27585679 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
12 0.27454576 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
13 0.27408695 70 jmlr-2006-Online Passive-Aggressive Algorithms
15 0.27132648 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
16 0.27014032 53 jmlr-2006-Learning a Hidden Hypergraph
18 0.26933846 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
19 0.2691775 88 jmlr-2006-Streamwise Feature Selection
20 0.26905981 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting