jmlr jmlr2011 jmlr2011-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
Reference: text
sentIndex sentText sentNum sentScore
1 Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. [sent-12, score-0.3]
2 We show that the mistake bound can be improved by the proposed online learning method. [sent-13, score-0.289]
3 In some trials of online learning, besides assigning a weight to the misclassified example, the proposed online learning algorithm also updates the weight for one of the existing support vectors, referred to as auxiliary example. [sent-39, score-0.541]
4 The key challenge in the proposed online learning approach is to decide which existing support vector should be selected for updating weight. [sent-42, score-0.322]
5 In order to quantitatively analyze the impact of updating the weight for such an existing support vector, we employ an analysis that is based on the work of online convex programming by incremental dual ascent (Shalev-Shwartz and Singer, 2006, 2007). [sent-44, score-0.402]
6 Our analysis shows that under certain conditions, the proposed online learning algorithm can significantly reduce the mistake bound of the existing online algorithms. [sent-45, score-0.474]
7 Besides binary classification, we extend the double updating online learning algorithm to multi-class learning. [sent-46, score-0.421]
8 Extensive experiments show promising performance of the proposed online learning algorithm compared to the state-of-the-art algorithms for online learning. [sent-47, score-0.332]
9 Section 4 extends the double updating method to online multi-class learning. [sent-51, score-0.421]
10 One of the most well-known online approaches is the Perceptron algorithm (Rosenblatt, 1958; Freund and Schapire, 1999), which updates the learning function by adding the misclassified example with a constant weight to the current set of support vectors. [sent-60, score-0.257]
11 Despite the difference, most online learningalgorithms only update the weight of the newly added support vector, and keep the weights of the existing support vectors unchanged. [sent-69, score-0.353]
12 1588 D OUBLE U PDATING O NLINE L EARNING The proposed online learning algorithm is closely related to the recent work of online convex programming by incremental dual ascent (Shalev-Shwartz and Singer, 2006, 2007). [sent-71, score-0.387]
13 Although these online learning algorithms are capable of dynamically adjusting the weights of support vectors, they are designed to either fit in the budget for the number of support vectors or to handle drifting concepts, but not to reduce the number of classification mistakes in online learning. [sent-77, score-0.543]
14 Finally, several algorithms were proposed for online training of SVM that update the weights of more than one support vectors simultaneously (Cauwenberghs and Poggio, 2000; Bordes et al. [sent-78, score-0.273]
15 These algorithms differ from the proposed one in that they are designed for efficiently learning an SVM classification model, not for online learning, and therefore do not provide guarantee for mistake bound. [sent-84, score-0.289]
16 Double Updating Online Learning for Binary Classification In this section, we present the proposed double updating online learning method for solving online binary classification tasks. [sent-86, score-0.587]
17 , (xT , yT )}, where xt ∈ Rd is a d-dimensional instance and yt ∈ Y = {−1, +1} is the class label assigned to xt . [sent-93, score-0.315]
18 2 Motivation We consider trial t in an online learning task where the training example (xa , ya ) is misclassified (i. [sent-102, score-0.334]
19 i=1 In the conventional approach for online learning, we simply assign a constant weight, denoted by β ∈ (0,C], to (xa , ya ), and the resulting classifier becomes n f ′ (x) = βya κ(x, xa ) + ∑ αi yi κ(x, xi ) = βya κ(x, xa ) + f (x). [sent-115, score-0.748]
20 i=1 The shortcoming of the conventional online learning approach is that the introduction of the new support vector (xa , ya ) may harm the classification of existing support vectors in D , which is revealed by the following proposition. [sent-116, score-0.423]
21 Proof It follows from the fact that: ∃ xi ∈ D , yi ya κ(xi , xa ) < 0 when ya f (xa ) < 0. [sent-123, score-0.52]
22 When ya f (xa ) ≤ −γ, there exists one support vector (xb , yb ) ∈ D that satisfies βya yb k(xa , xb ) ≤ −βγ/n. [sent-125, score-1.001]
23 This support vector will be misclassified by the updated classifier f ′ (x) if yb f (xb ) ≤ βγ/n. [sent-126, score-0.314]
24 It is clear that by adding the misclassified example (xa , ya ) to classifier f (x) with weight β, the classification score of (xb , yb ) will be reduced by at least βρ, which could lead to a significant misclassification of the auxiliary example (xb , yb ). [sent-130, score-0.799]
25 To avoid such a mistake, we propose to update the weights for both (xa , ya ) and (xb , yb ) simultaneously. [sent-131, score-0.471]
26 In the next section, we show the details of the double updating algorithm for online learning, and the analysis for mistake bound. [sent-132, score-0.544]
27 For the remaining part of this section, we denote by (xb , yb ) an auxiliary example that satisfies the two conditions specified before. [sent-141, score-0.352]
28 We define ka = κ(xa , xa ), kb = κ(xb , xb ), kab = κ(xa , xb ), wab = ya yb kab . [sent-142, score-2.022]
29 According to the assumption of auxiliary example, we have wab = kab ya yb ≤ −ρ. [sent-143, score-0.924]
30 Finally, we denote by γb the weight for the auxiliary example (xb , yb ) that is used in the current classifier f (x), by γa and γb the updated weights for (xa , ya ) and (xb , yb ), respectively, and by dγb the difference γb − γb . [sent-144, score-0.819]
31 3 Double Updating Online Learning for Binary Classification Recall an auxiliary example (xb , yb ) should satisfy two conditions (I) yb f (xb ) ≤ 0, and (II) wab ≤ −ρ. [sent-146, score-1.038]
32 In addition, the example (xa , ya ) received in the current iteration t is misclassified, that is, ya f (xa ) ≤ 0. [sent-147, score-0.288]
33 Following the framework of dual formulation for online learning, the following lemma shows how to compute ∆t , that is, the improvement in the objective function of dual SVM by adjusting weights for (xa , ya ) and (xb , yb ). [sent-148, score-0.682]
34 Theorem 1 Assume C ≥ γb + 1/(1 − ρ) with ρ ∈ [0, 1) for the selected auxiliary example (xb , yb ), we have the following bound for the bounding constant ∆: 1 ∆≥ . [sent-152, score-0.352]
35 This is because for given γa ≥ 0, the optimal solution for dγb , given by 1 − yb f (xb ) − wab γa , dγb = kb 1591 Z HAO , H OI AND J IN is positive because yb f (xb ) ≤ 0 and wab ≤ −ρ. [sent-154, score-1.662]
36 Using the fact ka , kb ≤ 1, γa , dγb ≥ 0, ya f (xa ) ≤ 0, yb f (xb ) ≤ 0, and wa,b ≤ −ρ, we have 1 2 1 h(γa , dγb ) ≥ γa + dγb − γ2 − dγb + ργa dγb . [sent-155, score-0.838]
37 We have the following bound for ∆ when updating the weight for the misclassified example (xa , ya ) and the auxiliary example (xb , yb ): ∆≥ 1 1 + min (1 + ρ)2 , (C − γ)2 . [sent-161, score-0.622]
38 2 2 Although Theorem 1 and 2 show that the double update strategy could significantly improve the bounding constant ∆ over 1/2 and consequentially reduce the mistake bound, it is applicable only when there exists an auxiliary example. [sent-164, score-0.38]
39 Specifically, we relax the condition for performing double ˆ update as follows: there exists (xb , yb ) ∈ D that (i) wab ≤ −ρ, (ii) yb ft−1 (xb ) ≤ 1, and (iii) C ≥ γb +ρ. [sent-166, score-1.147]
40 ˆ Theorem 3 Assume wab ≤ −ρ, yb ft−1 (xb ) ≤ 1 and C ≥ γb + ρ, we have the following bound for the bounding constant ∆≥ 1 + ρ2 . [sent-168, score-0.686]
41 Proposition 2 Denote ℓa := 1 − ya f (xa ) and ℓb := 1 − yb f (xb ). [sent-176, score-0.422]
42 In this algorithm, to efficiently find the auxiliary example (xb , yb ), we introduce a variable fti for each support vector to keep track of its classification score. [sent-180, score-0.435]
43 1593 Z HAO , H OI AND J IN In addition, we denote by Mds (ρ) and Mdw (ρ) the sets of indexes for the cases of strong and weak double updating, respectively, that is, 1 for (xt , yt ), t ∈ M }, 1−ρ Mdw (ρ) = {t |∃ (xb , yb ) s. [sent-184, score-0.537]
44 wab ≤ −ρ, yb ft−1 (xb ) ≤ 1 and C ≥ γb + ρ,t ∈ M /Mds (ρ)}. [sent-186, score-0.686]
45 , (xT , yT ) be a sequence of examples, where xt ∈ Rd , yt ∈ {−1, +1} and κ(xt , xt ) ≤ 1 for all t, and assume C ≥ 1. [sent-194, score-0.315]
46 As revealed by the above theorem, the number of mistakes made by the proposed double updating online learning algorithm will be smaller than the online learning algorithm that only performs a single update in each trial. [sent-200, score-0.713]
47 The difference in the mistake bound is essentially due to the double updating, that is, the more the number of double updates, the more advantageous the proposed algorithm will be. [sent-201, score-0.431]
48 Besides, the above bound also indicates that a strong double update is more powerful than a weak double update given that the associated weight of a strong double update (1+ρ)/(1−ρ) is always much larger than that of a weak double update ρ2 /2. [sent-202, score-0.757]
49 Multiclass Double Updating Online Learning In this section, we extend the proposed double updating online learning algorithm to multiclass learning where each instance can be assigned to multiple classes. [sent-208, score-0.487]
50 1 Online Multiclass Learning Similar to online binary classification, online multiclass learning is performed over a sequence of training examples (x1 ,Y1 ), . [sent-210, score-0.398]
51 (H(Ya ) · H(Yb ))κ(xa , xb ) ≤ −2ρ where ρ ∈ (0, 1) is a threshold. [sent-246, score-0.265]
52 Given κ(xa , xb ) ≥ 0, the second condition of auxiliary example implies H(Ya ) · H(Yb ) ≤ 0, which further indicates that two examples (xa ,Ya ) and (xb ,Yb ) have the opposite prediction, that is, (ra = sb ) or (sa = rb ). [sent-249, score-0.437]
53 To ease our further discussions, we define ka = i=1 i=1 H κ(xa , xa ), kb = κ(xb , xb ), wab = (H(Ya ) · H(Yb ))κ(xa , xb ) . [sent-254, score-1.56]
54 The following proposition shows the optimization problem related to the multiclass double updating online learning algorithm, which forms the basis for deriving the bounding constant ∆. [sent-255, score-0.505]
55 Similar to double updating for binary classification, we introduce weak double ˆ update when there exists (xb ,Yb ) s. [sent-263, score-0.438]
56 wab ≤ −2ρ, ft−1,rb (xb ) − ft−1,sb (xb ) ≤ 1, and C ≥ γb + ρ . [sent-265, score-0.408]
57 wab ≤ −2ρ, ft−1,rb (xb ) − ft−1,sb (xb ) ≤ 1, C ≥ γb + ρ 2 and the current instance is misclassified, then we have the following bounding constant ∆≥ 1 + ρ2 . [sent-268, score-0.408]
58 Experimental Results In this section, we evaluate the empirical performance of the proposed double updating online learning algorithms for online learning tasks. [sent-287, score-0.587]
59 We first evaluate the performance of DUOL for binary classification, followed by the evaluation of multiclass double updating online learning. [sent-288, score-0.487]
60 Note that one may also compare with the online SVM algorithm (Shalev-Shwartz and Singer, 2006), which updates the weights for all support vectors in each trial. [sent-295, score-0.274]
61 We evaluate the online learning performance by measuring the mistake rate, that is, the percentage of examples that are misclassified by the online learning algorithm. [sent-306, score-0.455]
62 Second, we observe that among the three variants of double updating online learning, the DUOL approach, which solves the optimization problem exactly, yields the least mistake rate with the smallest number of support vectors for most of the cases. [sent-320, score-0.602]
63 This shows that the proposed double updating approach is effective in improving the performance of online prediction. [sent-703, score-0.421]
64 Third, we observe that the proposed DUOL algorithm is significantly more accurate than the other two variants of double updating online learning algorithms (DUOLiter and DUOLappr ) for varied C values, as we expected. [sent-722, score-0.421]
65 This is because it is easier to find an auxiliary example for double updating when ρ is small. [sent-809, score-0.329]
66 This is because the condition of conducting a strong double update is significantly more difficult to be satisfied w s that that for a weak double update. [sent-812, score-0.337]
67 They are: (i) “Max”, the perceptron method based on the max-score multiclass update, (ii) “Uniform”, the perceptron method based on the Uniform multiclass update, and (iii) “Prop”, the perceptron method based on the proportion multiclass update. [sent-821, score-0.498]
68 First, by comparing all the baseline algorithms, we find that the two PA algorithms yield considerably lower mistake rates than the other single-updating online learning algorithms. [sent-832, score-0.289]
69 774 Algorithm segment Algorithm satimage usps Algorithm mnist letter protein Table 4: Evaluation of multiclass online learning algorithms on the multiclass data sets. [sent-1229, score-0.298]
70 Second, among the three variants of double updating for multi-label learning, it is not surprising to observe that M-DUOL yields the lowest mistake rates for all data sets. [sent-1232, score-0.378]
71 For the future work, it is possible to extend other single update online learning methods, such as EG (Kivinen and Warmuth, 1995), for double updating. [sent-1243, score-0.349]
72 Finally, we plan to extend the proposed double updating framework for budget online learning to make sparse classifiers. [sent-1245, score-0.421]
73 Conclusions This paper presented a novel “double updating” approach to online learning named as “DUOL”, which not only updates the weight of the misclassified example, but also adjusts the weight of one existing support vector that the most seriously conflicts with the new support vector. [sent-1247, score-0.337]
74 We show that the mistake bound for an online classification task can be significantly reduced by the proposed DUOL algorithms. [sent-1248, score-0.289]
75 Promising empirical results showed that the proposed double updating online learning algorithms consistently outperform the single-update online learning algorithms. [sent-1250, score-0.587]
76 (6) (7) ˆ dγb −C + γb ≤ 0, ˆ −dγb − γb ≤ 0, γa ,dγb ka 2 kb 2 γ + d + wab γa dγb − ℓa γa − ℓb dγb , 2 a 2 γb γa −C ≤ 0, −γa ≤ 0, min (8) (9) ˆ where ka , kb > 0, wab ≤ 0, ℓa = 1 − ya f (xa ) ≥ 0, ℓb = 1 − yb f (xb ) ≥ 0 and γb > 0. [sent-1255, score-2.07]
77 Plugging the ˆ results γa = C, λ2 = 0, dγb = C − γb and λ4 = 0 into the zero gradient condition, we have ˆ ˆ kaC + wab (C − γb ) − ℓa + λ1 = 0 and kb (C − γb ) + wabC − ℓb + λ3 = 0. [sent-1266, score-0.698]
78 Thus, we have ˆ λ1 = −[kaC + wab (C − γb ) − ℓa ] ˆ and λ3 = −[kb (C − γb ) + wabC − ℓb ]. [sent-1267, score-0.408]
79 As a result, if ˆ −(kaC + wab (C − γb ) − ℓa ) > 0 and ˆ − (kb (C − γb ) + wabC − ℓb ) > 0, ˆ then KKT conditions are satisfied, (γa , dγb ) = (C,C − γb ) is the unique solution. [sent-1268, score-0.408]
80 Plugging the results λ2 = 0, γ γa = C, λ3 = 0 and dγb = −ˆ b in to the zero gradient conditions: γ kaC + wab (−ˆ b ) − ℓa + λ1 = 0 and kb (−ˆ b ) + wabC − ℓb − λ4 = 0. [sent-1280, score-0.698]
81 γ γ But since kb (−ˆ b ) < 0 wabC ≤ 0 and ℓb , λ4 ≥ 0, kb (−ˆ b ) + wabC − ℓb − λ4 < 0, which contradicts γ γ the equation above. [sent-1281, score-0.58]
82 Plugging the conditions γa = C, λ2 = 0, λ3 = 0 and λ4 = 0 into the zero gradient equations: kaC + wab dγb − ℓa + λ1 = 0 and kb dγb + wabC − ℓb = 0. [sent-1286, score-0.698]
83 Solving the above equations leads to the following: λ1 = If w2 C − wab ℓb − ka kbC + kb ℓa ab kb w2 C−wab ℓb −ka kbC+kb ℓa ab kb a result, (γa , dγb ) = (C, > 0 and ℓb −wabC ) kb ℓb −wabC kb and dγb = ℓb − wabC . [sent-1287, score-2.112]
84 kb ˆ ∈ [−ˆ b ,C − γb ], then the KKT conditions are all satisfied; as γ is the unique optimal solution. [sent-1288, score-0.29]
85 Since λ3 [dγb − (C − γb )] = 0, plugging the conditions λ1 = 0, γa = 0, ˆ dγb = C − γb and λ4 = 0 into the zero gradient conditions: ˆ wab (C − γb ) − ℓa − λ2 = 0 ˆ and kb (C − γb ) − ℓb + λ3 = 0. [sent-1303, score-0.698]
86 ˆ Since wab ≤ 0, C − γb ≥ 0 and ℓa ≥ 0, we conclude ˆ λ2 = wab (C − γb ) − ℓa ≤ 0. [sent-1304, score-0.816]
87 From the conditions λ1 = 0, γa = 0, λ3 = 0 γ and dγb = −ˆ b and the zero gradient conditions, we have γ wab (−ˆ b ) − ℓa − λ2 = 0 and kb (−ˆ b ) − ℓb − λ4 = 0. [sent-1313, score-0.698]
88 γ γ ˆ Since kb , γb > 0 and ℓb ≥ 0, we conclude λ4 = kb (−ˆ b ) − ℓb < 0. [sent-1314, score-0.58]
89 • Else if λ4 = 0, from the conditions λ1 = 0, γa = 0, λ3 = 0 and λ4 = 0 and the zero gradient conditions, we have wab dγb − ℓa − λ2 = 0 and kb dγb − ℓb = 0. [sent-1316, score-0.698]
90 Since wab ≤ 0, ℓb , ℓa ≥ 0 and kb > 0, λ2 = wab ℓb − ℓa ≤ 0, kb which contradicts λ2 > 0 (Since λ2 = 0). [sent-1317, score-1.396]
91 From the conditions λ1 = λ2 = λ4 = 0, dγb = C − γb and zero gradient conditions: ˆ ˆ ka γa + wab (C − γb ) − ℓa = 0 and kb (C − γb ) + wab γa − ℓb + λ3 = 0. [sent-1329, score-1.232]
92 As a result, if ˆ ℓa − wab (C − γb ) ∈ [0,C] and ka ˆ ℓb − kb (C − γb ) − wab ˆ ℓa − wab (C − γb ) > 0, ka (C−ˆ ˆ the unique optimal solution is (γa , dγb ) = ( ℓa −waba γb ) ,C − γb ). [sent-1330, score-1.766]
93 From λ1 = λ2 = λ3 = 0, dγb = −ˆ b and zero γ γ gradient conditions: ka γa + wab (−ˆ b ) − ℓa = 0 and γ kb (−ˆ b ) + wab γa − ℓb − λ4 = 0, γ since λ4 = kb (−ˆ b ) + wab γa − ℓb < 0 which contradicts with the condition λ4 > 0. [sent-1337, score-1.93]
94 γ • If λ4 = 0, from λ1 = λ2 = λ3 = λ4 = 0 and zero gradient conditions: ka γa + wab dγb − ℓa = 0 and kb dγb + wab γa − ℓb = 0. [sent-1338, score-1.232]
95 As a result, if γa and dγb satisfy the following: γa = kb ℓa − wab ℓb ∈ [0,C] and ka kb − w2 ab dγb = ka ℓb − wab ℓa ˆ ∈ [−ˆ b ,C − γb ], γ ka kb − w2 ab a b b a then (γa , dγb ) = ( kkℓk−wab ℓb , kkℓk−wab ℓa ) is the unique optimal solution. [sent-1339, score-2.192]
96 i=1 We can check the value of σ(ra , b) − σ(sa , b) by examining all possible cases as follows: 1 If ra = rb that implies that xa and xb have the same relevant labels, then we should have H(Ya ) · H(Yb ) = 1 − σ(sa , b) ≥ 1 (either 1 or 2); 2 If ra = rb , then: 2. [sent-1343, score-0.655]
97 The Proof of Proposition 4 In this appendix, we will derive the dual ascent by the multiclass double updating approach. [sent-1354, score-0.376]
98 (10) j=1 Now our goal is to derive the dual ascent guaranteed by the proposed double updating scheme. [sent-1392, score-0.31]
99 Assume we conduct a double updating for (xa ,Ya ) and some auxiliary example (xb ,Yb ), we can prove Proposition 4 as follows. [sent-1394, score-0.329]
100 j=1 Hence, the dual ascent is computed as follows: ∆D = Dt − Dt−1 = γa 1 − ft−1,ra (xa ) − ft−1,sa (xa ) k + dγb 1 − ft−1,rb (xb ) − ft−1,sb (xb ) 2 −γ2 sa − dγb sb − ∑ σ(i, a)σ(i, b)γa dγb k(xa , xb ) . [sent-1397, score-0.44]
wordName wordTfidf (topN-words)
[('wab', 0.408), ('duol', 0.407), ('kb', 0.29), ('yb', 0.278), ('xb', 0.265), ('xa', 0.206), ('online', 0.166), ('romma', 0.16), ('double', 0.154), ('ft', 0.145), ('ya', 0.144), ('ka', 0.126), ('mistake', 0.123), ('misclassi', 0.12), ('wabc', 0.107), ('yt', 0.105), ('xt', 0.105), ('updating', 0.101), ('perceptron', 0.1), ('mistakes', 0.097), ('ouble', 0.093), ('crammer', 0.09), ('hao', 0.079), ('pdating', 0.079), ('nline', 0.076), ('auxiliary', 0.074), ('md', 0.074), ('oi', 0.071), ('duolappr', 0.067), ('multiclass', 0.066), ('pa', 0.066), ('ab', 0.064), ('sa', 0.064), ('st', 0.058), ('sb', 0.056), ('agg', 0.053), ('mira', 0.053), ('wmin', 0.053), ('ra', 0.05), ('duoliter', 0.047), ('fti', 0.047), ('kac', 0.047), ('koby', 0.045), ('ri', 0.045), ('rb', 0.042), ('iter', 0.041), ('numer', 0.041), ('cgt', 0.04), ('duo', 0.04), ('prop', 0.04), ('singer', 0.039), ('yoram', 0.037), ('dual', 0.037), ('support', 0.036), ('alma', 0.034), ('hya', 0.033), ('hyb', 0.033), ('kbc', 0.033), ('kivinen', 0.033), ('rt', 0.031), ('updates', 0.03), ('classi', 0.03), ('update', 0.029), ('bordes', 0.028), ('fs', 0.028), ('mds', 0.028), ('aggressive', 0.028), ('shai', 0.028), ('appr', 0.027), ('mdw', 0.027), ('mitface', 0.027), ('waba', 0.027), ('earning', 0.027), ('kk', 0.026), ('rand', 0.026), ('yi', 0.026), ('weight', 0.025), ('fn', 0.024), ('trial', 0.024), ('spambase', 0.023), ('dorothea', 0.023), ('icts', 0.023), ('vectors', 0.022), ('mushrooms', 0.02), ('ub', 0.02), ('ms', 0.02), ('weights', 0.02), ('antoine', 0.02), ('claudio', 0.02), ('duolrand', 0.02), ('fink', 0.02), ('fri', 0.02), ('kab', 0.02), ('rosenblatt', 0.02), ('margin', 0.019), ('dekel', 0.019), ('splice', 0.019), ('existing', 0.019), ('ascent', 0.018), ('proposition', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
2 0.15784511 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
3 0.15385212 41 jmlr-2011-Improved Moves for Truncated Convex Models
Author: M. Pawan Kumar, Olga Veksler, Philip H.S. Torr
Abstract: We consider the problem of obtaining an approximate maximum a posteriori estimate of a discrete random field characterized by pairwise potentials that form a truncated convex model. For this problem, we propose two st-MINCUT based move making algorithms that we call Range Swap and Range Expansion. Our algorithms can be thought of as extensions of αβ-Swap and α-Expansion respectively that fully exploit the form of the pairwise potentials. Specifically, instead of dealing with one or two labels at each iteration, our methods explore a large search space by considering a range of labels (that is, an interval of consecutive labels). Furthermore, we show that Range Expansion provides the same multiplicative bounds as the standard linear programming (LP) relaxation in polynomial time. Compared to previous approaches based on the LP relaxation, for example interior-point algorithms or tree-reweighted message passing (TRW), our methods are faster as they use only the efficient st-MINCUT algorithm in their design. We demonstrate the usefulness of the proposed approaches on both synthetic and standard real data problems. Keywords: truncated convex models, move making algorithms, range moves, multiplicative bounds, linear programming relaxation
Author: Jim C. Huang, Brendan J. Frey
Abstract: We present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called cumulative distribution networks (CDNs). Unlike graphs for probability density and mass functions, for CDFs the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model, and conditional probabilities correspond to computing mixed derivatives. We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graphs, but include the conditional independence properties of bi-directed graphs. In order to perform inference in such models, we describe the ‘derivative-sum-product’ (DSP) message-passing algorithm in which messages correspond to derivatives of the joint CDF. We will then apply CDNs to the problem of learning to rank players in multiplayer team-based games and suggest several future directions for research. Keywords: graphical models, cumulative distribution function, message-passing algorithm, inference
5 0.11390073 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
6 0.060054362 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
7 0.059801783 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
8 0.056046374 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
9 0.053588994 36 jmlr-2011-Generalized TD Learning
10 0.05240592 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning
11 0.050452419 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
12 0.033355545 101 jmlr-2011-Variable Sparsity Kernel Learning
13 0.031495225 66 jmlr-2011-Multiple Kernel Learning Algorithms
14 0.031404518 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
15 0.028042641 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
16 0.026912037 35 jmlr-2011-Forest Density Estimation
17 0.025800303 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
18 0.024946706 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
19 0.02479028 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
20 0.024182184 58 jmlr-2011-Learning from Partial Labels
topicId topicWeight
[(0, 0.158), (1, 0.196), (2, -0.106), (3, -0.062), (4, 0.133), (5, 0.114), (6, 0.173), (7, 0.035), (8, 0.087), (9, -0.082), (10, -0.123), (11, 0.084), (12, 0.241), (13, -0.167), (14, 0.198), (15, 0.223), (16, 0.025), (17, 0.054), (18, 0.062), (19, 0.086), (20, 0.182), (21, -0.071), (22, -0.056), (23, -0.129), (24, 0.006), (25, 0.066), (26, -0.094), (27, -0.04), (28, -0.141), (29, -0.132), (30, -0.045), (31, -0.025), (32, -0.102), (33, -0.058), (34, -0.053), (35, -0.065), (36, 0.057), (37, -0.09), (38, 0.09), (39, -0.058), (40, 0.054), (41, -0.016), (42, -0.018), (43, 0.059), (44, -0.127), (45, -0.034), (46, -0.127), (47, -0.031), (48, -0.061), (49, 0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.95506835 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
2 0.59768248 41 jmlr-2011-Improved Moves for Truncated Convex Models
Author: M. Pawan Kumar, Olga Veksler, Philip H.S. Torr
Abstract: We consider the problem of obtaining an approximate maximum a posteriori estimate of a discrete random field characterized by pairwise potentials that form a truncated convex model. For this problem, we propose two st-MINCUT based move making algorithms that we call Range Swap and Range Expansion. Our algorithms can be thought of as extensions of αβ-Swap and α-Expansion respectively that fully exploit the form of the pairwise potentials. Specifically, instead of dealing with one or two labels at each iteration, our methods explore a large search space by considering a range of labels (that is, an interval of consecutive labels). Furthermore, we show that Range Expansion provides the same multiplicative bounds as the standard linear programming (LP) relaxation in polynomial time. Compared to previous approaches based on the LP relaxation, for example interior-point algorithms or tree-reweighted message passing (TRW), our methods are faster as they use only the efficient st-MINCUT algorithm in their design. We demonstrate the usefulness of the proposed approaches on both synthetic and standard real data problems. Keywords: truncated convex models, move making algorithms, range moves, multiplicative bounds, linear programming relaxation
3 0.40249014 14 jmlr-2011-Better Algorithms for Benign Bandits
Author: Elad Hazan, Satyen Kale
Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efficient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efficient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning
Author: Jim C. Huang, Brendan J. Frey
Abstract: We present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called cumulative distribution networks (CDNs). Unlike graphs for probability density and mass functions, for CDFs the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model, and conditional probabilities correspond to computing mixed derivatives. We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graphs, but include the conditional independence properties of bi-directed graphs. In order to perform inference in such models, we describe the ‘derivative-sum-product’ (DSP) message-passing algorithm in which messages correspond to derivatives of the joint CDF. We will then apply CDNs to the problem of learning to rank players in multiplayer team-based games and suggest several future directions for research. Keywords: graphical models, cumulative distribution function, message-passing algorithm, inference
5 0.36906034 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
6 0.25057131 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
7 0.24798578 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
8 0.21643557 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
9 0.19695623 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning
10 0.1861655 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
11 0.18052134 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
12 0.17662305 36 jmlr-2011-Generalized TD Learning
13 0.174106 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.16355093 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
15 0.14900401 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
16 0.13891909 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
17 0.13421974 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
18 0.13195431 58 jmlr-2011-Learning from Partial Labels
19 0.13044459 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
20 0.12367827 101 jmlr-2011-Variable Sparsity Kernel Learning
topicId topicWeight
[(4, 0.028), (9, 0.022), (10, 0.015), (24, 0.414), (31, 0.06), (32, 0.03), (41, 0.014), (67, 0.028), (71, 0.016), (73, 0.022), (78, 0.078), (83, 0.157)]
simIndex simValue paperId paperTitle
1 0.90805846 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
same-paper 2 0.88871282 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
3 0.86568791 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
4 0.6377871 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
5 0.56797737 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
6 0.56089634 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
7 0.53863418 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
8 0.53261364 61 jmlr-2011-Logistic Stick-Breaking Process
9 0.51014119 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization
10 0.49569893 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
11 0.48774436 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
12 0.48099709 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
13 0.46180505 48 jmlr-2011-Kernel Analysis of Deep Networks
14 0.46109754 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
16 0.45428044 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
17 0.452306 16 jmlr-2011-Clustering Algorithms for Chains
18 0.44896126 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
19 0.44685075 66 jmlr-2011-Multiple Kernel Learning Algorithms
20 0.44570413 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking