nips nips2003 nips2003-41 knowledge-graph by maker-knowledge-mining

41 nips-2003-Boosting versus Covering

Source: pdf

Author: Kohei Hatano, Manfred K. Warmuth

Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm ﬁnds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We ﬁrst show that InfoBoost, a modiﬁcation of AdaBoost proposed by Aslam for a diﬀerent purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artiﬁcial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i. [sent-7, score-0.4]

2 In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. [sent-10, score-0.649]

3 On the other hand, a greedy set covering algorithm ﬁnds a consistent hypothesis of size O(k log m). [sent-11, score-0.558]

4 Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. [sent-12, score-0.213]

5 We ﬁrst show that InfoBoost, a modiﬁcation of AdaBoost proposed by Aslam for a diﬀerent purpose, does perform as well as the greedy set covering algorithm. [sent-13, score-0.197]

6 1 Introduction The boosting method has become a powerful paradigm of machine learning. [sent-18, score-0.105]

7 In this method a highly accurate hypothesis is built by combining many “weak” hypotheses. [sent-19, score-0.28]

8 AdaBoost [FS97, SS99] is the most common boosting algorithm. [sent-20, score-0.105]

9 We start with m labeled examples labeled with ±1. [sent-22, score-0.124]

10 At each iteration t, the algorithm receives a ±1 valued weak hypothesis ht whose error (weighted by the current distribution on the examples) is slightly smaller than 1 . [sent-24, score-1.023]

11 It then updates its distribution so that after 2 the update, the hypothesis ht has weighted error exactly 1 . [sent-25, score-0.799]

12 The ﬁnal hypothesis is 2 a linear combination of the received weak hypotheses and it stops when this ﬁnal hypothesis is consistent with all examples. [sent-26, score-1.031]

13 It is well known [SS99] that if each weak hypothesis has weighted error at most γ 1 1 − γ2 2 − 2 , then the upper bound on the training error reduces by a factor of ∗ This research was done while K. [sent-27, score-0.612]

14 and after O( γ12 log m) iterations, the ﬁnal hypothesis is consistent with all examples. [sent-29, score-0.347]

15 Also, it has been shown that if the ﬁnal hypotheses are restricted to (unweighted) majority votes of weak hypotheses [Fre95], then this upper bound on the number of iterations cannot be improved by more than a constant factor. [sent-30, score-0.759]

16 However, if there always is a positively one-sided weak hypothesis (i. [sent-31, score-0.514]

17 its positive predictions are always correct) that has error1 at most 1 − γ , then a set cover algo2 2 1 rithm can be used to reduce the training error by a factor2 of 1 − γ and O( γ log m) weak hypotheses suﬃce to form a consistent hypothesis [Nat91]. [sent-33, score-0.82]

18 In particular, consider the problem of ﬁnding a consistent hypothesis for m examples labeled by a k literal disjunction. [sent-35, score-0.485]

19 Assume we use the literals as the pool of weak hypotheses and always choose as the weak hypothesis a literal that is consistent with all negative examples. [sent-36, score-1.101]

20 Then it can be shown that, for any distribution D on the examples, there exists a literal (or a constant hypothesis) h with weighted error 1 at most 1 − 4k (See e. [sent-37, score-0.138]

21 Therefore, the upper bound on the training error 2 of AdaBoost reduces by a factor of 1− 1 4k2 and O(k 2 log m) iterations suﬃce. [sent-40, score-0.191]

22 However, a trivial greedy set covering algorithm, that follows a strikingly similar protocol as the boosting algorithms, ﬁnds a consistent disjunction with O(k log m) literals. [sent-41, score-0.504]

23 We show that InfoBoost mimics the set cover algorithm in this case (and 1 attains the improved factor of 1 − k ). [sent-42, score-0.109]

24 We then show that Ω(k 2 log m) iterations are really required by AdaBoost using both an explicit construction (which requires some assumptions) and artiﬁcial experiments. [sent-44, score-0.123]

25 The diﬀerences are quite large: For m = 10, 000 random examples and a disjunction of size k = 60, AdaBoost requires 2400 iterations (on the average), whereas Covering and InfoBoost require 60 iterations. [sent-45, score-0.223]

26 We then show that InfoBoost has the improved reduction factor if the weak hypotheses happen to be one-sided. [sent-46, score-0.46]

27 Finally we give a modiﬁed version of AdaBoost that exploits the one-sidedness of the weak hypotheses and avoids some technical problems that can occur with InfoBoost. [sent-47, score-0.415]

28 The instances xi are in some domain X and the labels yi are in {−1, 1}. [sent-53, score-0.21]

29 The boosting algorithms maintain a distribution Dt over the examples. [sent-54, score-0.122]

30 At the t-th iteration, the algorithm chooses a weak3 hypothesis ht : X → {−1, 1} and then updates its distribution. [sent-56, score-0.762]

31 The most popular boosting algorithm does this as follows: AdaBoost: Dt+1 (i) = 1 Dt (i) exp{−yi ht (xi )αt } , Zt This assumes equal weight on both types of examples. [sent-57, score-0.575]

32 3 For the sake of simplicity we focus on the case when the range of the labels and the weak hypotheses is ±1 valued. [sent-59, score-0.418]

33 2 Here Zt is a normalization constant and the coeﬃcient αt depends on the error t at iteration t: αt = 1 ln 1−t t and t = PrDt [ht (xi ) = yi ]. [sent-61, score-0.336]

34 The ﬁnal hypothesis is 2 given by the sign of the following linear combination of the chosen weak hypotheses: T H(x) = t=1 αt ht (x). [sent-62, score-0.964]

35 y i \ ht The constraint can be easily understood using the table of +1 Figure 1. [sent-70, score-0.463]

36 −1 b d +1 a c Four types The second boosting algorithm we discuss in this paper has the following update: InfoBoost: Dt+1 (i) = Dt (i) exp{−yi ht (xi )αt [ht (xi )]} , Zt [±1] where αt [±1] = 1 ln 1−[±1] , t [±1] = PrDt [ht (xi ) = yi |ht (xi ) = ±1] and Zt is 2 t the normalization factor. [sent-77, score-0.818]

37 The ﬁnal hypothesis is given by the sign of H(x) = T t=1 αt [ht (x)] ht (x). [sent-78, score-0.738]

38 In the original paper [Asl00], the InfoBoost update was motivated by seeking a distribution Dt+1 for which the error of ht is half and yi and ht (xi ) have mutual information zero. [sent-79, score-1.087]

39 Here we motivate InfoBoost as a minimization of the same relative entropy subject to the AdaBoost constraint b + c = 1 and a second simultaneously 2 enforced constraint a + b = 1 . [sent-80, score-0.103]

40 A natural question is why not just do two steps of AdaBoost at each iteration t: One for ht and and then, sequentially, one for 1. [sent-85, score-0.507]

41 We call the latter algorithm AdaBoost with Bias, since the constant hypothesis introduces a bias into the ﬁnal hypothesis. [sent-86, score-0.366]

42 y i \ ht +1 −1 +1 yi \ ht +1 −1 +1 0 0 2 5 0 −1 2 5 1 5 Dt+1 : AdaB. [sent-89, score-0.993]

43 Bias yi \ ht +1 −1 +1 yi \ ht +1 −1 +1 1 3 0 1 5 0 −1 1 2 1 6 −1 3 10 1 2 Figure 2: Updating based on a positively one-sided hypothesis ht (weight c is 0): The updated distributions on the four types of examples are quite diﬀerent. [sent-92, score-1.959]

44 We will show in the next section that in the case of learning disjunctions, AdaBoost with Bias (and plain AdaBoost) can require many more iterations than InfoBoost and the trivial covering algorithm. [sent-93, score-0.23]

45 A natural extension would be to constrain the errors of all past hypotheses to half which is the Totally Corrective Algorithm of [KW99]. [sent-95, score-0.254]

46 3 Lower bounds of AdaBoost for Learning k disjunctions So far we did not specify how the weak hypothesis ht is chosen at iteration t. [sent-97, score-1.008]

47 We assume there is a pool H of weak hypotheses and distinguish two methods: Greedy: Choose a ht ∈ H for which the normalization factor Zt in the update of the algorithm is minimized. [sent-98, score-0.95]

48 Minimal: Choose ht with error smaller than a given threshold 1 − δ. [sent-99, score-0.472]

49 2 The greedy method is motivated by the fact that t Zt upper bounds the training error of the ﬁnal hypothesis ([SS99, Asl00]) and this method greedily minimizes this upper bound. [sent-100, score-0.446]

50 In our lower bounds on the number of iterations the example set is always consistent with a k-literal monotone disjunction over N variables. [sent-102, score-0.235]

51 More precisely the instances xi are in {±1}N and the label yi is xi,1 ∨ xi,2 ∨ . [sent-103, score-0.207]

52 The pool of weak learners consists of the N literals Xj , where Xj (xi ) = xij . [sent-107, score-0.302]

53 For the greedy method we show that on random data sets InfoBoost and the covering algorithm use drastically fewer iterations than AdaBoost with Bias. [sent-108, score-0.286]

54 We chose 10, 000 examples as follows: The ﬁrst k bits of each example are chosen independently at random so that the probability of label +1 is half (i. [sent-109, score-0.133]

55 Figure 3 shows the number of iterations as function of the size of disjunction k (averaged over 20 runs) of AdaBoost with Bias until consistency is reached on all 10, 000 exam- Figure 3: Average # of steps ples. [sent-112, score-0.185]

56 However we now give an explicit construction for the minimal method of choosing the weak hypothesis for which the number of iterations of greedy covering and InfoBoost grow linearly in k and the number of iterations of AdaBoost with Bias is quadratic in k. [sent-118, score-0.859]

57 The i-th instance xi is the i-th row of this matrix and the ﬁrst N examples (rows) are labeled +1 and the label of the last row yN +1 is −1. [sent-121, score-0.188]

58 Clearly the literal XN is consistent with the labels and thus always has error 0 w. [sent-122, score-0.187]

59 But note that the disjunction of the last k literals is also consistent (for any k). [sent-126, score-0.199]

60 We will construct a distribution on the rows that gives high probability to the early rows (See Figure 4) and allows the following ”minimal” choice of weak hypotheses: At iteration t, AdaBoost with Bias is given 1 1 the weak hypothesis Xt . [sent-127, score-0.778]

61 This weak hypothesis will have error 1 − 2k (δ = 2k ) w. [sent-128, score-0.499]

62 Contrary to our construction, the initial distribution for boosting applications is typically uniform. [sent-132, score-0.122]

63 D1 (xi )  +  −  −  −   −  − − xi,j + + − − − − − + + + − − − − + + + + + − − + + + + + + − yi   +   +   +   +     +   + −        Figure 4: The examples (rows of the matrix), the labels, and the distribution D1 . [sent-141, score-0.19]

64 AdaBoost with Bias does two update steps at iteration t (constrain the error of ht to half and then sequentially the error of 1 to half. [sent-143, score-0.634]

65 ) Dt (i) = Dt (i) exp{−yi ht (xi )αt } Dt (i) exp{−yi αt } and Dt+1 = . [sent-144, score-0.436]

66 Zt Zt The Z’s are normalization factors, αt = 1 2 ln 1−εt and αt = εt 1 2 D (x ) ln D t(x ≤N ) . [sent-145, score-0.223]

67 The ﬁnal t hypothesis is the sign of the following linear combination: H(x) = T t=1 αt . [sent-146, score-0.302]

68 For AdaBoost with Bias and t ≤ N , PrDt [Xt (xi ) = yi ] = 1 2 − 1 2k . [sent-148, score-0.121]

69 Since PrDt [Xt (xi ) = yi ] = Dt (xN +1 )+Dt (x≤t ) and Dt (xN +1 ) = 1 2 1 it suﬃces to show that Dt (x≤t ) = 2k for t ≤ N . [sent-151, score-0.121]

70 Zt−1 Zt−1 (1) Note that the example xt is not covered by any previous hypotheses X1 , . [sent-156, score-0.31]

71 Zj Zj j=1 (2) 1 Using the inductive assumption that PrDt [Xt (xi ) = yi ] = 1 − 2k , for t < t, one 2 k+1 1 1 can show that αt = 1 ln k−1 , Zt = k (k − 1)(k + 1), Dt (x≤N ) = 1 + 2(k+1) , 2 2 1 1 Dt (xN +1 ) = 1 − 2(k+1) , αt = 1 ln k+2 , and Zt = k+1 k(k + 2). [sent-160, score-0.323]

72 For the described examples set, initial distribution D1 , and minimal choice of weak hypotheses, AdaBoost with Bias needs at least N iterations to construct a ﬁnal hypothesis whose error with respect to D1 is below ε. [sent-163, score-0.663]

73 At the end of the iteration t, the examples xt+1 , . [sent-166, score-0.109]

74 , xN are not correctly classiﬁed by the past weak hypotheses X1 , . [sent-169, score-0.4]

75 In particular, the ﬁnal linear combination evaluated at xN is t H(xN ) = t j=1 t αj = − αj Xj (xN )+ j=1 t αj + j=1 t k+1 t k+2 + ln < 0. [sent-173, score-0.124]

76 αj = − ln 2 k−1 2 k j=1 Thus sign(H(xN )) = −1 and the ﬁnal hypothesis has error at least D1 (xN ) ≥ ε with respect to D1 . [sent-174, score-0.417]

77 To show a similar lower bound for plain AdaBoost we use the same example set and the following sequence of weak hypotheses X1 , 1, X2 , 1, . [sent-175, score-0.446]

78 For odd iteration 1 numbers t the above proposition shows the error of the weak hypothesis is 1 − 2k and 2 1 for even iteration numbers one can show that the hypothesis 1 has error 1 − 2(k+1) . [sent-179, score-0.943]

79 2 4 InfoBoost and SemiBoost for one-sided weak hypotheses Aslam proved the following upper bound on the training error[Asl00] of InfoBoost: Theorem 3. [sent-180, score-0.434]

80 The training error of the ﬁnal hypothesis produced by InfoBoost is T bounded by t=1 Zt , where Zt = PrDt [ht (xi ) = +1] 1 − γt [+1]2 + PrDt [ht (xi ) = −1] 1 − γt [−1]2 and edge4 γt [±1] = 1 − 2εt [±1]. [sent-181, score-0.316]

81 √ However, if ht is one-sided, InfoBoost gives the improved factor of 1 − γt : √ Corollary 4. [sent-184, score-0.496]

82 So the second summand becomes 2 2 Pr[ht (xi ) = −1, yi = +1] Pr[ht (xi ) = −1, yi = −1] Dt =2 = Dt Pr[yi = +1] Pr[yi = −1] Pr[ht (xi ) = −1|yi = +1] Pr[ht (xi ) = −1|yi = −1] Dt Dt Dt Dt Pr[ht (xi ) = −1|yi = +1]. [sent-195, score-0.269]

83 When the considered weak hypotheses are positively one-sided, then the trivial greedy covering algorithm (which simply chooses the set that covers the most uncovered positive examples), achieves the improved factor of 1 − γ, which 1 means at most γ ln m iterations. [sent-197, score-0.847]

84 By a careful analysis (not included), one can show that the factor for InfoBoost can be improved to 1 − γ, if all weak hypotheses are one-sided. [sent-198, score-0.46]

85 So in this case InfoBoost indeed matches the 1 − γ factor of the greedy covering algorithm. [sent-199, score-0.224]

86 Thus InfoBoost terminates in a single iteration and outputs a hypothesis that predicts +1 for any instance and InfoBoost cannot be used for constructing a cover. [sent-203, score-0.351]

87 Recall that the ﬁnal hyT pothesis of InfoBoost is given by H(x) = t=1 αt [ht (x)] ht (x). [sent-205, score-0.436]

88 This doesn’t seem to be a linear combination of hypotheses from H since the coeﬃcients vary with the prediction of weak hypotheses. [sent-206, score-0.423]

89 However observe that αt [ht (x)] ht (x) = αt [+1] h+ (x) + αt [−1] h− (x), where h± = h(x) if h(x) = ±1 and 0 otherwise. [sent-207, score-0.436]

90 So the ﬁnal hypothesis of InfoBoost and the new algorithm we 2 will deﬁne in a moment is a bias plus a linear combination of the the original weak learners in H. [sent-210, score-0.603]

91 We propose the following variant of AdaBoost (called Semi-Boost): In each iteration execute one step of AdaBoost but the chosen weak hypothesis must be a semi hypothesis of one of the original hypothesis h ∈ H which has a positive edge. [sent-211, score-1.171]

92 Also if all the chosen hypotheses are of the h+ type then the ﬁnal hypothesis is a disjunction. [sent-213, score-0.517]

93 If hypotheses are chosen by smallest error (largest edge), then the greedy covering algorithm is simulated. [sent-214, score-0.484]

94 Analogously, if all the chosen hypotheses are of the h− type then one can show that the ﬁnal hypothesis of SemiBoost is a conjunction. [sent-215, score-0.517]

95 Furthermore, two steps of SemiBoost (with hypothesis h+ in the ﬁrst step followed by the sibling hypothesis h− in the second step) are equivalent to one step of InfoBoost with hypothesis h. [sent-216, score-0.868]

96 Finally we note that the ﬁnal hypothesis of InfoBoost (or SemiBoost) is not welldeﬁned when it includes both types of one-sided hypotheses, i. [sent-217, score-0.3]

97 Second, we allow inﬁnite coeﬃcients but interpret the ﬁnal hypothesis as a version of a decision list [Riv87]: Whenever more than one semi hypotheses with inﬁnite coeﬃcients are non-zero on the current instance, then the semi hypothesis with the lowest iteration number determines the label. [sent-223, score-0.952]

98 Once such a consistent decision list over some set of hypothesis ht and 1 has been found, it is easy the ﬁnd an alternate linear combination of the same set of hypotheses (using linear programming) that maximizes the margin or minimizes the one-norm of the coeﬃcient vector subject to consistency. [sent-224, score-1.018]

99 Conclusion: We showed that AdaBoost can require signiﬁcantly more iterations than the simple greedy cover algorithm when the weak hypotheses are one-sided and gave a variant of AdaBoost that can readily exploit one-sidedness. [sent-225, score-0.623]

100 He encouraged us to analyze AdaBoost with Bias and suggested to write a the ﬁnal hypothesis of InfoBoost as a linear combination of semi hypotheses. [sent-228, score-0.355]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('infoboost', 0.474), ('ht', 0.436), ('adaboost', 0.347), ('hypothesis', 0.28), ('dt', 0.236), ('hypotheses', 0.217), ('weak', 0.183), ('zt', 0.16), ('prdt', 0.158), ('semiboost', 0.126), ('yi', 0.121), ('boosting', 0.105), ('covering', 0.103), ('ln', 0.101), ('disjunction', 0.096), ('greedy', 0.094), ('xt', 0.093), ('iterations', 0.075), ('bias', 0.072), ('nal', 0.071), ('xi', 0.071), ('xn', 0.07), ('pr', 0.069), ('literal', 0.069), ('coe', 0.065), ('iteration', 0.057), ('literals', 0.055), ('semi', 0.052), ('examples', 0.052), ('consistent', 0.048), ('aslam', 0.047), ('hatano', 0.047), ('error', 0.036), ('labeled', 0.036), ('positively', 0.035), ('manfred', 0.033), ('pool', 0.033), ('improved', 0.033), ('disjunctions', 0.032), ('learners', 0.031), ('plain', 0.03), ('di', 0.029), ('construction', 0.029), ('rows', 0.029), ('entropy', 0.029), ('summand', 0.027), ('constraint', 0.027), ('factor', 0.027), ('erent', 0.025), ('cients', 0.025), ('bits', 0.024), ('combination', 0.023), ('cruz', 0.023), ('su', 0.022), ('half', 0.022), ('zj', 0.022), ('trivial', 0.022), ('sign', 0.022), ('updated', 0.022), ('normalization', 0.021), ('cover', 0.021), ('minimal', 0.02), ('chosen', 0.02), ('motivate', 0.02), ('santa', 0.02), ('types', 0.02), ('log', 0.019), ('variant', 0.019), ('update', 0.019), ('chooses', 0.018), ('labels', 0.018), ('upper', 0.018), ('distribution', 0.017), ('uc', 0.017), ('negative', 0.017), ('protocol', 0.017), ('irrelevant', 0.017), ('weighted', 0.016), ('schapire', 0.016), ('corollary', 0.016), ('statement', 0.016), ('always', 0.016), ('bound', 0.016), ('avoids', 0.015), ('constrain', 0.015), ('constructs', 0.015), ('label', 0.015), ('maintains', 0.015), ('steps', 0.014), ('instance', 0.014), ('updates', 0.014), ('proposition', 0.014), ('algorithm', 0.014), ('acknowledgment', 0.014), ('enforcement', 0.014), ('formulae', 0.014), ('mimics', 0.014), ('sibling', 0.014), ('sequentially', 0.014), ('list', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 41 nips-2003-Boosting versus Covering

Author: Kohei Hatano, Manfred K. Warmuth

2 0.36219501 143 nips-2003-On the Dynamics of Boosting

Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire

Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simpliﬁed nonlinear iterated map and analyze its behavior in low-dimensional cases. We ﬁnd stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classiﬁer when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1

3 0.25060609 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

Author: David C. Parkes, Satinder P. Singh

Abstract: Online mechanism design (MD) considers the problem of providing incentives to implement desired system-wide outcomes in systems with self-interested agents that arrive and depart dynamically. Agents can choose to misrepresent their arrival and departure times, in addition to information about their value for diﬀerent outcomes. We consider the problem of maximizing the total longterm value of the system despite the self-interest of agents. The online MD problem induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium. 1

4 0.13722549 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classiﬁers of equivalent quality. This faster method could be used for more demanding classiﬁcation tasks, such as on-line learning. 1

5 0.12653023 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan

Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face ﬁnder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classiﬁer change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the ﬁeld in an application for automatic assessment of human-robot interaction.

6 0.12096925 133 nips-2003-Mutual Boosting for Contextual Inference

7 0.098828018 148 nips-2003-Online Passive-Aggressive Algorithms

8 0.097888902 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

9 0.091166295 122 nips-2003-Margin Maximizing Loss Functions

10 0.088411845 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

11 0.084370919 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

12 0.080426708 124 nips-2003-Max-Margin Markov Networks

13 0.073440537 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

14 0.072381586 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

15 0.071874052 102 nips-2003-Large Scale Online Learning

16 0.071588002 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

17 0.06630829 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

18 0.062961884 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

19 0.059862576 170 nips-2003-Self-calibrating Probability Forecasting

20 0.058134113 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.176), (1, -0.029), (2, -0.049), (3, -0.275), (4, 0.093), (5, -0.2), (6, -0.032), (7, 0.111), (8, -0.074), (9, -0.057), (10, -0.21), (11, -0.104), (12, 0.212), (13, -0.258), (14, -0.145), (15, -0.009), (16, 0.108), (17, -0.152), (18, -0.105), (19, -0.061), (20, -0.201), (21, 0.009), (22, -0.148), (23, 0.098), (24, 0.049), (25, 0.059), (26, -0.099), (27, 0.034), (28, 0.07), (29, -0.085), (30, 0.046), (31, -0.014), (32, 0.043), (33, 0.002), (34, 0.054), (35, 0.038), (36, 0.026), (37, -0.046), (38, 0.016), (39, -0.09), (40, 0.044), (41, -0.016), (42, -0.069), (43, -0.032), (44, 0.043), (45, -0.107), (46, 0.033), (47, 0.039), (48, 0.05), (49, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97002488 41 nips-2003-Boosting versus Covering

Author: Kohei Hatano, Manfred K. Warmuth

2 0.83859766 143 nips-2003-On the Dynamics of Boosting

Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire

3 0.52193677 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

Author: David C. Parkes, Satinder P. Singh

4 0.41548482 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan

5 0.34761682 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

6 0.29647133 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

7 0.29148629 133 nips-2003-Mutual Boosting for Contextual Inference

8 0.27653122 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

9 0.26148367 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

10 0.24709658 122 nips-2003-Margin Maximizing Loss Functions

11 0.24179348 102 nips-2003-Large Scale Online Learning

12 0.23339021 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

13 0.22735977 124 nips-2003-Max-Margin Markov Networks

14 0.22195333 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

15 0.21408321 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

16 0.21317038 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

17 0.2083876 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

18 0.20742653 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

19 0.19980119 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

20 0.19103087 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.054), (11, 0.021), (35, 0.055), (53, 0.101), (58, 0.248), (71, 0.106), (76, 0.042), (85, 0.109), (91, 0.117), (99, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92634362 170 nips-2003-Self-calibrating Probability Forecasting

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

same-paper 2 0.84472549 41 nips-2003-Boosting versus Covering

Author: Kohei Hatano, Manfred K. Warmuth

3 0.83642632 48 nips-2003-Convex Methods for Transduction

Author: Tijl D. Bie, Nello Cristianini

Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves ﬁnding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semideﬁnite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1

4 0.67107475 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

5 0.6647262 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

6 0.66132319 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

7 0.65876842 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

8 0.65828615 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

9 0.65661311 143 nips-2003-On the Dynamics of Boosting

10 0.65585947 126 nips-2003-Measure Based Regularization

11 0.65564996 145 nips-2003-Online Classification on a Budget

12 0.65348929 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

13 0.65343273 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

14 0.65296221 113 nips-2003-Learning with Local and Global Consistency

15 0.65223294 78 nips-2003-Gaussian Processes in Reinforcement Learning

16 0.65097153 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

17 0.64968103 187 nips-2003-Training a Quantum Neural Network

18 0.6487425 189 nips-2003-Tree-structured Approximations by Expectation Propagation

19 0.64797807 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

20 0.64701337 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks