nips nips2003 nips2003-109 knowledge-graph by maker-knowledge-mining

109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Source: pdf

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classiﬁers of equivalent quality. This faster method could be used for more demanding classiﬁcation tasks, such as on-line learning. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. [sent-5, score-0.895]

2 Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. [sent-7, score-1.403]

3 We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classiﬁers of equivalent quality. [sent-9, score-0.953]

4 1 Introduction Fast and robust face detection is an important computer vision problem with applications to surveillance, multimedia processing, and HCI. [sent-11, score-0.524]

5 Face detection is often formulated as a search and classiﬁcation problem: a search strategy generates potential image regions and a classiﬁer determines whether or not they contain a face. [sent-12, score-0.511]

6 When a brute-force search strategy is used, face detection is a rare event detection problem, in the sense that among the millions of image regions, only very few contain faces. [sent-14, score-1.214]

7 The resulting classiﬁer design problem is very challenging: The detection rate must be very high in order to avoid missing any rare events. [sent-15, score-0.617]

8 At the same time, the false positive rate must be very low (e. [sent-16, score-0.383]

9 In their seminal work [4], Viola and Jones proposed a face detection method based on a cascade of classiﬁers, illustrated in ﬁgure 1. [sent-20, score-1.092]

10 Most image regions are rejected quickly, resulting in very fast face detection performance. [sent-22, score-0.566]

11 There are three elements in the Viola-Jones framework: the cascade architecture, a rich over-complete set of rectangle features, and an algorithm based on AdaBoost for constructing ensembles of rectangle features in each classiﬁer node. [sent-23, score-0.905]

12 Much of the recent work on face detection following Viola-Jones has explored alternative boosting algorithms such as FloatBoost [5], GentleBoost [6], and Asymmetric AdaBoost [7] (see [8] for a related method). [sent-24, score-0.603]

13 Hn dn , f n Non-face Face Figure 1: Illustration of the cascade architecture with n nodes. [sent-28, score-0.656]

14 This paper is motivated by the observation that the AdaBoost feature selection method is an indirect way to meet the learning goals of the cascade. [sent-29, score-0.515]

15 For example, weeks of computation are required to produce the ﬁnal cascade in [4]. [sent-31, score-0.564]

16 In this paper we present a new cascade learning algorithm which uses direct forward feature selection to construct the ensemble classiﬁers in each node of the cascade. [sent-32, score-1.297]

17 We demonstrate empirically that our algorithm is two orders of magnitude faster than the Viola-Jones algorithm, and produces cascades which are indistinguishable in face detection performance. [sent-33, score-0.712]

18 Our results also suggest that a large portion of the effectiveness of the Viola-Jones detector should be attributed to the cascade design and the choice of the feature set. [sent-35, score-0.753]

19 2 Cascade Architecture for Rare Event Detection The learning goal for the cascade in ﬁgure 1 is the construction of a set of classiﬁers n {Hi }i=1 . [sent-36, score-0.564]

20 Each Hi is required to have a very high detection rate, but only a moderate false positive rate (e. [sent-37, score-0.705]

21 If the {Hi } can be constructed to produce independent errors, then the overall detection rate d and false positive rate f for the cascade is n n given by i=1 di and i=1 fi respectively. [sent-41, score-1.419]

22 In a hypothetical example, a 20 node cascade with di = 0. [sent-42, score-0.667]

23 As in [4], the overall cascade learning method in this paper is a stage-wise, greedy feature selection process. [sent-47, score-0.861]

24 In moving from node Hi to Hi+1 during training, negative examples that were classiﬁed successfully by the cascade are discarded and replaced with new ones, using the standard bootstrapping approach from [1]. [sent-52, score-0.722]

25 The cascade architecture in ﬁgure 1 should be suitable for other rare event problems, such as network intrusion detection in which an attack constitutes a few packets out of tens of millions. [sent-54, score-1.26]

26 Recent work in that community has also explored a cascade approach [9]. [sent-55, score-0.592]

27 For each node in the cascade architecture, given a training set {xi , yi }, the learning objective is to select a set of weak classiﬁers {ht } from a total set of F features and combine them into an ensemble H with a high detection rate d and a moderate false positive rate f . [sent-56, score-2.053]

28 Train all weak classifiers Train all weak classifiers Adjust threshold of the ensemble to meet the detection rate goal no yes Add the feature with minimum weighted error to the ensemble d>D? [sent-57, score-1.694]

29 Add the feature to minimize false positive rate of the ensemble f>=F ? [sent-58, score-0.815]

30 Add the feature to maximize detection rate of the ensemble f>=F or d<=D ? [sent-59, score-0.876]

31 (a) (b) Figure 2: Diagram for training one node in the cascade architecture, (a) is for the ViolaJones method, and (b) is for the proposed method. [sent-60, score-0.726]

32 F and D are false positive rate and detection rate goals respectively. [sent-61, score-0.856]

33 A weak classiﬁer is formed from a rectangle feature by applying the feature to the input pattern and thresholding the result. [sent-62, score-0.576]

34 In [4], an algorithm based on AdaBoost trains weak classiﬁers, adds them to the ensemble, and computes the ensemble weights. [sent-64, score-0.417]

35 AdaBoost [10] is an iterative method for obtaining an ensemble of weak classiﬁers by evolving a distribution of weights, Dt , over the training data. [sent-65, score-0.481]

36 After T rounds of boosting, the decision of the T 1 t=1 αt ht (x) ≥ θ , where the αt are the standard ensemble is deﬁned as H(x) = 0 otherwise AdaBoost ensemble weights and θ is the threshold of the ensemble. [sent-67, score-0.609]

37 This threshold is adjusted to meet the detection rate goal. [sent-68, score-0.637]

38 More features are then added if necessary to meet the false positive rate goal. [sent-69, score-0.634]

39 The process of sequentially adding features which individually minimize the weighted error is at best an indirect way to meet the learning goals for the ensemble. [sent-71, score-0.413]

40 For example, the false positive goal is relatively easy to meet, compared to the detection rate goal which is near 100%. [sent-72, score-0.705]

41 As a consequence, the threshold θ produced by AdaBoost must be discarded in favor of a threshold computed directly from the ensemble performance. [sent-73, score-0.355]

42 Beyond these concerns is a more basic question about the cascade learning problem: What is the role of boosting in forming an effective ensemble? [sent-76, score-0.667]

43 Our hypothesis is that the overall success of the method depends upon having a sufﬁciently rich feature set, which deﬁnes the space of possible weak classiﬁers. [sent-77, score-0.359]

44 3 Direct Feature Selection Method We propose a new cascade learning algorithm based on forward feature selection [11]. [sent-81, score-0.883]

45 Pseudo-code of the algorithm for building an ensemble classiﬁer for a single node is given 1 A feature and its corresponding classiﬁer will be used interchangeably. [sent-82, score-0.562]

46 Given d, the minimum detection rate and f , the maximum false positive rate. [sent-85, score-0.705]

47 For every feature, j, train a weak classiﬁer hj , whose false positive rate is f . [sent-87, score-0.58]

48 while dt < d or ft > f (a) if dt < d, then, ﬁnd the feature k, such that by adding it to H, the new ensemble will have largest detection rate dt+1 . [sent-96, score-0.995]

49 (b) else, ﬁnd the feature k, such that by adding it to H, the new ensemble will have smallest false positive rate ft+1 . [sent-97, score-0.845]

50 The decision of the ensemble classiﬁer is formed by a majority voting of weak 1 hj ∈H hj (x) ≥ θ classiﬁers in H, i. [sent-100, score-0.547]

51 Table 1: The direct feature selection method for building an ensemble classiﬁer. [sent-104, score-0.608]

52 The ﬁrst step in our algorithm is to train each of the weak classiﬁers to meet the false positive rate goal for the ensemble. [sent-107, score-0.731]

53 In each iteration, we consider adding each possible classiﬁer to the ensemble and select the one which makes the largest improvement to the ensemble performance. [sent-110, score-0.6]

54 The key difference is that we train the weak classiﬁers only once per node, while in the Viola-Jones method they are trained once for each feature in the cascade. [sent-114, score-0.347]

55 For the cascade of 32 nodes with 4297 features in [4], the difference in learning time will be dramatic. [sent-119, score-0.7]

56 Decision of the ensemble classiﬁer is formed by a weighted average of weak classiﬁers in H. [sent-138, score-0.446]

57 Decrease the threshold θ until the ensemble reaches the detection rate goal. [sent-139, score-0.745]

58 In the ﬁrst experiment we constructed three face detection cascades. [sent-154, score-0.524]

59 One cascade used the direct feature selection method from table 1. [sent-155, score-0.927]

60 The second cascade used the weight setting algorithm in table 2. [sent-156, score-0.651]

61 The third cascade used our implementation of the Viola-Jones algorithm. [sent-158, score-0.564]

62 The third cascade was stopped after 28 nodes because the AdaBoost based training algorithm could not meet the learning goal. [sent-160, score-0.894]

63 With 200 features, when the detection rate is 99. [sent-161, score-0.444]

64 9%, the AdaBoost ensemble’s false positive rate is larger than 97%. [sent-162, score-0.383]

65 We constructed the ROC curves by removing nodes from the cascade to generate points with increasing detection and false positive rates. [sent-165, score-1.285]

66 The second experiment explored the ability of the rectangle feature set to meet the detection rate goal for the ensemble on a difﬁcult node. [sent-167, score-1.151]

67 Figure 3(b) shows the false positive and detection rates for the ensemble (i. [sent-168, score-0.856]

68 , one node in the cascade architecture) as a function of the number of features that were added to the ensemble. [sent-170, score-0.753]

69 The training set used was the bootstrapped training set for the 19th node in the cascade which was trained by the ViolaJones method. [sent-171, score-0.818]

70 Even for this difﬁcult learning task, the algorithm can improve the detection rate from about 0. [sent-172, score-0.471]

71 Our hypothesis is that the strength of this feature set in the context of the cascade architecture is the key to 3 We found that the criterion for automatically ﬁnding detection errors in [6] was too loose. [sent-176, score-1.137]

72 This criterion yielded higher detection rates and lower false positive rates than manual counting. [sent-177, score-0.583]

73 (a) is ROC curves of the proposed method and the ViolaJones method and (b) is trend of detection and false positive rates when more features are combined in one node. [sent-185, score-0.793]

74 We conducted a third experiment in which we focused on learning one node in the cascade architecture. [sent-187, score-0.667]

75 Figure 4 shows ROC curves of the Viola-Jones, direct feature selection, and weight setting methods for one node of the cascade. [sent-188, score-0.392]

76 detection rate > 99%), our algorithms yield better ROC curve performance than the Viola-Jones method. [sent-194, score-0.444]

77 5 Related Work A survey of face detection methods can be found in [12]. [sent-196, score-0.496]

78 We restrict our attention here to frontal face detection algorithms related to the cascade idea. [sent-197, score-1.083]

79 Other cascade structures have been constructed for SVM classiﬁers. [sent-201, score-0.592]

80 An alternative cascade framework for SVM classiﬁers is proposed by Heisele et. [sent-205, score-0.564]

81 proposed another object detection method which consists of a series of antiface templates [15]. [sent-209, score-0.394]

82 In an extension of their original work [7], Viola and Jones proposed an asymmetric AdaBoost algorithm in which false negatives are penalized more than false positives. [sent-217, score-0.477]

83 This is an interesting attempt to incorporate the rare event observation more explicitly into their Correct detection rate 1 0. [sent-218, score-0.695]

84 6 Conclusions Face detection is a canonical example of a rare event detection task, in which target patterns occur with much lower frequency than non-targets. [sent-230, score-0.895]

85 It results in a challenging classiﬁer design problem: The detection rate must be very high in order to avoid missing any rare events and the false positive rate must be very low to dodge the ﬂood of non-events. [sent-231, score-1.042]

86 A cascade classiﬁer architecture is well-suited to rare event detection. [sent-232, score-0.907]

87 The Viola-Jones face detection framework consists of a cascade architecture, a rich overcomplete feature set, and a learning algorithm based on AdaBoost. [sent-233, score-1.274]

88 We have demonstrated that a simpler direct algorithm based on forward feature selection can produce cascades of similar quality with two orders of magnitude less computation. [sent-234, score-0.488]

89 This is because the learning goal is a highly-skewed tradeoff between detection rate and false positive rate which does not ﬁt naturally into the weighted error framework of AdaBoost. [sent-236, score-0.861]

90 Our experiments suggest that the feature set and cascade structure in the Viola-Jones framework are the key elements in the success of the method. [sent-237, score-0.746]

91 Three issues that we plan to explore in future work are: the necessary properties for feature sets, global feature selection methods, and the incorporation of search into the cascade framework. [sent-238, score-1.033]

92 The rectangle feature set seems particularly well-suited for face detection. [sent-239, score-0.415]

93 What general properties must a feature set possess to be successful in the cascade framework? [sent-240, score-0.723]

94 In other rare event detection tasks where a large set of diverse features is not naturally available, methods to create such a feature set may be useful (e. [sent-241, score-0.818]

95 Finally, the current detection method relies on a brute-force search strategy for generating candidate regions. [sent-246, score-0.428]

96 We plan to explore the cascade architecture in conjunction with more general interest operators, such as those deﬁned in [18, 20]. [sent-247, score-0.656]

97 A statistical model for 3d object detection applied to faces and cars. [sent-265, score-0.416]

98 Rapid object detection using a boosted cascade of simple features. [sent-271, score-0.959]

99 Empirical analysis of detection cascades of boosted classiﬁers for rapid object detection. [sent-291, score-0.485]

100 Feature reduction and hierarchy of classiﬁers for fast object detection in video images. [sent-351, score-0.362]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cascade', 0.564), ('detection', 0.322), ('ensemble', 0.273), ('false', 0.207), ('adaboost', 0.176), ('face', 0.174), ('rare', 0.173), ('meet', 0.165), ('feature', 0.159), ('classi', 0.15), ('rate', 0.122), ('weak', 0.117), ('hi', 0.113), ('selection', 0.106), ('node', 0.103), ('er', 0.092), ('architecture', 0.092), ('viola', 0.092), ('cascades', 0.09), ('ers', 0.089), ('features', 0.086), ('rectangle', 0.082), ('boosting', 0.079), ('event', 0.078), ('roc', 0.073), ('floatboost', 0.062), ('violajones', 0.062), ('curves', 0.06), ('training', 0.059), ('positive', 0.054), ('faces', 0.054), ('nodes', 0.05), ('demanding', 0.049), ('gure', 0.047), ('jones', 0.045), ('search', 0.045), ('sequentially', 0.045), ('image', 0.042), ('carmichael', 0.042), ('classifiers', 0.042), ('dodge', 0.042), ('keren', 0.042), ('ood', 0.042), ('owchart', 0.042), ('hj', 0.041), ('orders', 0.041), ('object', 0.04), ('train', 0.039), ('direct', 0.038), ('cvpr', 0.038), ('pattern', 0.037), ('asymmetric', 0.036), ('rehg', 0.036), ('heisele', 0.036), ('ensembles', 0.036), ('lienhart', 0.036), ('rowley', 0.036), ('ht', 0.035), ('weighted', 0.034), ('bootstrapped', 0.033), ('boosted', 0.033), ('weight', 0.032), ('method', 0.032), ('dt', 0.032), ('intrusion', 0.031), ('detector', 0.03), ('adding', 0.03), ('faster', 0.029), ('indistinguishable', 0.029), ('bootstrapping', 0.029), ('millions', 0.029), ('stopped', 0.029), ('goals', 0.029), ('strategy', 0.029), ('table', 0.028), ('regions', 0.028), ('threshold', 0.028), ('constructed', 0.028), ('vision', 0.028), ('rich', 0.028), ('explored', 0.028), ('voting', 0.027), ('vote', 0.027), ('forward', 0.027), ('algorithm', 0.027), ('intelligence', 0.027), ('majority', 0.026), ('discarded', 0.026), ('detecting', 0.025), ('ft', 0.025), ('incorporated', 0.025), ('add', 0.024), ('forming', 0.024), ('indirect', 0.024), ('ieee', 0.024), ('select', 0.024), ('frontal', 0.023), ('scales', 0.023), ('success', 0.023), ('formed', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

2 0.26678348 133 nips-2003-Mutual Boosting for Contextual Inference

Author: Michael Fink, Pietro Perona

Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex

3 0.23347446 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan

Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face ﬁnder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classiﬁer change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the ﬁeld in an application for automatic assessment of human-robot interaction.

4 0.18910743 143 nips-2003-On the Dynamics of Boosting

Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire

Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simpliﬁed nonlinear iterated map and analyze its behavior in low-dimensional cases. We ﬁnd stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classiﬁer when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1

5 0.16230819 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman

Abstract: Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random ﬁeld for jointly solving the tasks of object detection and scene classiﬁcation. 1

6 0.13722549 41 nips-2003-Boosting versus Covering

7 0.098878354 11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors

8 0.0985962 181 nips-2003-Statistical Debugging of Sampled Programs

9 0.096675463 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification

10 0.092602558 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

11 0.090199038 73 nips-2003-Feature Selection in Clustering Problems

12 0.087600589 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

13 0.085528515 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

14 0.084375612 53 nips-2003-Discriminating Deformable Shape Classes

15 0.081066161 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

16 0.080142118 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

17 0.07593894 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

18 0.068235591 51 nips-2003-Design of Experiments via Information Theory

19 0.063787602 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

20 0.063611686 1 nips-2003-1-norm Support Vector Machines

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.217), (1, -0.076), (2, 0.051), (3, -0.269), (4, -0.16), (5, -0.288), (6, -0.112), (7, -0.034), (8, -0.067), (9, 0.058), (10, -0.149), (11, -0.097), (12, 0.085), (13, -0.22), (14, 0.019), (15, -0.037), (16, -0.002), (17, 0.039), (18, 0.083), (19, 0.089), (20, -0.021), (21, -0.048), (22, 0.019), (23, -0.05), (24, -0.006), (25, -0.017), (26, 0.009), (27, 0.031), (28, -0.018), (29, -0.022), (30, -0.062), (31, -0.012), (32, -0.04), (33, 0.021), (34, -0.097), (35, 0.01), (36, 0.049), (37, 0.033), (38, 0.006), (39, 0.069), (40, -0.014), (41, -0.015), (42, 0.037), (43, 0.019), (44, -0.004), (45, 0.055), (46, -0.125), (47, -0.008), (48, -0.065), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96812135 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

2 0.85006618 133 nips-2003-Mutual Boosting for Contextual Inference

Author: Michael Fink, Pietro Perona

3 0.82279813 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan

4 0.73572403 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman

5 0.65360898 143 nips-2003-On the Dynamics of Boosting

Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire

6 0.56063008 181 nips-2003-Statistical Debugging of Sampled Programs

7 0.54751092 3 nips-2003-AUC Optimization vs. Error Rate Minimization

8 0.49046534 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

9 0.4564386 11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors

10 0.45284784 41 nips-2003-Boosting versus Covering

11 0.44357482 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

12 0.43515527 53 nips-2003-Discriminating Deformable Shape Classes

13 0.41929212 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification

14 0.41234851 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

15 0.40878493 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

16 0.40401399 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

17 0.38801843 187 nips-2003-Training a Quantum Neural Network

18 0.38462651 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

19 0.37152871 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

20 0.35200185 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.068), (11, 0.038), (25, 0.179), (30, 0.015), (35, 0.065), (53, 0.073), (66, 0.012), (71, 0.114), (76, 0.052), (82, 0.011), (85, 0.135), (91, 0.127)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90560138 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

2 0.86332142 12 nips-2003-A Model for Learning the Semantics of Pictures

Author: Victor Lavrenko, R. Manmatha, Jiwoon Jeon

Abstract: We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vector. Given a training set of images with annotations, we compute a joint probabilistic model of image features and words which allow us to predict the probability of generating a word given the image regions. This may be used to automatically annotate and retrieve images given a word as a query. Experiments show that our model signiﬁcantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval. 1

3 0.78788531 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman

4 0.77422726 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classiﬁcation tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches. 1

5 0.77281022 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

6 0.76791173 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

7 0.76430619 41 nips-2003-Boosting versus Covering

8 0.76406574 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

9 0.75948226 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

10 0.7559492 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

11 0.75572217 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

12 0.75568926 117 nips-2003-Linear Response for Approximate Inference

13 0.75470984 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

14 0.75304055 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

15 0.75107479 145 nips-2003-Online Classification on a Budget

16 0.75079417 100 nips-2003-Laplace Propagation

17 0.74870908 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

18 0.74632007 113 nips-2003-Learning with Local and Global Consistency

19 0.74510765 189 nips-2003-Tree-structured Approximations by Expectation Propagation

20 0.74473 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints