nips nips2003 nips2003-133 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael Fink, Pietro Perona
Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. [sent-5, score-0.642]
2 When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. [sent-6, score-1.115]
3 This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. [sent-7, score-0.616]
4 In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. [sent-8, score-0.221]
5 Results show that the method outperforms applications overlooking contextual information. [sent-9, score-0.36]
6 We suggest that achieving contextual integration is a step toward human-like detection capabilities. [sent-10, score-0.707]
7 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. [sent-11, score-0.345]
8 Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. [sent-12, score-0.683]
9 Our goal is to propose an efficient method for detection of multiple objects in natural scenes. [sent-13, score-0.651]
10 Knowledge on position of several objects might shed light on the entire scene (Figure 1). [sent-15, score-0.365]
11 Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. [sent-16, score-0.326]
12 A B Figure 1: Contextual spatial relationships assist detection A. [sent-17, score-0.318]
13 Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. [sent-22, score-1.044]
14 First, part and whole detection are often evidently interacting [5, 6]. [sent-24, score-0.318]
15 The insights emerging from observing human perception were adopted by the object detection community. [sent-27, score-0.597]
16 Many object detection algorithms bypass stages of a strict compositional hierarchy. [sent-28, score-0.666]
17 The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. [sent-29, score-0.583]
18 Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. [sent-30, score-0.318]
19 Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. [sent-31, score-0.63]
20 Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. [sent-32, score-1.281]
21 Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. [sent-33, score-0.613]
22 We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. [sent-34, score-0.595]
23 Section 2 will start by posing the multiclass detection problem from labeled images. [sent-35, score-0.404]
24 In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. [sent-36, score-0.437]
25 Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. [sent-37, score-0.498]
26 An application of the Mutual Boosting framework to facial component detection is presented in Section 5. [sent-38, score-0.369]
27 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. [sent-40, score-0.779]
28 Could we make use of these objects' contextual relations to improve detection? [sent-41, score-0.36]
29 Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. [sent-42, score-0.392]
30 In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. [sent-43, score-0.733]
31 We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. [sent-44, score-0.348]
32 Let M denote the cardinality of the object set we wish to detect in natural scenes. [sent-45, score-0.261]
33 Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. [sent-46, score-1.122]
34 We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,. [sent-47, score-0.471]
35 ,M indicating the likelihood of object m appearing at different positions in a target image. [sent-49, score-0.247]
36 Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. [sent-56, score-0.472]
37 Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. [sent-58, score-0.715]
38 The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. [sent-60, score-0.864]
39 In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. [sent-62, score-0.5]
40 It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. [sent-63, score-0.301]
41 By making use of features that capture the object regularities well, one can improve performance or reduce detection time. [sent-66, score-0.593]
42 A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. [sent-69, score-0.491]
43 A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. [sent-70, score-0.94]
44 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. [sent-71, score-0.23]
45 The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. [sent-72, score-1.038]
46 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. [sent-73, score-0.427]
47 The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. [sent-74, score-0.848]
48 In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). [sent-75, score-0.282]
49 To apply this observation for our multiclass detection problem we simultaneously train M object detectors. [sent-76, score-0.584]
50 At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). [sent-77, score-0.882]
51 This process gradually enriches the initial grounded features with composite contextual features. [sent-79, score-0.483]
52 The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). [sent-80, score-0.517]
53 ,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). [sent-83, score-0.681]
54 This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. [sent-84, score-0.345]
55 Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. [sent-85, score-0.918]
56 The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. [sent-86, score-0.763]
57 While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). [sent-87, score-1.149]
58 Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. [sent-88, score-0.487]
59 As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e. [sent-89, score-0.421]
60 Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. [sent-92, score-0.329]
61 We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. [sent-93, score-0.45]
62 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. [sent-94, score-0.486]
63 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e. [sent-95, score-0.944]
64 H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,. [sent-100, score-0.366]
65 ,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,. [sent-102, score-0.732]
66 ,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,. [sent-106, score-0.881]
67 Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. [sent-120, score-0.363]
68 Each raw image H0 is analyzed by M object detection maps Hm=1,. [sent-121, score-0.764]
69 mouth feature from eye feature from face detection image detection image F. [sent-130, score-1.177]
70 eye feature from eye detection image detection image Figure 5: A-E. [sent-132, score-1.05]
71 Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). [sent-133, score-0.352]
72 The windows’ scale is defined by the detected object size and by the map mode (local or contextual). [sent-134, score-0.304]
73 faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. [sent-136, score-0.624]
74 Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). [sent-138, score-0.61]
75 We chose to perform contextual detection in the face-scene for two main reasons. [sent-139, score-0.678]
76 First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. [sent-140, score-0.389]
77 Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. [sent-141, score-0.607]
78 Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. [sent-142, score-0.441]
79 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. [sent-149, score-0.62]
80 ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. [sent-152, score-0.578]
81 Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). [sent-159, score-0.332]
82 25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. [sent-161, score-0.181]
83 MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. [sent-171, score-0.483]
84 The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. [sent-173, score-0.678]
85 For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. [sent-174, score-0.393]
86 This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. [sent-175, score-0.311]
87 However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. [sent-176, score-0.452]
88 This procedure explains the superior AdaBoost performance in the first boosting iteration. [sent-179, score-0.397]
89 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). [sent-180, score-0.195]
90 Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. [sent-181, score-0.182]
91 Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). [sent-182, score-0.201]
92 Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. [sent-183, score-0.391]
93 Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. [sent-184, score-0.756]
94 This approach is optimal when many objects with common characteristics are trained. [sent-185, score-0.234]
95 Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. [sent-187, score-1.324]
96 DNA sequences include multiple objects (Genes) in mutual positions, and therefore might be handled by a variant of Mutual Boosting. [sent-189, score-0.634]
97 Mutual Boosting combines parallel boosting, with a similar feature approach to efficiently incorporate contextual information. [sent-191, score-0.46]
98 We suggest that achieving wide contextual integration is one step towards human-like object detection capabilities. [sent-192, score-0.917]
99 (1991) Coding visual images of objects in the inferotemporal cortex of the macaque monkey. [sent-207, score-0.317]
100 Scene perception: Detecting the judging objects undergoing relational violations. [sent-226, score-0.234]
wordName wordTfidf (topN-words)
[('boosting', 0.397), ('contextual', 0.36), ('mutual', 0.346), ('detection', 0.318), ('objects', 0.234), ('object', 0.21), ('adaboost', 0.154), ('wlmn', 0.127), ('hm', 0.115), ('compositional', 0.111), ('image', 0.109), ('mb', 0.109), ('vj', 0.103), ('parts', 0.102), ('detectors', 0.099), ('learner', 0.096), ('face', 0.095), ('sigma', 0.095), ('windows', 0.092), ('configurations', 0.086), ('cm', 0.085), ('scenes', 0.084), ('hn', 0.08), ('eye', 0.074), ('pxp', 0.073), ('efficient', 0.072), ('aimed', 0.072), ('maps', 0.068), ('features', 0.065), ('rectangle', 0.062), ('weak', 0.06), ('scene', 0.06), ('window', 0.059), ('raw', 0.059), ('grounded', 0.058), ('mouth', 0.058), ('emerge', 0.058), ('multiclass', 0.056), ('biederman', 0.055), ('images', 0.053), ('facial', 0.051), ('detect', 0.051), ('detected', 0.049), ('feature', 0.048), ('learners', 0.048), ('ab', 0.048), ('wl', 0.047), ('faces', 0.047), ('map', 0.045), ('position', 0.044), ('win', 0.043), ('emerging', 0.043), ('viola', 0.04), ('perceptual', 0.04), ('semantics', 0.04), ('cube', 0.038), ('positions', 0.037), ('cutout', 0.036), ('detectionlayermn', 0.036), ('downscale', 0.036), ('downscaling', 0.036), ('hmt', 0.036), ('mouths', 0.036), ('office', 0.036), ('oliva', 0.036), ('winm', 0.036), ('perona', 0.036), ('intermediate', 0.035), ('mn', 0.034), ('cascade', 0.034), ('cov', 0.034), ('existing', 0.034), ('eyes', 0.033), ('local', 0.033), ('neighboring', 0.032), ('enabling', 0.032), ('referencing', 0.032), ('torralba', 0.032), ('visual', 0.03), ('training', 0.03), ('jones', 0.03), ('labeled', 0.03), ('achieving', 0.029), ('define', 0.029), ('gist', 0.029), ('fink', 0.029), ('nose', 0.029), ('trained', 0.027), ('might', 0.027), ('layer', 0.027), ('instances', 0.027), ('stages', 0.027), ('efficiently', 0.027), ('referring', 0.027), ('neighborhoods', 0.027), ('multiple', 0.027), ('detector', 0.027), ('human', 0.026), ('incorporate', 0.025), ('could', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 133 nips-2003-Mutual Boosting for Contextual Inference
Author: Michael Fink, Pietro Perona
Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex
2 0.29856947 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman
Abstract: Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field for jointly solving the tasks of object detection and scene classification. 1
3 0.26678348 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin
Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1
Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan
Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classifier change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the field in an application for automatic assessment of human-robot interaction.
5 0.15774341 143 nips-2003-On the Dynamics of Boosting
Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire
Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. We find stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1
6 0.12096925 41 nips-2003-Boosting versus Covering
7 0.098823003 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
8 0.094519958 122 nips-2003-Margin Maximizing Loss Functions
9 0.090359531 11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors
10 0.081682205 12 nips-2003-A Model for Learning the Semantics of Pictures
11 0.079129502 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
12 0.077258006 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
13 0.074084006 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
14 0.072047569 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution
15 0.071264058 119 nips-2003-Local Phase Coherence and the Perception of Blur
16 0.068773217 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
17 0.066785276 53 nips-2003-Discriminating Deformable Shape Classes
18 0.060481299 46 nips-2003-Clustering with the Connectivity Kernel
19 0.05834635 73 nips-2003-Feature Selection in Clustering Problems
20 0.05825074 68 nips-2003-Eye Movements for Reward Maximization
topicId topicWeight
[(0, -0.194), (1, -0.059), (2, 0.092), (3, -0.234), (4, -0.232), (5, -0.299), (6, -0.079), (7, -0.038), (8, -0.083), (9, -0.016), (10, -0.202), (11, -0.091), (12, 0.062), (13, -0.135), (14, 0.009), (15, -0.067), (16, -0.029), (17, 0.04), (18, -0.056), (19, 0.166), (20, -0.061), (21, -0.038), (22, 0.078), (23, -0.03), (24, 0.038), (25, -0.148), (26, 0.05), (27, -0.059), (28, -0.007), (29, -0.027), (30, 0.035), (31, -0.052), (32, -0.062), (33, -0.016), (34, -0.116), (35, -0.001), (36, -0.012), (37, 0.067), (38, 0.008), (39, 0.057), (40, 0.022), (41, -0.087), (42, -0.012), (43, 0.066), (44, -0.078), (45, 0.022), (46, -0.045), (47, -0.042), (48, -0.068), (49, -0.101)]
simIndex simValue paperId paperTitle
same-paper 1 0.98568434 133 nips-2003-Mutual Boosting for Contextual Inference
Author: Michael Fink, Pietro Perona
Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex
2 0.82484591 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman
Abstract: Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field for jointly solving the tasks of object detection and scene classification. 1
3 0.78412324 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin
Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1
Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan
Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classifier change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the field in an application for automatic assessment of human-robot interaction.
5 0.51830751 143 nips-2003-On the Dynamics of Boosting
Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire
Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. We find stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1
6 0.51730049 11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors
7 0.41515926 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring
8 0.37849334 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
9 0.36501399 12 nips-2003-A Model for Learning the Semantics of Pictures
10 0.35737804 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
11 0.3499704 41 nips-2003-Boosting versus Covering
12 0.34488449 181 nips-2003-Statistical Debugging of Sampled Programs
13 0.30014539 3 nips-2003-AUC Optimization vs. Error Rate Minimization
14 0.29619965 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution
15 0.29380009 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
16 0.29067409 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
17 0.28008375 73 nips-2003-Feature Selection in Clustering Problems
18 0.27969563 119 nips-2003-Local Phase Coherence and the Perception of Blur
19 0.27205181 53 nips-2003-Discriminating Deformable Shape Classes
20 0.27009636 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
topicId topicWeight
[(0, 0.043), (11, 0.032), (30, 0.015), (35, 0.024), (53, 0.048), (71, 0.534), (76, 0.032), (85, 0.062), (91, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.98545599 133 nips-2003-Mutual Boosting for Contextual Inference
Author: Michael Fink, Pietro Perona
Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex
2 0.97071582 108 nips-2003-Learning a Distance Metric from Relative Comparisons
Author: Matthew Schultz, Thorsten Joachims
Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. 1
3 0.96544766 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines
Author: William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Douglas A. Jones, Timothy R. Leek
Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.
4 0.96417165 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
Author: Xuanlong Nguyen, Michael I. Jordan
Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1
5 0.96138048 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
6 0.78682601 117 nips-2003-Linear Response for Approximate Inference
7 0.67938048 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
8 0.65660918 121 nips-2003-Log-Linear Models for Label Ranking
9 0.65571988 41 nips-2003-Boosting versus Covering
10 0.64267069 145 nips-2003-Online Classification on a Budget
11 0.64183903 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
12 0.6336323 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
13 0.63359773 100 nips-2003-Laplace Propagation
14 0.63080949 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
15 0.62459195 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
16 0.62057769 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
17 0.6160996 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
18 0.61551803 187 nips-2003-Training a Quantum Neural Network
19 0.61348677 122 nips-2003-Margin Maximizing Loss Functions
20 0.60756499 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation