nips nips2006 nips2006-50 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. [sent-8, score-0.15]
2 Further processing costs time and resources, but may add value. [sent-10, score-0.264]
3 Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. [sent-11, score-0.828]
4 We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. [sent-13, score-0.199]
5 After each stage some tests are performed to judge the quality of the wafer. [sent-18, score-0.457]
6 If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. [sent-19, score-0.284]
7 Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). [sent-23, score-0.159]
8 However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. [sent-24, score-0.202]
9 Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. [sent-25, score-0.423]
10 Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. [sent-26, score-0.215]
11 In this way, we save on many pixel operations and can reduce the cost in time to process an image. [sent-27, score-0.225]
12 Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. [sent-28, score-0.602]
13 Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. [sent-29, score-0.718]
14 Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. [sent-30, score-0.187]
15 Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). [sent-31, score-0.449]
16 We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. [sent-32, score-0.233]
17 1 Problem Formulation Let there be s stages to the processing pipeline. [sent-35, score-0.184]
18 In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. [sent-38, score-0.706]
19 Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. [sent-39, score-0.679]
20 Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. [sent-40, score-0.253]
21 To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. [sent-42, score-1.16]
22 For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. [sent-43, score-0.548]
23 Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. [sent-49, score-0.226]
24 We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. [sent-50, score-0.26]
25 We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). [sent-51, score-0.253]
26 The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . [sent-52, score-0.35]
27 j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. [sent-53, score-0.185]
28 We use superscripts to denote the example index and subscripts to denote the stage index. [sent-58, score-0.457]
29 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. [sent-59, score-0.185]
30 1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . [sent-61, score-0.717]
31 (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. [sent-62, score-0.263]
32 Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. [sent-63, score-0.274]
33 Initialize wi = ck for all stages i and examples k. [sent-71, score-0.26]
34 For each stage i: (a) Calculate targets for each training example, as shown in equation 5. [sent-73, score-0.525]
35 Select the stage ¯ with the best (highest) score. [sent-77, score-0.457]
36 1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. [sent-86, score-0.617]
37 We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. [sent-89, score-0.944]
38 If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . [sent-90, score-0.248]
39 That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). [sent-92, score-0.487]
40 For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). [sent-93, score-0.331]
41 + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. [sent-96, score-0.154]
42 ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. [sent-99, score-0.537]
43 Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). [sent-108, score-0.228]
44 We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. [sent-109, score-0.171]
45 Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. [sent-112, score-0.537]
46 Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up. [sent-114, score-0.487]
47 We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . [sent-120, score-0.15]
48 We assume that the weights α at each stage have been scaled to sum to 1. [sent-129, score-0.488]
49 If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. [sent-157, score-0.611]
50 For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . [sent-160, score-0.32]
51 Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). [sent-173, score-0.365]
52 This database contains 19-by-19 gray-scale images of faces and non-faces. [sent-175, score-0.175]
53 The training set has 2429 face images and 4548 non-face images. [sent-176, score-0.177]
54 The testing set has 472 faces and 23573 non-faces. [sent-177, score-0.156]
55 We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. [sent-178, score-0.401]
56 35 average cost/error per example 200 training cost training error testing cost testing error 0. [sent-190, score-0.438]
57 8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. [sent-195, score-0.311]
58 1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). [sent-197, score-0.161]
59 Most image patches (or subwindows) do not contain faces. [sent-198, score-0.184]
60 We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. [sent-199, score-0.194]
61 The first stage looks at image patches of size 3-by-3 (i. [sent-200, score-0.641]
62 The next stage looks at the same image, but at a resolution of 6-by-6. [sent-203, score-0.561]
63 The total classifier at any stage is a linear combination of these simple classifiers. [sent-207, score-0.457]
64 The total classifier for the whole stage is merely the sum of these look-up functions. [sent-211, score-0.457]
65 Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. [sent-212, score-0.679]
66 We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. [sent-213, score-0.669]
67 If the image is a face, we add a negative cost (i. [sent-214, score-0.317]
68 bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). [sent-216, score-0.349]
69 If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i. [sent-217, score-0.933]
70 While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. [sent-220, score-0.168]
71 More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). [sent-221, score-0.216]
72 To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. [sent-223, score-1.186]
73 This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. [sent-224, score-0.518]
74 As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). [sent-225, score-0.577]
75 This chained method allows the culling of most of this high-resolution image processing. [sent-227, score-0.374]
76 2 Experiments For each example, we construct a vector of stage costs as above. [sent-229, score-0.58]
77 This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. [sent-233, score-0.167]
78 There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. [sent-234, score-0.471]
79 Our method (by either training scheme) produces lower processing cost for most error rates. [sent-244, score-0.213]
80 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. [sent-245, score-0.214]
81 However, cascade detection and this work (chained detection) are quite different. [sent-247, score-0.298]
82 The examples which pass that detector are then passed to a finer detector for training, and so on. [sent-250, score-0.206]
83 By contrast, our chain detectors are trained as an ensemble. [sent-252, score-0.193]
84 First, we assume that the information available at each stage changes. [sent-254, score-0.457]
85 Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. [sent-255, score-1.289]
86 By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. [sent-256, score-0.338]
87 Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. [sent-257, score-0.246]
88 The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. [sent-259, score-0.163]
89 We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. [sent-263, score-0.69]
90 Cascade detectors are often used for the same purpose. [sent-264, score-0.153]
91 In cascade detectors, the time taken to evaluate a given image patch is reduced. [sent-266, score-0.308]
92 In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. [sent-267, score-0.805]
93 To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). [sent-268, score-0.492]
94 1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. [sent-270, score-0.315]
95 Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. [sent-271, score-0.587]
96 Both are valid (and separate) ways of tackling image processing (and other tasks as well). [sent-272, score-0.184]
97 In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. [sent-290, score-0.202]
98 In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. [sent-291, score-0.152]
99 While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. [sent-293, score-0.201]
100 Rapid object detection using a boosted cascade of simple features. [sent-333, score-0.377]
wordName wordTfidf (topN-words)
[('stage', 0.457), ('chained', 0.251), ('subclassi', 0.223), ('wafer', 0.223), ('cascade', 0.185), ('gi', 0.16), ('detectors', 0.153), ('gj', 0.148), ('boosting', 0.124), ('costs', 0.123), ('stages', 0.123), ('image', 0.123), ('er', 0.122), ('cost', 0.114), ('detection', 0.113), ('manufacturing', 0.112), ('decision', 0.107), ('xi', 0.106), ('resolution', 0.104), ('classi', 0.098), ('bonus', 0.097), ('recommends', 0.097), ('face', 0.089), ('faces', 0.089), ('dh', 0.088), ('ck', 0.087), ('egi', 0.084), ('riverside', 0.084), ('waldboost', 0.084), ('resolutions', 0.082), ('detector', 0.082), ('add', 0.08), ('object', 0.079), ('rules', 0.078), ('rounds', 0.074), ('pixel', 0.074), ('pipeline', 0.073), ('gn', 0.068), ('testing', 0.067), ('ers', 0.067), ('ci', 0.067), ('sup', 0.062), ('base', 0.062), ('patches', 0.061), ('processing', 0.061), ('xk', 0.061), ('risk', 0.061), ('weight', 0.057), ('wafers', 0.056), ('rademacher', 0.055), ('decisions', 0.055), ('slope', 0.053), ('rejected', 0.053), ('false', 0.053), ('bound', 0.053), ('images', 0.05), ('wi', 0.05), ('en', 0.049), ('pipelined', 0.048), ('downsampling', 0.048), ('historic', 0.048), ('early', 0.048), ('stopping', 0.047), ('adaboost', 0.046), ('positives', 0.046), ('run', 0.045), ('cb', 0.044), ('indicator', 0.044), ('sequential', 0.042), ('pass', 0.042), ('yoav', 0.041), ('throw', 0.041), ('processed', 0.041), ('series', 0.041), ('chain', 0.04), ('silicon', 0.039), ('training', 0.038), ('returns', 0.038), ('dedicated', 0.037), ('save', 0.037), ('pixels', 0.037), ('continue', 0.036), ('database', 0.036), ('let', 0.036), ('cations', 0.034), ('bounds', 0.033), ('jmlr', 0.033), ('rejection', 0.033), ('draws', 0.032), ('ease', 0.032), ('multiply', 0.032), ('weights', 0.031), ('asymmetric', 0.031), ('read', 0.031), ('fi', 0.03), ('equation', 0.03), ('gained', 0.03), ('freund', 0.03), ('svm', 0.029), ('theorem', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000015 50 nips-2006-Chained Boosting
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
Author: Philip M. Long, Rocco Servedio
Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.
3 0.15217294 185 nips-2006-Subordinate class recognition using relational object models
Author: Aharon B. Hillel, Daphna Weinshall
Abstract: We address the problem of sub-ordinate class recognition, like the distinction between different types of motorcycles. Our approach is motivated by observations from cognitive psychology, which identify parts as the defining component of basic level categories (like motorcycles), while sub-ordinate categories are more often defined by part properties (like ’jagged wheels’). Accordingly, we suggest a two-stage algorithm: First, a relational part based object model is learnt using unsegmented object images from the inclusive class (e.g., motorcycles in general). The model is then used to build a class-specific vector representation for images, where each entry corresponds to a model’s part. In the second stage we train a standard discriminative classifier to classify subclass instances (e.g., cross motorcycles) based on the class-specific vector representation. We describe extensive experimental results with several subclasses. The proposed algorithm typically gives better results than a competing one-step algorithm, or a two stage algorithm where classification is based on a model of the sub-ordinate class. 1
4 0.12178982 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
Author: Shai Avidan, Moshe Butman
Abstract: Bob offers a face-detection web service where clients can submit their images for analysis. Alice would very much like to use the service, but is reluctant to reveal the content of her images to Bob. Bob, for his part, is reluctant to release his face detector, as he spent a lot of time, energy and money constructing it. Secure MultiParty computations use cryptographic tools to solve this problem without leaking any information. Unfortunately, these methods are slow to compute and we introduce a couple of machine learning techniques that allow the parties to solve the problem while leaking a controlled amount of information. The first method is an information-bottleneck variant of AdaBoost that lets Bob find a subset of features that are enough for classifying an image patch, but not enough to actually reconstruct it. The second machine learning technique is active learning that allows Alice to construct an online classifier, based on a small number of calls to Bob’s face detector. She can then use her online classifier as a fast rejector before using a cryptographically secure classifier on the remaining image patches. 1
5 0.11652468 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
6 0.10681783 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
7 0.10605746 154 nips-2006-Optimal Change-Detection and Spiking Neurons
8 0.10120451 66 nips-2006-Detecting Humans via Their Pose
9 0.098669574 21 nips-2006-AdaBoost is Consistent
10 0.094229236 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
11 0.093981415 175 nips-2006-Simplifying Mixture Models through Function Approximation
12 0.090551741 193 nips-2006-Tighter PAC-Bayes Bounds
13 0.087640151 186 nips-2006-Support Vector Machines on a Budget
14 0.087548532 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
15 0.086471923 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
16 0.0857042 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
17 0.081830502 116 nips-2006-Learning from Multiple Sources
18 0.080771662 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
19 0.077840939 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing
20 0.076709047 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
topicId topicWeight
[(0, -0.257), (1, 0.032), (2, 0.007), (3, -0.099), (4, 0.01), (5, 0.133), (6, -0.236), (7, -0.012), (8, 0.024), (9, 0.107), (10, 0.097), (11, 0.048), (12, 0.122), (13, 0.068), (14, 0.081), (15, -0.022), (16, 0.013), (17, 0.011), (18, 0.006), (19, -0.05), (20, 0.074), (21, 0.141), (22, -0.005), (23, -0.01), (24, 0.035), (25, -0.029), (26, -0.1), (27, 0.102), (28, -0.009), (29, -0.079), (30, 0.01), (31, 0.029), (32, 0.096), (33, 0.066), (34, 0.012), (35, -0.005), (36, 0.113), (37, -0.046), (38, 0.037), (39, 0.017), (40, -0.045), (41, 0.023), (42, 0.03), (43, 0.147), (44, -0.059), (45, -0.074), (46, 0.038), (47, 0.046), (48, 0.119), (49, -0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.95428854 50 nips-2006-Chained Boosting
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
Author: Philip M. Long, Rocco Servedio
Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.
3 0.7270872 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
Author: Shai Avidan, Moshe Butman
Abstract: Bob offers a face-detection web service where clients can submit their images for analysis. Alice would very much like to use the service, but is reluctant to reveal the content of her images to Bob. Bob, for his part, is reluctant to release his face detector, as he spent a lot of time, energy and money constructing it. Secure MultiParty computations use cryptographic tools to solve this problem without leaking any information. Unfortunately, these methods are slow to compute and we introduce a couple of machine learning techniques that allow the parties to solve the problem while leaking a controlled amount of information. The first method is an information-bottleneck variant of AdaBoost that lets Bob find a subset of features that are enough for classifying an image patch, but not enough to actually reconstruct it. The second machine learning technique is active learning that allows Alice to construct an online classifier, based on a small number of calls to Bob’s face detector. She can then use her online classifier as a fast rejector before using a cryptographically secure classifier on the remaining image patches. 1
4 0.64375669 185 nips-2006-Subordinate class recognition using relational object models
Author: Aharon B. Hillel, Daphna Weinshall
Abstract: We address the problem of sub-ordinate class recognition, like the distinction between different types of motorcycles. Our approach is motivated by observations from cognitive psychology, which identify parts as the defining component of basic level categories (like motorcycles), while sub-ordinate categories are more often defined by part properties (like ’jagged wheels’). Accordingly, we suggest a two-stage algorithm: First, a relational part based object model is learnt using unsegmented object images from the inclusive class (e.g., motorcycles in general). The model is then used to build a class-specific vector representation for images, where each entry corresponds to a model’s part. In the second stage we train a standard discriminative classifier to classify subclass instances (e.g., cross motorcycles) based on the class-specific vector representation. We describe extensive experimental results with several subclasses. The proposed algorithm typically gives better results than a competing one-step algorithm, or a two stage algorithm where classification is based on a model of the sub-ordinate class. 1
5 0.52437007 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
6 0.52070892 20 nips-2006-Active learning for misspecified generalized linear models
7 0.50647324 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
8 0.50506943 156 nips-2006-Ordinal Regression by Extended Binary Classification
9 0.50176787 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
10 0.48384482 193 nips-2006-Tighter PAC-Bayes Bounds
11 0.47992554 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
12 0.47787192 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
13 0.47701246 52 nips-2006-Clustering appearance and shape by learning jigsaws
14 0.46941683 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
15 0.46141648 186 nips-2006-Support Vector Machines on a Budget
16 0.45543724 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
17 0.44296592 66 nips-2006-Detecting Humans via Their Pose
18 0.42916274 170 nips-2006-Robotic Grasping of Novel Objects
19 0.42060155 155 nips-2006-Optimal Single-Class Classification Strategies
20 0.41725183 33 nips-2006-Analysis of Representations for Domain Adaptation
topicId topicWeight
[(1, 0.095), (3, 0.011), (7, 0.052), (9, 0.028), (22, 0.048), (44, 0.062), (57, 0.58), (65, 0.038), (69, 0.013)]
simIndex simValue paperId paperTitle
1 0.96970332 66 nips-2006-Detecting Humans via Their Pose
Author: Alessandro Bissacco, Ming-Hsuan Yang, Stefano Soatto
Abstract: We consider the problem of detecting humans and classifying their pose from a single image. Specifically, our goal is to devise a statistical model that simultaneously answers two questions: 1) is there a human in the image? and, if so, 2) what is a low-dimensional representation of her pose? We investigate models that can be learned in an unsupervised manner on unlabeled images of human poses, and provide information that can be used to match the pose of a new image to the ones present in the training set. Starting from a set of descriptors recently proposed for human detection, we apply the Latent Dirichlet Allocation framework to model the statistics of these features, and use the resulting model to answer the above questions. We show how our model can efficiently describe the space of images of humans with their pose, by providing an effective representation of poses for tasks such as classification and matching, while performing remarkably well in human/non human decision problems, thus enabling its use for human detection. We validate the model with extensive quantitative experiments and comparisons with other approaches on human detection and pose matching. 1
same-paper 2 0.96544886 50 nips-2006-Chained Boosting
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
3 0.95353538 52 nips-2006-Clustering appearance and shape by learning jigsaws
Author: Anitha Kannan, John Winn, Carsten Rother
Abstract: Patch-based appearance models are used in a wide range of computer vision applications. To learn such models it has previously been necessary to specify a suitable set of patch sizes and shapes by hand. In the jigsaw model presented here, the shape, size and appearance of patches are learned automatically from the repeated structures in a set of training images. By learning such irregularly shaped ‘jigsaw pieces’, we are able to discover both the shape and the appearance of object parts without supervision. When applied to face images, for example, the learned jigsaw pieces are surprisingly strongly associated with face parts of different shapes and scales such as eyes, noses, eyebrows and cheeks, to name a few. We conclude that learning the shape of the patch not only improves the accuracy of appearance-based part detection but also allows for shape-based part detection. This enables parts of similar appearance but different shapes to be distinguished; for example, while foreheads and cheeks are both skin colored, they have markedly different shapes. 1
4 0.95352328 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
Author: Su-in Lee, Varun Ganapathi, Daphne Koller
Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.
5 0.76409453 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf
Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1
6 0.74286437 34 nips-2006-Approximate Correspondences in High Dimensions
7 0.74112886 42 nips-2006-Bayesian Image Super-resolution, Continued
8 0.72863543 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
9 0.72395456 47 nips-2006-Boosting Structured Prediction for Imitation Learning
10 0.71935129 110 nips-2006-Learning Dense 3D Correspondence
11 0.71901834 185 nips-2006-Subordinate class recognition using relational object models
12 0.7076326 45 nips-2006-Blind Motion Deblurring Using Image Statistics
13 0.70652527 122 nips-2006-Learning to parse images of articulated bodies
14 0.6792714 154 nips-2006-Optimal Change-Detection and Spiking Neurons
15 0.67331004 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
16 0.67087561 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
17 0.66947132 130 nips-2006-Max-margin classification of incomplete data
18 0.6656276 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
19 0.66543984 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
20 0.65099275 53 nips-2006-Combining causal and similarity-based reasoning