nips nips2003 nips2003-143 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire
Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. We find stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. [sent-11, score-0.291]
2 By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. [sent-13, score-0.256]
3 AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. [sent-14, score-0.443]
4 We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. [sent-15, score-0.562]
5 We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. [sent-16, score-0.208]
6 A “weak” classifier produced by the weak learning algorithm has a probability of misclassification that is slightly below 50%. [sent-18, score-0.16]
7 AdaBoost maintains a distribution (set of weights) over the training examples, and requests a weak classifier from the weak learning algorithm at each iteration. [sent-22, score-0.256]
8 Training examples that were misclassified by the weak classifier at the current iteration then receive higher weights at the following iteration. [sent-23, score-0.208]
9 When it is possible to achieve a positive margin, AdaBoost has been shown to approximately maximize the margin [2]. [sent-28, score-0.257]
10 In particular, it is known that AdaBoost achieves a margin of at least 1 ρ, where ρ is the largest margin that can possibly be attained by a com2 bined classifier (other bounds appear in [3]). [sent-29, score-0.406]
11 For all the extensive theoretical and empirical study of AdaBoost, it is still unknown if AdaBoost achieves a maximal margin solution, and thus the best upper bound on the probability of error (for margin-based bounds). [sent-31, score-0.223]
12 First we simplify the algorithm to reveal a nonlinear iterated map for AdaBoost’s weight vector. [sent-37, score-0.229]
13 This iterated map gives a direct relation between the weights at time t and the weights at time t + 1, including renormalization, thus providing a much more concise mapping than the original algorithm. [sent-38, score-0.21]
14 We then provide a specific set of examples in which trajectories of this iterated map converge to a limit cycle, allowing us to calculate AdaBoost’s output vector directly. [sent-39, score-0.24]
15 There are two interesting cases governing the dynamics: the case where the optimal weak classifiers are chosen at each iteration (the ‘optimal’ case), and the case where permissible non-optimal weak classifiers may be chosen (the ‘non-optimal’ case). [sent-40, score-0.308]
16 In the optimal case, the weak learning algorithm is required to choose a weak classifier which has the largest edge at every iteration. [sent-41, score-0.302]
17 In the non-optimal case, the weak learning algorithm may choose any weak classifier as long as its edge exceeds ρ, the maximum margin achievable by a combined classifier. [sent-42, score-0.532]
18 Based on large scale experiments and a gap in theoretical bounds, R¨ tsch and Warmuth [3] a conjectured that AdaBoost does not necessarily converge to a maximum margin classifier in the non-optimal case, i. [sent-44, score-0.307]
19 In practice, the weak classifiers are generated by CART or another heuristic weak learning algorithm, implying that the choice need not always be optimal. [sent-47, score-0.22]
20 AdaBoost, as shown by Breiman [5] and others, is actually a coordinate descent algorithm on a particular exponential loss function. [sent-50, score-0.204]
21 However, minimizing this function in other ways does not necessarily achieve large margins; the process of coordinate descent must be somehow responsible. [sent-51, score-0.235]
22 In Section 4, we introduce a differentiable function that can be maximized to achieve maximal margins; performing coordinate ascent on this function yields a new boosting algorithm that directly maximizes margins. [sent-52, score-0.511]
23 We approximate the update rule for coordinate ascent on this function and derive an algorithm with updates that are slightly more aggressive than those of arc-gv. [sent-54, score-0.501]
24 Then we decouple the dynamics for AdaBoost in the binary case to reveal a nonlinear iterated map. [sent-56, score-0.191]
25 We use these cycles to show that AdaBoost produces a maximal margin solution in the optimal case; this result generalizes to m × m. [sent-59, score-0.324]
26 Then, we produce the example promised above to show that AdaBoost does not necessarily converge to a maximal margin solution in the non-optimal case. [sent-60, score-0.317]
27 In Section 4 we introduce a differentiable function that can be used to maximize the margin via coordinate ascent, and then approximate the coordinate ascent update step to derive a new algorithm. [sent-61, score-0.795]
28 Denote by dt ∈ Rm the distribution (weights) over the training examples at iteration t, expressed as a column vector. [sent-65, score-0.312]
29 , hn , with hj : X → {1, −1}; we assume that for every hj on this list, −hj also appears. [sent-72, score-0.19]
30 , Mij = +1 if training example i is classified correctly by hypothesis hj , and −1 otherwise. [sent-75, score-0.152]
31 The (unnormalized) coefficient of classifier hj for the final combined hypothesis is denoted λj , so that the final n n combined hypothesis is fAda (x) = j=1 (λj / λ 1 )hj (x) where λ 1 = j=1 λj . [sent-76, score-0.243]
32 The margin of training example i is defined by yi fAda (xi ), or equivalently (Mλ)i / λ 1 , and the edge of hypothesis j with respect to the training data (weighted by d) is (dT M)j , or 1 − 2×(probability of error of ˜ hj on the training set weighted by d). [sent-79, score-0.421]
33 We call this minimum margin over training examples the margin of classifier λ. [sent-81, score-0.418]
34 , m T (b) jt = argmaxj (dt M)j (c) rt = (dT M)jt t (d) αt = 1 2 ln 1+rt 1−rt (e) λt+1 = λt + αt ejt , where ejt is 1 in position jt and 0 elsewhere. [sent-95, score-0.999]
35 Output: λcombined,j = λtmax +1,j / λtmax +1 1 Thus at each iteration, the distribution dt is computed (Step 3a), classifier jt with maximum edge is selected (Step 3b), and the weight of that classifier is updated (Step 3c, 3d, 3e). [sent-97, score-0.494]
36 ) AdaBoost can be reduced to the following iterated map for the dt ’s. [sent-99, score-0.411]
37 This map gives a direct relationship between dt and dt+1 , taking the normalization of Step 3a into account automatically. [sent-100, score-0.269]
38 dt+1,i −(Mijt αt ) where αt = m −(Mλt )i e−(Mijt αt ) i=1 e 1 M 1 − rt 2 ijt 1 − Mijt rt = dt+1,i e e−(Mλt )i e−(Mijt αt ) = = = 1 + rt 1 2 1 + Mijt rt dt,i 1−Mijt rt 1+Mijt rt 1 2 1+Mijt rt 1−Mijt rt 1 + rt 1 − rt , so , thus dt,i m i=1 1 ln 2 . [sent-112, score-2.135]
39 Thus, d+ = ijt each i such that Mijt = 1, we find: dt,i dt,i = dt+1,i = 1+rt 1 + rt d+ + d− 1−rt 1+rt 2 and d− = 1−rt 2 . [sent-114, score-0.235]
40 Then we will consider a larger matrix and show that AdaBoost fails to converge to a maximum margin solution in the non-optimal case. [sent-119, score-0.298]
41 We are in the optimal case, where jt = argmaxj (dT M)j . [sent-124, score-0.296]
42 In the triangular region with vertices (0, 0, 1), (1/3, 1/3, 1/3), (0, 1, 0), jt will be 1. [sent-126, score-0.224]
43 Similarly, we have regions for jt = 2 and jt = 3 (see Figure 1(a)). [sent-127, score-0.448]
44 Since dt+1 will always satisfy (dT M )jt = 0, the t+1 1 dynamics are restricted to the edges of a triangle with vertices (0, 1 , 2 ), ( 1 , 0, 1 ), ( 1 , 1 , 0) 2 2 2 2 2 after the first iteration (see Figure 1(b)). [sent-128, score-0.187]
45 1 second component of d_t second component of d_t 1 (dT M)2 =0 (dT M)1 =0 1/2 jt =1 jt =3 1/3 jt =2 1/3 first component of d_t 1 (dT M)3 =0 1/2 1 first component of d_t Figure 1: (a-Left) Regions of dt -space where classifiers jt = 1, 2, 3 will respectively be selected. [sent-129, score-1.198]
46 5) position along triangle position along triangle (0,. [sent-136, score-0.164]
47 5 Figure 2: (a-Upper Left) The iterated map on the unfolded triangle. [sent-175, score-0.221]
48 (b-Upper Right) The map from (a) iterated twice, showing where dt+3 will be, given dt . [sent-178, score-0.388]
49 (c-Lower Left) 50 timesteps of AdaBoost showing convergence of dt ’s to a cycle. [sent-180, score-0.287]
50 There are many concentric rings at positions dcyc , dcyc , and dcyc . [sent-182, score-0.681]
51 On this reduced 1-dimensional phase space, the iterated map has no stable fixed points or orbits of length 2. [sent-185, score-0.226]
52 √ However, consider the following periodic orbit of√ length 3: √ √ √ √ (1)T (2)T (3)T 3− 5 5−1 1 5−1 1 3− 5 dcyc = ( 4 , 4 , 2 ), dcyc = ( 2 , 4 , 4 ), dcyc = ( 5−1 , 1 , 3−4 5 ). [sent-186, score-0.687]
53 This 4 2 (1) is clearly a cycle, since starting from dcyc , AdaBoost will choose jt = 1. [sent-187, score-0.468]
54 Now, computing dcyc,i /(1 + Mi,1 r1 ) for each i yields dcyc . [sent-189, score-0.217]
55 There is √ √ √ (1)T (2)T (3)T 3− 5 1 5−1 5−1 3− 5 1 in fact another 3-cycle, dcyc = ( 4 , 2 , 4 ), dcyc = ( 2 , 4 , 4 ), dcyc = √ √ ( 5−1 , 3−4 5 , 1 ). [sent-191, score-0.651]
56 To find these cycles, we hypothesized only that a cycle of length 3 exists, 4 2 visiting each hypothesis in turn, and used the reduced equations from Section 2 to solve for the cycle coordinates. [sent-192, score-0.248]
57 We give the following outline of the proof for global stability: This map is a contraction, so any small perturbation from the cycle will diminish, yielding local stability of the cycles. [sent-193, score-0.169]
58 One only needs to consider the one-dimensional map defined on the unfolded triangle, since within one iteration every trajectory lands on the triangle. [sent-194, score-0.168]
59 One can break the unfolded triangle into intervals and find the region of attraction of each fixed cycle; in fact the whole triangle is the union of both regions of attraction. [sent-197, score-0.173]
60 The convergence to one of these two 3cycles is very fast; Figure 2(b) shows that the absolute slope of the second iterated map at the fixed points is much less than 1. [sent-198, score-0.212]
61 = (1/3, 1/3/1/3), and since mini (Mλcombined )i = 1/3 AdaBoost produces a maximal margin solution. [sent-200, score-0.276]
62 This 3 × 3 case can be generalized to m classifiers, each having one misclassified training example; in this case there will be periodic cycles of length m, and the contraction will also persist (the cycles will be stable). [sent-201, score-0.212]
63 We note that for every low-dimensional case we tried, periodic cycles of larger lengths seem to exist (such as in Figure 2(d)), but that the contraction at each iteration does not, so it is harder to show stability. [sent-202, score-0.199]
64 Now, we give an example to show that non-optimal AdaBoost does not necessarily converge to a maximal margin solution. [sent-203, score-0.297]
65 Recall that in the non-optimal case jt ∈ {j : (dT M)j √ ρ}. [sent-207, score-0.224]
66 This cycle is the same cycle as in our previous example (although there is one extra dimension). [sent-212, score-0.188]
67 In any 2 4 case, the value of the margin produced by this cycle is 1/3, not 1/2. [sent-214, score-0.3]
68 In practice, it may be possible for AdaBoost to converge to a maximum margin solution when hypotheses are chosen to be only slightly non-optimal; however the notion of non-optimal we are using is a very natural notion, and we have shown that AdaBoost may not converge to ρ here. [sent-216, score-0.367]
69 Note that for some matrices M, a maximum margin solution may still be attained in the non-optimal case (for example the simple 3×3 matrix we analyzed above), but it is not attained in general as shown by our example. [sent-217, score-0.3]
70 We are not saying that the only way for AdaBoost to converge to a non-optimal solution is to fall into the wrong cycle; there may be many other non-cyclic ways for the algorithm to fail to converge to a maximum margin solution. [sent-218, score-0.343]
71 4 Coordinate Ascent for Maximum Margins AdaBoost can be interpreted as an algorithm based on coordinate descent. [sent-220, score-0.165]
72 There are other algorithms such as AdaBoost∗ and arc-gv that attempt to maximize the margin explicitly, but these are not based on coordinate descent. [sent-221, score-0.387]
73 We now suggest a boosting algorithm that aims to maximize the margin explicitly (like arc-gv and AdaBoost∗ ) yet is based on coordinate ascent. [sent-222, score-0.495]
74 An important note is that AdaBoost and our new algorithm choose the direction of descent/ascent (value of jt ) using the same formula, jt = argmaxj (dT M)j . [sent-223, score-0.569]
75 t This lends further credence to the conjecture that AdaBoost maximizes the margin in the optimal case, since the direction AdaBoost chooses is the same direction one would choose to maximize the margin directly via coordinate ascent. [sent-224, score-0.732]
76 m The function that AdaBoost minimizes via coordinate descent is F (λ) = i=1 e−(Mλ)i . [sent-225, score-0.188]
77 So it must be the process of coordinate descent which awards AdaBoost its ability to increase margins, not simply AdaBoost’s ability to minimize F . [sent-228, score-0.218]
78 , G is a concave function for each fixed value of λ 1 , whose maximum only occurs in the limit as λ 1 → ∞, and more importantly, as λ 1 → ∞ we have G(λ) → µ(λ), where µ(λ) = (mini (Mλ)i )/ λ 1 , the margin of λ. [sent-232, score-0.215]
79 Rather than performing coordinate descent on F as in AdaBoost, let us perform coordinate ascent on G. [sent-234, score-0.513]
80 The choice of direction jt at iteration t is: argmax j dG(λt + αej ) dα α=0 = argmax j m −(Mλt )i Mij i=1 e F (λt ) λt + 1 1 λt 2 1 ln(F (λt )). [sent-235, score-0.356]
81 An approximate coordinate ascent algorithm which avoids this line search is the following approximation to this maximization problem: 0≈ m −(Mλt )i −Mijt α e Mijt i=1 e F (λt + αejt ) − G(λt ). [sent-242, score-0.363]
82 We can solve for αt analytically: αt = 1 ln 2 1 + rt 1 − rt − 1 ln 2 1 + gt 1 − gt , where gt = max{0, G(λt )}. [sent-243, score-0.594]
83 The update for αt is strictly positive (in the case of positive margins) due to the Von Neumann min-max theorem and equation (2), that is: rt ≥ ρ = mind∈∆m maxj (dT M)j = maxλ∈∆n mini (Mλ)i ≥ mini (Mλt )i / λt 1 > G(λt ), and thus αt > 0 ∀t. [sent-245, score-0.333]
84 We have preliminary proofs that the value of G increases at each iteration of our approximate coordinate ascent algorithm, and that our algorithms converge to a maximum margin solution, even in the non-optimal case. [sent-246, score-0.674]
85 It converges to a combined classifier attaining a margin inside the interval [ρ − ν, ρ] within 2(log2 m)/ν 2 steps, but does not guarantee asymptotic convergence to ρ for a fixed ν. [sent-249, score-0.262]
86 There are many other boosting algorithms, but some of them require minimization over non-convex functions; here, we choose to compare with the simple updates of AdaBoost (due to its fast convergence rate), AdaBoost∗ , and arc-gv. [sent-250, score-0.166]
87 16 arc−gv, approximate coord ascent, and coord ascent AdaBoost arc−gv approximate coord. [sent-254, score-0.278]
88 13 arc−gv approximate coordinate ascent AdaBoost AdaBoost* 0. [sent-259, score-0.347]
89 1 90 400 1800 Iterations 10000 Figure 3: (a-Left) Performance of all algorithms in the optimal case on a random 11 × 21 input matrix (b-Right) AdaBoost, arc-gv, and approximate coordinate ascent on synthetic data. [sent-261, score-0.386]
90 001), approximate coordinate ascent, and coordinate ascent on G (with a line search for αt at every iteration) on a reduced randomly generated 11 × 21 matrix, in the optimal case. [sent-263, score-0.541]
91 AdaBoost settles into a cycle (as shown in Figure2(d)), so its updates remain consistently large, causing λt 1 to grow faster, thus converge faster with respect to G. [sent-264, score-0.17]
92 The values of rt in the cycle happen to produce an optimal margin solution, so AdaBoost quickly converges to this solution. [sent-265, score-0.513]
93 The approximate coordinate ascent algorithm has slightly less aggressive updates than AdaBoost, and is very closely aligned with coordinate ascent; arcgv is slower. [sent-266, score-0.611]
94 The j weak learner is hj (x) = x(j), thus Mij = yi xi (j). [sent-270, score-0.26]
95 As expected, the convergence rate of approximate coordinate ascent falls between AdaBoost and arc-gv. [sent-271, score-0.381]
96 5 Conclusions We have used the nonlinear iterated map defined by AdaBoost to understand its update rule in low-dimensional cases and uncover cyclic dynamics. [sent-272, score-0.216]
97 We produced an example to show that AdaBoost does not necessarily maximize the margin in the non-optimal case. [sent-273, score-0.281]
98 Then, we introduced a coordinate ascent algorithm and an approximate coordinate ascent algorithm that aim to maximize the margin directly. [sent-274, score-0.942]
99 Here, the direction of ascent agrees with the direction chosen by AdaBoost and other algorithms. [sent-275, score-0.232]
100 Boosting as a regularized path to a maximum margin classifier. [sent-299, score-0.215]
wordName wordTfidf (topN-words)
[('adaboost', 0.7), ('mijt', 0.231), ('jt', 0.224), ('dcyc', 0.217), ('dt', 0.21), ('rt', 0.206), ('margin', 0.191), ('ascent', 0.176), ('coordinate', 0.149), ('iterated', 0.119), ('ejt', 0.116), ('weak', 0.11), ('hj', 0.095), ('cycle', 0.094), ('boosting', 0.075), ('classi', 0.075), ('er', 0.068), ('iteration', 0.066), ('triangle', 0.065), ('map', 0.059), ('cycles', 0.059), ('dynamics', 0.056), ('margins', 0.055), ('mini', 0.053), ('aggressive', 0.05), ('argmaxj', 0.05), ('tmax', 0.05), ('maximize', 0.047), ('ln', 0.046), ('mij', 0.046), ('converge', 0.046), ('timesteps', 0.043), ('unfolded', 0.043), ('descent', 0.039), ('contraction', 0.038), ('combined', 0.037), ('hypothesis', 0.037), ('periodic', 0.036), ('ers', 0.035), ('gv', 0.034), ('misclassi', 0.034), ('convergence', 0.034), ('princeton', 0.033), ('maximal', 0.032), ('robert', 0.031), ('gt', 0.03), ('rings', 0.03), ('updates', 0.03), ('schapire', 0.03), ('atsch', 0.029), ('coord', 0.029), ('fada', 0.029), ('ijt', 0.029), ('unused', 0.029), ('arc', 0.029), ('necessarily', 0.028), ('direction', 0.028), ('conjecture', 0.027), ('choose', 0.027), ('stable', 0.025), ('attained', 0.024), ('maximum', 0.024), ('reduced', 0.023), ('component', 0.023), ('dg', 0.023), ('approximate', 0.022), ('differentiable', 0.022), ('optimal', 0.022), ('maximizes', 0.022), ('yi', 0.021), ('update', 0.021), ('hypotheses', 0.021), ('training', 0.02), ('solution', 0.02), ('manfred', 0.02), ('weight', 0.019), ('slightly', 0.019), ('argmax', 0.019), ('warmuth', 0.019), ('xi', 0.019), ('achieve', 0.019), ('initially', 0.018), ('gunnar', 0.018), ('derive', 0.018), ('dynamical', 0.018), ('tsch', 0.018), ('position', 0.017), ('matrix', 0.017), ('understand', 0.017), ('explicitly', 0.017), ('edge', 0.017), ('reveal', 0.016), ('analyze', 0.016), ('weights', 0.016), ('algorithm', 0.016), ('examples', 0.016), ('outline', 0.016), ('learner', 0.015), ('produced', 0.015), ('ability', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 143 nips-2003-On the Dynamics of Boosting
Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire
Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. We find stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1
2 0.36219501 41 nips-2003-Boosting versus Covering
Author: Kohei Hatano, Manfred K. Warmuth
Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We first show that InfoBoost, a modification of AdaBoost proposed by Aslam for a different purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artificial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1
Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan
Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classifier change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the field in an application for automatic assessment of human-robot interaction.
4 0.18910743 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin
Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1
5 0.15774341 133 nips-2003-Mutual Boosting for Contextual Inference
Author: Michael Fink, Pietro Perona
Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex
6 0.15341055 122 nips-2003-Margin Maximizing Loss Functions
7 0.14619647 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
8 0.11273254 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion
9 0.099303804 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
10 0.084425285 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification
11 0.071673468 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
12 0.058404446 1 nips-2003-1-norm Support Vector Machines
13 0.055839643 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
14 0.053800896 124 nips-2003-Max-Margin Markov Networks
15 0.05335502 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
16 0.053174134 168 nips-2003-Salient Boundary Detection using Ratio Contour
17 0.046775378 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
18 0.044558007 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
19 0.042879883 51 nips-2003-Design of Experiments via Information Theory
20 0.041818421 121 nips-2003-Log-Linear Models for Label Ranking
topicId topicWeight
[(0, -0.166), (1, -0.042), (2, -0.012), (3, -0.318), (4, 0.012), (5, -0.259), (6, -0.058), (7, 0.051), (8, -0.062), (9, 0.002), (10, -0.218), (11, -0.1), (12, 0.245), (13, -0.258), (14, -0.146), (15, 0.018), (16, 0.098), (17, -0.075), (18, 0.004), (19, -0.106), (20, -0.207), (21, 0.036), (22, -0.032), (23, -0.009), (24, -0.004), (25, -0.027), (26, -0.185), (27, -0.044), (28, 0.051), (29, -0.048), (30, -0.04), (31, 0.026), (32, 0.084), (33, -0.062), (34, 0.036), (35, -0.014), (36, -0.083), (37, 0.06), (38, 0.009), (39, 0.024), (40, -0.002), (41, 0.004), (42, -0.068), (43, -0.011), (44, 0.038), (45, -0.05), (46, 0.026), (47, 0.057), (48, 0.025), (49, -0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.97213793 143 nips-2003-On the Dynamics of Boosting
Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire
Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. We find stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1
2 0.83667457 41 nips-2003-Boosting versus Covering
Author: Kohei Hatano, Manfred K. Warmuth
Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We first show that InfoBoost, a modification of AdaBoost proposed by Aslam for a different purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artificial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1
Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan
Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classifier change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the field in an application for automatic assessment of human-robot interaction.
4 0.50485176 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin
Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1
5 0.4424732 133 nips-2003-Mutual Boosting for Contextual Inference
Author: Michael Fink, Pietro Perona
Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex
6 0.35269451 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
7 0.32435879 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
8 0.29766977 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion
9 0.2900019 122 nips-2003-Margin Maximizing Loss Functions
10 0.24418668 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
12 0.23474409 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification
13 0.19844355 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
14 0.19293337 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
15 0.19059375 26 nips-2003-An MDP-Based Approach to Online Mechanism Design
16 0.18979135 124 nips-2003-Max-Margin Markov Networks
17 0.18862231 3 nips-2003-AUC Optimization vs. Error Rate Minimization
18 0.17137127 1 nips-2003-1-norm Support Vector Machines
19 0.16916363 102 nips-2003-Large Scale Online Learning
20 0.15917604 181 nips-2003-Statistical Debugging of Sampled Programs
topicId topicWeight
[(0, 0.053), (11, 0.016), (30, 0.017), (35, 0.055), (45, 0.015), (53, 0.112), (69, 0.015), (71, 0.067), (76, 0.032), (78, 0.015), (85, 0.087), (91, 0.117), (95, 0.259), (99, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.79701066 143 nips-2003-On the Dynamics of Boosting
Author: Cynthia Rudin, Ingrid Daubechies, Robert E. Schapire
Abstract: In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. We find stable cycles for these cases, which can explicitly be used to solve for AdaBoost’s output. By considering AdaBoost as a dynamical system, we are able to prove R¨ tsch and Warmuth’s conjecture that AdaBoost may fail a to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv. 1
2 0.62501991 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
3 0.62230676 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
4 0.6214999 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
5 0.61946023 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
6 0.61842555 107 nips-2003-Learning Spectral Clustering
7 0.61396515 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
8 0.61032248 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
9 0.60992974 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
10 0.60987443 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
11 0.60923457 30 nips-2003-Approximability of Probability Distributions
12 0.60867089 81 nips-2003-Geometric Analysis of Constrained Curves
13 0.60832089 158 nips-2003-Policy Search by Dynamic Programming
14 0.60755831 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
15 0.60697615 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
16 0.6052388 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
17 0.60505986 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
18 0.60485244 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
19 0.60450244 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
20 0.60413849 68 nips-2003-Eye Movements for Reward Maximization