nips nips2006 nips2006-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. [sent-6, score-0.015]
2 In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. [sent-7, score-0.047]
3 We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. [sent-8, score-0.2]
4 These algorithms belong to a group of voting methods, for example [1, 2, 3], that produce a classifier as a linear combination of base or weak classifiers. [sent-10, score-0.033]
5 While empirical studies show that boosting is one of the best off the shelf classification algorithms (see [3]) theoretical results don’t give a complete explanation of their effectiveness. [sent-11, score-0.129]
6 Breiman [4] showed that under some assumptions on the underlying distribution “population boosting” converges to the Bayes risk as the number of iterations goes to infinity. [sent-12, score-0.092]
7 Since the population version assumes infinite sample size, this does not imply a similar result for AdaBoost, especially given results of Jiang [5], that there are examples when AdaBoost has prediction error asymptotically suboptimal at t = ∞ (t is the number of iterations). [sent-13, score-0.033]
8 These modifications include restricting the l1 -norm of the combined classifier [6, 7] and restricting the step size of the algorithm [8]. [sent-15, score-0.046]
9 Jiang [9] analyses the unmodified boosting algorithm and proves a process consistency property, under certain assumptions. [sent-16, score-0.2]
10 Process consistency means that there exists a sequence (tn ) such that if AdaBoost with sample size n is stopped after tn iterations, its risk approaches the Bayes risk. [sent-17, score-0.317]
11 However Jiang also imposes strong conditions on the underlying distribution: the distribution of X (the predictor) has to be absolutely continuous with respect to Lebesgue measure and the funcP(Y =1|X) tion FB (X) = 1 ln P(Y =−1|X) has to be continuous on X . [sent-18, score-0.283]
12 Also Jiang’s proof is not constructive 2 and does not give any hint on when the algorithm should be stopped. [sent-19, score-0.074]
13 Bickel, Ritov and Zakai in [10] prove a consistency result for AdaBoost, under the assumption that the probability distribution is such that the steps taken by the algorithm are not too large. [sent-20, score-0.106]
14 We would like to obtain a simple stopping rule that guarantees consistency and doesn’t require any modification to the algorithm. [sent-21, score-0.134]
15 This paper provides a constructive answer to all of the mentioned issues: 1. [sent-22, score-0.022]
16 We provide a simple stopping rule: the number of iterations t is a fixed function of the sample size n. [sent-25, score-0.081]
17 We assume only that the class of base classifiers has finite VC-dimension, and that the span of this class is sufficiently rich. [sent-27, score-0.069]
18 We are given X , the measurable (feature) space, and Y = {−1, 1}, set of (binary) labels. [sent-31, score-0.041]
19 Our goal is to construct a classifier gn : X → Y based on this sample. [sent-36, score-0.05]
20 The quality of the classifier gn is given by the misclassification probability L(gn ) = P(gn (X) = Y |Sn ). [sent-37, score-0.065]
21 Of course we want this probability to be as small as possible and close to the Bayes risk L∗ = inf L(g) = E(min{η(X), 1 − η(X)}), g where the infimum is taken over all possible (measurable) classifiers and η(·) is a conditional probability η(x) = P(Y = 1|X = x). [sent-38, score-0.213]
22 g(x) = We are going to produce a classifier as a linear combination of base classifiers in H = {h|h : X → Y}. [sent-40, score-0.063]
23 We shall assume that class H has a finite VC (Vapnik-Chervonenkis) dimension dV C (H) = max |S| : S ⊆ X , H|S = 2|S| . [sent-41, score-0.049]
24 and i=1 Then the boosting procedure can be described as follows. [sent-43, score-0.129]
25 , t set fk = fk−1 + αk−1 hk−1 , where the following holds Rn (fk ) = inf h∈H,α∈R Rn (fk−1 + αh) We call αi the step size of the algorithm at step i. [sent-50, score-0.209]
26 We shall also use the convex hull of H scaled by λ ≥ 0, n Fλ = n λi hi , n ∈ N ∪ {0}, λi ≥ 0, f f= i=1 λi = λ, hi ∈ H i=1 as well as the set of k-combinations, k ∈ N, of functions in H k Fk = λi hi , λi ∈ R, hi ∈ H f f= . [sent-53, score-0.415]
27 i=1 We shall also need to define the l∗ -norm: for any f ∈ F f ∗ = inf{ |αi |, f = αi hi , hi ∈ H}. [sent-54, score-0.223]
28 The set of classifiers based on class F is denoted by ˜˜ g ◦ F = {f |f = g(f ), f ∈ F}. [sent-57, score-0.018]
29 Define the derivative of an arbitrary function Q(·) in the direction of h as Q (f ; h) = ∂Q(f + λh) ∂λ . [sent-58, score-0.029]
30 λ=0 The second derivative Q (f ; h) is defined similarly. [sent-59, score-0.029]
31 3 Consistency of boosting procedure We shall need the following assumption. [sent-60, score-0.16]
32 Assumption 1 Let the distribution P and class H be such that lim inf R(f ) = R∗ , λ→∞ f ∈Fλ where R∗ = inf R(f ) over all measurable functions. [sent-61, score-0.341]
33 For many classes H, the above assumption is satisfied for all possible distributions P. [sent-62, score-0.02]
34 We begin with a simple lemma (see [1, Theorem 8] or [11, Theorem 6. [sent-65, score-0.077]
35 1]): Lemma 1 For any t ∈ N if dV C (H) ≥ 2 the following holds: dP (F t ) ≤ 2(t + 1)(dV C (H) + 1) log2 [2(t + 1)/ ln 2], where dP (F t ) is the pseudodimension of class F t . [sent-66, score-0.335]
36 The proof of AdaBoost consistency is based on the following result, which builds on the result by Koltchinskii and Panchenko [12] and resembles [6, Lemma 2]. [sent-67, score-0.105]
37 n (3) Also, for any δ > 0, with probability at least 1 − δ, |Rϕ (f ) − Rϕ,n (f )| sup f ∈πλ (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] n ≤ cλLϕ,λ ◦F t ln(1/δ) 2n + Mϕ,λ (4) and 2V ln(4n + 2) + Mϕ,λ n sup |Rϕ (f ) − Rϕ,n (f )| ≤ 4λLϕ,λ f ∈Fλ ln(1/δ) . [sent-70, score-0.707]
38 We begin with symmetrization to get E |Rϕ (f ) − Rϕ,n (f )| sup ≤ 2E f ∈πλ ◦F t 1 n sup f ∈πλ ◦F t n σi (ϕ(−Yi f (Xi )) − ϕ(0)) , i=1 where σi are i. [sent-74, score-0.459]
39 112–113]) with a function ψ(x) = (ϕ(x) − ϕ(0))/Lϕ,λ to get E |Rϕ (f ) − Rϕ,n (f )| sup ≤ 4Lϕ,λ E f ∈πλ ◦F t 1 n sup f ∈πλ ◦F t = 4Lϕ,λ E 1 n sup f ∈πλ ◦F t n −σi Yi f (Xi ) i=1 n σi f (Xi ) . [sent-80, score-0.655]
40 Notice, that functions in πλ ◦ F t are bounded and clipped to the absolute value equal λ, therefore we can rescale πλ ◦ F t by (2λ)−1 and get E n 1 n sup f ∈πλ ◦F t σi f (Xi ) = 2λE 1 n sup f ∈(2λ)−1 ◦πλ ◦F t i=1 n σi f (Xi ) . [sent-82, score-0.459]
41 i=1 Next, we are going to use Dudley’s entropy integral [14] to bound the r. [sent-83, score-0.03]
42 s above E sup f ∈(2λ)−1 ◦πλ ◦F t 1 n n 12 σi f (Xi ) ≤ √ n i=1 ∞ ln N ( , (2λ)−1 ◦ πλ ◦ F t , L2 (Pn ))d . [sent-85, score-0.48]
43 Next, since (2λ)−1 ◦ πλ is a non-decreasing ˜ transform, we use inequality dP ((2λ)−1 ◦ πλ ◦ F t ) ≤ dP (F t ) (e. [sent-87, score-0.027]
44 3]) E sup f ∈(2λ)−1 ◦πλ ◦F t 1 n n σi f (Xi ) ≤ c i=1 dP (F t ) . [sent-90, score-0.212]
45 n And then, since Lemma 1 gives an upper-bound on the pseudodimension of the class F t , we have E sup f ∈πλ ◦F t 1 n n σi f (Xi ) ≤ cλ i=1 (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] , n with constant c above being independent of H, t and λ. [sent-91, score-0.547]
46 To prove the second statement we use McDiarmid’s bounded difference inequality [16, Theorem 9. [sent-92, score-0.05]
47 136], since ∀i sup sup |Rϕ (f ) − Rϕ,n (f )| − (xj ,yj )n ,(xi ,yi ) f ∈πλ ◦F t j=1 sup f ∈πλ ◦F t |Rϕ (f ) − Rϕ,n (f )| ≤ Mϕ,λ , n where Rϕ,n (f ) is obtained from Rϕ,n (f ) by changing pair (xi , yi ) to (xi , yi ). [sent-94, score-0.738]
48 Lemma 2, unlike [6, Lemma 2], allows us to choose the number of steps t, that describes the complexity of the linear combination of base functions in addition to the parameter λ, which governs the size of the deviations of the functions in F, and this is essential for the proof of the consistency. [sent-96, score-0.067]
49 ϕ(x) = e−x ) we have to choose λ = κ ln n and t = nν with κ > 0, ν > 0 and 2κ + ν < 1. [sent-99, score-0.268]
50 We need the following simple consequence of the proof of [10, Theorem 1] Theorem 1 Let function Q(f ) be convex in f . [sent-101, score-0.034]
51 ¯ Then for any reference function f and the sequence of functions fm , produced by the boosting ¯ algorithm, the following bound holds ∀m s. [sent-106, score-0.889]
52 ¯ Q(fm ) ≤ Q(f ) + ¯ 8B 3 Q(f0 )(Q(f0 ) − Q(f )) 3 β ln 2 0 + c3 (m + 1) 2 0 1 −2 , (6) ¯ ¯ where k = f − fk ∗ , c3 = 2Q(f0 )/β, β = inf{Q (f ; h) : Q(f ) < Q(f ) < Q(f0 ), h ∈ H}, B = sup{Q (f ; h) : Q(f ) < Q(f0 ), h ∈ H}. [sent-109, score-0.327]
53 The statement of the theorem is a version of the result implicit in the proof of [10, Theorem ¯ 1]. [sent-111, score-0.135]
54 If for some m we have Q(fm ) ≤ Q(f ), then theorem is trivially true for all m ≥ m. [sent-112, score-0.101]
55 Therefore, ¯ we are going to consider only the case when Q(fm+1 ) > Q(f ). [sent-113, score-0.03]
56 By convexity of Q(·) ¯ ¯ |Q (fm ; fm − f )| ≥ Q(fm ) − Q(f ) = m . [sent-114, score-0.742]
57 (7) ¯ ˜ Let fm − f = αi hi , where αi and hi correspond to the best representation (with the smallest ˜˜ ˜ l∗ -norm). [sent-115, score-0.911]
58 Then from (7) and linearity of the derivative we have m ≤ ˜ αi Q (fm ; hi ) ≤ sup |Q (fm ; h)| ˜ |˜ i |, α h∈H therefore sup Q (fm ; h) ≥ h∈H m ¯ fm − f . [sent-116, score-1.268]
59 (9) α∈R 2 2β On the other hand, 1 Q(fm + αm hm ) = inf Q(fm + αh) ≤ inf Q(fm ) + αQ (fm ; h) + α2 B) h∈H,α∈R h∈H,α∈R 2 suph∈H |Q (fm ; h)|2 = Q(fm ) − . [sent-118, score-0.463]
60 (10) 2B Therefore, combining (9) and (10) , we get β . [sent-119, score-0.019]
61 B |Q (fm ; hm )| ≥ sup |Q (fm ; h)| h∈H (11) Another Taylor expansion, this time around fm+1 , gives us 1 2 ˜ ˜ Q(fm ) = Q(fm+1 ) + αm Q (f m ; hm ), (12) 2 ˜ ˜ where f m is some (other) function on the path from fm to fm+1 . [sent-120, score-1.357]
62 Therefore, if |αm | < |Q (fm ; hm )|/B, then |Q (fm ; hm )|2 Q(fm ) − Q(fm+1 ) < , 2B but by (10) suph∈H |Q (fm ; h)|2 |Q (fm ; hm )|2 Q(fm ) − Q(fm+1 ) ≥ ≥ , 2B 2B therefore we conclude, by combining (11) and (8), that √ √ β suph∈H |Q (fm ; h)| |Q (fm ; hm )| m β ≥ . [sent-121, score-0.852]
63 then m i−1 j=0 i 1 2 m 1 ≥ a + bi i ¯ 8B 3 Q(f0 )(Q(f0 ) − Q(f )) 3 β ln 2 0 + 2Q(f0 ) (m β 2 0 + 1) . [sent-124, score-0.268]
64 The theorem above allows us to get an upper bound on the difference between the ϕ-risk of the function output by AdaBoost and the ϕ-risk of the appropriate reference function. [sent-126, score-0.097]
65 Let tn = nν be the number of steps we run AdaBoost, let λn = κ ln n, ¯ with ν > 0, κ > 0 and ν + 2κ < 1. [sent-128, score-0.372]
66 Let fn be a minimizer of the function Rn (·) within Fλn . [sent-129, score-0.145]
67 Then for n large enough with high probability the following holds ¯ Rn (ftn ) ≤ Rn (fn ) + 8 ∗ )3/2 (R ln −1/2 λ2 + (4/R∗ )tn n λ2 n Proof. [sent-130, score-0.308]
68 Because in AdaBoost 1 n Rn (f ; h) = n (−Yi h(Xi ))2 e−Yi f (Xi ) = i=1 1 n n e−Yi f (Xi ) = R(f ) i=1 then all the conditions in Theorem 1 are satisfied (with Q(f ) replaced by Rn (f )) and in the Equation ¯ ¯ ¯ (6) we have B = Rn (f0 ) = 1, β ≥ Rn (fn ), f0 − fn ∗ ≤ λn . [sent-132, score-0.117]
69 Rn (ft ) ≤ Rn (fn ) the theorem is trivially true we only have to notice that Lemma 2 guarantees that with probability at least 1 − δ 2V ln(4n + 2) ln(1/δ) + Mϕ,λn . [sent-135, score-0.154]
70 Theorem 3 Assume V = dV C (H) < ∞, L∗ > 0, lim inf R(f ) = R∗ , λ→∞ f ∈Fλ tn → ∞, and tn = O(nν ) for ν < 1. [sent-141, score-0.365]
71 Then AdaBoost stopped at step tn returns a sequence of classifiers almost surely satisfying L(g(ftn )) → L∗ . [sent-142, score-0.188]
72 Also, let f be a minimizer of R and fn be a minimizer of Rn within Fλn . [sent-146, score-0.173]
73 (15) (16) (17) Inequalities (15) and (17) hold with probability at least 1 − δn , while inequality (16) is true for sufficiently large n when (17) holds. [sent-148, score-0.042]
74 The ’s above are 1 (V + 1)(nν + 1) log2 [2(nν + 1)/ ln 2] + nκ n = cnκ κ ln n 2 = 8 ∗ )3/2 (R ln (κ ln n)2 + (4/R∗ )nν (κ ln n)2 ln(1/δn ) 2n −1/2 , 2V ln(4n + 2) ln(1/δn ) + nκ n 2n and ϕ(λn ) = n−κ . [sent-149, score-1.34]
75 Now we appeal to the Borel-Cantelli lemma and arrive at R(πλ (ftn )) → R∗ a. [sent-152, score-0.077]
76 Eventually we can use [17, Theorem 3] to conclude that 3 = 4nκ κ ln n a. [sent-154, score-0.268]
77 Hence AdaBoost is consistent if stopped after nν steps. [sent-160, score-0.068]
78 4 Discussion We showed that AdaBoost is consistent if stopped sufficiently early, after tn iterations, for tn = nν with ν < 1, given that Bayes risk L∗ > 0. [sent-161, score-0.334]
79 Results by Jiang [5] imply that for some X and function class H AdaBoost algorithm will achieve zero training error after tn steps, where n2 /tn = o(1). [sent-163, score-0.139]
80 We don’t know what happens in between O(n1−ε ) and O(n2 ln n). [sent-164, score-0.268]
81 We analyzed only AdaBoost, the boosting algorithm that uses loss function ϕ(x) = e−x . [sent-166, score-0.129]
82 Since the proof of Theorem 2 relies on the properties of the exponential loss, we cannot make a similar conclusion for other versions of boosting, e. [sent-167, score-0.034]
83 , logit boosting with ϕ(x) = ln(1 + e−x ): in this case assumption on the second derivative holds with Rn (f ; h) ≥ Rn (f )/n, though the resulting inequality is trivial, the factor 1/n precludes us from finding any useful bound. [sent-169, score-0.279]
84 It is a subject of future work to find an analog of Theorem 2 that will handle logit loss. [sent-170, score-0.033]
85 On weak base hypotheses and their implications for boosting regression and classification. [sent-189, score-0.162]
86 On the Bayes-risk consistency of regularized boosting a methods. [sent-192, score-0.2]
87 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-195, score-0.129]
88 [16] Luc Devroye, L´ szl´ Gy¨ rfi, and G´ bor Lugosi. [sent-228, score-0.031]
wordName wordTfidf (topN-words)
[('fm', 0.719), ('ln', 0.268), ('adaboost', 0.238), ('hm', 0.213), ('sup', 0.212), ('ftn', 0.206), ('boosting', 0.129), ('inf', 0.125), ('rn', 0.119), ('fn', 0.117), ('tn', 0.104), ('hi', 0.096), ('jiang', 0.09), ('dp', 0.088), ('theorem', 0.078), ('lemma', 0.077), ('consistency', 0.071), ('stopped', 0.068), ('dv', 0.068), ('suph', 0.062), ('fk', 0.059), ('risk', 0.058), ('classi', 0.056), ('xi', 0.055), ('annals', 0.053), ('yi', 0.051), ('gn', 0.05), ('pseudodimension', 0.049), ('bayes', 0.047), ('stopping', 0.047), ('leo', 0.046), ('ritov', 0.041), ('wenxin', 0.041), ('measurable', 0.041), ('ers', 0.039), ('koltchinskii', 0.036), ('michel', 0.036), ('er', 0.035), ('proof', 0.034), ('berkeley', 0.034), ('iterations', 0.034), ('bickel', 0.033), ('logit', 0.033), ('base', 0.033), ('lim', 0.032), ('shall', 0.031), ('indicators', 0.031), ('bor', 0.031), ('going', 0.03), ('derivative', 0.029), ('minimizer', 0.028), ('mum', 0.027), ('peter', 0.027), ('statistics', 0.027), ('inequality', 0.027), ('bartlett', 0.026), ('completes', 0.026), ('tong', 0.025), ('terminal', 0.025), ('holds', 0.025), ('doesn', 0.024), ('statement', 0.023), ('restricting', 0.023), ('convexity', 0.023), ('trivially', 0.023), ('constructive', 0.022), ('notice', 0.022), ('predictor', 0.021), ('sn', 0.021), ('assumption', 0.02), ('ft', 0.02), ('get', 0.019), ('california', 0.019), ('banach', 0.018), ('hint', 0.018), ('hyperplanes', 0.018), ('luc', 0.018), ('mcdiarmid', 0.018), ('nicolas', 0.018), ('class', 0.018), ('imply', 0.017), ('modi', 0.017), ('bagging', 0.016), ('szl', 0.016), ('breiman', 0.016), ('clipped', 0.016), ('contraction', 0.016), ('mb', 0.016), ('precludes', 0.016), ('symmetrization', 0.016), ('unmodi', 0.016), ('sequence', 0.016), ('guarantees', 0.016), ('population', 0.016), ('trees', 0.015), ('probability', 0.015), ('yoav', 0.015), ('jon', 0.015), ('fb', 0.015), ('absolutely', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
2 0.12610248 136 nips-2006-Multi-Instance Multi-Label Learning with Application to Scene Classification
Author: Zhi-hua Zhou, Min-ling Zhang
Abstract: In this paper, we formalize multi-instance multi-label learning, where each training example is associated with not only multiple instances but also multiple class labels. Such a problem can occur in many real-world tasks, e.g. an image usually contains multiple patches each of which can be described by a feature vector, and the image can belong to multiple categories since its semantics can be recognized in different ways. We analyze the relationship between multi-instance multi-label learning and the learning frameworks of traditional supervised learning, multiinstance learning and multi-label learning. Then, we propose the M IML B OOST and M IML S VM algorithms which achieve good performance in an application to scene classification. 1
3 0.12184228 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
4 0.11340296 98 nips-2006-Inferring Network Structure from Co-Occurrences
Author: Michael G. Rabbat, Mário Figueiredo, Robert Nowak
Abstract: We consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes that co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-occurrences as randomly shuffled samples of a random walk on the network. We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability. 1
5 0.098669574 50 nips-2006-Chained Boosting
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
6 0.09468393 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
7 0.093771383 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
8 0.088303357 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
9 0.087034084 116 nips-2006-Learning from Multiple Sources
10 0.07624846 193 nips-2006-Tighter PAC-Bayes Bounds
11 0.075395547 61 nips-2006-Convex Repeated Games and Fenchel Duality
12 0.065133929 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
13 0.059311017 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects
14 0.05741147 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
15 0.057304811 175 nips-2006-Simplifying Mixture Models through Function Approximation
16 0.056692507 109 nips-2006-Learnability and the doubling dimension
17 0.050023962 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
18 0.048311882 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
19 0.047178533 181 nips-2006-Stability of $K$-Means Clustering
20 0.042732075 47 nips-2006-Boosting Structured Prediction for Imitation Learning
topicId topicWeight
[(0, -0.133), (1, 0.052), (2, -0.094), (3, -0.031), (4, -0.025), (5, 0.187), (6, -0.059), (7, 0.054), (8, -0.002), (9, 0.08), (10, 0.12), (11, 0.042), (12, 0.026), (13, 0.046), (14, -0.013), (15, -0.022), (16, 0.036), (17, 0.038), (18, -0.03), (19, -0.035), (20, -0.019), (21, 0.193), (22, 0.042), (23, 0.053), (24, -0.002), (25, -0.065), (26, -0.014), (27, -0.046), (28, -0.06), (29, 0.133), (30, -0.051), (31, -0.061), (32, 0.068), (33, 0.233), (34, 0.119), (35, -0.089), (36, -0.019), (37, 0.103), (38, 0.007), (39, -0.144), (40, -0.029), (41, 0.009), (42, -0.152), (43, 0.03), (44, -0.075), (45, 0.083), (46, 0.077), (47, 0.108), (48, -0.029), (49, -0.203)]
simIndex simValue paperId paperTitle
same-paper 1 0.95864099 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
2 0.64748412 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1
3 0.62338591 136 nips-2006-Multi-Instance Multi-Label Learning with Application to Scene Classification
Author: Zhi-hua Zhou, Min-ling Zhang
Abstract: In this paper, we formalize multi-instance multi-label learning, where each training example is associated with not only multiple instances but also multiple class labels. Such a problem can occur in many real-world tasks, e.g. an image usually contains multiple patches each of which can be described by a feature vector, and the image can belong to multiple categories since its semantics can be recognized in different ways. We analyze the relationship between multi-instance multi-label learning and the learning frameworks of traditional supervised learning, multiinstance learning and multi-label learning. Then, we propose the M IML B OOST and M IML S VM algorithms which achieve good performance in an application to scene classification. 1
Author: Philip M. Long, Rocco Servedio
Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.
5 0.38985151 47 nips-2006-Boosting Structured Prediction for Imitation Learning
Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff
Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1
6 0.38646725 50 nips-2006-Chained Boosting
7 0.38592812 116 nips-2006-Learning from Multiple Sources
8 0.36874509 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
9 0.34692383 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
10 0.32841212 109 nips-2006-Learnability and the doubling dimension
11 0.31941366 98 nips-2006-Inferring Network Structure from Co-Occurrences
12 0.31808832 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
13 0.2898095 193 nips-2006-Tighter PAC-Bayes Bounds
14 0.26233035 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
15 0.26223606 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
16 0.25637835 41 nips-2006-Bayesian Ensemble Learning
17 0.25343609 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects
18 0.24986306 61 nips-2006-Convex Repeated Games and Fenchel Duality
19 0.24435094 5 nips-2006-A Kernel Method for the Two-Sample-Problem
20 0.23413138 181 nips-2006-Stability of $K$-Means Clustering
topicId topicWeight
[(1, 0.079), (3, 0.012), (7, 0.08), (9, 0.054), (20, 0.01), (22, 0.09), (44, 0.092), (57, 0.117), (65, 0.041), (69, 0.018), (94, 0.273)]
simIndex simValue paperId paperTitle
same-paper 1 0.78170902 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
2 0.68126953 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
3 0.59302372 175 nips-2006-Simplifying Mixture Models through Function Approximation
Author: Kai Zhang, James T. Kwok
Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.
4 0.59113538 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
5 0.58796674 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao
Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1
6 0.58567762 34 nips-2006-Approximate Correspondences in High Dimensions
7 0.58391678 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
8 0.58311164 20 nips-2006-Active learning for misspecified generalized linear models
9 0.58273512 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
10 0.58224499 158 nips-2006-PG-means: learning the number of clusters in data
11 0.58216006 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
12 0.57913339 65 nips-2006-Denoising and Dimension Reduction in Feature Space
13 0.57816613 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
14 0.57799649 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
15 0.57795054 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
16 0.57623684 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
17 0.57609433 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
18 0.57587034 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
19 0.57577777 98 nips-2006-Inferring Network Structure from Co-Occurrences
20 0.57543999 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency