nips nips2011 nips2011-299 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. [sent-6, score-0.126]
2 VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. [sent-7, score-0.286]
3 Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. [sent-8, score-0.683]
4 Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. [sent-9, score-0.454]
5 Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. [sent-10, score-0.294]
6 Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). [sent-11, score-0.361]
7 1 Introduction Many machine learning algorithms implement empirical risk minimization or a regularized variant of it. [sent-12, score-0.151]
8 For example, the popular AdaBoost [4] algorithm minimizes exponential loss on the training examples. [sent-13, score-0.099]
9 Similarly, the support vector machine [11] minimizes hinge loss on the training examples. [sent-14, score-0.099]
10 Therefore, empirical risk minimization on the training set is often performed while regularizing the complexity of the function classes being explored. [sent-17, score-0.178]
11 The rationale behind this regularization approach is that it ensures that the empirical risk converges (uniformly) to the true unknown risk. [sent-18, score-0.129]
12 A key tool in obtaining such concentration inequalities is Hoeffding’s inequality which relates the empirical mean of a bounded random variable to its true mean. [sent-20, score-0.096]
13 Bernstein’s and Bennett’s inequalities relate the true mean of a random variable to the empirical mean but also incorporate the true variance of the random variable. [sent-21, score-0.145]
14 Recently, there have been empirical counterparts of Bernstein’s inequality [1, 5]; these bounds incorporate the empirical variance of a random variable rather than its true variance. [sent-23, score-0.205]
15 An alternative to empirical risk minimization, called sample variance penalization [5], has been proposed and is motivated by empirical Bernstein bounds. [sent-26, score-0.411]
16 A new boosting algorithm is proposed in this paper which implements sample variance penalization. [sent-27, score-0.167]
17 The algorithm minimizes the empirical risk on the training set as well as the empirical variance. [sent-28, score-0.265]
18 Moreover, the 1 algorithm proposed in this article does not require exhaustive enumeration of the weak learners (unlike an earlier algorithm by [10]). [sent-30, score-0.397]
19 Assume that a training set (Xi , yi )n is provided where Xi ∈ X and yi ∈ {±1} are drawn indei=1 pendently and identically distributed (iid) from a fixed but unknown distribution D. [sent-31, score-0.199]
20 In the rest of this article, G : X → {±1} denotes the so-called weak learner. [sent-33, score-0.198]
21 The notation Gs denotes the weak learner in a particular iteration s. [sent-34, score-0.307]
22 Further, the two indices sets Is and Js , respectively, denote examples that the weak learner Gs correctly classified and misclassified, i. [sent-35, score-0.341]
23 , Is := {i|Gs (Xi ) = yi } and Js := {j|Gs (Xj ) = yj }. [sent-37, score-0.106]
24 Algorithm 1 AdaBoost Require: (Xi , yi )n , and weak learners H i=1 Initialize the weights: wi ← 1/n for i = 1, . [sent-38, score-0.682]
25 for s ← 1 to S do Estimate a weak learner Gs (·) from training examples weighted by (wi )n . [sent-42, score-0.365]
26 i=1 1 αs = 2 log i:Gs (Xi )=yi wi / j:Gs (Xj )=yj wj if αs ≤ 0 then break end if f (·) ← f (·) + αs Gs (·) wi ← wi exp(−yi Gs (Xi )αs )/Zs where Zs is such that end for n i=1 wi = 1. [sent-43, score-1.296]
27 Algorithm 2 VadaBoost Require: (Xi , yi )n , scalar parameter 1 ≥ λ ≥ 0, and weak learners H i=1 Initialize the weights: wi ← 1/n for i = 1, . [sent-44, score-0.699]
28 for s ← 1 to S do 2 ui ← λnwi + (1 − λ)wi Estimate a weak learner Gs (·) from training examples weighted by (ui )n . [sent-48, score-0.383]
29 i=1 1 αs = 4 log i:Gs (Xi )=yi ui / j:Gs (Xj )=yj uj if αs ≤ 0 then break end if f (·) ← f (·) + αs Gs (·) wi ← wi exp(−yi Gs (Xi )αs )/Zs where Zs is such that end for 2 n i=1 wi = 1. [sent-49, score-0.886]
30 AdaBoost (Algorithm 1) assigns a weight wi to each training example. [sent-52, score-0.334]
31 In each step of the AdaBoost, a weak learner Gs (·) is obtained on the weighted examples and a weight αs is assigned to it. [sent-53, score-0.363]
32 If a training example is correctly classified, its weight is exponentially decreased; if it is misclassified, its weight is exponentially increased. [sent-55, score-0.101]
33 AdaBoost essentially performs empirical risk minimizaS n 1 tion: minf ∈F n i=1 e−yi f (Xi ) by greedily constructing the function f (·) via s=1 αs Gs (·). [sent-57, score-0.143]
34 Recently an alternative to empirical risk minimization has been proposed. [sent-58, score-0.151]
35 This new criterion, known as the sample variance penalization [5] trades-off the empirical risk with the empirical variance: arg min f ∈F 1 n n l(f (Xi ), yi ) + τ i=1 ˆ V[l(f (X), y)] , n (1) where τ ≥ 0 explores the trade-off between the two quantities. [sent-59, score-0.497]
36 The motivation for sample variance penalization comes from the following theorem [5]: 2 Theorem 1 Let (Xi , yi )n be drawn iid from a distribution D. [sent-60, score-0.347]
37 From the above uniform convergence result, it can be argued that future loss can be minimized by ˆ minimizing the right hand side of the bound on training examples. [sent-65, score-0.097]
38 Since the variance V[l(f (X), y)] has a multiplicative factor involving M(n), δ and n, for a given problem, it is difficult to specify the relative importance between empirical risk and empirical variance a priori. [sent-66, score-0.326]
39 Hence, sample variance penalization (1) necessarily involves a trade-off parameter τ . [sent-67, score-0.225]
40 Empirical risk minimization or sample variance penalization on the 0 − 1 loss is a hard problem; this problem is often circumvented by minimizing a convex upper bound on the 0 − 1 loss. [sent-68, score-0.435]
41 With the above loss, it was shown by [10] that sample variance penalization is equivalent to minimizing the following cost, 2 n e −yi f (Xi ) n + λ n i=1 2 n e −2yi f (Xi ) − e i=1 −yi f (Xi ) . [sent-70, score-0.225]
42 Even though the exponential loss is unbounded, boosting is typically performed only for a finite number of iterations in most practical applications. [sent-72, score-0.126]
43 Moreover, since weak learners typically perform only slightly better than random guessing, each αs in AdaBoost (or in VadaBoost) is typically small thus limiting the range of the function learned. [sent-73, score-0.314]
44 Furthermore, experiments will confirm that sample variance penalization results in a significant empirical performance improvement over empirical risk minimization. [sent-74, score-0.411]
45 3 Derivation of VadaBoost In the sth iteration, our objective is to choose a weak learner Gs and a weight αs such that s t −yi s−1 αt Gt (xi ) t=1 /Zs . [sent-80, score-0.366]
46 Denote by wi the quantity e s−1 s t s didate weak learner G (·), the cost (3) for the function t=1 αt G (·) + αG (·) can be expressed as a function of α: V (α; w, λ, I, J) := 2 wi e−α + i∈I 2 wi e−2α + n wj eα +λ n j∈J 2 wj eα . [sent-82, score-1.514]
47 i∈I wi e−α + 2 wj e2α − j∈J i∈I (4) j∈J up to a multiplicative factor. [sent-83, score-0.428]
48 Let the vector w whose ith component is wi denote the current set of weights on the training examples. [sent-85, score-0.334]
49 Lemma 2 The update of αs in Algorithm 2 minimizes the cost e−2α + 2 λnwi + (1 − λ)wi U (α; w, λ, I, J) := i∈I 1 2 λnwj + (1 − λ)wj e2α . [sent-87, score-0.137]
50 n Theorem 3 Assume that 0 ≤ λ ≤ 1 and i=1 wi = 1 (i. [sent-91, score-0.282]
51 That is, U is an upper bound on V and the bound is exact at α = 0. [sent-95, score-0.146]
52 On the next line, we used the fact that i∈I wi + j∈J = i=1 wi = 1. [sent-99, score-0.564]
53 9 α Figure 1: Typical Upper bound U (α; w, λ, I, J) and the actual cost function V (α; w, λ, I, J) values under varying α. [sent-116, score-0.153]
54 9 We point out that we use a different upper bound in each iteration since V and U are parameterized by the current weights in the VadaBoost algorithm. [sent-120, score-0.137]
55 Although the choice 0 ≤ λ ≤ 1 seems restrictive, intuitively, it is natural to have a higher penalization on the empirical mean rather than the empirical variance during minimization. [sent-122, score-0.311]
56 Also, a closer look at the empirical Bernstein inequality in [5] shows that the empirical variance term is multiplied by 1/n while the empirical mean is multiplied by one. [sent-123, score-0.298]
57 Thus, for large values of n, the weight on the sample variance is small. [sent-124, score-0.123]
58 Also, our upper bound is loosest for the case λ = 0. [sent-128, score-0.14]
59 We visualize the upper bound and the true cost for two settings of λ in Figure 1. [sent-129, score-0.181]
60 Since the cost (4) is minimized via an upper bound (5), a natural question is: how good is this approximation? [sent-130, score-0.181]
61 In the other extreme when λ = 0, the cost of VadaBoost coincides with AdaBoost and the bound is effectively at its loosest. [sent-133, score-0.135]
62 Even in this extreme case, VadaBoost derived through an upper bound only requires at most twice the number of iterations as AdaBoost to achieve a particular cost. [sent-134, score-0.148]
63 Theorem 5 Let OA denote the squared cost obtained by AdaBoost after S iterations. [sent-136, score-0.112]
64 For weak learners in any iteration achieving a fixed error rate < 0. [sent-137, score-0.33]
65 s Proof Denote the weight on the example i in sth iteration by wi . [sent-139, score-0.373]
66 The weighted error rate of the sth s classifier is s = j∈Js wj . [sent-140, score-0.217]
67 We have, for both algorithms, S+1 wi = S s s=1 αs G (Xi )) . [sent-141, score-0.282]
68 S s=1 Zs S exp(−yi wi exp(−yi αS GS (Xi )) = Zs n (6) The value of the normalization factor in the case of AdaBoost is a Zs = s wi e−αs = 2 s wj eαs + j∈js s (1 − s ). [sent-142, score-0.71]
69 (7) i∈Is Similarly, the value of the normalization factor for VadaBoost is given by v Zs = s wi e−αs = (( s )(1 − s wj eαs + j∈Js i∈Is 5 1 s )) 4 ( √ s + √ 1− s ). [sent-143, score-0.428]
70 (8) The squared cost function of AdaBoost after S steps is given by n 2 S OA = αs yi G (X)) = exp(−yi s=1 i=1 n a n Zs s=1 i=1 2 S s s+1 wi 2 S =n 2 a Zs S = n2 s=1 4 s (1 − s ). [sent-144, score-0.48]
71 Similarly, for We used (6), (7) and the fact that i=1 wi 2 λ = 0 the cost of VadaBoost satisfies n OV = 2 S s exp(−yi αs yi G (X)) S = s=1 i=1 n 2 n s+1 wi a Zs s=1 2 S =n 2 v Zs s=1 i=1 S = n2 (2 s (1 − s) + s (1 − s )). [sent-146, score-0.735]
72 Then, the squared cost achieved by AdaBoost is given by n2 (4 (1 − ))S . [sent-148, score-0.112]
73 To achieve the same cost value, VadaBoost, with weak learners with the same log(4 (1− )) √ error rate needs at most S times. [sent-149, score-0.399]
74 Algorithm 3 EBBoost: Require: (Xi , yi )n , scalar parameter λ ≥ 0, and weak learners H i=1 Initialize the weights: wi ← 1/n for i = 1, . [sent-154, score-0.699]
75 A limitation of the EBBoost algorithm A sample variance penalization algorithm known as EBBoost was previously explored [10]. [sent-159, score-0.252]
76 While this algorithm was simple to implement and showed significant improvements over AdaBoost, it suffers from a severe limitation: it requires enumeration and evaluation of every possible weak learner per iteration. [sent-160, score-0.341]
77 An implementation of EBBoost requires exhaustive enumeration of weak learners in search of the one that minimizes cost (3). [sent-162, score-0.517]
78 It is preferable, instead, to find the best weak learner by providing weights on the training examples and efficiently computing the rule whose performance on that weighted set of examples is guaranteed to be better than random guessing. [sent-163, score-0.416]
79 Thus, it becomes necessary to exhaustively enumerate weak learners in Algorithm 3. [sent-165, score-0.314]
80 While enumeration of weak learners is possible in the case of decision stumps, it poses serious difficulties in the case of weak learners such as decision trees, ridge regression, etc. [sent-166, score-0.742]
81 Thus, VadaBoost is the more versatile boosting algorithm for sample variance penalization. [sent-167, score-0.167]
82 2 The cost which VadaBoost minimizes at λ = 0 is the squared cost of AdaBoost, we do not square it again. [sent-168, score-0.249]
83 6 Table 1: Mean and standard errors with decision stump as the weak learner. [sent-169, score-0.257]
84 3 Table 2: Mean and standard errors with CART as the weak learner. [sent-350, score-0.198]
85 The primary purpose of our experiments is to compare sample variance penalization versus empirical risk minimization and to show that we can efficiently perform sample variance penalization for weak learners beyond decision stumps. [sent-496, score-0.963]
86 The first set of experiments use decision stumps as the weak learners. [sent-503, score-0.294]
87 The second set of experiments used Classification and Regression Trees or CART [3] as weak learners. [sent-504, score-0.198]
88 With CART as the weak learner, VadaBoost is once again significantly better than AdaBoost. [sent-517, score-0.198]
89 5 10 in that theorem was that the error rate of each weak learner was fixed. [sent-521, score-0.31]
90 However, in practice, 10 the error rates of the weak learners are not constant over the iterations. [sent-522, score-0.314]
91 In 10 figure 2 we show the cost (plus 1) for each algorithm (the AdaBoost cost has been squared) 10 100 200 300 400 500 600 700 Iteration versus the number of iterations using a logarithmic scale on the Y-axis. [sent-524, score-0.223]
92 From the figure, it can be seen that the number of iterations required by VadaBoost is roughly twice the number of iterations required by AdaBoost. [sent-527, score-0.123]
93 5 4 cost + 1 3 2 1 0 7 Conclusions This paper identified a key weakness in the EBBoost algorithm and proposed a novel algorithm that efficiently overcomes the limitation to enumerable weak learners. [sent-530, score-0.31]
94 VadaBoost reduces a well motivated cost by iteratively minimizing an upper bound which, unlike EBBoost, allows the boosting method to handle any weak learner by estimating weights on the data. [sent-531, score-0.589]
95 Furthermore, despite the use of an upper bound, the novel boosting method remains efficient. [sent-533, score-0.115]
96 Even when the bound is at its loosest, the number of iterations required by VadaBoost is a small constant factor more than the number of iterations required by AdaBoost. [sent-534, score-0.158]
97 Experimental results showed that VadaBoost outperforms AdaBoost in terms of classification accuracy and efficiently applying to any family of weak learners. [sent-535, score-0.198]
98 The effectiveness of boosting has been explained via margin theory [9] though it has taken a number of years to settle certain open questions [8]. [sent-536, score-0.105]
99 Another future research direction is to design efficient sample variance penalization algorithms for other problems such as multi-class classification, ranking, and so on. [sent-538, score-0.225]
100 How boosting the margin can also boost classifier complexity. [sent-594, score-0.107]
wordName wordTfidf (topN-words)
[('vadaboost', 0.639), ('adaboost', 0.31), ('ebboost', 0.286), ('wi', 0.282), ('gs', 0.266), ('weak', 0.198), ('wj', 0.146), ('penalization', 0.127), ('zs', 0.118), ('learners', 0.116), ('js', 0.115), ('learner', 0.093), ('bernstein', 0.09), ('cart', 0.089), ('yi', 0.086), ('cost', 0.085), ('risk', 0.072), ('variance', 0.07), ('boosting', 0.069), ('nwi', 0.067), ('stumps', 0.064), ('empirical', 0.057), ('minimizes', 0.052), ('bound', 0.05), ('musk', 0.05), ('nwj', 0.05), ('sth', 0.05), ('enumeration', 0.05), ('upper', 0.046), ('loosest', 0.044), ('oa', 0.041), ('xi', 0.039), ('iterations', 0.037), ('misclassi', 0.034), ('abalone', 0.034), ('splice', 0.034), ('twonorm', 0.034), ('wisc', 0.034), ('decision', 0.032), ('initialize', 0.03), ('validation', 0.03), ('ringnorm', 0.03), ('mnist', 0.028), ('sample', 0.028), ('training', 0.027), ('stump', 0.027), ('mushrooms', 0.027), ('squared', 0.027), ('classi', 0.027), ('limitation', 0.027), ('examples', 0.026), ('shivaswamy', 0.025), ('spambase', 0.025), ('wine', 0.025), ('weights', 0.025), ('weight', 0.025), ('jebara', 0.024), ('correctly', 0.024), ('iteratively', 0.023), ('waveform', 0.023), ('minimization', 0.022), ('brute', 0.022), ('break', 0.022), ('inequality', 0.021), ('weighted', 0.021), ('margin', 0.021), ('yj', 0.02), ('hoeffding', 0.02), ('gt', 0.02), ('loss', 0.02), ('theorem', 0.019), ('inequalities', 0.018), ('actual', 0.018), ('ui', 0.018), ('multiplied', 0.018), ('shaded', 0.018), ('boost', 0.017), ('required', 0.017), ('scalar', 0.017), ('szepesv', 0.017), ('exp', 0.017), ('article', 0.017), ('iid', 0.017), ('iteration', 0.016), ('versus', 0.016), ('exhaustive', 0.016), ('twice', 0.015), ('trees', 0.015), ('bandit', 0.015), ('homeland', 0.015), ('tony', 0.015), ('onoda', 0.015), ('olshen', 0.015), ('freund', 0.015), ('ny', 0.015), ('effectiveness', 0.015), ('gure', 0.014), ('force', 0.014), ('criterion', 0.014), ('performs', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
2 0.18730059 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
3 0.15962084 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
4 0.1232925 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
5 0.11154678 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan
Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
6 0.091762841 49 nips-2011-Boosting with Maximum Adaptive Sampling
7 0.088539444 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
8 0.079501912 153 nips-2011-Learning large-margin halfspaces with more malicious noise
9 0.071605712 217 nips-2011-Practical Variational Inference for Neural Networks
10 0.064806633 258 nips-2011-Sparse Bayesian Multi-Task Learning
11 0.060353987 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
12 0.055580955 180 nips-2011-Multiple Instance Filtering
13 0.054578882 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
14 0.05207172 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
15 0.051425278 28 nips-2011-Agnostic Selective Classification
16 0.049933612 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
17 0.046136726 198 nips-2011-On U-processes and clustering performance
18 0.044255305 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
19 0.043234181 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
20 0.042746708 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
topicId topicWeight
[(0, 0.13), (1, -0.026), (2, -0.056), (3, -0.052), (4, 0.004), (5, 0.074), (6, 0.021), (7, -0.084), (8, -0.229), (9, 0.056), (10, -0.068), (11, 0.002), (12, -0.015), (13, -0.081), (14, 0.016), (15, -0.012), (16, -0.01), (17, -0.128), (18, -0.103), (19, -0.146), (20, 0.022), (21, -0.048), (22, -0.034), (23, 0.036), (24, 0.127), (25, -0.041), (26, 0.04), (27, 0.031), (28, 0.087), (29, 0.016), (30, -0.055), (31, -0.009), (32, 0.073), (33, 0.015), (34, 0.006), (35, 0.128), (36, 0.032), (37, -0.063), (38, 0.007), (39, 0.009), (40, -0.053), (41, -0.039), (42, 0.037), (43, 0.044), (44, 0.043), (45, -0.029), (46, -0.051), (47, 0.043), (48, -0.093), (49, -0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.94232702 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
2 0.73580092 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
3 0.72925347 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
4 0.61684251 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
5 0.57333672 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
6 0.55274546 49 nips-2011-Boosting with Maximum Adaptive Sampling
7 0.4421851 241 nips-2011-Scalable Training of Mixture Models via Coresets
8 0.42313689 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
9 0.39002824 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
10 0.36848637 111 nips-2011-Hashing Algorithms for Large-Scale Learning
11 0.34374124 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
12 0.34156695 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
13 0.34015712 95 nips-2011-Fast and Accurate k-means For Large Datasets
14 0.33810112 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
15 0.33612472 217 nips-2011-Practical Variational Inference for Neural Networks
16 0.32221562 198 nips-2011-On U-processes and clustering performance
17 0.32098874 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
18 0.30896038 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
19 0.30838057 240 nips-2011-Robust Multi-Class Gaussian Process Classification
20 0.30760714 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
topicId topicWeight
[(0, 0.013), (4, 0.032), (20, 0.029), (26, 0.034), (31, 0.07), (33, 0.027), (43, 0.111), (45, 0.105), (57, 0.062), (65, 0.013), (74, 0.034), (83, 0.026), (85, 0.282), (99, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.76327682 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
2 0.75904632 35 nips-2011-An ideal observer model for identifying the reference frame of objects
Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths
Abstract: The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols × and + differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conflicting reference frames, the model predicts two factors should influence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object (proximity) and it should be more likely to share the reference frame containing the most objects (alignment). We confirm people use both cues using a novel methodology that allows for easy testing of human reference frame inference. 1
3 0.72273195 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
Author: Tzu-kuo Huang, Jeff G. Schneider
Abstract: Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. 1
4 0.60001546 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
5 0.56363308 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama
Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1
6 0.56101358 258 nips-2011-Sparse Bayesian Multi-Task Learning
7 0.55932212 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
8 0.55681717 24 nips-2011-Active learning of neural response functions with Gaussian processes
9 0.55499202 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
10 0.55439132 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
11 0.55311149 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
12 0.55248392 186 nips-2011-Noise Thresholds for Spectral Clustering
13 0.54950219 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
14 0.54891258 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
15 0.54838735 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
16 0.54828906 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
17 0.54822671 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
18 0.54776323 29 nips-2011-Algorithms and hardness results for parallel large margin learning
19 0.54719126 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
20 0.54618382 109 nips-2011-Greedy Model Averaging