nips nips2011 nips2011-178 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The problem of multi-class boosting is considered. [sent-4, score-0.374]
2 A new framework, based on multi-dimensional codewords and predictors is introduced. [sent-5, score-0.442]
3 The optimal set of codewords is derived, and a margin enforcing loss proposed. [sent-6, score-0.745]
4 The resulting risk is minimized by gradient descent on a multidimensional functional space. [sent-7, score-0.542]
5 Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. [sent-8, score-0.611]
6 The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. [sent-9, score-0.661]
7 Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. [sent-11, score-0.725]
8 1 Introduction Boosting is a popular approach to classifier design in machine learning. [sent-12, score-0.026]
9 It is a simple and effective procedure to combine many weak learners into a strong classifier. [sent-13, score-0.375]
10 However, most existing boosting methods were designed primarily for binary classification. [sent-14, score-0.448]
11 In many cases, the extension to M ary problems (of M > 2) is not straightforward. [sent-15, score-0.075]
12 Nevertheless, the design of multi-class boosting algorithms has been investigated since the introduction of AdaBoost in [8]. [sent-16, score-0.4]
13 The first is to reduce the multiclass problem to a collection of binary sub-problems. [sent-18, score-0.401]
14 Methods in this class include the popular “one vs all” approach, or methods such as “all pairs”, ECOC [4, 1], AdaBoost-M2 [7], AdaBoost-MR [18] and AdaBoostMH [18, 9]. [sent-19, score-0.043]
15 The second approach is to boost an M -ary classifier directly, using multiclass weak learners, such as trees. [sent-21, score-0.488]
16 These methods require strong weak learners which substantially increase complexity and have high potential for overfitting. [sent-23, score-0.375]
17 This is particularly problematic because, although there is a unified view of these methods under the game theory interpretation of boosting [16], none of them has been shown to maximize the multiclass margin. [sent-24, score-0.701]
18 Overall, the problem of optimal and efficient M -ary boosting is still not as well understood as its binary counterpart. [sent-25, score-0.496]
19 We propose two algorithms to solve this optimization: CD-MCBoost, which is a functional coordinate descent procedure, and GD-MCBoost, which implements functional gradient descent. [sent-27, score-0.524]
20 The two algorithms differ in terms of the strategy used to update the multidimensional predictor. [sent-28, score-0.092]
21 CD-MCBoost supports any type of weak learners, updating one component of the predictor per boosting iteration, GD-MCBoost requires multiclass weak learners but updates all 1 components simultaneously. [sent-29, score-1.68]
22 Both methods directly optimize the predictor of the multiclass problem and are shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the classification risk. [sent-30, score-0.888]
23 Experiments show that they outperform comparable prior methods on a number of datasets. [sent-32, score-0.024]
24 2 Multiclass boosting We start by reviewing the fundamental ideas behind the classical use of boosting for the design of binary classifiers, and then extend these ideas to the multiclass setting. [sent-33, score-1.258]
25 1 Binary classification A binary classifier, F (x), is a mapping from examples x ∈ X to class labels y ∈ {−1, 1}. [sent-35, score-0.156]
26 The optimal classifier, in the minimum probability of error sense, is Bayes decision rule F (x) = arg miny∈{−1,1} PY |X (y|x). [sent-36, score-0.214]
27 In practice, the optimal predictor is learned from a sample D = {(xi , yi )}n of training examples, and (4) is approximated by the empirical risk i=1 n L[yi , f (xi )]. [sent-41, score-0.584]
28 ] is said to be Bayes consistent if (1) and (2) are equivalent. [sent-44, score-0.029]
29 For large margin methods, such as boosting, the loss is also a function of the classification margin yf (x), i. [sent-45, score-0.592]
30 This dependence on the margin yf (x) guarantees that the classifier has good generalization when the training sample is small [19]. [sent-49, score-0.361]
31 Boosting learns the optimal predictor f ∗ (x) : X → R as the solution of minf (x) s. [sent-50, score-0.406]
32 hp (x)} is a set of weak learners hi (x) : X → R, and the optimization is carried out by gradient descent in the functional space span(H) of linear combinations of hi (x) [14]. [sent-55, score-0.842]
33 2 Multiclass setting To extend the above formulation to the multiclass setting, we note that the definition of the classification labels as ±1 plays a significant role in the formulation of the binary case. [sent-57, score-0.49]
34 One of the difficulties of the multiclass extension is that these labels do not have an immediate extension to the multiclass setting. [sent-58, score-0.767]
35 To address this problem, we return to the classical setting, where the class labels of a M -ary problem take values in the set {1, . [sent-59, score-0.082]
36 Each class k is then mapped into a distinct class label y k , which can be thought of as a codeword that identifies the class. [sent-63, score-0.121]
37 In the binary case, these codewords are defined as y 1 = 1 and y 2 = −1. [sent-64, score-0.479]
38 It is possible to derive an alternative form for the expressions of the margin and classifier F (x) that depends explicitly on codewords. [sent-65, score-0.186]
39 For this, we note that (2) can be written as F (x) = arg max y k f ∗ (x) k 2 (8) and the margin can be expressed as yf = f −f 1 1 2 (y f 1 2 2 (y f if k = 1 = if k = 2 − y2 f ) − y1 f ) 1 if k = 1 = (y k f − max y l f ). [sent-66, score-0.496]
40 if k = 2 l=k 2 (9) The interesting property of these forms is that they are directly extensible to the M -ary classification case. [sent-67, score-0.038]
41 For this, we assume that the codewords y k and the predictor f (x) are multi-dimensional, i. [sent-68, score-0.708]
42 The margin of f (x) with respect to class k is then defined as M(f (x), y k ) = 1 [< f (x), y k > − max < f (x), y l >] l=k 2 (10) and the classifier as F (x) = arg maxk < f (x), y k >, (11) where < . [sent-71, score-0.394]
43 Note that this is equivalent to F (x) = arg max k∈{1,. [sent-74, score-0.108]
44 ,M } M(f (x), y k ), (12) and thus F (x) is the class of largest margin for the predictor f (x). [sent-77, score-0.532]
45 This definition is closely related to previous notions of multiclass margin. [sent-78, score-0.327]
46 For example, it generalizes that of [11], where the codewords y k are restricted to the binary vectors in the canonical basis of Rd , and is a special case of that in [1], where the dot products < f (x), y k > are replaced by a generic function of f, x, and k. [sent-79, score-0.479]
47 Given a training sample D = {(xi , yi )}n , the optimal predictor f ∗ (x) minimizes the risk i=1 n RM (f ) = EX,Y {LM [y, f (x)]} ≈ LM [yi , f (xi )]} (13) i=1 where LM [. [sent-80, score-0.611]
48 A natural extension of (6) and (9) is a loss of the form LM [y, f (x)] = φ(M(f (x), y)). [sent-83, score-0.082]
49 (14) To avoid the nonlinearity of the max operator in (10), we rely on M 1 e− 2 [ − ] . [sent-84, score-0.027]
50 It follows that the minimization of the risk of (13) encourages predictors of large margin M(f ∗ (xi ), yi ), ∀i. [sent-86, score-0.494]
51 For M = 2, LM [y, f (x)] reduces to L2 [y, f (x)] = 1 + e−yf (x) (16) and the risk minimization problem is identical to that of AdaBoost [8]. [sent-87, score-0.197]
52 In appendices B and C it is shown that RM (f ) is convex and Bayes consistent, in the sense that if f ∗ (x) is the minimizer of (13), then < f ∗ (x), y k >= log PY |X (y k |x) + c ∀k (17) and (11) implements the Bayes decision rule F (x) = arg maxk PY |X (y k |x). [sent-88, score-0.263]
53 3 (18) Optimal set of codewords From (15), the choice of codewords y k has an impact in the optimal predictor f ∗ (x), which is determined by the projections < f ∗ (x), y k >. [sent-90, score-1.193]
54 To maximize the margins of (10), the difference between these projections should be as large as possible. [sent-91, score-0.061]
55 To accomplish this we search for the set of M distinct unit codewords Y = {y 1 , . [sent-92, score-0.405]
56 , y M } ∈ Rd that are as dissimilar as possible ⎧ i j 2 ⎪ maxd,y1 ,. [sent-95, score-0.028]
57 5 (M = 3) 0 -1 -1 0 1 (M = 4) Figure 1: Optimal codewords for M = 2, 3, 4. [sent-121, score-0.405]
58 To solve this problem,we start by noting that, for d < M , the smallest distance of (19) can be increased by simply increasing d, since this leads to a larger space. [sent-122, score-0.028]
59 y M lie in an, at most, M − 1 dimensional subspace of Rd , e. [sent-126, score-0.028]
60 On the contrary, as shown in Appendix D, if d > M − 1 there exits a vector v ∈ Rd with equal projection on all codewords, < y i , v >=< y j , v > ∀i, j = 1, . [sent-129, score-0.038]
61 (20) Since the addition of v to the predictor f (x) does not change the classification rule of (11), this makes the optimal predictor underdetermined. [sent-132, score-0.683]
62 In this case, as shown in Appendix E, the vertices of a M −1 dimensional regular simplex1 centered at the origin [3] are solutions of (19). [sent-134, score-0.135]
63 Figure 1 presents the set of optimal codewords when M = 2, 3, 4. [sent-135, score-0.453]
64 Note that in the binary case this set consists of the traditional codewords yi ∈ {+1, −1}. [sent-136, score-0.553]
65 In general, there is no closed form solution for the vertices of a regular simplex of M vectors. [sent-137, score-0.167]
66 However, these can be derived from those of a regular simplex of M − 1 vectors, and a recursive solution is possible [3]. [sent-138, score-0.128]
67 3 Risk minimization We have so far defined a proper margin loss function for M -ary classification and identified an optimal codebook. [sent-139, score-0.317]
68 In this section, we derive two boosting algorithms for the minimization of the classification risk of (13). [sent-140, score-0.571]
69 The first is a functional coordinate descent algorithm, which updates a single component of the predictor per boosting iteration. [sent-142, score-1.083]
70 The second is a functional gradient descent algorithm that updates all components simultaneously. [sent-143, score-0.384]
71 1 Coordinate descent ∗ ∗ In the first method, each component fi∗ (x) of the optimal predictor f ∗ (x) = [f1 (x), . [sent-145, score-0.495]
72 fM −1 (x)], is the linear combination of weak learners that solves the optimization problem minf1 (x),. [sent-147, score-0.375]
73 hp (x)} is a set of weak learners, hi (x) : X → R. [sent-160, score-0.267]
74 , fM −1 (x)] the predictor available after t boosting iterations. [sent-165, score-0.677]
75 At iteration t + 1 a single component fj (x) of f (x) is updated with a step in the direction of the scalar functional g that most t t ∗ t decreases the risk R[f1 , . [sent-166, score-0.512]
76 4 where 1j ∈ Rd is a vector whose j th element is one and the remainder zero, i. [sent-174, score-0.029]
77 Using the risk of (13), it is shown in Appendix F that n −δR[f t ; j, g] j g(xi )wi , = (23) i=1 with j wi = 1 − 1 e 2 2 M 1 < 1j , yi − y k > e 2 . [sent-181, score-0.311]
78 (24) k=1 The direction of greatest risk decrease is the weak learner n ∗ gj (x) = arg max g∈H j g(xi )wi , (25) i=1 and the optimal step size along this direction ∗ ∗ αj = arg min R[f t (x) + αgj (x)1j ]. [sent-182, score-0.85]
79 (26) α∈R The classifier is thus updated as ∗ ∗ t t ∗ ∗ t f t+1 = f t (x) + αj gj (x)1j = [f1 , . [sent-183, score-0.101]
80 It starts with f (x) = 0 ∈ Rd and updates the predictor components sequentially. [sent-190, score-0.401]
81 Note that, since (13) is a convex function of f (x), it converges to the global minimum of the risk. [sent-191, score-0.053]
82 2 Gradient descent Alternatively, (13) can be minimized by learning a linear combination of multiclass weak learners. [sent-193, score-0.623]
83 In this case, the optimization problem is minf (x) R[f (x)] (28) s. [sent-194, score-0.055]
84 , hp (x)} is a set of multiclass weak learners, hi (x) : X → RM −1 , such as decision trees. [sent-198, score-0.622]
85 Note that to fit tree classifiers in this definition their output (usually a class number) should be translated into a class codeword. [sent-199, score-0.112]
86 As before, let f t (x) ∈ RM −1 be the predictor available after t boosting iterations. [sent-200, score-0.677]
87 At iteration t + 1 a step is given along the direction g(x) ∈ H of largest decrease of the risk R[f (x)]. [sent-201, score-0.266]
88 For this, we consider the directional functional derivative of R[f (x)] along the direction of the functional g : X → RM −1 , at point f (x) = f t (x). [sent-202, score-0.422]
wordName wordTfidf (topN-words)
[('codewords', 0.405), ('boosting', 0.374), ('multiclass', 0.327), ('predictor', 0.303), ('learners', 0.214), ('margin', 0.186), ('yf', 0.175), ('weak', 0.161), ('risk', 0.159), ('functional', 0.139), ('lm', 0.123), ('fj', 0.117), ('adaboost', 0.11), ('classi', 0.109), ('py', 0.107), ('fm', 0.104), ('descent', 0.102), ('gj', 0.101), ('span', 0.084), ('er', 0.083), ('arg', 0.081), ('rm', 0.079), ('bayes', 0.079), ('appendix', 0.078), ('wi', 0.078), ('saberian', 0.076), ('nuno', 0.076), ('hi', 0.075), ('rd', 0.074), ('binary', 0.074), ('yi', 0.074), ('regular', 0.068), ('updates', 0.067), ('multidimensional', 0.064), ('enforcing', 0.061), ('simplex', 0.06), ('maxk', 0.057), ('coordinate', 0.056), ('minf', 0.055), ('direction', 0.055), ('diego', 0.053), ('optimal', 0.048), ('gradient', 0.045), ('loss', 0.045), ('convergent', 0.044), ('implements', 0.043), ('class', 0.043), ('component', 0.042), ('laboratory', 0.04), ('labels', 0.039), ('vertices', 0.039), ('xi', 0.038), ('minimization', 0.038), ('ary', 0.038), ('extensible', 0.038), ('vasconcelos', 0.038), ('exits', 0.038), ('extension', 0.037), ('predictors', 0.037), ('san', 0.036), ('codeword', 0.035), ('reliance', 0.035), ('minimized', 0.033), ('stumps', 0.033), ('derivative', 0.032), ('projections', 0.032), ('components', 0.031), ('mohammad', 0.031), ('miny', 0.031), ('hp', 0.031), ('greatest', 0.03), ('consistent', 0.029), ('california', 0.029), ('th', 0.029), ('directional', 0.029), ('reviewing', 0.029), ('margins', 0.029), ('rule', 0.029), ('increased', 0.028), ('dimensional', 0.028), ('differ', 0.028), ('minimum', 0.028), ('along', 0.028), ('dissimilar', 0.028), ('decision', 0.028), ('ideas', 0.027), ('max', 0.027), ('minimizes', 0.027), ('avoided', 0.027), ('implement', 0.026), ('translated', 0.026), ('design', 0.026), ('culty', 0.026), ('cation', 0.025), ('formulation', 0.025), ('convex', 0.025), ('mini', 0.025), ('hull', 0.025), ('outperform', 0.024), ('decrease', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
2 0.25935742 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
3 0.23142651 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
4 0.19417436 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
5 0.18730059 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
6 0.1737636 49 nips-2011-Boosting with Maximum Adaptive Sampling
7 0.1565105 59 nips-2011-Composite Multiclass Losses
8 0.1241069 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
9 0.11714942 153 nips-2011-Learning large-margin halfspaces with more malicious noise
10 0.11520724 162 nips-2011-Lower Bounds for Passive and Active Learning
11 0.097550906 19 nips-2011-Active Classification based on Value of Classifier
12 0.093654148 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
13 0.090351306 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
14 0.088252418 28 nips-2011-Agnostic Selective Classification
15 0.082939893 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
16 0.077247001 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
17 0.075948425 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
18 0.075833254 271 nips-2011-Statistical Tests for Optimization Efficiency
19 0.074004374 258 nips-2011-Sparse Bayesian Multi-Task Learning
20 0.065020651 180 nips-2011-Multiple Instance Filtering
topicId topicWeight
[(0, 0.188), (1, -0.017), (2, -0.106), (3, -0.092), (4, 0.009), (5, 0.162), (6, 0.067), (7, -0.1), (8, -0.385), (9, 0.042), (10, -0.094), (11, 0.041), (12, 0.015), (13, -0.032), (14, -0.014), (15, -0.044), (16, 0.007), (17, -0.116), (18, -0.125), (19, -0.197), (20, 0.069), (21, -0.091), (22, 0.011), (23, 0.094), (24, 0.06), (25, -0.136), (26, -0.016), (27, -0.079), (28, 0.009), (29, 0.074), (30, 0.031), (31, -0.004), (32, 0.032), (33, -0.06), (34, -0.03), (35, 0.011), (36, -0.048), (37, -0.176), (38, -0.068), (39, 0.06), (40, -0.012), (41, -0.022), (42, -0.045), (43, -0.018), (44, 0.026), (45, 0.041), (46, 0.028), (47, 0.083), (48, -0.056), (49, -0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.96292931 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
2 0.77817446 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
3 0.76435173 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
4 0.65558201 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
5 0.58853686 59 nips-2011-Composite Multiclass Losses
Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson
Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
6 0.57231766 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
7 0.56287926 49 nips-2011-Boosting with Maximum Adaptive Sampling
8 0.54702497 153 nips-2011-Learning large-margin halfspaces with more malicious noise
9 0.50094348 19 nips-2011-Active Classification based on Value of Classifier
10 0.45410135 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
11 0.45085198 28 nips-2011-Agnostic Selective Classification
12 0.41180319 33 nips-2011-An Exact Algorithm for F-Measure Maximization
13 0.37150195 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
14 0.36377791 143 nips-2011-Learning Anchor Planes for Classification
15 0.35519543 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
16 0.33939654 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
17 0.33598116 111 nips-2011-Hashing Algorithms for Large-Scale Learning
18 0.33412972 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
19 0.33154023 277 nips-2011-Submodular Multi-Label Learning
20 0.33039793 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
topicId topicWeight
[(0, 0.036), (4, 0.035), (20, 0.011), (26, 0.045), (31, 0.06), (33, 0.016), (43, 0.178), (45, 0.133), (57, 0.012), (65, 0.036), (66, 0.18), (74, 0.06), (83, 0.068), (99, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.851641 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
2 0.83340865 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities
Author: Angela Yao, Juergen Gall, Luc V. Gool, Raquel Urtasun
Abstract: A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from “simple data”, i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art. 1
3 0.82329261 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
Author: Andre S. Barreto, Doina Precup, Joelle Pineau
Abstract: Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems. In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to fix the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difficult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRL’s model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and fitted Q-iteration. 1
4 0.81395078 301 nips-2011-Variational Gaussian Process Dynamical Systems
Author: Neil D. Lawrence, Michalis K. Titsias, Andreas Damianou
Abstract: High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences. 1
5 0.8115806 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
6 0.78124768 264 nips-2011-Sparse Recovery with Brownian Sensing
7 0.77966338 288 nips-2011-Thinning Measurement Models and Questionnaire Design
8 0.77857226 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
9 0.77798283 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis
10 0.77040726 282 nips-2011-The Fast Convergence of Boosting
11 0.76791257 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
12 0.76790613 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
13 0.76691711 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
14 0.75987023 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
15 0.75928551 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
16 0.75835341 258 nips-2011-Sparse Bayesian Multi-Task Learning
17 0.75677633 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
18 0.75262713 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
19 0.75098079 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
20 0.74732274 265 nips-2011-Sparse recovery by thresholded non-negative least squares