nips nips2011 nips2011-252 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua
Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. [sent-2, score-0.348]
2 This implies that features should be shared by several classes. [sent-3, score-0.108]
3 We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. [sent-4, score-0.31]
4 We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. [sent-5, score-0.323]
5 We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. [sent-6, score-0.098]
6 1 Introduction Learning to classify an object into a relevant target class surfaces in many domains such as document categorization, object recognition in computer vision, and web advertisement. [sent-8, score-0.082]
7 In multiclass learning problems we use training examples to learn a classifier which will later be used for accurately classifying new objects. [sent-9, score-0.209]
8 Typically, the classifier first calculates several features from the input object and then classifies the object based on those features. [sent-10, score-0.131]
9 We start with predictors that are based on linear combinations of features. [sent-13, score-0.079]
10 Later, in Section 3, we show how our framework enables learning highly non-linear predictors by embedding non-linearity in the construction of the features. [sent-14, score-0.111]
11 While, in general, finding the most accurate sparse predictor is known to be NP hard, two main approaches have been proposed for overcoming the hardness result. [sent-17, score-0.16]
12 The second approach relies on forward greedy selection of features (e. [sent-21, score-0.19]
13 A popular model for multiclass predictors maintains a weight vector for each one of the classes. [sent-24, score-0.256]
14 Since the number of classes can be rather large, and our goal is to learn a model with an overall small number of features, we would like that the weight vectors will share the features with non-zero weights as much as possible. [sent-26, score-0.113]
15 Organizing the weight vectors of all classes as rows of a single matrix, this is equivalent to requiring sparsity of the columns of the matrix. [sent-27, score-0.113]
16 , Jerusalem, Israel † 1 In this paper we describe and analyze an efficient algorithm for learning a multiclass predictor whose corresponding matrix of weights has a small number of non-zero columns. [sent-30, score-0.29]
17 We apply our algorithm to natural multiclass learning problems and demonstrate its advantages over previously proposed state-of-the-art methods. [sent-32, score-0.177]
18 Our algorithm is a generalization of the forward greedy selection approach to sparsity in columns. [sent-33, score-0.158]
19 We discuss some generic methods for constructing features in Section 3. [sent-44, score-0.09]
20 Our goal is to learn a multiclass predictor, which is a mapping from the features of an object into Y. [sent-50, score-0.279]
21 We focus on the set of predictors parametrized by matrices W 2 Rk,d that takes the following form: hW (x) = argmax(W x)y . [sent-51, score-0.079]
22 More generally, given a matrix W and a pair of norms k · kp , k · kr we denote kW kp,r = k(kW·,1 kp , . [sent-56, score-0.117]
23 The 0 1 loss of a multiclass predictor hW on an example (x, y) is defined as 1[hW (x) 6= y]. [sent-60, score-0.309]
24 Since this loss function is not convex with respect to W , we use a surrogate convex loss function based on the following easy to verify inequalities: 1[hW (x) 6= y] 1[hW (x) 6= y] (W x)y + (W x)hW (x) max 1[y 6= y] (W x)y + (W x)y0 y 0 2Y X 0 ln e1[y 6=y] (W x)y +(W x)y0 . [sent-62, score-0.131]
25 This logistic multiclass loss function `(W, (x, y)) has several nice properties — see for example [39]. [sent-68, score-0.211]
26 The reason we need the loss function to be both convex and smooth is as follows. [sent-70, score-0.089]
27 the first order approximation of the loss) to make a greedy choice of which column to update. [sent-75, score-0.102]
28 To ensure that this greedy choice indeed yields a significant improvement we must know that the first order approximation is indeed close to the actual loss function, and for that we need both lower and upper bounds on the quality of the first order approximation. [sent-76, score-0.108]
29 , (xm , ym ), the average training loss of a matrix W is: P 1 L(W ) = m (x,y)2S `(W, (x, y)). [sent-80, score-0.095]
30 W 2Rk,d (4) That is, find the matrix W with minimal training loss among all matrices with column sparsity of at most s, where s is a user-defined parameter. [sent-84, score-0.147]
31 In Section 4 we show that for sparse models, a small training error is likely to yield a small error on unseen examples as well. [sent-86, score-0.103]
32 To overcome the hardness result, the ShareBoost algorithm will follow the forward greedy selection approach. [sent-90, score-0.124]
33 The algorithm comes with formal generalization and sparsity guarantees (described in Section 4) that makes ShareBoost an attractive multiclass learning engine due to efficiency (both during training and at test time) and accuracy. [sent-91, score-0.295]
34 2 Related Work The centrality of the multiclass learning problem has spurred the development of various approaches for tackling the task. [sent-93, score-0.193]
35 Perhaps the most straightforward approach is a reduction from multiclass to binary, e. [sent-94, score-0.177]
36 The more direct approach we choose, in particular, the multiclass predictors of the form given in eqn. [sent-97, score-0.256]
37 This construction is common in generalized additive models [17], multiclass versions of boosting [16, 28], and has been popularized lately due to its role in prediction with structured output where the number of classes is exponentially large (see e. [sent-100, score-0.286]
38 While this approach can yield predictors with a rather mild dependency of the required features on k (see for example the analysis in [39, 31, 14]), it relies on a-priori assumptions on the structure of X and Y. [sent-103, score-0.227]
39 The class of predictors of the form given in eqn. [sent-105, score-0.079]
40 (1) can be trained using Frobenius norm regularization (as done by multiclass SVM – see e. [sent-106, score-0.234]
41 However, as pointed out in [26], these regularizers might yield a matrix with many non-zeros columns, and hence, will lead to a predictor that uses many features. [sent-109, score-0.133]
42 For example, in Section C we show cases where mix-norm regularization does not yield sparse solutions while ShareBoost does yield a sparse solution. [sent-115, score-0.116]
43 For example, when using decision stumps, features will be highly correlated which will violate Assumption 4. [sent-128, score-0.098]
44 For example, when the features are constructed using decision stumps, d will be extremely large, but ShareBoost can still be implemented efficiently. [sent-134, score-0.098]
45 As mentioned before, ShareBoost follows the forward greedy selection approach for tackling the hardness of solving eqn. [sent-136, score-0.14]
46 The greedy approach has been widely studied in the context of learning sparse predictors for linear regression. [sent-138, score-0.162]
47 However, in multiclass problems, one needs sparsity of groups of variables (columns of W ). [sent-139, score-0.231]
48 ShareBoost generalizes the fully corrective greedy selection procedure given in [29] to the case of selection of groups of variables, and our analysis follows similar techniques. [sent-140, score-0.128]
49 Obtaining group sparsity by greedy methods has been also recently studied in [20, 23], and indeed, ShareBoost shares similarities with these works. [sent-141, score-0.126]
50 For the multiclass problem with log loss, the criterion of ShareBoost is much easier to compute, especially in large scale problems. [sent-148, score-0.177]
51 [23] suggested many other selection rules that are geared toward the squared loss, which is far from being an optimal loss function for multiclass problems. [sent-149, score-0.227]
52 While the original presentation in [34] seems rather different than the type of predictors we describe in eqn. [sent-151, score-0.079]
53 In particular, the features x are assumed to be decision stumps and each column W·,i is constrained to be ↵i (1[1 2 Ci ] , . [sent-153, score-0.186]
54 That is, the stump is shared by all classes in the subset Ci . [sent-157, score-0.107]
55 JointBoost chooses such shared decision stumps in a greedy manner by applying the GentleBoost algorithm on top of this presentation. [sent-158, score-0.18]
56 This leads to higher accuracy while using less features as was shown in our experiments on image classification. [sent-165, score-0.088]
57 4 2 The ShareBoost Algorithm ShareBoost is a forward greedy selection approach for solving eqn. [sent-168, score-0.101]
58 Usually, in a greedy approach, we update the weight of one feature at a time. [sent-170, score-0.086]
59 We will choose the column that maximizes the `1 norm of the corresponding column of the gradient of the loss at W . [sent-172, score-0.15]
60 (6) @L(W ) @Wq,r = (7) q2Y (x,y) Finally, after choosing the column for which krr L(W )k1 is maximized, we re-optimize all the columns of W which were selected so far. [sent-182, score-0.105]
61 p our experiments, we In used Nesterov’s accelerated gradient method [25] whose runtime is O(mtk/ ✏) for a smooth objecp tive, where ✏ is the desired accuracy. [sent-194, score-0.097]
62 Therefore, the overall runtime is O(T mdk + T 2 mk/ ✏). [sent-195, score-0.089]
63 It is interesting to compare this runtime to the complexity of minimizing the mixed-norm regularization objective given in eqn. [sent-196, score-0.112]
64 Since the objective is no longer smooth, the runtime of using Nesterov’s accelerated method would be O(mdk/✏) which can be much larger than the runtime of ShareBoost when d T. [sent-198, score-0.148]
65 Modifying the Greedy Choice Rule ShareBoost chooses the feature r which maximizes the `1 norm of the r-th column of the gradient matrix. [sent-202, score-0.098]
66 Of course, the runtime of choosing r will now be much larger. [sent-207, score-0.081]
67 5 Selecting a Group of Features at a Time In some situations, features can be divided into groups where the runtime of calculating a single feature in each group is almost the same as the runtime of calculating all features in the group. [sent-209, score-0.382]
68 In such cases, it makes sense to choose groups of features at each iteration of ShareBoost. [sent-210, score-0.105]
69 This can be easily done by simply choosing the group of features J P that maximizes j2J krj L(W )k1 . [sent-211, score-0.133]
70 That is, we construct a non-linear predictor by first mapping the original features into a higher dimensional space and then learning a linear predictor in that space, which corresponds to a non-linear predictor over the original feature space. [sent-224, score-0.393]
71 The first is the decision stumps method which is widely used by Boosting algorithms. [sent-226, score-0.085]
72 The second approach shows how to use ShareBoost for learning piece-wise linear predictors and is inspired by the super-vectors construction recently described in [40]. [sent-227, score-0.096]
73 A decision stump is a binary feature of the form 1[vi ✓], for some feature i 2 {1, . [sent-230, score-0.109]
74 To construct a non-linear predictor we can map each object v into a feature-vector x that contains all possible decision stumps. [sent-234, score-0.152]
75 First note that for each i, all stump features corresponding to i can get at most m + 1 values on a training set of size m. [sent-237, score-0.137]
76 5 5 the number of pieces, (uj , bj ) represents the linear function corresponding to piece j, and (vj , rj ) represents the center and radius of piece j. [sent-266, score-0.09]
77 In the case of learning a multiclass predictor, we shall learn a predictor v 7! [sent-271, score-0.275]
78 1 to learn a piece-wise linear model with few pieces (that is, each group of features will correspond to one piece of the model). [sent-275, score-0.152]
79 Then, we train ShareBoost so as to choose a multiclass predictor that only use few pairs (vj , rj ). [sent-277, score-0.314]
80 The advantage of using ShareBoost here is that while it learns a non-linear model it will try to find a model with few linear “pieces”, which is advantageous both in terms of test runtime as well as in terms of generalization performance. [sent-278, score-0.113]
81 We first show that if the algorithm has managed to find a matrix W with a small number of non-zero columns and a small training error, then the generalization error of W is also small. [sent-281, score-0.115]
82 We perform experiments with an OCR data set and demonstrate a mild growth of the number of features as the number of classes grows from 2 to 36. [sent-299, score-0.137]
83 The second experiment shows that ShareBoost can construct predictors with state-of-the-art accuracy while only requiring few features, which amounts to fast prediction runtime. [sent-300, score-0.108]
84 The main finding is that ShareBoost outperforms the mixed-norm regularization method when the output predictor needs to be very sparse, while mixed-norm regularization can be better in the regime of rather dense predictors. [sent-304, score-0.158]
85 Feature Sharing The main motivation for deriving the ShareBoost algorithm is the need for a multiclass predictor that uses only few features, and in particular, the number of features should increase slowly with the number of classes. [sent-306, score-0.348]
86 We trained ShareBoost with the number of classes varying from 2 classes to the 36 classes corresponding to the 10 digits and 26 capital letters. [sent-308, score-0.14]
87 We calculated how many features were required to achieve a certain fixed accuracy as a function of the number of classes. [sent-309, score-0.088]
88 Namely, we minimize the binary logistic loss using a greedy algorithm. [sent-312, score-0.094]
89 Both methods aim at constructing sparse predictors using the same greedy approach. [sent-313, score-0.179]
90 The difference between the methods is that ShareBoost selects features in a shared manner while the 1-vs-rest approach selects features for each binary problem separately. [sent-314, score-0.181]
91 2 we plot the overall number of features required by both methods to achieve a fixed accuracy on the test set as a function of the number of classes. [sent-316, score-0.102]
92 As can be easily seen, the increase in the number of required features is mild for ShareBoost but significant for the 1-vs-rest approach. [sent-317, score-0.097]
93 200 150 100 50 0 0 5 10 15 20 25 30 35 40 # classes Figure 2: The number of features required to achieve a fixed accuracy as a function of the number of classes for ShareBoost (dashed) and the 1-vs-rest (solid-circles). [sent-318, score-0.168]
94 Constructing fast and accurate predictors The goal of our this experiment is to show that ShareBoost achieves state-ofthe-art performance while constructing very fast predictors. [sent-320, score-0.112]
95 We experimented with the MNIST digit dataset, which consists of a training set of 60, 000 digits represented by centered sizenormalized 28 ⇥ 28 images, and a test set of 10, 000 digits. [sent-321, score-0.095]
96 The MNIST dataset has been extensively studied and is considered a standard test for multiclass classification of handwritten digits. [sent-322, score-0.206]
97 47% with the expanded training set of 360, 000 examples generated by adding five deformed instances per original example and with T = 305 rounds. [sent-337, score-0.076]
98 As can be seen, less than Figure 3: The test error rate of 75 features suffices to obtain an error rate of < 1%. [sent-342, score-0.115]
99 Trading accuracy for sparsity in optimization problems with sparsity constraints. [sent-548, score-0.091]
100 Sharing visual features for multiclass and multiview object detection. [sent-581, score-0.279]
wordName wordTfidf (topN-words)
[('shareboost', 0.903), ('multiclass', 0.177), ('kw', 0.152), ('hw', 0.118), ('predictor', 0.098), ('predictors', 0.079), ('features', 0.073), ('jointboost', 0.066), ('runtime', 0.066), ('stumps', 0.06), ('greedy', 0.06), ('mac', 0.047), ('classes', 0.04), ('sharing', 0.039), ('boosting', 0.038), ('sparsity', 0.038), ('mnist', 0.037), ('shared', 0.035), ('columns', 0.035), ('loss', 0.034), ('jerusalem', 0.032), ('kv', 0.032), ('stump', 0.032), ('training', 0.032), ('smooth', 0.031), ('regularization', 0.03), ('object', 0.029), ('kp', 0.029), ('column', 0.028), ('group', 0.028), ('classi', 0.027), ('norm', 0.027), ('norms', 0.027), ('krr', 0.027), ('orcam', 0.027), ('piece', 0.026), ('feature', 0.026), ('forward', 0.025), ('deferred', 0.025), ('decision', 0.025), ('pieces', 0.025), ('nesterov', 0.025), ('convex', 0.024), ('rounds', 0.024), ('mild', 0.024), ('xr', 0.024), ('document', 0.024), ('mdk', 0.023), ('deformed', 0.023), ('fink', 0.023), ('sparse', 0.023), ('hardness', 0.023), ('rj', 0.023), ('israel', 0.021), ('expanded', 0.021), ('yield', 0.02), ('corrective', 0.02), ('digits', 0.02), ('generalization', 0.019), ('patch', 0.018), ('ci', 0.018), ('vj', 0.018), ('quattoni', 0.018), ('compressed', 0.017), ('maximizes', 0.017), ('kr', 0.017), ('calculating', 0.017), ('er', 0.017), ('constructing', 0.017), ('construction', 0.017), ('million', 0.017), ('tackling', 0.016), ('objective', 0.016), ('selection', 0.016), ('collins', 0.016), ('groups', 0.016), ('accurate', 0.016), ('relies', 0.016), ('choose', 0.016), ('experimented', 0.015), ('bj', 0.015), ('rf', 0.015), ('embedding', 0.015), ('matrix', 0.015), ('operations', 0.015), ('decrease', 0.015), ('choosing', 0.015), ('guarantees', 0.015), ('accuracy', 0.015), ('handwritten', 0.015), ('np', 0.015), ('verify', 0.015), ('assumptions', 0.015), ('test', 0.014), ('digit', 0.014), ('advantageous', 0.014), ('approximation', 0.014), ('error', 0.014), ('prediction', 0.014), ('ym', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua
Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1
2 0.1241069 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
3 0.07142289 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
4 0.068604968 59 nips-2011-Composite Multiclass Losses
Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson
Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
5 0.058132324 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
6 0.057560965 261 nips-2011-Sparse Filtering
7 0.056394268 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
8 0.05068865 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
9 0.050143145 258 nips-2011-Sparse Bayesian Multi-Task Learning
10 0.049499635 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
11 0.049274176 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
12 0.048807263 49 nips-2011-Boosting with Maximum Adaptive Sampling
13 0.048732888 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
14 0.04814364 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
15 0.045640267 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
16 0.044811897 4 nips-2011-A Convergence Analysis of Log-Linear Training
17 0.043632995 260 nips-2011-Sparse Features for PCA-Like Linear Regression
18 0.042863108 244 nips-2011-Selecting Receptive Fields in Deep Networks
19 0.042198975 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
20 0.041780598 53 nips-2011-Co-Training for Domain Adaptation
topicId topicWeight
[(0, 0.124), (1, 0.018), (2, -0.064), (3, -0.017), (4, 0.005), (5, 0.06), (6, 0.049), (7, 0.012), (8, -0.119), (9, 0.02), (10, -0.044), (11, 0.002), (12, -0.01), (13, 0.022), (14, -0.042), (15, 0.014), (16, -0.012), (17, -0.009), (18, 0.019), (19, -0.067), (20, -0.007), (21, 0.015), (22, -0.016), (23, 0.033), (24, 0.007), (25, -0.028), (26, -0.017), (27, -0.045), (28, -0.026), (29, 0.013), (30, 0.082), (31, 0.035), (32, 0.005), (33, 0.057), (34, 0.049), (35, -0.023), (36, -0.045), (37, -0.127), (38, 0.023), (39, -0.016), (40, 0.064), (41, 0.016), (42, -0.089), (43, -0.019), (44, 0.032), (45, 0.079), (46, 0.024), (47, 0.026), (48, -0.029), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.8737852 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua
Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1
2 0.71529609 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
3 0.665645 19 nips-2011-Active Classification based on Value of Classifier
Author: Tianshi Gao, Daphne Koller
Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.
4 0.65006894 59 nips-2011-Composite Multiclass Losses
Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson
Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
5 0.56619751 49 nips-2011-Boosting with Maximum Adaptive Sampling
Author: Charles Dubout, Francois Fleuret
Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1
6 0.54348463 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
7 0.54187506 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
8 0.52116358 4 nips-2011-A Convergence Analysis of Log-Linear Training
9 0.51597053 33 nips-2011-An Exact Algorithm for F-Measure Maximization
10 0.51325184 282 nips-2011-The Fast Convergence of Boosting
11 0.50181162 28 nips-2011-Agnostic Selective Classification
12 0.49043989 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
13 0.47600824 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
14 0.46851161 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
15 0.46810919 261 nips-2011-Sparse Filtering
16 0.45201546 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
17 0.44545329 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
18 0.43558022 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
19 0.43541938 275 nips-2011-Structured Learning for Cell Tracking
20 0.43525392 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
topicId topicWeight
[(0, 0.038), (4, 0.046), (20, 0.067), (26, 0.054), (28, 0.185), (31, 0.049), (33, 0.027), (43, 0.078), (45, 0.154), (57, 0.041), (65, 0.014), (74, 0.046), (83, 0.028), (99, 0.058)]
simIndex simValue paperId paperTitle
1 0.89078033 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
Author: Liang Xiong, Barnabás Póczos, Jeff G. Schneider
Abstract: An important task in exploring and analyzing real-world data sets is to detect unusual and interesting phenomena. In this paper, we study the group anomaly detection problem. Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behaviors of groups of points. For this purpose, we propose the Flexible Genre Model (FGM). FGM is designed to characterize data groups at both the point level and the group level so as to detect various types of group anomalies. We evaluate the effectiveness of FGM on both synthetic and real data sets including images and turbulence data, and show that it is superior to existing approaches in detecting group anomalies. 1
2 0.86495131 95 nips-2011-Fast and Accurate k-means For Large Datasets
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
3 0.85895252 226 nips-2011-Projection onto A Nonnegative Max-Heap
Author: Jun Liu, Liang Sun, Jieping Ye
Abstract: We consider the problem of computing the Euclidean projection of a vector of length p onto a non-negative max-heap—an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative maxheap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classification task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to find the so-called maximal root-tree of the subtree rooted at each node. A naive approach for finding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for efficiently finding the maximal roottree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and O(p2 ) for a general tree. We report simulation results showing the effectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1. 1
4 0.83852029 156 nips-2011-Learning to Learn with Compound HD Models
Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1
same-paper 5 0.82073963 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua
Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1
6 0.75022161 22 nips-2011-Active Ranking using Pairwise Comparisons
7 0.74761862 263 nips-2011-Sparse Manifold Clustering and Embedding
8 0.74672925 29 nips-2011-Algorithms and hardness results for parallel large margin learning
9 0.7466892 80 nips-2011-Efficient Online Learning via Randomized Rounding
10 0.74492532 109 nips-2011-Greedy Model Averaging
11 0.74249303 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
12 0.74135011 202 nips-2011-On the Universality of Online Mirror Descent
13 0.74052805 251 nips-2011-Shaping Level Sets with Submodular Functions
14 0.74038702 186 nips-2011-Noise Thresholds for Spectral Clustering
15 0.73838985 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
16 0.73757041 265 nips-2011-Sparse recovery by thresholded non-negative least squares
17 0.73746365 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
18 0.73733312 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
19 0.73470265 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
20 0.73460263 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning