nips nips2011 nips2011-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Charles Dubout, Francois Fleuret
Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Some applications, in particular in computer vision, may involve up to millions of training examples and features. [sent-6, score-0.15]
2 In such contexts, the training time may become prohibitive. [sent-7, score-0.087]
3 Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. [sent-8, score-0.119]
4 given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. [sent-12, score-0.157]
5 We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. [sent-13, score-0.353]
6 Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. [sent-14, score-0.211]
7 However, while textbook AdaBoost repeatedly selects each of them using all the training examples and all the features for a predetermined number of rounds, one is not obligated to do so and can instead choose only to look at a subset of examples and features. [sent-17, score-0.366]
8 For the sake of simplicity, we identify the space of weak learners and the feature space by considering all the thresholded versions of the latter. [sent-18, score-0.13]
9 More sophisticated combinations of features can be envisioned in our framework by expanding the feature space. [sent-19, score-0.148]
10 The computational cost of one iteration of Boosting is roughly proportional to the product of the number of candidate weak learners Q and the number of samples T considered, and the performance increases with both. [sent-20, score-0.271]
11 More samples allow a more accurate estimation of the weak-learners’ performance, and more candidate weak-learners increase the performance of the best one. [sent-21, score-0.098]
12 Therefore, one wants at the same time to look at a large number of candidate weak-learners, in order to find a good one, but also needs to look at a large number of training examples, to get an accurate estimate of the weak-learner performances. [sent-22, score-0.209]
13 As Boosting progresses, the candidate weak-learners tend to behave more and more similarly, as their performance degrades. [sent-23, score-0.063]
14 While a small number of samples is initially sufficient to characterize the good weak-learners, it becomes more and more difficult, and the optimal values for a fixed product Q T moves to larger T and smaller Q. [sent-24, score-0.08]
15 Our main analytical results are Equations (13) and (17) in § 3. [sent-26, score-0.053]
16 They give exact expressions of the 1 expected edge of the selected weak-learner – that is the immediate loss reduction it provides in the considered Boosting iteration – as a function of the number T of samples and number Q of weaklearners used in the optimization process. [sent-27, score-0.497]
17 From this result we derive several algorithms described in § 4, and estimate their performance compared to standard and state-of-the-art baselines in § 5. [sent-28, score-0.108]
18 In the simplest version of the procedure, it requires to estimate for each candidate weak-learner a score dubbed “edge”, which requires to loop through every training example. [sent-30, score-0.146]
19 Reducing this computational cost is crucial to cope with high-dimensional feature spaces or very large training sets. [sent-31, score-0.159]
20 This can be achieved through two main strategies: sampling the training examples, or the feature space, since there is a direct relation between features and weak-learners. [sent-32, score-0.323]
21 Sampling the training set was introduced historically to deal with weak-learners which can not be trained with weighted samples. [sent-33, score-0.087]
22 This procedure consists of sampling examples from the training set according to their Boosting weights, and of approximating a weighted average over the full set by a non-weighted average over the sampled subset. [sent-34, score-0.266]
23 Such a procedure has been re-introduced recently for computational reasons [5, 8, 7], since the number of subsampled examples controls the trade-off between statistical accuracy and computational cost. [sent-37, score-0.117]
24 Sampling the feature space is the central idea behind LazyBoost [6], and consists simply of replacing the brute-force exhaustive search over the full feature set by an optimization over a subset produced by sampling uniformly a predefined number of features. [sent-38, score-0.183]
25 The natural redundancy of most of the families of features makes such a procedure particularly efficient. [sent-39, score-0.254]
26 Recently developed methods rely on multi-arms bandit methods to balance properly the exploitation of features known to be informative, and the exploration of new features [3, 4]. [sent-40, score-0.428]
27 The idea behind those methods is to associate a bandit arm to every feature, and to see the loss reduction as a reward. [sent-41, score-0.262]
28 Maximizing the overall reduction is achieved with a standard bandit strategy such as UCB [1], or Exp3. [sent-42, score-0.189]
29 First they make the assumption that the quality of a feature – the expected loss reduction of a weak-learner using it – is stationary. [sent-46, score-0.131]
30 This goes against the underpinning of Boosting, which is that at any iteration the performance of the learners is relative to the sample weights, which evolves over the training (Exp3. [sent-47, score-0.214]
31 Second, without additional knowledge about the feature space, the only structure they can exploit is the stationarity of individual features. [sent-49, score-0.072]
32 Hence, improvement over random selection can only be achieved by sampling again the exact same features one has already seen in the past. [sent-50, score-0.199]
33 We therefore only use those methods in a context where features come from multiple families. [sent-51, score-0.135]
34 This allows us to model the quality, and to bias the sampling, at the level of families instead of individual features. [sent-52, score-0.143]
35 Those approaches exploit information about features to bias the sampling, hence making it more efficient, and reducing the number of weak-learners required to achieve the same loss reduction. [sent-53, score-0.16]
36 In particular, there is no notion of varying the number of samples used for the estimation of the weak-learners’ performance. [sent-55, score-0.06]
37 3 Boosting with noisy maximization We present in this section some analytical results to approximate a standard round of AdaBoost – or most other Boosting algorithms – by sampling both the training examples and the features used to build the weak-learners. [sent-56, score-0.435]
38 Our main goal is to devise a way of selecting the optimal numbers of weaklearners Q and samples T to look at, so that their product is upper-bounded by a given constant, and that the expectation of the real performance of the selected weak-learner is maximal. [sent-57, score-0.237]
39 1 we recall standard notation for Boosting, the concept of the edge of a weak-learner, and how it can be approximated by a sampling of the training examples. [sent-59, score-0.432]
40 2 we formalize the optimization of the learners and derive the expectation E[G∗ ] of the true edge G∗ of the selected weak-learner, and we illustrate these results in the Gaussian case in § 3. [sent-61, score-0.31]
41 1 Edge estimation with weighting-by-sampling Given a training set (xn , yn ) ∈ X × {−1, 1}, n = 1, . [sent-73, score-0.131]
42 , N (1) and a set H of weak-learners, the standard Boosting procedure consists of building a strong classifier f (x) = αi hi (x) (2) i by choosing the terms αi ∈ R and hi ∈ H in a greedy manner to minimize a loss estimated over the training samples. [sent-76, score-0.18]
43 At every iteration, choosing the optimal weak-learner boils down to finding the weak-learner with the largest edge, which is the derivative of the loss reduction w. [sent-77, score-0.115]
44 The higher this value, the more the loss can be reduced locally, and thus the better the weak-learner. [sent-81, score-0.049]
45 The edge is a linear function of the responses of the weak-learner over the samples N G(h) = yn ωn h(xn ), (3) n=1 where the ωn s depend on the current responses of f over the xn s. [sent-82, score-0.305]
46 We consider without loss of N generality that they have been normalized such that n=1 ωn = 1. [sent-83, score-0.049]
47 The approximated edge fluctuates around the true value, with a binomial distribution. [sent-92, score-0.313]
48 The Boosting algorithm selects the weak-learner with the highest approximated edge, which has a real edge G∗ . [sent-93, score-0.288]
49 On this figure, the largest approximated edge is H1 , hence the real edge G∗ of the selected weaklearner is equal to G1 , which is less than G3 . [sent-94, score-0.448]
50 which follows a binomial distribution centered on the true edge, with a variance decreasing with the number of samples T . [sent-95, score-0.116]
51 , GQ be a series of independent, real-valued random variables standing for the true edges of Q weak-learners sampled randomly. [sent-101, score-0.129]
52 , ∆Q be a series of independent, real-valued random variables standing for the noise in the estimation of the edges due to the sampling of only T training examples, and finally ∀q, let Hq = Gq + ∆q be the approximated edge. [sent-105, score-0.345]
53 We define G∗ as the true edge of the weak-learner, which has the highest approximated edge G∗ = Gargmax1≤q≤Q Hq (8) This quantity is random due to both the sampling of the weak-learners, and the sampling of the training examples. [sent-106, score-0.739]
54 The quantity we want to optimize is e(Q, T ) = E[G∗ ], the expectation of the true edge of the selected learner, which increases with both Q and T . [sent-107, score-0.248]
55 In the case of multiple families of weak-learners, it makes sense to model the distributions of the edges Gq separately for each family, as they often have a more homogeneous behavior inside a family than across families. [sent-111, score-0.27]
56 , QK , T ), the expected edge of the selected weak-learner when we sample T examples from the training set, and Qk weak-learners from the kth family. [sent-115, score-0.341]
57 4 Adaptive Boosting Algorithms We propose here several new algorithms to sample features and training examples adaptively at each Boosting step. [sent-122, score-0.261]
58 While all the formulation above deals with uniform sampling of weak learners, we actually sample features, and optimize thresholds to build stumps. [sent-123, score-0.154]
59 Q takes into account the decomposition of the feature set into K families of feature extractors. [sent-128, score-0.239]
60 It models the distributions of the Gq s separately, estimating the distribution of each on a small number of features and examples sampled at the beginning of each iteration, chosen so as to account for 10% of the total cost. [sent-129, score-0.224]
61 From these models, it optimizes Q, T and the index l of the family to sample from, to maximize e(Q1{l=1} , . [sent-130, score-0.087]
62 Hence, in a given Boosting step, it does not mix weak-learners based on features from different families. [sent-134, score-0.111]
63 the number of sampled weak-learner Q and the number of samples T . [sent-151, score-0.088]
64 Right: same value as a function of Q alone, for different fixed costs (product of the number of examples T and Q). [sent-152, score-0.063]
65 More precisely: given a fixed Q0 and T0 , at every Boosting iteration, the Laminating first samples Q0 weak-learners and T0 training examples. [sent-158, score-0.168]
66 Then, it computes the approximated edges and keeps the Q0 /2 best weak-learners. [sent-159, score-0.158]
67 If more than one remains, it samples 2T0 examples, and re-iterates. [sent-160, score-0.06]
68 The cost of each iteration is constant, equal to Q0 T0 , and there are at most log2 (Q0 ) of them, leading to an overall cost of O(log2 (Q0 )Q0 T0 ). [sent-161, score-0.115]
69 In the experiments, we equalize the computational cost with the MAS approaches parametrized by T, Q by forcing log2 (Q0 )Q0 T0 = T Q. [sent-162, score-0.057]
70 5 Experiments We demonstrate the validity of our approach for pattern recognition on two standard data-sets, using multiple types of image features. [sent-163, score-0.047]
71 We compare our algorithms both to different flavors of uniform sampling and to state-of-the-art bandit based methods, all tuned to deal properly with multiple families of features. [sent-164, score-0.447]
72 1 Datasets and features For the first set of experiments we use the well known MNIST handwritten digits database [10], containing respectively 60,000/10,000 train/test grayscale images of size 28 × 28 pixels, divided in ten classes. [sent-166, score-0.221]
73 We use features computed by multiple image descriptors, leading to a total of 16,451 features. [sent-167, score-0.158]
74 Those descriptors can be broadly divided in two categories. [sent-168, score-0.06]
75 We dub it as challenging as state-of-the-art results without using additional training data barely reach 65% accuracy. [sent-173, score-0.109]
76 We use directly as features the same image descriptors as described above for MNIST, plus additional versions of some of them making use of color information. [sent-174, score-0.171]
77 2 Baselines We first define three baselines extending LazyBoost in the context of multiple feature families. [sent-176, score-0.169]
78 The most naive strategy one could think of, that we call Uniform Naive, simply ignores the families, and picks features uniformly at random. [sent-177, score-0.234]
79 This strategy does not properly distribute the sampling among the families, thus if one of them had a far greater cardinality than the others, all features would come from it. [sent-178, score-0.254]
80 Q to pick one of the feature families at random, and then samples the Q features from that single family, and Uniform Q. [sent-180, score-0.379]
81 1 to pick uniformly at random Q families of features, and then pick one feature uniformly in each family. [sent-181, score-0.278]
82 The second family of baselines we have tested bias their sampling at every Boosting iteration according to the observed edges in the previous iterations, and balance the exploitation of families of features known to perform well with the exploration of new family by using bandit algorithms [3, 4]. [sent-182, score-0.807]
83 P, -greedy), which differ only by the underlying bandit algorithm used. [sent-184, score-0.123]
84 We tune the meta-parameters of these techniques – namely the scale of the reward and the exploration-exploitation trade-off – by training them multiple times over a large range of parameters and keeping only the results of the run with the smallest final Boosting loss. [sent-185, score-0.111]
85 00) Table 2: Mean and standard deviation of the Boosting loss (log10 ) on the two data-sets and for each method, estimated on ten randomized runs. [sent-353, score-0.137]
86 Methods highlighted with a require the tuning of meta-parameters which have been optimized by training fully multiple times. [sent-354, score-0.141]
87 3 Results and analysis We report the results of the proposed algorithms against the baselines introduced in § 5. [sent-356, score-0.108]
88 We ran each configuration ten times and report the mean and standard deviation of each. [sent-359, score-0.059]
89 We set the maximum cost of all the algorithms to 10N , setting Q = 10 and T = N for the baselines, as this configuration leads to the best results after 10,000 Boosting rounds of AdaBoost. [sent-360, score-0.058]
90 2, the meta-parameters of the bandit methods have been optimized by running the training fully ten times, with the corresponding computational effort. [sent-366, score-0.269]
91 22) Table 3: Mean and standard deviation of the test error (in percent) on the two data-sets and for each method, estimated on ten randomized runs. [sent-533, score-0.088]
92 Methods highlighted with a require the tuning of meta-parameters which have been optimized by training fully multiple times. [sent-534, score-0.141]
93 Still our algorithms performs substantially better than the baselines for 10 and 100 weak-learners, gaining more than 10% in the test error rates, and behave similarly to the baselines for larger number of stumps. [sent-538, score-0.241]
94 As stated in § 1, the optimal values for a fixed product Q T moves to larger T and smaller Q. [sent-539, score-0.044]
95 For instance, On the MNIST data-set with MAS Naive, averaging on ten randomized runs, for respectively 10, 100, 1,000, 10,000 stumps, T = 1,580, 13,030, 37,100, 43,600, and Q = 388, 73, 27, 19. [sent-540, score-0.088]
96 The poor behavior of bandit methods for small number of stumps may be related to the large variations of the sample weights during the first iterations of Boosting, which goes against the underlying assumption of stationarity of the loss reduction. [sent-547, score-0.325]
97 This allowed us to maximize the loss reduction under strict control of the computational cost. [sent-549, score-0.119]
98 The first one is to blur the boundary between the MAS procedures and the Laminating, by deriving an analytical model of the loss reduction for generalized sampling procedures: Instead of doubling the number of samples and halving the number of weak-learners, we could adapt both set sizes optimally. [sent-552, score-0.295]
99 This would account for the degrading density estimation when weak-learner families have not been sampled for a while, and induce an exploratory sampling which may be missing in the current algorithms. [sent-554, score-0.281]
100 Israel, Associate Professor Emeritus at the University of British Columbia for his help on the derivation of the expectation of the true edge of the weak-learner with the highest approximated edge (equations (9) to (13)). [sent-558, score-0.509]
wordName wordTfidf (topN-words)
[('gq', 0.466), ('hq', 0.402), ('boosting', 0.331), ('mas', 0.327), ('tq', 0.253), ('laminating', 0.176), ('edge', 0.164), ('families', 0.143), ('bandit', 0.123), ('features', 0.111), ('baselines', 0.108), ('mnist', 0.097), ('stumps', 0.095), ('approximated', 0.093), ('qk', 0.091), ('sampling', 0.088), ('training', 0.087), ('naive', 0.081), ('weaklearners', 0.075), ('hu', 0.068), ('adaboost', 0.064), ('examples', 0.063), ('learners', 0.062), ('samples', 0.06), ('ten', 0.059), ('ucb', 0.058), ('subsampled', 0.054), ('analytical', 0.053), ('fleuret', 0.05), ('idiap', 0.05), ('lazyboost', 0.05), ('loss', 0.049), ('iteration', 0.045), ('reduction', 0.045), ('qth', 0.044), ('yn', 0.044), ('look', 0.042), ('edges', 0.041), ('family', 0.039), ('candidate', 0.038), ('xn', 0.037), ('descriptors', 0.037), ('feature', 0.037), ('standing', 0.036), ('uniform', 0.035), ('cost', 0.035), ('stationarity', 0.035), ('properly', 0.034), ('expectation', 0.033), ('maximization', 0.033), ('expressions', 0.032), ('binomial', 0.032), ('british', 0.032), ('highest', 0.031), ('weak', 0.031), ('highlighted', 0.03), ('tiny', 0.029), ('randomized', 0.029), ('multiarmed', 0.028), ('pick', 0.028), ('sampled', 0.028), ('handwritten', 0.028), ('en', 0.028), ('exploitation', 0.028), ('transforms', 0.027), ('selected', 0.027), ('bandits', 0.026), ('maximize', 0.025), ('behave', 0.025), ('associate', 0.024), ('keeps', 0.024), ('auer', 0.024), ('stated', 0.024), ('multiple', 0.024), ('true', 0.024), ('image', 0.023), ('rounds', 0.023), ('optimizes', 0.023), ('divided', 0.023), ('behavior', 0.023), ('lund', 0.022), ('avors', 0.022), ('dub', 0.022), ('equalize', 0.022), ('kalal', 0.022), ('matas', 0.022), ('professor', 0.022), ('ql', 0.022), ('oriented', 0.022), ('account', 0.022), ('hi', 0.022), ('guration', 0.022), ('every', 0.021), ('balance', 0.021), ('uniformly', 0.021), ('strategy', 0.021), ('gaussian', 0.02), ('variances', 0.02), ('moves', 0.02), ('underpinning', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 49 nips-2011-Boosting with Maximum Adaptive Sampling
Author: Charles Dubout, Francois Fleuret
Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1
2 0.28347251 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
Author: Amir-massoud Farahmand
Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1
3 0.1737636 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
4 0.12925301 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
5 0.12679851 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
6 0.091762841 299 nips-2011-Variance Penalizing AdaBoost
7 0.081149645 153 nips-2011-Learning large-margin halfspaces with more malicious noise
8 0.077459559 268 nips-2011-Speedy Q-Learning
9 0.063950256 32 nips-2011-An Empirical Evaluation of Thompson Sampling
10 0.061280854 291 nips-2011-Transfer from Multiple MDPs
11 0.059640858 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
12 0.059607353 261 nips-2011-Sparse Filtering
13 0.055505656 258 nips-2011-Sparse Bayesian Multi-Task Learning
14 0.053372387 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
15 0.053292654 244 nips-2011-Selecting Receptive Fields in Deep Networks
16 0.052231736 175 nips-2011-Multi-Bandit Best Arm Identification
17 0.051343847 4 nips-2011-A Convergence Analysis of Log-Linear Training
18 0.050719522 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
19 0.049381059 275 nips-2011-Structured Learning for Cell Tracking
20 0.048919514 61 nips-2011-Contextual Gaussian Process Bandit Optimization
topicId topicWeight
[(0, 0.169), (1, -0.05), (2, -0.039), (3, 0.052), (4, -0.016), (5, 0.058), (6, 0.027), (7, -0.012), (8, -0.177), (9, 0.009), (10, -0.125), (11, 0.03), (12, -0.026), (13, -0.081), (14, -0.047), (15, -0.072), (16, -0.016), (17, -0.134), (18, -0.098), (19, -0.174), (20, -0.128), (21, 0.052), (22, -0.011), (23, 0.012), (24, -0.082), (25, -0.153), (26, 0.062), (27, -0.037), (28, -0.017), (29, 0.082), (30, -0.046), (31, 0.05), (32, 0.074), (33, 0.112), (34, -0.069), (35, -0.048), (36, -0.018), (37, -0.124), (38, 0.028), (39, -0.035), (40, -0.103), (41, 0.107), (42, 0.038), (43, 0.116), (44, 0.014), (45, 0.033), (46, 0.056), (47, -0.118), (48, -0.115), (49, -0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.92011911 49 nips-2011-Boosting with Maximum Adaptive Sampling
Author: Charles Dubout, Francois Fleuret
Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1
2 0.60682833 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
Author: Amir-massoud Farahmand
Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1
3 0.59272313 268 nips-2011-Speedy Q-Learning
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
4 0.54753768 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
5 0.53507555 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
6 0.51123893 282 nips-2011-The Fast Convergence of Boosting
7 0.44009113 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
8 0.43136442 29 nips-2011-Algorithms and hardness results for parallel large margin learning
9 0.40445474 153 nips-2011-Learning large-margin halfspaces with more malicious noise
10 0.38144222 291 nips-2011-Transfer from Multiple MDPs
11 0.33744171 4 nips-2011-A Convergence Analysis of Log-Linear Training
12 0.32121274 275 nips-2011-Structured Learning for Cell Tracking
13 0.31179115 19 nips-2011-Active Classification based on Value of Classifier
14 0.3065176 175 nips-2011-Multi-Bandit Best Arm Identification
15 0.30490792 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
16 0.30361477 32 nips-2011-An Empirical Evaluation of Thompson Sampling
17 0.30184942 109 nips-2011-Greedy Model Averaging
18 0.30052406 33 nips-2011-An Exact Algorithm for F-Measure Maximization
19 0.2996363 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
20 0.29829124 240 nips-2011-Robust Multi-Class Gaussian Process Classification
topicId topicWeight
[(0, 0.035), (4, 0.047), (20, 0.057), (26, 0.024), (31, 0.07), (33, 0.024), (43, 0.094), (45, 0.142), (57, 0.04), (58, 0.21), (74, 0.051), (79, 0.016), (83, 0.024), (84, 0.018), (99, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.79955196 49 nips-2011-Boosting with Maximum Adaptive Sampling
Author: Charles Dubout, Francois Fleuret
Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1
2 0.76645011 231 nips-2011-Randomized Algorithms for Comparison-based Search
Author: Dominique Tschopp, Suhas Diggavi, Payam Delgosha, Soheil Mohajer
Abstract: This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difficulty of this problem depends on properties of the underlying database. We show the importance of a characterization: combinatorial disorder D which defines approximate triangle n inequalities on ranks. We present a lower bound of Ω(D log D + D2 ) average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of D for worst case behavior. We develop 3 a randomized scheme for NN retrieval in O(D3 log2 n + D log2 n log log nD ) 3 questions. The learning requires asking O(nD3 log2 n + D log2 n log log nD ) questions and O(n log2 n/ log(2D)) bits to store.
3 0.70075423 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
4 0.69940031 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
5 0.69910991 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
6 0.69797605 199 nips-2011-On fast approximate submodular minimization
7 0.69671416 109 nips-2011-Greedy Model Averaging
8 0.69607145 80 nips-2011-Efficient Online Learning via Randomized Rounding
9 0.69604248 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
10 0.69535971 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
11 0.6948868 265 nips-2011-Sparse recovery by thresholded non-negative least squares
12 0.69396001 263 nips-2011-Sparse Manifold Clustering and Embedding
13 0.69375288 258 nips-2011-Sparse Bayesian Multi-Task Learning
14 0.69374031 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
15 0.69324553 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
16 0.69269341 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
17 0.6925267 25 nips-2011-Adaptive Hedge
18 0.69222373 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
19 0.69210738 29 nips-2011-Algorithms and hardness results for parallel large margin learning
20 0.69149435 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods