nips nips2013 nips2013-275 knowledge-graph by maker-knowledge-mining

275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Source: pdf

Author: Leonidas Lefakis, François Fleuret

Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between ofﬂine and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classiﬁer response space. We propose an efﬁcient algorithmic implementation which makes it tractable in practice, and demonstrate its efﬁciency experimentally on several compute-vision data-sets, on which it outperforms both online and ofﬂine methods in a memory constrained setting. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ch Abstract We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. [sent-5, score-0.584]

2 This novel approach lies in the area between ofﬂine and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. [sent-6, score-0.117]

3 We propose an efﬁcient algorithmic implementation which makes it tractable in practice, and demonstrate its efﬁciency experimentally on several compute-vision data-sets, on which it outperforms both online and ofﬂine methods in a memory constrained setting. [sent-8, score-0.073]

4 1 Introduction Learning a boosted classiﬁer from a set of samples S = {X, Y }N ∈ RD × {−1, 1} is usually addressed in the context of two main frameworks. [sent-9, score-0.179]

5 In ofﬂine Boosting settings [10] it is assumed that the learner has full access to the entire dataset S at any given time. [sent-10, score-0.257]

6 At each iteration t, the learning algorithm calculates a weight wi for each sample i – the derivative of the loss with respect to the classiﬁer response on that sample – and feeds these weights together with the entire dataset to a weak learning algorithm, which learns a predictor ht . [sent-11, score-0.533]

7 The coefﬁcient at of the chosen weak learner ht is then calculated based on its weighted error. [sent-12, score-0.526]

8 A common approach used in practice to deal with such limitations is that of sub-sampling the data-set using strategies based on the sample weights W [9, 13]. [sent-16, score-0.108]

9 Though these methods address the limits of the weak learning algorithms resources, they nonetheless assume a) access to the entire data-set at all times, b) the ability to calculate the weights W of the N samples and to sub-sample K of these, all in an efﬁcient manner. [sent-17, score-0.468]

10 For large datasets, in order to address such issues, the framework of online learning is frequently employed. [sent-19, score-0.073]

11 Online Boosting algorithms [15] typically assume access solely to a Filter() function, by which they mine samples from the data-set typically one at a time. [sent-20, score-0.187]

12 Due to the their online nature 1 such approaches typically treat the weak learning algorithm as a black box, assuming that it can be trained in an online manner, and concentrate on different approaches to calculating the weak learner coefﬁcients [15, 4]. [sent-21, score-0.877]

13 A notable exception is the works of [11] and [14], where weak learner selectors are introduced, one for each weak learner in the ensemble, which are capable of picking a weak learner from a predetermined pool. [sent-22, score-1.305]

14 All these approaches however are similar in the fact that they are forced to predetermine the number of weak learners in the boosted strong classiﬁer. [sent-23, score-0.453]

15 We propose here a middle ground between these two extremes in which the boosted classiﬁer can store some of the already processed samples in a reservoir, possibly keeping them through multiple rounds of training. [sent-24, score-0.179]

16 As in online learning we assume access only to a Filter() through which we can sample Qt samples at each Boosting iteration. [sent-25, score-0.248]

17 The drawback of this approach is a) that after each iteration all old samples are discarded, and b) the algorithm must process an increasing number of samples at each iteration as the weights become increasingly smaller. [sent-27, score-0.378]

18 We propose to acquire a ﬁxed number of samples at each iteration and to add these to a persistent reservoir, discarding only a subset. [sent-28, score-0.191]

19 Rather they use a simple sliding window approach and concentrate on the removal and adding of weak learners to tackle this drift. [sent-30, score-0.416]

20 A related concept to the work presented here is that of learning on a budget [6], where, as in the online learning setting, samples are presented one at a time to the learner, a perceptron, which builds a classiﬁcation model by retaining an active subset of these samples. [sent-31, score-0.238]

21 As noted, these approaches are concerned with the complexity of the classiﬁcation model, that is the budget refers to the number of samples which have none-zero coefﬁcients in the dual representation of the classiﬁer. [sent-36, score-0.194]

22 As mentioned, the batch version of Boosting consists of iteratively selecting a weak learner ht at each iteration t, based on the loss reduction they induce on the full training set S. [sent-43, score-0.604]

23 In the reservoir setting, weak learners are selected solely from the information provided by the samples contained in the reservoir Rt . [sent-44, score-1.629]

24 We consider here one iteration of a Boosting procedure, where each sample is weighted according to its contribution to the overall loss. [sent-49, score-0.1]

25 We deﬁne similarly ωB , and for any weak learner h ∈ H let hB ∈ {−1, 1}|B| stands for the vector of the |B| responses over the samples in B. [sent-56, score-0.6]

26 At each iteration t, the learning algorithm is presented with a batch of fresh samples Qt ⊂ S, |Qt | = q, and must choose r samples from the full set of samples Rt ∪ Qt at its disposal, in order to build Rt+1 with |Rt+1 | = r, which it subsequently uses for training. [sent-57, score-0.474]

27 t Using the samples from Rt , the learner chooses a weak learner ht ∈ H to maximize ht t ◦yRt , wRt , R where ◦ stands for the Hadamard component-wise vector product. [sent-58, score-0.882]

28 Maximizing this latter quantity corresponds to minimizing the weighted error estimated on the samples currently in Rt . [sent-59, score-0.152]

29 The weight at of the selected weak learner can also be estimated with Rt . [sent-60, score-0.448]

30 The learner then receives a fresh batch of samples Qt+1 and the process continues iteratively. [sent-61, score-0.354]

31 In the following we will address the issue of which strategy to employ to discard the q samples at each time step t. [sent-63, score-0.179]

32 for choosing which q samples from Rt ∪ Qt to discard. [sent-67, score-0.141]

33 Max Weights (Max) At each iteration t the weight vector wRt ∪Qt is computed for the r + q samples and the r samples with the largest weights are kept. [sent-69, score-0.354]

34 Weighted Sampling (WSam) As t above wRt ∪Qt is computed, then normalized to 1, and used as a distribution to sample r samples to keep without replacement. [sent-70, score-0.162]

35 Random Sampling (Rand) The reservoir is constructed by sampling uniformly r samples from the r + q available, without replacement. [sent-71, score-0.684]

36 These baselines are presented to highlight that a more sophisticated reservoir strategy is needed to ensure competitive performance, rather than to serve as examples of state-of-the-art baselines. [sent-73, score-0.619]

37 Our objective will be to populate the reservoir with samples that will allow for an optimal selection of weak learners, as close as possible to the choice we would make if we could keep all samples. [sent-74, score-0.972]

38 3 The issue at hand is similar to that of feature selection: The selected samples should be jointly informative for choosing the good weak learners. [sent-75, score-0.395]

39 This forces to ﬁnd a proper balance between the individual importance of the kept samples (i. [sent-76, score-0.121]

40 choosing those with large weights) and maximizing the heterogeneity of the weak learners responses on them. [sent-78, score-0.445]

41 1 Greedy Edge Expectation Maximization In that reservoir setting, it makes sense that given a set of samples A from which we must discard samples and retain only a subset B, what we would like is to retain a training set that is as representative as possible of the entire set A. [sent-80, score-0.919]

42 Ideally, we would like B to be such that if we pick the optimal weak-learner according to the samples it contains h∗ = argmax hB ◦ yB , wB (1) h∈H it maximizes the same quantity estimated on all the samples in A. [sent-81, score-0.242]

43 There may be many weak-learners in H that have the exact same responses as h∗ on the samples in B, and since we consider a situation where we will not have access to the samples from A \ B anymore, we model the choice among these weak-learners as a random choice. [sent-85, score-0.305]

44 If H is picked uniformly in H, under a reasonable assumption of symmetry, we propose H ◦ y ∼ N (µ, Σ) (3) where µ is the vector of dimension N of the expectations of weak learner edges, and Σ is a covariance ¯ matrix of size N × N . [sent-88, score-0.447]

45 ¯ ¯ B We note that choosing B based on (7) requires estimates of three quantities: the expected weak-learner edges µA , the co-variance matrix ΣAA , and the weak learner h∗ trained on B. [sent-93, score-0.477]

46 2 Computing Σ and µ As mentioned, the proposed method requires in particular an estimate of the vector of expected edges µA of the samples in A, as well as the corresponding covariance matrix ΣAA . [sent-96, score-0.177]

47 In practice, the estimation of the above depends on the nature of the weak learner family H. [sent-97, score-0.424]

48 , D}, has the following form: 1 if α xd ≥ α θ ∀x ∈ RD , hθ,α,d (x) = (8) −1 otherwise 4 where xd refers to the value of the dth component of x. [sent-102, score-0.139]

49 In practice when choosing the optimal stump for a given set of samples A, a learner would sort all the samples according to each of the D dimensions, and for each dimension d it would consider stumps with thresholds θ between two consecutive samples in that sorted list. [sent-103, score-0.67]

50 The covariance of the edge of two samples can also be calculated efﬁciently, with O(|A|2 D) complexity. [sent-105, score-0.144]

51 For two given samples i,j we have ∀h ∈ H, yi hi yj hj ∈ {−1, 1}. [sent-106, score-0.152]

52 (9) Having sorted the samples along a speciﬁc dimension d we have that for α = 1, yi hi yj hj = yi yj for those weak learners which disagree on those samples i. [sent-107, score-0.723]

53 the probability that a weak learner h will be the chosen one given that it will be trained on B and integrating over H, this however, though consistent with the Gaussian assumption, is computationally prohibitive. [sent-117, score-0.446]

54 , 1) which is equivalent to making the assumption that the training process will results in a weak learner which performs perfectly on the training data B. [sent-122, score-0.494]

55 The second is to generate a number |HLattice | of weak learner edges by sampling on the {−1, 1}|B| lattice using the Gaussian H ◦ y ∼ N (µB , ΣBB ) restricted to this lattice and to keep the optimal h∗ = argmax h ∈ HLattice (hB ◦ yB ), wB . [sent-124, score-0.568]

56 For these reason we propose a greedy approach to building the reservoir population which is computationally bounded. [sent-133, score-0.604]

57 The greedy process then iteratively goes through the |B| samples in B and ﬁnds the sample j BB such that for B = B \ {j} the value ΣB B Σ−1 (h∗ ◦ yB ), wB ¯ ¯ B BB + h∗ ◦ yB , wB B (11) is maximized, where, in this context, h∗ refers to the weak learner chosen by training on B . [sent-137, score-0.672]

58 5 Evaluation of E( h∗ , wA |B) A Each step in the above greedy process requires going through all the samples j in the current B and calculating E( h∗ , wA |B ) for B = B \ {j}. [sent-145, score-0.195]

59 The main computational cost here is the ﬁrst factor, incurred in calculating the inverse of the covariance matrix ΣB B which results from the matrix ΣBB by removing a single row and column. [sent-148, score-0.086]

60 Thus by ¯ ¯ ¯ BB BB B −1 ∗ T pre-calculating the products ΣBB hB and wB ΣBB once at the beginning of each greedy optimization ¯ ¯ step, we can incur a cost of O(|B|) for each sample j and an O(|B|2 ) cost overall. [sent-158, score-0.145]

61 6 Weights wB ˜ GEEM provides a method for selecting which samples to keep and which to discard. [sent-160, score-0.141]

62 However in doing so it creates a biased sample B of the set A, and consequently weights wB are not representative of the weight distribution wA . [sent-161, score-0.085]

63 7 Overall Complexity The proposed method GEEM comprises, at each boosting iteration, three main steps: (1) The calculation of ΣAA , (2) The optimization of B, and (3) The training of the weak learner ht The third step is common to all the reservoir strategies presented here. [sent-164, score-1.397]

64 In the case of classiﬁcation stumps by presorting the samples along each dimension and exploiting the structure of the hypothesis space H, we can incur a cost of O(D|B| log |B|) where D is the dimensionality of the input space. [sent-165, score-0.232]

65 Thus the overall complexity of the proposed method is O(|A|3 + D|A|log|A|) which in practice should not be signiﬁcantly larger than O(D|B|log|B|), the cost of the remaining reservoir strategies. [sent-169, score-0.571]

66 We note that this analysis ignores the cost of processing incoming samples Qt which is also common to all strategies, dependent on the task this cost may handily dominate all others. [sent-170, score-0.183]

67 6 4 Experiments In order to experimentally validate both the framework of reservoir boosting as well as the proposed method GEEM, we conducted experiments on four popular computer vision datasets. [sent-171, score-0.789]

68 Original experiments with the exponential loss in a reservoir setting showed it to be unstable and to lead to degraded performance for all the reservoir strategies presented here. [sent-174, score-1.127]

69 In [14] the authors performed extensive comparison in an online setting and also found logitboost to yield the best results. [sent-175, score-0.122]

70 We set the number of weak learners T in the boosted classiﬁer to be T = 250 common to all methods. [sent-176, score-0.453]

71 In the case of the online boosting algorithms this translates to ﬁxing the number of weak learners. [sent-177, score-0.607]

72 Finally, for the methods that use a reservoir – that is GEEM and the baselines outlined in 3 – we set r = q. [sent-178, score-0.595]

73 Thus at every iteration, the reservoir is populated with |Rt | = r samples and the algorithm receives a further |Qt | = r samples from the ﬁlter. [sent-179, score-0.782]

74 The reservoir strategy is then used to discard r of these samples to build Rt+1 . [sent-180, score-0.719]

75 1 Data-sets We used four standard datasets: CIFAR-10 is a recognition dataset consisting of 32 × 32 images of 10 distinct classes depicting vehicles and animals. [sent-182, score-0.065]

76 The training data consists of 5000 images of each class. [sent-183, score-0.066]

77 MNIST is a well-known optical digit recognition dataset comprising 60000 images of size 28 × 28 of digits from 0 − 9. [sent-185, score-0.065]

78 The training set consists of 12180 images of size 64 × 128 of both pedestrians and background images from which we extract HoG features [7]. [sent-188, score-0.097]

79 STL-10 An image recognition dataset consisting of images of size 96 × 96 belonging to 10 classes, each represented by 500 images in the training set. [sent-189, score-0.131]

80 2 Baselines The baselines for the reservoir strategy have already been outlined in 3, and we also benchmarked three online Boosting algorithms: Oza [15], Chen [4], and Bisc [11]. [sent-192, score-0.692]

81 The ﬁrst two algorithms treat weak learners as a black-box but predeﬁne their number. [sent-193, score-0.395]

82 Note that these approaches are not mutually exclusive with the proposed method, as the weak learners picked by GEEM can be combined with an online boosting algorithm optimizing their coefﬁcients. [sent-195, score-0.717]

83 For the ﬁnal method [11], we initiated the number of selectors to be K = 250 resulting in the same number of weak learners as the other methods. [sent-196, score-0.428]

84 Finally we compared our method against two sub-sampling methods that have access to the full dataset and subsample r samples using a weighted sampling routine. [sent-198, score-0.242]

85 At each iteration, these methods compute the boosting weights of all the samples in the dataset and use weighted sampling to obtain a subset Rt . [sent-199, score-0.498]

86 The ﬁrst method is a simple weighted sampling method (WSS) while the second is Madaboost (Mada) which combines weighted sampling with weight adjustment for the sub-sampled samples. [sent-200, score-0.132]

87 We furthermore show comparison with a ﬁxed reservoir baseline (Fix), this baseline subsamples the dataset once prior to learning and then trains the ensemble using ofﬂine Adaboost, the contents of the reservoir in this case do not change from iteration to iteration. [sent-201, score-1.229]

88 5 Results and Discussion Table 3, 4, and 5, list respectively the performance of the reservoir baselines, the online Boosting techniques, and the sub-sampling methods. [sent-202, score-0.613]

89 33) Table 3: Test Accuracy on the four datasets for the different reservoir strategies Dataset CIFAR STL INRIA MNIST Online Boosting Chen Bisc Oza 39. [sent-268, score-0.587]

90 33) Table 4: Comparison of GEEM with online boosting algorithms Dataset CIFAR STL INRIA MNIST WSS r=100 r=250 50. [sent-300, score-0.322]

91 33) Table 5: Comparison of GEEM with subsampling algorithms As can be seen, GEEM outperforms the other reservoir strategies on three of the four datasets and performs on par with the best on the fourth (MNIST). [sent-364, score-0.62]

92 It also outperforms the on-line Boosting techniques on three data-sets and on par with the best baselines on MNIST. [sent-365, score-0.088]

93 Note that the Fix baseline was provided with ten times the number of samples to reach a similar level of performance. [sent-367, score-0.121]

94 These results demonstrate that both the reservoir framework we propose for Boosting, and the speciﬁc GEEM algorithm, provide performance greater or on par with existing state-of-the-art methods. [sent-368, score-0.573]

95 When compared with other reservoir strategies, GEEM suffers from larger complexity which translates to a longer training time. [sent-369, score-0.606]

96 For the INRIA dataset and r = 100 GEEM requires circa 70 seconds for training as opposed to 50 for the WSam strategy, while for r = 250 GEEM takes approximately 320 seconds to train compared to 70 for WSam. [sent-370, score-0.069]

97 We note however that even when equating training time, which translates to using r = 100 for GEEM and r = 250 for WSam, GEEM still outperforms the simpler reservoir strategies. [sent-371, score-0.606]

98 It could be applied in the context of parallel processing, where a dataset can be split among CPUs each training a classiﬁer on a different portion of the data. [sent-376, score-0.069]

99 Fast kernel classiﬁers with e online and active learning. [sent-381, score-0.073]

100 Tracking concept change with incremental boosting by minimization of the evolving exponential loss. [sent-438, score-0.249]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reservoir', 0.54), ('geem', 0.387), ('weak', 0.254), ('boosting', 0.249), ('wb', 0.24), ('yb', 0.188), ('learner', 0.17), ('bb', 0.156), ('learners', 0.141), ('qt', 0.134), ('hb', 0.127), ('samples', 0.121), ('rt', 0.114), ('aa', 0.086), ('wsam', 0.074), ('online', 0.073), ('ine', 0.072), ('ht', 0.071), ('ya', 0.069), ('eh', 0.067), ('stumps', 0.06), ('boosted', 0.058), ('mej', 0.055), ('xd', 0.055), ('baselines', 0.055), ('er', 0.053), ('classi', 0.05), ('wrt', 0.049), ('logitboost', 0.049), ('ilter', 0.049), ('stl', 0.049), ('oza', 0.049), ('iteration', 0.048), ('strategies', 0.047), ('inria', 0.047), ('ha', 0.046), ('ensemble', 0.044), ('budget', 0.044), ('wa', 0.043), ('greedy', 0.042), ('weights', 0.04), ('cifar', 0.038), ('mnist', 0.038), ('fresh', 0.037), ('bisc', 0.037), ('hlattice', 0.037), ('idiap', 0.037), ('mada', 0.037), ('madaboost', 0.037), ('martigny', 0.037), ('populate', 0.037), ('training', 0.035), ('discard', 0.034), ('dataset', 0.034), ('lattice', 0.034), ('solely', 0.033), ('access', 0.033), ('par', 0.033), ('edges', 0.033), ('stump', 0.033), ('loaded', 0.033), ('selectors', 0.033), ('calculating', 0.032), ('cost', 0.031), ('calculation', 0.031), ('images', 0.031), ('yj', 0.031), ('weighted', 0.031), ('translates', 0.031), ('responses', 0.03), ('wss', 0.03), ('refers', 0.029), ('antoine', 0.028), ('batch', 0.026), ('pkdd', 0.026), ('workshops', 0.026), ('bordes', 0.026), ('filter', 0.026), ('jth', 0.025), ('stands', 0.025), ('sorted', 0.024), ('retain', 0.024), ('switzerland', 0.024), ('strategy', 0.024), ('weight', 0.024), ('contents', 0.023), ('sampling', 0.023), ('covariance', 0.023), ('fix', 0.023), ('yoram', 0.022), ('rand', 0.022), ('computationally', 0.022), ('discarding', 0.022), ('sample', 0.021), ('concentrate', 0.021), ('incur', 0.02), ('entire', 0.02), ('keep', 0.02), ('ecml', 0.02), ('choosing', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

2 0.15965821 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classiﬁer of weak classiﬁers through directly minimizing empirical classiﬁcation error over labeled training examples; once the training classiﬁcation error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classiﬁers to maximize any targeted arbitrarily deﬁned margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

3 0.14051682 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

Author: Alexander Zimin, Gergely Neu

Abstract: We study the problem of online learning in ﬁnite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret deﬁned as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a ﬁnite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be signiﬁcantly improved under general assumptions. 1

4 0.13978966 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-speciﬁc decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for speciﬁc a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a signiﬁcant improvement over the state of the art. 1

5 0.11927202 230 nips-2013-Online Learning with Costly Features and Labels

Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari

Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label signiﬁcantly increases the regret of the problem. 1

6 0.092229672 318 nips-2013-Structured Learning via Logistic Regression

7 0.078332558 75 nips-2013-Convex Two-Layer Modeling

8 0.066348612 188 nips-2013-Memory Limited, Streaming PCA

9 0.064440452 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

10 0.063279465 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family

11 0.060410216 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

12 0.058536958 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

13 0.058398344 171 nips-2013-Learning with Noisy Labels

14 0.054536939 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

15 0.053636499 89 nips-2013-Dimension-Free Exponentiated Gradient

16 0.052293416 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

17 0.050461818 325 nips-2013-The Pareto Regret Frontier

18 0.048707634 84 nips-2013-Deep Neural Networks for Object Detection

19 0.048515711 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

20 0.046501905 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.149), (1, 0.001), (2, 0.008), (3, -0.073), (4, 0.061), (5, -0.024), (6, -0.04), (7, 0.011), (8, -0.013), (9, 0.004), (10, -0.04), (11, -0.02), (12, 0.004), (13, 0.003), (14, 0.032), (15, -0.066), (16, -0.047), (17, 0.01), (18, 0.006), (19, -0.069), (20, -0.104), (21, 0.041), (22, 0.029), (23, 0.015), (24, 0.03), (25, 0.077), (26, 0.016), (27, -0.046), (28, 0.027), (29, -0.041), (30, 0.028), (31, 0.103), (32, -0.054), (33, 0.018), (34, -0.006), (35, -0.0), (36, -0.072), (37, -0.133), (38, 0.03), (39, 0.034), (40, -0.119), (41, -0.004), (42, -0.03), (43, 0.03), (44, 0.127), (45, -0.023), (46, -0.076), (47, -0.042), (48, -0.03), (49, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91502774 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

2 0.7204777 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family

Author: Xiaojin Zhu

Abstract: What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for ﬁnding the optimal teaching set. Our algorithm optimizes the aggregate sufﬁcient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework. 1

3 0.68827808 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

4 0.6396662 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

5 0.60890156 318 nips-2013-Structured Learning via Logistic Regression

Author: Justin Domke

Abstract: A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for ﬁxed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss.

6 0.57022315 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

7 0.56700456 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

8 0.56593359 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

9 0.55959308 171 nips-2013-Learning with Noisy Labels

10 0.55694503 230 nips-2013-Online Learning with Costly Features and Labels

11 0.53469318 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

12 0.50271451 244 nips-2013-Parametric Task Learning

13 0.49208072 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

14 0.48732716 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

15 0.48365903 134 nips-2013-Graphical Models for Inference with Missing Data

16 0.47174004 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

17 0.46249536 340 nips-2013-Understanding variable importances in forests of randomized trees

18 0.46036556 76 nips-2013-Correlated random features for fast semi-supervised learning

19 0.4488419 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors

20 0.44568458 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.018), (7, 0.188), (16, 0.031), (19, 0.01), (33, 0.134), (34, 0.094), (36, 0.022), (41, 0.024), (49, 0.037), (56, 0.103), (70, 0.082), (85, 0.048), (89, 0.029), (93, 0.081), (95, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84847927 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

Author: Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang

Abstract: We ﬁnd that various well-known graph-based models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss deﬁned over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze ﬁve popular graph-based models: absorbing random walks, partially absorbing random walks, hitting times, pseudo-inverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights into several open questions related to these models, and provides theoretical justiﬁcations and guidelines for their practical use. Simulations on synthetic and real datasets conﬁrm the potential of the proposed theory and tool.

same-paper 2 0.82176495 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

3 0.74872684 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori

Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efﬁciently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classiﬁcation, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classiﬁcation, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classiﬁcation label and localized latent paths. 1

4 0.74805123 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average

Author: Yao-Liang Yu

Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justiﬁed using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1

5 0.74417359 121 nips-2013-Firing rate predictions in optimal balanced networks

Author: David G. Barrett, Sophie Denève, Christian K. Machens

Abstract: How are ﬁring rates in a spiking network related to neural input, connectivity and network function? This is an important problem because ﬁring rates are a key measure of network activity, in both the study of neural computation and neural network dynamics. However, it is a difﬁcult problem, because the spiking mechanism of individual neurons is highly non-linear, and these individual neurons interact strongly through connectivity. We develop a new technique for calculating ﬁring rates in optimal balanced networks. These are particularly interesting networks because they provide an optimal spike-based signal representation while producing cortex-like spiking activity through a dynamic balance of excitation and inhibition. We can calculate ﬁring rates by treating balanced network dynamics as an algorithm for optimising signal representation. We identify this algorithm and then calculate ﬁring rates by ﬁnding the solution to the algorithm. Our ﬁring rate calculation relates network ﬁring rates directly to network input, connectivity and function. This allows us to explain the function and underlying mechanism of tuning curves in a variety of systems. 1

6 0.7409485 157 nips-2013-Learning Multi-level Sparse Representations

7 0.73988521 5 nips-2013-A Deep Architecture for Matching Short Texts

8 0.73918003 251 nips-2013-Predicting Parameters in Deep Learning

9 0.73874527 215 nips-2013-On Decomposing the Proximal Map

10 0.73873174 64 nips-2013-Compete to Compute

11 0.73605305 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

12 0.73542362 99 nips-2013-Dropout Training as Adaptive Regularization

13 0.73437858 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

14 0.73394507 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

15 0.73156393 201 nips-2013-Multi-Task Bayesian Optimization

16 0.73132497 30 nips-2013-Adaptive dropout for training deep neural networks

17 0.73055553 15 nips-2013-A memory frontier for complex synapses

18 0.73023891 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

19 0.72881162 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

20 0.72775894 331 nips-2013-Top-Down Regularization of Deep Belief Networks