nips nips2013 nips2013-90 knowledge-graph by maker-knowledge-mining

90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting


Source: pdf

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. [sent-9, score-0.288]

2 Moreover, it has been shown that all boosting algorithms based on convex functions are susceptible to random classification noise [14]. [sent-22, score-0.227]

3 In 1 this paper, we propose a boosting method for binary classification – DirectBoost – a greedy coordinate descent algorithm that directly minimizes classification error over labeled training examples to build an ensemble linear classifier of weak classifiers. [sent-25, score-0.84]

4 The algorithm stops once a (local coordinatewise) maximum of the margins is reached. [sent-27, score-0.211]

5 In the next section, we first present a coordinate descent algorithm that directly minimizes 0-1 loss over labeled training examples. [sent-28, score-0.272]

6 We then describe coordinate ascent algorithms that aims to directly maximize any targeted arbitrarily defined margins right after we reach a (local coordinatewise) minimum of 0-1 loss. [sent-29, score-0.5]

7 , hl } denote the set of all possible weak classifiers that can be produced by the weak learning algorithm, where a weak classifier hj ∈ H is a mapping from an instance space X to Y = {−1, 1}. [sent-35, score-0.576]

8 We assume that the training set consists of examples with labels {(xi , yi )}, i = 1, · · · , n, where (xi , yi ) ∈ X × Y that are generated independently from p(X, Y ). [sent-39, score-0.252]

9 1 Minimizing 0-1 Loss Similar to AdaBoost, DirectBoost works by sequentially running an iterative greedy coordinate descent algorithm, each time directly minimizing true empirical classification error (1) instead of a weighted empirical classification error in AdaBoost. [sent-42, score-0.305]

10 That is, for each iteration, only the parameter of a weak classifier that leads to the most significant true classification error reduction is updated, while the weights of all other weak classifiers are kept unchanged. [sent-43, score-0.433]

11 Consider the tth iteration, the ensemble classifier is t ft (x) = αk hk (x) (3) k=1 where previous t − 1 weak classifiers hk (x) and corresponding weights αk , k = 1, · · · , t − 1 have been selected and determined. [sent-45, score-0.507]

12 From these two linear scoring functions, the one with the higher score determines the predicted label yi of the ensemble classifier ft (xi ). [sent-48, score-0.257]

13 The intersection point ei of these two linear ˆ scoring functions is the critical point that the predicted label yi switches its sign, the intersection ˆ point satisfies the condition that Ft (xi , +1) = Ft (xi , −1) = 0, i. [sent-49, score-0.202]

14 These points divide αt into (at most) n + 1 intervals, each interval has the value of a true classification error, thus the classification error is a stepwise 2 Algorithm 1 Greedy coordinate descent algorithm that minimizes a 0-1 loss. [sent-52, score-0.237]

15 3: for a weak classifier hk ∈ H do 4: Visit each sample in the order that |a(xi )| is increasing. [sent-54, score-0.267]

16 5: Compute the slope and intercept of F (xi , yi ) = yi hk (xi )α + yi a(xi ). [sent-55, score-0.402]

17 ˆ 7: If (slope > 0 and intercept < 0), error update on the righthand side of ei is -1. [sent-57, score-0.234]

18 ˆ 8: If (slope < 0 and intercept > 0), error update on the righthand side of ei is +1. [sent-58, score-0.234]

19 11: end for 12: Pick the weak classifiers that lead to largest classification error reduction. [sent-61, score-0.241]

20 13: Among selected these weak classifiers, only update the weight of one weak classifier that gives the smallest exponential loss. [sent-62, score-0.384]

21 The greedy coordinate descent algorithm that sequentially minimizes a 0-1 loss is described in Algorithm 1, lines 3-11 are the weak learning steps and the rest are boosting steps. [sent-66, score-0.633]

22 Suppose for a weak classifier, we have Ft (xi , yi ), i = 1, 2, 3, 4 as shown in Figure 1. [sent-68, score-0.251]

23 We incrementally update the classification error on intervals of ei , i = 1, 2, 3, 4: For Ft (x1 , y1 ), its slope is negative and its intercept is negative, sample ˆ x1 always has a negative margin for αt > 0, thus there is no error update on the right-hand side of e1 . [sent-70, score-0.734]

24 For Ft (x2 , y2 ), its slope is positive and its intercept is negative, then when αt is at the right side ˆ of e2 , sample x2 has positive margin and becomes correctly classified, so we update the error by -1, ˆ the error rate is reduced to 25%. [sent-71, score-0.615]

25 For Ft (x3 , y3 ), its slope is negative and its intercept is positive, then when αt is at the right side of e3 , sample x3 has a negative margin and becomes misclassified, so ˆ we update the error rate changes to 50% again. [sent-72, score-0.616]

26 For Ft (x4 , y4 ), its slope is positive and its intercept is positive, sample x4 always has positive margin for αt > 0, thus there is no error update on the right-hand side of e4 . [sent-73, score-0.566]

27 ˆ e ˆ Ft(x3 , y3 ) a4,|a4 | Ft(x1 , y1 ) a3,|a3 | |a2 | |a1 | 0 Ft(x4 , y4 ) a1 a2 e1 ˆ e2 ˆ e3 ˆ e4 ˆ αt Ft(x2 , y2 ) 50% Classification error 25% 0 e2 ˆ e3 ˆ αt Figure 1: An example of computing minimum 0-1 loss of a weak learner over 4 samples. [sent-75, score-0.344]

28 A nice property of the above greedy coordinate descent algorithm is that the classification error is monotonically decreasing. [sent-78, score-0.287]

29 Assume there are M weak classifiers be considered, the computational complexity of Algorithm 1 in the training stage is O(M n) for each iteration. [sent-79, score-0.308]

30 Algorithm 1 is a heuristic procedure that minimizes 0-1 loss, it is not guaranteed to find the global minimum, it may trap to a coordinatewise local minimum [22] of 0-1 loss. [sent-83, score-0.203]

31 Nevertheless, we switch to algorithms that directly maximize the margins we present below. [sent-84, score-0.245]

32 Instead, we can prove that the generalization error of any ensemble classifier is bounded in terms of the average margin of bottom n′ samples or n′ th order margin of training examples, as well as the number of training examples and the complexity of the base classifiers. [sent-87, score-1.209]

33 This view motivates us to propose a coordinate ascent algorithm to directly maximize several types of margins just right after the training error reaches a (local coordinatewise) minimum. [sent-88, score-0.556]

34 We denote the minimum margin, the average margin, and median margin over the training examples n 1 as gmin = mini∈{1,··· ,n} mi , gaverage = n i=1 mi , and gmedian = median{mi , i = 1, · · · , n}. [sent-91, score-0.822]

35 Furthermore, we can sort the margins over all training examples in an increasing order, and consider n′ worst training examples n′ ≤ n that have smaller margins, and compute the average margin over those n′ training examples. [sent-92, score-0.983]

36 We call this the average margin of the bottom n′ samples, and denote 1 it as gaverage n′ = n′ i∈Bn′ mi , where Bn′ denotes the set of n′ samples having the smallest margins. [sent-93, score-0.713]

37 The margin maximization method described below is a greedy coordinate ascent algorithm that adds a weak classifier achieving maximum margin. [sent-94, score-0.81]

38 It allows us to continuously maximize the margin while keeping the training error at a minimum by running the greedy coordinate descent algorithm presented in the previous section. [sent-95, score-0.77]

39 The margin mi is a linear fractional function of α, and it is quasiconvex, and quasiconcave, i. [sent-96, score-0.403]

40 Theorem 2 shows that the average margin of bottom n′ examples is quasiconcave in the region of the zero training error. [sent-99, score-0.695]

41 Theorem 2 Denote the average margin of bottom n′ samples as yi gaverage n′ (α) = i∈{Bn′ |α} t k=1 αk hk (xi ) t k=1 αk ′ where {B |α} denotes the set of n samples whose margins are at the bottom for fixed α. [sent-100, score-1.132]

42 Then gaverage n′ (α) in the region of zero training error is quasiconcave. [sent-101, score-0.312]

43 Therefore, the margin on the ith example (xi , yi ) with respect to αt is either concave when it is monotonically increasing or convex when it is monotonically decreasing. [sent-104, score-0.539]

44 At points q2 Consider a greedy coordinate ascent algorithm that maxiand q4 , the set of bottom n′ = 3 exam- mizes the average margin gaverage over all training examples. [sent-108, score-0.968]

45 n n ∂gaverage i=1 ai i=1 bi,t c − (9) = ∂αt (c + αt )2 4 Algorithm 2 Greedy coordinate ascent algorithm that maximizes the average margin of bottom n′ examples. [sent-110, score-0.756]

46 3: for a weak classifier do 4: Determine the lowest sample whose margin is decreasing and determine d. [sent-114, score-0.536]

47 7: Compute the intersection qj+1 of the j + 1th highest increasing margin in Bn′ and the j + 1th c smallest decreasing margin in Bn′ (the complement of the set Bn′ ). [sent-117, score-0.74]

48 13: Compute the average margin of the bottom n′ examples at q ∗ . [sent-121, score-0.532]

49 14: end if 15: end for 16: Pick the weak classifier with the largest increment of the average margin of bottom n′ examples with weight being q ∗ . [sent-122, score-0.77]

50 17: Repeat 2-16 until no increment in average margin of bottom n′ examples. [sent-123, score-0.531]

51 Therefore, the maximum average margin can only happen at two ends of the interval. [sent-124, score-0.378]

52 As shown in Figure 2, the maximum average margin is either at the origin or at point d, which depends on the sign of the derivative in (9). [sent-125, score-0.451]

53 If it is positive, the average margin is monotonically increasing, we set αt = d − ǫ, otherwise we set αt = 0. [sent-126, score-0.409]

54 The greedy coordinate ascent algorithm found by: looking at all weak classifiers in H, if the nominator in (9) is positive, we let its weight ǫ close to the right value on the interval where the training error is minimum, and compute the value of the average margin. [sent-127, score-0.649]

55 We add the weak classifier which has the largest average margin increment. [sent-128, score-0.57]

56 Theorem 3 When constrained to the region of zero training error, the greedy coordinate ascent algorithm that maximizes the average margin over all examples converges to an optimal solution. [sent-131, score-0.848]

57 Now consider a greedy coordinate ascent algorithm maximizing the average margin of bottom n′ training examples, gaverage n′ . [sent-132, score-0.998]

58 Apparently maximizing the minimum margin is a special case by choosing n′ = 1. [sent-133, score-0.423]

59 Our aim is to maximize the average margin of the bottom 3 examples. [sent-135, score-0.519]

60 The interval [0, d] of αt indicates an interval where the training error is zero. [sent-136, score-0.204]

61 On the point of d, the sample margin m3 alters from positive to negative, which causes the training error jump from 0 to 1/6. [sent-137, score-0.48]

62 As shown in Figure 2, the margin of each of six training examples is either monotonically increasing or decreasing. [sent-138, score-0.529]

63 In general, the set of bottom n′ training examples for an interval of αt with a minimum training error varies over αt , it is required to precisely search for any snapshot of bottom n′ examples with a different value of α. [sent-140, score-0.614]

64 To address this, we first examine when the margins of two examples intersect. [sent-141, score-0.258]

65 Consider the ith a +bi,t α example (xi , yi ) with margin mi = i c+αt t and the jth example (xj , yj ) with margin mj = aj +bj,t αt c+αt . [sent-142, score-0.86]

66 The greedy coordinate ascent algorithm that sequentially maximizes the average margin of bottom n′ examples is described in Algorithm 2, lines 3-15 are the weak learning steps and the rest are boosting steps. [sent-146, score-1.205]

67 Since the function of the average margin of bottom n′ examples is quasiconcave, we can determine the optimal point q ∗ by Dn′ , and only need to compute the margin value at q ∗ . [sent-148, score-0.876]

68 We add the weak learner, which has the largest increment of the average margin over bottom n′ examples, into the ensembled classifier. [sent-149, score-0.723]

69 This procedure terminates if there is no increment in the average margin of bottom n′ examples over the considered weak classifiers. [sent-150, score-0.77]

70 If M weak learners are considered, the computational complexity of Algorithm 2 in the training stage is O (max(n log n, M n′ )) for each iteration. [sent-151, score-0.344]

71 ǫ-relaxation: Unfortunately, there is a fundamental difficulty in the greedy coordinate ascent algorithm that maximizes the average margin of bottom n′ samples: It gets stuck at a corner, from which it is impossible to make progress along any coordinate direction. [sent-154, score-0.887]

72 The main idea is to allow a single coordinate to change even if this worsens the margin function. [sent-157, score-0.443]

73 When a coordinate is changed, it is set to ǫ plus or ǫ minus the value that maximizes the margin function along that coordinate, where ǫ is a positive number. [sent-158, score-0.493]

74 We can design a similar greedy coordinate ascent algorithm to directly maximize the bottom n′ th sample margin by only making a slight modification to Algorithm 2: for a weak classifier, we choose the intersection point that led to the largest increasing of the bottom n′ th margin. [sent-159, score-1.147]

75 As shown in Figure 2, bottom n′ th margin is a multimodal function, this algorithm with ǫ-relaxation is very sensitive to the choice of n′ , and it usually gets stuck in a bad coordinatewise point without using ǫ-relaxation. [sent-161, score-0.609]

76 For all the algorithms in our comparison, we use decision trees with depth of either 1 or 3 as weak learners since for the small datasets, decision stumps (tree depth of 1) is already strong enough. [sent-165, score-0.369]

77 DirectBoost with decision trees is implemented by a greedy top-down recursive partition algorithm to find the tree but differently from AdaBoost and LPBoost, since DirectBoost does not maintain a distribution over training samples. [sent-166, score-0.193]

78 Instead, for each splitting node, DirectBoost simply chooses the attribute to split on by minimizing 0-1 loss or maximizing the predefined margin value. [sent-167, score-0.405]

79 1 Experiments on UCI data We first compare DirectBoost with AdaBoost, LogitBoost, soft margin LPBoost and BrownBoost on 10 UCI data sets1 from the UCI Machine Learning Repository [8]. [sent-172, score-0.377]

80 6 D depth AdaBoost LogitBoost BrownBoost DirectBoostavg DirectBoostǫ avg DirectBoostorder Datasets N LPBoost Tic-tac-toe 958 9 3 1. [sent-181, score-0.21]

81 8) Table 1: Percent test errors of AdaBoost, LogitBoost, soft margin LPBoost with column generation, BrownBoost, and three DirectBoost methods on 10 UCI datasets each with N samples and D attributes. [sent-321, score-0.423]

82 We run the algorithms until convergence where the stopping criterion is that the change of loss is less than 1e-6, and then choose the ensemble classifier from the round with minimum error on the validation data. [sent-323, score-0.252]

83 LPBoost maximizes the soft margin subject to linear constraints, its objective is equivalent to DirectBoost with maximizing the average margin of bottom n′ samples [19], thus we set the same candidate parameters n′ /n = {0. [sent-335, score-0.968]

84 For DirectBoost, the algorithm terminates when there is no increment in the targeted margin value, and we select the model with the optimal n′ by the validation set. [sent-343, score-0.464]

85 Clearly DirectBoostavg , DirectBoostǫ and DirectBoostorder outperform other boosting algorithms avg in general, specially DirectBoostǫ is better than AdaBoost, LogitBoost, LPBoost and BrownBoost avg over all data sets except Cancer-wdbc. [sent-346, score-0.536]

86 Among the family of DirectBoost algorithms, DirectBoostavg wins on two datasets where it searches the optimal margin solution in the region of zero training error, this means that keeping the training error at zero may lead to good performance in some cases. [sent-347, score-0.641]

87 With ǫ-relaxation, DirectBoostǫ searches the optimal margin solution in the whole parameter avg space and gives the best performance on the remaining 5 data sets. [sent-349, score-0.523]

88 It is well known that AdaBoost performs well on the datasets with a small test error such as Tic-tac-toe and Fourclass, it is extremely hard for other boosting algorithms to beat AdaBoost. [sent-350, score-0.247]

89 When avg ′ Figure 3: The value of average margins of bottom n decision stumps are used as weak learners, samples vs. [sent-356, score-0.8]

90 the number of iterations for LPBoost with LPBoost converges to a global optimal solucolumn generation and DirectBoostǫ on Australian avg tion, and DirectBoostǫ nearly converges to avg dataset, left: Decision tree, right: Decision stump. [sent-357, score-0.396]

91 the maximum margin as shown by the right figure in Figure 3, even though no theoretical justification is known for this observed phenomenon. [sent-358, score-0.344]

92 All these three algorithms employ decision trees with a depth of 3 as weak learners. [sent-360, score-0.251]

93 Table 2: Number of iterations and total run times (in seconds) in training stage on Adult dataset with 10000 The experiments are conducted on a PC with training samples and the depth of DecisionTrees is 3. [sent-361, score-0.26]

94 2 Evaluate noise robustness In the experiments conducted below, we evaluate the noise robustness of each boosting method. [sent-369, score-0.226]

95 This is a simple counterexample to show that for a broad class of convex loss functions, no boosting algorithm is provably robust to random label noise, this class includes AdaBoost, LogitBoost, etc. [sent-371, score-0.267]

96 For LPBoost and its variations [25, 26], they do not satisfy the preconditions of the theorem presented by [14], but Glocer [12] showed experimentally that these soft margin boosting methods have the same problem as the AdaBoost and LogitBoost to handle random noise. [sent-372, score-0.555]

97 5 Table 4: Percent test errors of AdaBoost (AB), LogitBoost (LB), LPBoost (LPB), BrownBoost (BB), DirectBoostǫ , and DirectBoostorder on two UCI avg datasets with random noise. [sent-438, score-0.199]

98 We repeat the synthetic learning problem with binary-valued weak classifiers that is described in [14]. [sent-439, score-0.211]

99 On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms. [sent-552, score-0.37]

100 Direct 0-1 loss minimization and margin maximization with boosting. [sent-592, score-0.375]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('directboost', 0.5), ('margin', 0.344), ('lpboost', 0.315), ('margins', 0.211), ('adaboost', 0.203), ('weak', 0.192), ('avg', 0.179), ('boosting', 0.178), ('logitboost', 0.164), ('brownboost', 0.143), ('directboostorder', 0.143), ('gaverage', 0.143), ('classi', 0.137), ('coordinatewise', 0.129), ('ft', 0.11), ('bottom', 0.107), ('coordinate', 0.099), ('bn', 0.095), ('intercept', 0.087), ('training', 0.087), ('er', 0.085), ('greedy', 0.078), ('ascent', 0.076), ('hk', 0.075), ('xi', 0.074), ('ers', 0.071), ('slope', 0.063), ('mi', 0.059), ('yi', 0.059), ('uci', 0.059), ('directboostavg', 0.057), ('lpb', 0.057), ('ensemble', 0.055), ('maximizes', 0.05), ('error', 0.049), ('minimum', 0.049), ('examples', 0.047), ('adult', 0.046), ('increment', 0.046), ('ei', 0.046), ('ai', 0.046), ('glocer', 0.043), ('quasiconcave', 0.043), ('validation', 0.043), ('derivative', 0.043), ('lb', 0.042), ('intersect', 0.042), ('qj', 0.041), ('dn', 0.039), ('generation', 0.038), ('learners', 0.036), ('average', 0.034), ('interval', 0.034), ('maximize', 0.034), ('label', 0.033), ('soft', 0.033), ('region', 0.033), ('intersection', 0.032), ('depth', 0.031), ('voting', 0.031), ('monotonically', 0.031), ('targeted', 0.031), ('loss', 0.031), ('bb', 0.03), ('maximizing', 0.03), ('descent', 0.03), ('sign', 0.03), ('stage', 0.029), ('ab', 0.029), ('th', 0.029), ('ith', 0.029), ('dborder', 0.029), ('righthand', 0.029), ('decision', 0.028), ('cation', 0.027), ('samples', 0.026), ('negative', 0.025), ('fourclass', 0.025), ('zhai', 0.025), ('stopping', 0.025), ('minimizes', 0.025), ('convex', 0.025), ('mj', 0.025), ('ht', 0.024), ('noise', 0.024), ('percent', 0.024), ('side', 0.023), ('stumps', 0.023), ('learner', 0.023), ('incrementally', 0.023), ('negation', 0.021), ('xia', 0.021), ('wins', 0.021), ('adds', 0.021), ('datasets', 0.02), ('bj', 0.02), ('increasing', 0.02), ('tolerant', 0.02), ('repeat', 0.019), ('sort', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

2 0.15965821 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between offline and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classifier response space. We propose an efficient algorithmic implementation which makes it tractable in practice, and demonstrate its efficiency experimentally on several compute-vision data-sets, on which it outperforms both online and offline methods in a memory constrained setting. 1

3 0.12760559 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1

4 0.10866652 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini

Abstract: We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. We then introduce another type of bound targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1

5 0.1030203 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9

6 0.1018066 171 nips-2013-Learning with Noisy Labels

7 0.099624187 318 nips-2013-Structured Learning via Logistic Regression

8 0.098835848 75 nips-2013-Convex Two-Layer Modeling

9 0.076645546 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

10 0.074134216 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

11 0.071860395 148 nips-2013-Latent Maximum Margin Clustering

12 0.064901181 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

13 0.057857499 89 nips-2013-Dimension-Free Exponentiated Gradient

14 0.05590269 60 nips-2013-Buy-in-Bulk Active Learning

15 0.055667657 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

16 0.053268231 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

17 0.053233292 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

18 0.052449435 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

19 0.050664969 14 nips-2013-A Stability-based Validation Procedure for Differentially Private Machine Learning

20 0.050293501 158 nips-2013-Learning Multiple Models via Regularized Weighting


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.15), (1, 0.03), (2, -0.005), (3, -0.063), (4, 0.095), (5, -0.01), (6, -0.057), (7, 0.007), (8, -0.003), (9, 0.056), (10, -0.004), (11, -0.031), (12, -0.012), (13, -0.027), (14, 0.03), (15, -0.036), (16, -0.039), (17, 0.061), (18, 0.059), (19, -0.045), (20, -0.055), (21, 0.104), (22, 0.084), (23, 0.079), (24, 0.005), (25, 0.008), (26, 0.038), (27, -0.088), (28, 0.085), (29, -0.125), (30, 0.015), (31, 0.086), (32, -0.036), (33, -0.071), (34, 0.101), (35, -0.024), (36, -0.002), (37, -0.117), (38, 0.019), (39, 0.086), (40, -0.074), (41, -0.008), (42, 0.003), (43, 0.073), (44, 0.013), (45, -0.03), (46, -0.071), (47, 0.013), (48, -0.016), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94351888 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

2 0.7605918 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1

3 0.71949673 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between offline and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classifier response space. We propose an efficient algorithmic implementation which makes it tractable in practice, and demonstrate its efficiency experimentally on several compute-vision data-sets, on which it outperforms both online and offline methods in a memory constrained setting. 1

4 0.69046044 244 nips-2013-Parametric Task Learning

Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima

Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.

5 0.67962229 171 nips-2013-Learning with Noisy Labels

Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari

Abstract: In this paper, we theoretically study the problem of binary classification in the presence of random classification noise — the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is class-conditional — the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence — methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.

6 0.65194768 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

7 0.64384335 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

8 0.62990463 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

9 0.60994256 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

10 0.59883291 318 nips-2013-Structured Learning via Logistic Regression

11 0.58513558 76 nips-2013-Correlated random features for fast semi-supervised learning

12 0.57047695 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

13 0.56630075 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

14 0.54486918 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

15 0.53948677 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

16 0.52026987 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family

17 0.51164877 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

18 0.50520211 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

19 0.4956519 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

20 0.49514213 340 nips-2013-Understanding variable importances in forests of randomized trees


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.01), (16, 0.022), (33, 0.102), (34, 0.088), (41, 0.026), (49, 0.021), (56, 0.092), (70, 0.447), (85, 0.033), (89, 0.023), (93, 0.042), (95, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92905462 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems

Author: Hesham Mostafa, Lorenz. K. Mueller, Giacomo Indiveri

Abstract: We present a recurrent neuronal network, modeled as a continuous-time dynamical system, that can solve constraint satisfaction problems. Discrete variables are represented by coupled Winner-Take-All (WTA) networks, and their values are encoded in localized patterns of oscillations that are learned by the recurrent weights in these networks. Constraints over the variables are encoded in the network connectivity. Although there are no sources of noise, the network can escape from local optima in its search for solutions that satisfy all constraints by modifying the effective network connectivity through oscillations. If there is no solution that satisfies all constraints, the network state changes in a seemingly random manner and its trajectory approximates a sampling procedure that selects a variable assignment with a probability that increases with the fraction of constraints satisfied by this assignment. External evidence, or input to the network, can force variables to specific values. When new inputs are applied, the network re-evaluates the entire set of variables in its search for states that satisfy the maximum number of constraints, while being consistent with the external input. Our results demonstrate that the proposed network architecture can perform a deterministic search for the optimal solution to problems with non-convex cost functions. The network is inspired by canonical microcircuit models of the cortex and suggests possible dynamical mechanisms to solve constraint satisfaction problems that can be present in biological networks, or implemented in neuromorphic electronic circuits. 1

2 0.85256088 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

Abstract: Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC. 1

3 0.83287108 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

Author: Jose Bento, Nate Derbinsky, Javier Alonso-Mora, Jonathan S. Yedidia

Abstract: We describe a novel approach for computing collision-free global trajectories for p agents with specified initial and final configurations, based on an improved version of the alternating direction method of multipliers (ADMM). Compared with existing methods, our approach is naturally parallelizable and allows for incorporating different cost functionals with only minor adjustments. We apply our method to classical challenging instances and observe that its computational requirements scale well with p for several cost functionals. We also show that a specialization of our algorithm can be used for local motion planning by solving the problem of joint optimization in velocity space. 1

4 0.83251691 157 nips-2013-Learning Multi-level Sparse Representations

Author: Ferran Diego Andilla, Fred A. Hamprecht

Abstract: Bilinear approximation of a matrix is a powerful paradigm of unsupervised learning. In some applications, however, there is a natural hierarchy of concepts that ought to be reflected in the unsupervised analysis. For example, in the neurosciences image sequence considered here, there are the semantic concepts of pixel → neuron → assembly that should find their counterpart in the unsupervised analysis. Driven by this concrete problem, we propose a decomposition of the matrix of observations into a product of more than two sparse matrices, with the rank decreasing from lower to higher levels. In contrast to prior work, we allow for both hierarchical and heterarchical relations of lower-level to higher-level concepts. In addition, we learn the nature of these relations rather than imposing them. Finally, we describe an optimization scheme that allows to optimize the decomposition over all levels jointly, rather than in a greedy level-by-level fashion. The proposed bilevel SHMF (sparse heterarchical matrix factorization) is the first formalism that allows to simultaneously interpret a calcium imaging sequence in terms of the constituent neurons, their membership in assemblies, and the time courses of both neurons and assemblies. Experiments show that the proposed model fully recovers the structure from difficult synthetic data designed to imitate the experimental data. More importantly, bilevel SHMF yields plausible interpretations of real-world Calcium imaging data. 1

same-paper 5 0.82602906 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

6 0.7608456 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average

7 0.73581123 15 nips-2013-A memory frontier for complex synapses

8 0.61381447 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

9 0.60303169 121 nips-2013-Firing rate predictions in optimal balanced networks

10 0.59669882 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

11 0.58452702 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement

12 0.54440212 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

13 0.5387761 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

14 0.53501523 264 nips-2013-Reciprocally Coupled Local Estimators Implement Bayesian Information Integration Distributively

15 0.52297479 64 nips-2013-Compete to Compute

16 0.51147091 255 nips-2013-Probabilistic Movement Primitives

17 0.50643045 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

18 0.49794686 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

19 0.49760327 86 nips-2013-Demixing odors - fast inference in olfaction

20 0.4959985 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking