nips nips2003 nips2003-134 knowledge-graph by maker-knowledge-mining

134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

Source: pdf

Author: Clayton Scott, Robert Nowak

Abstract: This paper reports on a family of computationally practical classiﬁers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classiﬁers are based on dyadic classiﬁcation trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) ﬁtting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classiﬁcation rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classiﬁers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The classiﬁers are based on dyadic classiﬁcation trees (DCTs), which involve adaptively pruned partitions of the feature space. [sent-5, score-0.617]

2 A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) ﬁtting of the decision boundary. [sent-6, score-0.153]

3 Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. [sent-7, score-0.492]

4 For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. [sent-8, score-0.696]

5 We also study DCTs equipped with polynomial classiﬁcation rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. [sent-9, score-0.433]

6 We are not aware of any other practical classiﬁers that provide similar rate of convergence guarantees. [sent-10, score-0.201]

7 1 Introduction We previously studied dyadic classiﬁcation trees, equipped with simple binary decision rules at each leaf, in [1]. [sent-12, score-0.497]

8 There we applied standard structural risk minimization to derive a pruning rule that minimizes the empirical error plus a complexity penalty proportional to the square root of the size of the tree. [sent-13, score-0.629]

9 Our main result concerned the rate of convergence of the expected error probability of our pruned dyadic classiﬁcation tree to the Bayes error for a certain class of problems. [sent-14, score-0.928]

10 This class, which essentially requires the Bayes decision boundary to be locally Lipschitz, had previously been studied by Mammen and Tsybakov [2]. [sent-15, score-0.238]

11 They showed the minimax rate of convergence for this class to be n−1/d , where n is the number of labeled training samples, and d is the dimension of each sample. [sent-16, score-0.307]

12 They also demonstrated a classiﬁcation rule achieving this rate, but the rule requires minimization of the empirical error over the entire class of decision boundaries, an infeasible task in practice. [sent-17, score-0.322]

13 In this paper we exhibit a new pruning strategy that is both computationally efﬁcient and realizes the minimax rate to within a log factor. [sent-19, score-0.598]

14 Those works develop a theory of local uniform convergence, which allows the error to be decomposed in a spatially adaptive way (unlike conventional structural risk minimization). [sent-21, score-0.272]

15 In essence, the associated pruning rules allow a more reﬁned partition in a region where the classiﬁcation problem is harder (i. [sent-22, score-0.397]

16 Heuristic arguments and anecdotal evidence in both [3] and [4] suggest that spatially adaptive penalties lead to improved performance compared to “global” penalties. [sent-25, score-0.186]

17 In this work, we give theoretical support to this claim (for a speciﬁc kind of classiﬁcation tree, the DCT) by showing a superior rate of convergence for DCTs pruned according to spatially adaptive penalties. [sent-26, score-0.486]

18 We go on to study DCTs equipped with polynomial classiﬁcation rules at each leaf. [sent-27, score-0.174]

19 PDCTs can be practically implemented by employing polynomial kernel SVMs at each leaf node of a pruned DCT. [sent-30, score-0.533]

20 For any distribution whose Bayes decision boundary behaves locally like a H¨ lder-γ smooth function, we show that the PDCT error converges to the o Bayes error at a rate no slower than O((log n/n)γ/(d+2γ−2) ). [sent-31, score-0.543]

21 As γ → ∞ the rate tends to within a log factor of the parametric rate, n−1/2 . [sent-32, score-0.251]

22 Perceptron trees, tree classiﬁers having linear splits at each node, have been investigated by many authors and in particular we point to the works [5,6]. [sent-33, score-0.146]

23 A key aspect of PDCTs is their spatial adaptivity, which enables local (rather than global) polynomial ﬁtting of the decision boundary. [sent-35, score-0.249]

24 Traditional polynomial kernel-based methods are not capable of achieving such rates of convergence due to their lack of spatial adaptivity, and it is unlikely that other kernels can solve this problem for the same reason. [sent-36, score-0.31]

25 Then the error in approximation is O(1), a constant, which is the best one could hope for in learning a H¨ lder smooth boundo ary with a traditional polynomial kernel scheme. [sent-38, score-0.266]

26 On the other hand, if we partition the domain into hypercubes of side length O(1/m) and ﬁt an individual polynomial on each hypercube, then the approximation error decays like O(m−γ ). [sent-39, score-0.229]

27 On the other hand, pruning back the partition helps to avoid overﬁtting. [sent-41, score-0.355]

28 2 Dyadic Classiﬁcation Trees In this section we review our earlier results on dyadic classiﬁcation trees. [sent-43, score-0.311]

29 DCTs are based on the concept of a cyclic dyadic partition (CDP). [sent-47, score-0.397]

30 We deﬁne a dyadic classiﬁcation tree (DCT) to be a cyclic dyadic partition with (a) (b) (c) Figure 1: Example of a dyadic classiﬁcation tree when d = 2. [sent-68, score-1.311]

31 Polynomial-decorated DCTs, discussed in Section 4, are similar in structure, but a polynomial decision rule is employed at each leaf of the pruned tree, instead of a simple binary label. [sent-72, score-0.603]

32 Previously we presented a rule for pruning DCTs with consistency and rate of convergence properties. [sent-76, score-0.527]

33 Let m = 2J be a dyadic integer, and deﬁne T0 to be the DCT that has every leaf node at depth dJ. [sent-78, score-0.652]

34 Then each leaf of T0 corresponds to a cube of side length 1/m, and T0 has md total leaf nodes. [sent-79, score-0.487]

35 A subtree T of T0 is referred to as a pruned subtree, denoted T ≤ T0 , if T includes the root of T0 , if every internal node of T has both its children in T , and if the nodes of T inherit their labels from T0 . [sent-81, score-0.503]

36 The size of a tree T , denoted |T |, is the number of leaf nodes. [sent-82, score-0.321]

37 We deﬁned the complexity penalized dyadic classiﬁcation tree Tn to be the solution of Tn = arg min ˆ(T ) + αn T ≤T0 |T |, (1) where αn = 32 log(en)/n, and ˆ(T ) is the empirical error, i. [sent-83, score-0.521]

38 (The solution to this pruning problem can be computed efﬁciently [7]. [sent-86, score-0.301]

39 ) We showed that if X ∈ [0, 1]d with probability one, and md = o(n/ log n), then E{ (Tn )} → ∗ with probability one (i. [sent-87, score-0.165]

40 We also demonstrated a rate of convergence result for Tn , under certain assumptions on the distribution of (X, Y ). [sent-93, score-0.174]

41 Deﬁnition 1 Let c1 , c2 > 0, and let m0 be a dyadic integer. [sent-96, score-0.349]

42 A2 (Regularity): For all dyadic integers m ≥ m0 , if we subdivide the unit cube into cubes of side length 1/m, The Bayes decision boundary passes through at most c2 md−1 of the resulting md cubes. [sent-98, score-0.715]

43 These assumptions are satisﬁed when the density of X is essentially bounded with respect to Lebesgue measure, and when the Bayes decision boundary for the distribution on (X, Y ) behaves locally like a Lipschitz function. [sent-99, score-0.321]

44 See, for example, the boundary fragment class of [2] with γ = 1 therein. [sent-100, score-0.148]

45 In [1], we showed that if the distribution of (X, Y ) belongs to F, and m ∼ 1/(d+1) (n/ log n)1/(d+1) , then E{ (Tn )} − ∗ = O((log n/n) ). [sent-101, score-0.132]

46 However, this upper bound on the rate of convergence is not tight. [sent-102, score-0.22]

47 The results of Mammen and Tsybakov [2] show that the minimax rate of convergence, inf φn supF E{ (φn )} − ∗ , is on the order of n−1/d (here φn ranges over all possible discrimination rules). [sent-103, score-0.195]

48 In the next section, we introduce a new strategy for pruning DCTs, which leads to an improved rate of convergence of (log n/n)1/d (i. [sent-104, score-0.475]

49 3 Improved Tree Pruning with Spatially Adaptive Penalties An improved rate of convergence is achieved by pruning the initial tree T0 using a new complexity penalty. [sent-108, score-0.621]

50 Given a node v in a tree T , let Tv denote the subtree of T rooted at v. [sent-109, score-0.418]

51 Let S denote the training data, and let nv denote the number of training samples reaching node v. [sent-110, score-0.193]

52 Consider the pruning rule that selects Tn = arg min ˆ(T ) + min ∆(T, S, R) , where ∆(T, S, R) = v∈L(R) (2) R≤T T ≤T0 1 n 48nv |Tv | log(2n) + 48nv d log(m) . [sent-114, score-0.481]

53 Observe that the penalty is data-dependent (since nv depends on S) and spatially adaptive (choosing R ≤ T to minimize ∆). [sent-115, score-0.275]

54 The ﬁrst term in the penalty is written v∈L(R) pv 48|Tv | log(2n)/nv , where pv = nv /n. [sent-117, score-0.193]

55 The second term can be interpreted as the “cost” of spatially decomposing the bound on the generalization error. [sent-119, score-0.163]

56 Consider pruning one of two subtrees, both with the same size, and assume that both options result in the same increase in the empirical error. [sent-121, score-0.301]

57 Then the subtree with more data is selected for pruning. [sent-122, score-0.144]

58 Since deeper nodes typically have less data, this shows that the penalty favors unbalanced trees, which may promote higher resolution (deeper leaf nodes) in the vicinity of the decision boundary. [sent-123, score-0.479]

59 In contrast, the pruning rule (1) penalizes balanced and unbalanced trees (with the same size) equally. [sent-124, score-0.533]

60 The following theorem bounds the expected error of Tn . [sent-125, score-0.131]

61 Recall that m speciﬁes the depth of the initial tree T0 . [sent-127, score-0.222]

62 Theorem 1 If m ∼ (n/ log n)1/d , then E{ (Tn ) − ∗ } ≤ min T ≤T0 ( (T ) − ∗ ) + E min ∆(T, S, R) R≤T +O log n n . [sent-128, score-0.332]

63 Since the bound holds for all T , one feature of the pruning rule (2) is that Tn performs at least as well as the subtree T ≤ T0 that minimizes the bound. [sent-131, score-0.57]

64 This theorem may be applied to give us our desired rate of convergence result. [sent-132, score-0.257]

65 In other words, the pruning rule (2) comes within a log factor of the minimax rate. [sent-135, score-0.587]

66 Within each cube the Bayes decision boundary is described by a function (one coordinate is a function of the others) with H¨ lder regularity γ. [sent-141, score-0.376]

67 o The collection G contains all distributions whose Bayes decision boundaries behave locally like the graph of a function with H¨ lder regularity γ. [sent-142, score-0.352]

68 We propose a classiﬁer, called a polynomial-decorated dyadic classiﬁcation tree (PDCT), that achieves fast rates of convergence for distributions satisfying A3. [sent-144, score-0.593]

69 Given a positive integer r, a PDCT of degree r is a DCT, with class labels at each leaf node assigned by a degree r polynomial classiﬁer. [sent-145, score-0.495]

70 Consider the pruning rule that selects Tn,r = arg min ˆ(T ) + min ∆r (T, S, R) , R≤T T ≤T0 (3) where ∆r (T, S, R) = v 1 n 48nv Vd,r |Tv | log(2n) + 48nv (d + γ) log(m) . [sent-146, score-0.481]

71 Here Vd,r = d+r is the V C dimension of the collection of degree r polynomial classiﬁers r in d dimensions. [sent-147, score-0.162]

72 We actually consider a search over all pruned subtrees of T0 , and with all possible conﬁgurations of degree r polynomial classiﬁers at the leaf nodes. [sent-149, score-0.536]

73 Moreover, If r = γ − 1, then a decision boundary with H¨ lder regularity γ is well approximated by o a PDCT of degree r. [sent-151, score-0.372]

74 In this case, Tn,r converges to the Bayes risk at rates bounded by the next theorem. [sent-152, score-0.14]

75 Also notice that as γ → ∞, the rate of convergence comes within a logarithmic factor of the parametric rate n−1/2 . [sent-156, score-0.371]

76 5 Efﬁcient Algorithms The optimally pruned subtree Tn of rule (2) can be computed exactly in O(|T0 |2 ) operations. [sent-158, score-0.368]

77 This follows from a simple bottom-up dynamic programming algorithm, which we describe below, and uses a method for “square-root” pruning studied in [7]. [sent-159, score-0.301]

78 Given a node ∗ v ∈ T0 , let Tv be the subtree of T0 rooted at v that minimizes the objective function of (2), ∗ ∗ and let Rv be the associated subtree that minimizes ∆(Tv , S, R). [sent-163, score-0.508]

79 ∗ ∗ If v is a leaf node of T0 , then clearly Tv = Rv = {v}. [sent-165, score-0.265]

80 Using the ﬁrst algorithm in [7], the overall pruning procedure may be accomplished in (|T0 |2 ) operations. [sent-169, score-0.301]

81 Determining the optimally pruned degree r PDCT is more challenging. [sent-170, score-0.211]

82 The problem requires the construction, at each node of T0 , a polynomial classiﬁer of degree r having minimum empirical error. [sent-171, score-0.225]

83 Moreover, linear SVMs in perceptron trees have been shown to work well [6]. [sent-175, score-0.175]

84 6 Conclusions A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) ﬁtting of the decision boundary. [sent-176, score-0.153]

85 Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion that promotes unbalanced trees that focus on the decision boundary. [sent-177, score-0.78]

86 For distributions on (X, Y ) whose Bayes decision boundary behave locally like a H¨ lder-γ smooth o function, we show that the PDCT error converges to the Bayes error at a rate no slower than O((log n/n)γ/(d+2γ−2) ). [sent-178, score-0.491]

87 Polynomial kernel methods are not capable of achieving such rates due to their lack of spatial adaptivity. [sent-179, score-0.131]

88 When γ = 1, the DCT convergence rate is within a logarithmic factor of the minimax optimal rate. [sent-180, score-0.354]

89 As γ → ∞ the rate tends to within a log factor of n−1/2 , the parametric rate. [sent-181, score-0.251]

90 However, the rates for γ > 1 are not within a logarithmic factor of the minimax rate [2]. [sent-182, score-0.324]

91 m Given S ∈ Ωm , we know (Tn ) ≤ = ≤ ˆ(Tn ) + min f (Tn , S, R, 3m−d ) R≤Tn 4d log(m) R≤Tn n 4d log(m) ˆ(T ) + min ∆(T, S, R) + , R≤T n ˆ(Tn ) + min ∆(Tn , S, R) + where the last inequality comes from the deﬁnition of Tn . [sent-193, score-0.192]

92 1 Proof of Theorem 2 By Theorem 1, it sufﬁces to ﬁnd a tree T ∗ ≤ T0 such that E min∗ ∆(T ∗ , S, R) + ( (T ∗ ) − R≤T ∗ )=O log n n 1 d . [sent-197, score-0.248]

93 Deﬁne T ∗ to be the tree obtained by pruning back T0 at every node (thought of as a region of space) that does not intersect the Bayes decision boundary. [sent-198, score-0.645]

94 Deﬁne R∗ to be the pruned subtree of T ∗ consisting of all nodes in T ∗ up to depth j0 d, where j0 = J − (1/d) log2 (J) (truncated if necessary). [sent-202, score-0.442]

95 The ﬁrst term, consisting of a sum of terms over the leaf nodes of R∗ , is dominated by the sum of those terms over the leaf nodes of R∗ at depth j0 d. [sent-212, score-0.526]

96 Recall the leaf nodes of T ∗ at maximum depth are cells of side length 1/m. [sent-219, score-0.332]

97 Nowak, “Dyadic classiﬁcation trees via structural risk minimization,” in Advances in Neural Information Processing Systems 14, S. [sent-228, score-0.218]

98 Mansour, “A fast, bottom-up decision tree pruning algorithm with nearoptimal generalization,” in International Conference on Machine Learning, 1998, pp. [sent-241, score-0.555]

99 Wu, “Enlarging the margins in perceptron decision trees,” Machine Learning, vol. [sent-258, score-0.149]

100 Nowak, “Complexity-regularized dyadic classiﬁcation trees: Efﬁcient pruning and rates of convergence,” Tech. [sent-277, score-0.665]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tn', 0.313), ('dyadic', 0.311), ('tv', 0.305), ('pruning', 0.301), ('dcts', 0.248), ('rv', 0.207), ('leaf', 0.175), ('pruned', 0.172), ('tree', 0.146), ('pdct', 0.145), ('subtree', 0.144), ('dct', 0.144), ('trees', 0.134), ('bayes', 0.128), ('decision', 0.108), ('minimax', 0.104), ('classi', 0.104), ('log', 0.102), ('polynomial', 0.096), ('rate', 0.091), ('node', 0.09), ('spatially', 0.09), ('boundary', 0.09), ('theorem', 0.083), ('convergence', 0.083), ('cdp', 0.083), ('lder', 0.083), ('mammen', 0.083), ('nowak', 0.083), ('ri', 0.079), ('depth', 0.076), ('mansour', 0.072), ('tsybakov', 0.072), ('adaptivity', 0.072), ('penalty', 0.07), ('nv', 0.065), ('min', 0.064), ('md', 0.063), ('rice', 0.062), ('ers', 0.061), ('risk', 0.056), ('scott', 0.055), ('partition', 0.054), ('subtrees', 0.054), ('rates', 0.053), ('regularity', 0.052), ('behaves', 0.052), ('rule', 0.052), ('adaptive', 0.05), ('nodes', 0.05), ('lemma', 0.049), ('logarithmic', 0.048), ('error', 0.048), ('root', 0.047), ('penalties', 0.046), ('mcallester', 0.046), ('lipschitz', 0.046), ('unbalanced', 0.046), ('bound', 0.046), ('spatial', 0.045), ('rk', 0.043), ('cube', 0.043), ('boundaries', 0.042), ('rules', 0.042), ('cation', 0.042), ('perceptron', 0.041), ('pdcts', 0.041), ('resolvability', 0.041), ('locally', 0.04), ('degree', 0.039), ('smooth', 0.039), ('let', 0.038), ('equipped', 0.036), ('subdivide', 0.036), ('okamoto', 0.036), ('achieving', 0.033), ('cubes', 0.033), ('collections', 0.033), ('cyclic', 0.032), ('inequalities', 0.032), ('bounded', 0.031), ('side', 0.031), ('parametric', 0.03), ('deeper', 0.03), ('belongs', 0.03), ('proof', 0.029), ('class', 0.029), ('bennett', 0.029), ('fragment', 0.029), ('pv', 0.029), ('lebesgue', 0.029), ('factor', 0.028), ('structural', 0.028), ('collection', 0.027), ('minimizes', 0.027), ('slower', 0.027), ('aware', 0.027), ('svms', 0.027), ('integer', 0.027), ('generalization', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

Author: Clayton Scott, Robert Nowak

2 0.11207295 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

Author: David Kauchak, Sanjoy Dasgupta

Abstract: We describe a procedure which ﬁnds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efﬁciently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1

3 0.10585114 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

Author: Ting liu, Andrew W. Moore, Alexander Gray

Abstract: This paper is about non-approximate acceleration of high dimensional nonparametric operations such as k nearest neighbor classiﬁers and the prediction phase of Support Vector Machine classiﬁers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly ﬁnd the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure k-NN classiﬁcation and the prediction phase of SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based k-NN. These results include datasets with up to 106 dimensions and 105 records, and show non-trivial speedups while giving exact answers. 1

4 0.10490619 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classiﬁcation algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classiﬁcation. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a reﬁned version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a ﬁnite-dimensional base class.

5 0.098512203 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs

Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun

Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1

6 0.089494355 194 nips-2003-Warped Gaussian Processes

7 0.088718504 172 nips-2003-Semi-Supervised Learning with Trees

8 0.081217803 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

9 0.078463167 189 nips-2003-Tree-structured Approximations by Expectation Propagation

10 0.070045583 51 nips-2003-Design of Experiments via Information Theory

11 0.066400029 128 nips-2003-Minimax Embeddings

12 0.064199284 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

13 0.063787602 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

14 0.061404828 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

15 0.061156433 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

16 0.061018191 168 nips-2003-Salient Boundary Detection using Ratio Contour

17 0.060097985 42 nips-2003-Bounded Finite State Controllers

18 0.059643768 122 nips-2003-Margin Maximizing Loss Functions

19 0.059373621 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

20 0.05752394 151 nips-2003-PAC-Bayesian Generic Chaining

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.194), (1, -0.041), (2, -0.056), (3, -0.003), (4, 0.061), (5, -0.064), (6, -0.054), (7, -0.032), (8, -0.043), (9, 0.144), (10, 0.018), (11, 0.015), (12, 0.045), (13, 0.036), (14, 0.028), (15, -0.06), (16, -0.057), (17, 0.114), (18, 0.169), (19, -0.02), (20, -0.023), (21, -0.103), (22, -0.06), (23, 0.004), (24, 0.005), (25, 0.103), (26, 0.085), (27, 0.153), (28, 0.147), (29, -0.036), (30, 0.04), (31, 0.104), (32, 0.173), (33, 0.121), (34, -0.141), (35, -0.131), (36, -0.085), (37, 0.043), (38, -0.031), (39, 0.025), (40, -0.01), (41, 0.091), (42, 0.117), (43, -0.133), (44, 0.075), (45, -0.076), (46, -0.027), (47, -0.02), (48, -0.031), (49, -0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9501487 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

Author: Clayton Scott, Robert Nowak

2 0.72886616 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

Author: Ting liu, Andrew W. Moore, Alexander Gray

3 0.63748837 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

Author: David Kauchak, Sanjoy Dasgupta

4 0.63073659 172 nips-2003-Semi-Supervised Learning with Trees

Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum

Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efﬁcient computation of the optimal Bayesian classiﬁcation function from the labeled examples. We test our approach on eight real-world datasets. 1

5 0.50087577 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

6 0.44161585 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

7 0.43186438 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

8 0.39860681 181 nips-2003-Statistical Debugging of Sampled Programs

9 0.38720408 3 nips-2003-AUC Optimization vs. Error Rate Minimization

10 0.37986505 99 nips-2003-Kernels for Structured Natural Language Data

11 0.35846967 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

12 0.35571325 51 nips-2003-Design of Experiments via Information Theory

13 0.3557086 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems

14 0.34882116 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process

15 0.340754 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

16 0.3405219 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

17 0.33925122 189 nips-2003-Tree-structured Approximations by Expectation Propagation

18 0.33348337 187 nips-2003-Training a Quantum Neural Network

19 0.32639745 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

20 0.31588364 30 nips-2003-Approximability of Probability Distributions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.06), (1, 0.315), (11, 0.02), (29, 0.01), (30, 0.015), (35, 0.047), (53, 0.075), (69, 0.026), (71, 0.066), (76, 0.029), (85, 0.134), (91, 0.078), (99, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78713983 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

Author: Clayton Scott, Robert Nowak

2 0.68819761 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

Author: Liva Ralaivola, Florence D'alché-buc

Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: ﬁrst, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1

3 0.53221786 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classiﬁcation tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches. 1

4 0.52458876 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman

Abstract: Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random ﬁeld for jointly solving the tasks of object detection and scene classiﬁcation. 1

5 0.52042919 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

6 0.51564687 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

7 0.50933367 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

8 0.50896114 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

9 0.50800198 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

10 0.5064593 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality

11 0.50568491 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

12 0.50433171 78 nips-2003-Gaussian Processes in Reinforcement Learning

13 0.50427973 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

14 0.50250107 41 nips-2003-Boosting versus Covering

15 0.50240874 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

16 0.50069249 158 nips-2003-Policy Search by Dynamic Programming

17 0.4977012 172 nips-2003-Semi-Supervised Learning with Trees

18 0.49561793 113 nips-2003-Learning with Local and Global Consistency

19 0.49471369 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

20 0.49426118 145 nips-2003-Online Classification on a Budget