nips nips2003 nips2003-134 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. [sent-5, score-0.617]
2 A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. [sent-6, score-0.153]
3 Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. [sent-7, score-0.492]
4 For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. [sent-8, score-0.696]
5 We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. [sent-9, score-0.433]
6 We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. [sent-10, score-0.201]
7 1 Introduction We previously studied dyadic classification trees, equipped with simple binary decision rules at each leaf, in [1]. [sent-12, score-0.497]
8 There we applied standard structural risk minimization to derive a pruning rule that minimizes the empirical error plus a complexity penalty proportional to the square root of the size of the tree. [sent-13, score-0.629]
9 Our main result concerned the rate of convergence of the expected error probability of our pruned dyadic classification tree to the Bayes error for a certain class of problems. [sent-14, score-0.928]
10 This class, which essentially requires the Bayes decision boundary to be locally Lipschitz, had previously been studied by Mammen and Tsybakov [2]. [sent-15, score-0.238]
11 They showed the minimax rate of convergence for this class to be n−1/d , where n is the number of labeled training samples, and d is the dimension of each sample. [sent-16, score-0.307]
12 They also demonstrated a classification rule achieving this rate, but the rule requires minimization of the empirical error over the entire class of decision boundaries, an infeasible task in practice. [sent-17, score-0.322]
13 In this paper we exhibit a new pruning strategy that is both computationally efficient and realizes the minimax rate to within a log factor. [sent-19, score-0.598]
14 Those works develop a theory of local uniform convergence, which allows the error to be decomposed in a spatially adaptive way (unlike conventional structural risk minimization). [sent-21, score-0.272]
15 In essence, the associated pruning rules allow a more refined partition in a region where the classification problem is harder (i. [sent-22, score-0.397]
16 Heuristic arguments and anecdotal evidence in both [3] and [4] suggest that spatially adaptive penalties lead to improved performance compared to “global” penalties. [sent-25, score-0.186]
17 In this work, we give theoretical support to this claim (for a specific kind of classification tree, the DCT) by showing a superior rate of convergence for DCTs pruned according to spatially adaptive penalties. [sent-26, score-0.486]
18 We go on to study DCTs equipped with polynomial classification rules at each leaf. [sent-27, score-0.174]
19 PDCTs can be practically implemented by employing polynomial kernel SVMs at each leaf node of a pruned DCT. [sent-30, score-0.533]
20 For any distribution whose Bayes decision boundary behaves locally like a H¨ lder-γ smooth function, we show that the PDCT error converges to the o Bayes error at a rate no slower than O((log n/n)γ/(d+2γ−2) ). [sent-31, score-0.543]
21 As γ → ∞ the rate tends to within a log factor of the parametric rate, n−1/2 . [sent-32, score-0.251]
22 Perceptron trees, tree classifiers having linear splits at each node, have been investigated by many authors and in particular we point to the works [5,6]. [sent-33, score-0.146]
23 A key aspect of PDCTs is their spatial adaptivity, which enables local (rather than global) polynomial fitting of the decision boundary. [sent-35, score-0.249]
24 Traditional polynomial kernel-based methods are not capable of achieving such rates of convergence due to their lack of spatial adaptivity, and it is unlikely that other kernels can solve this problem for the same reason. [sent-36, score-0.31]
25 Then the error in approximation is O(1), a constant, which is the best one could hope for in learning a H¨ lder smooth boundo ary with a traditional polynomial kernel scheme. [sent-38, score-0.266]
26 On the other hand, if we partition the domain into hypercubes of side length O(1/m) and fit an individual polynomial on each hypercube, then the approximation error decays like O(m−γ ). [sent-39, score-0.229]
27 On the other hand, pruning back the partition helps to avoid overfitting. [sent-41, score-0.355]
28 2 Dyadic Classification Trees In this section we review our earlier results on dyadic classification trees. [sent-43, score-0.311]
29 DCTs are based on the concept of a cyclic dyadic partition (CDP). [sent-47, score-0.397]
30 We define a dyadic classification tree (DCT) to be a cyclic dyadic partition with (a) (b) (c) Figure 1: Example of a dyadic classification tree when d = 2. [sent-68, score-1.311]
31 Polynomial-decorated DCTs, discussed in Section 4, are similar in structure, but a polynomial decision rule is employed at each leaf of the pruned tree, instead of a simple binary label. [sent-72, score-0.603]
32 Previously we presented a rule for pruning DCTs with consistency and rate of convergence properties. [sent-76, score-0.527]
33 Let m = 2J be a dyadic integer, and define T0 to be the DCT that has every leaf node at depth dJ. [sent-78, score-0.652]
34 Then each leaf of T0 corresponds to a cube of side length 1/m, and T0 has md total leaf nodes. [sent-79, score-0.487]
35 A subtree T of T0 is referred to as a pruned subtree, denoted T ≤ T0 , if T includes the root of T0 , if every internal node of T has both its children in T , and if the nodes of T inherit their labels from T0 . [sent-81, score-0.503]
36 The size of a tree T , denoted |T |, is the number of leaf nodes. [sent-82, score-0.321]
37 We defined the complexity penalized dyadic classification tree Tn to be the solution of Tn = arg min ˆ(T ) + αn T ≤T0 |T |, (1) where αn = 32 log(en)/n, and ˆ(T ) is the empirical error, i. [sent-83, score-0.521]
38 (The solution to this pruning problem can be computed efficiently [7]. [sent-86, score-0.301]
39 ) We showed that if X ∈ [0, 1]d with probability one, and md = o(n/ log n), then E{ (Tn )} → ∗ with probability one (i. [sent-87, score-0.165]
40 We also demonstrated a rate of convergence result for Tn , under certain assumptions on the distribution of (X, Y ). [sent-93, score-0.174]
41 Definition 1 Let c1 , c2 > 0, and let m0 be a dyadic integer. [sent-96, score-0.349]
42 A2 (Regularity): For all dyadic integers m ≥ m0 , if we subdivide the unit cube into cubes of side length 1/m, The Bayes decision boundary passes through at most c2 md−1 of the resulting md cubes. [sent-98, score-0.715]
43 These assumptions are satisfied when the density of X is essentially bounded with respect to Lebesgue measure, and when the Bayes decision boundary for the distribution on (X, Y ) behaves locally like a Lipschitz function. [sent-99, score-0.321]
44 See, for example, the boundary fragment class of [2] with γ = 1 therein. [sent-100, score-0.148]
45 In [1], we showed that if the distribution of (X, Y ) belongs to F, and m ∼ 1/(d+1) (n/ log n)1/(d+1) , then E{ (Tn )} − ∗ = O((log n/n) ). [sent-101, score-0.132]
46 However, this upper bound on the rate of convergence is not tight. [sent-102, score-0.22]
47 The results of Mammen and Tsybakov [2] show that the minimax rate of convergence, inf φn supF E{ (φn )} − ∗ , is on the order of n−1/d (here φn ranges over all possible discrimination rules). [sent-103, score-0.195]
48 In the next section, we introduce a new strategy for pruning DCTs, which leads to an improved rate of convergence of (log n/n)1/d (i. [sent-104, score-0.475]
49 3 Improved Tree Pruning with Spatially Adaptive Penalties An improved rate of convergence is achieved by pruning the initial tree T0 using a new complexity penalty. [sent-108, score-0.621]
50 Given a node v in a tree T , let Tv denote the subtree of T rooted at v. [sent-109, score-0.418]
51 Let S denote the training data, and let nv denote the number of training samples reaching node v. [sent-110, score-0.193]
52 Consider the pruning rule that selects Tn = arg min ˆ(T ) + min ∆(T, S, R) , where ∆(T, S, R) = v∈L(R) (2) R≤T T ≤T0 1 n 48nv |Tv | log(2n) + 48nv d log(m) . [sent-114, score-0.481]
53 Observe that the penalty is data-dependent (since nv depends on S) and spatially adaptive (choosing R ≤ T to minimize ∆). [sent-115, score-0.275]
54 The first term in the penalty is written v∈L(R) pv 48|Tv | log(2n)/nv , where pv = nv /n. [sent-117, score-0.193]
55 The second term can be interpreted as the “cost” of spatially decomposing the bound on the generalization error. [sent-119, score-0.163]
56 Consider pruning one of two subtrees, both with the same size, and assume that both options result in the same increase in the empirical error. [sent-121, score-0.301]
57 Then the subtree with more data is selected for pruning. [sent-122, score-0.144]
58 Since deeper nodes typically have less data, this shows that the penalty favors unbalanced trees, which may promote higher resolution (deeper leaf nodes) in the vicinity of the decision boundary. [sent-123, score-0.479]
59 In contrast, the pruning rule (1) penalizes balanced and unbalanced trees (with the same size) equally. [sent-124, score-0.533]
60 The following theorem bounds the expected error of Tn . [sent-125, score-0.131]
61 Recall that m specifies the depth of the initial tree T0 . [sent-127, score-0.222]
62 Theorem 1 If m ∼ (n/ log n)1/d , then E{ (Tn ) − ∗ } ≤ min T ≤T0 ( (T ) − ∗ ) + E min ∆(T, S, R) R≤T +O log n n . [sent-128, score-0.332]
63 Since the bound holds for all T , one feature of the pruning rule (2) is that Tn performs at least as well as the subtree T ≤ T0 that minimizes the bound. [sent-131, score-0.57]
64 This theorem may be applied to give us our desired rate of convergence result. [sent-132, score-0.257]
65 In other words, the pruning rule (2) comes within a log factor of the minimax rate. [sent-135, score-0.587]
66 Within each cube the Bayes decision boundary is described by a function (one coordinate is a function of the others) with H¨ lder regularity γ. [sent-141, score-0.376]
67 o The collection G contains all distributions whose Bayes decision boundaries behave locally like the graph of a function with H¨ lder regularity γ. [sent-142, score-0.352]
68 We propose a classifier, called a polynomial-decorated dyadic classification tree (PDCT), that achieves fast rates of convergence for distributions satisfying A3. [sent-144, score-0.593]
69 Given a positive integer r, a PDCT of degree r is a DCT, with class labels at each leaf node assigned by a degree r polynomial classifier. [sent-145, score-0.495]
70 Consider the pruning rule that selects Tn,r = arg min ˆ(T ) + min ∆r (T, S, R) , R≤T T ≤T0 (3) where ∆r (T, S, R) = v 1 n 48nv Vd,r |Tv | log(2n) + 48nv (d + γ) log(m) . [sent-146, score-0.481]
71 Here Vd,r = d+r is the V C dimension of the collection of degree r polynomial classifiers r in d dimensions. [sent-147, score-0.162]
72 We actually consider a search over all pruned subtrees of T0 , and with all possible configurations of degree r polynomial classifiers at the leaf nodes. [sent-149, score-0.536]
73 Moreover, If r = γ − 1, then a decision boundary with H¨ lder regularity γ is well approximated by o a PDCT of degree r. [sent-151, score-0.372]
74 In this case, Tn,r converges to the Bayes risk at rates bounded by the next theorem. [sent-152, score-0.14]
75 Also notice that as γ → ∞, the rate of convergence comes within a logarithmic factor of the parametric rate n−1/2 . [sent-156, score-0.371]
76 5 Efficient Algorithms The optimally pruned subtree Tn of rule (2) can be computed exactly in O(|T0 |2 ) operations. [sent-158, score-0.368]
77 This follows from a simple bottom-up dynamic programming algorithm, which we describe below, and uses a method for “square-root” pruning studied in [7]. [sent-159, score-0.301]
78 Given a node ∗ v ∈ T0 , let Tv be the subtree of T0 rooted at v that minimizes the objective function of (2), ∗ ∗ and let Rv be the associated subtree that minimizes ∆(Tv , S, R). [sent-163, score-0.508]
79 ∗ ∗ If v is a leaf node of T0 , then clearly Tv = Rv = {v}. [sent-165, score-0.265]
80 Using the first algorithm in [7], the overall pruning procedure may be accomplished in (|T0 |2 ) operations. [sent-169, score-0.301]
81 Determining the optimally pruned degree r PDCT is more challenging. [sent-170, score-0.211]
82 The problem requires the construction, at each node of T0 , a polynomial classifier of degree r having minimum empirical error. [sent-171, score-0.225]
83 Moreover, linear SVMs in perceptron trees have been shown to work well [6]. [sent-175, score-0.175]
84 6 Conclusions A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. [sent-176, score-0.153]
85 Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion that promotes unbalanced trees that focus on the decision boundary. [sent-177, score-0.78]
86 For distributions on (X, Y ) whose Bayes decision boundary behave locally like a H¨ lder-γ smooth o function, we show that the PDCT error converges to the Bayes error at a rate no slower than O((log n/n)γ/(d+2γ−2) ). [sent-178, score-0.491]
87 Polynomial kernel methods are not capable of achieving such rates due to their lack of spatial adaptivity. [sent-179, score-0.131]
88 When γ = 1, the DCT convergence rate is within a logarithmic factor of the minimax optimal rate. [sent-180, score-0.354]
89 As γ → ∞ the rate tends to within a log factor of n−1/2 , the parametric rate. [sent-181, score-0.251]
90 However, the rates for γ > 1 are not within a logarithmic factor of the minimax rate [2]. [sent-182, score-0.324]
91 m Given S ∈ Ωm , we know (Tn ) ≤ = ≤ ˆ(Tn ) + min f (Tn , S, R, 3m−d ) R≤Tn 4d log(m) R≤Tn n 4d log(m) ˆ(T ) + min ∆(T, S, R) + , R≤T n ˆ(Tn ) + min ∆(Tn , S, R) + where the last inequality comes from the definition of Tn . [sent-193, score-0.192]
92 1 Proof of Theorem 2 By Theorem 1, it suffices to find a tree T ∗ ≤ T0 such that E min∗ ∆(T ∗ , S, R) + ( (T ∗ ) − R≤T ∗ )=O log n n 1 d . [sent-197, score-0.248]
93 Define T ∗ to be the tree obtained by pruning back T0 at every node (thought of as a region of space) that does not intersect the Bayes decision boundary. [sent-198, score-0.645]
94 Define R∗ to be the pruned subtree of T ∗ consisting of all nodes in T ∗ up to depth j0 d, where j0 = J − (1/d) log2 (J) (truncated if necessary). [sent-202, score-0.442]
95 The first term, consisting of a sum of terms over the leaf nodes of R∗ , is dominated by the sum of those terms over the leaf nodes of R∗ at depth j0 d. [sent-212, score-0.526]
96 Recall the leaf nodes of T ∗ at maximum depth are cells of side length 1/m. [sent-219, score-0.332]
97 Nowak, “Dyadic classification trees via structural risk minimization,” in Advances in Neural Information Processing Systems 14, S. [sent-228, score-0.218]
98 Mansour, “A fast, bottom-up decision tree pruning algorithm with nearoptimal generalization,” in International Conference on Machine Learning, 1998, pp. [sent-241, score-0.555]
99 Wu, “Enlarging the margins in perceptron decision trees,” Machine Learning, vol. [sent-258, score-0.149]
100 Nowak, “Complexity-regularized dyadic classification trees: Efficient pruning and rates of convergence,” Tech. [sent-277, score-0.665]
wordName wordTfidf (topN-words)
[('tn', 0.313), ('dyadic', 0.311), ('tv', 0.305), ('pruning', 0.301), ('dcts', 0.248), ('rv', 0.207), ('leaf', 0.175), ('pruned', 0.172), ('tree', 0.146), ('pdct', 0.145), ('subtree', 0.144), ('dct', 0.144), ('trees', 0.134), ('bayes', 0.128), ('decision', 0.108), ('minimax', 0.104), ('classi', 0.104), ('log', 0.102), ('polynomial', 0.096), ('rate', 0.091), ('node', 0.09), ('spatially', 0.09), ('boundary', 0.09), ('theorem', 0.083), ('convergence', 0.083), ('cdp', 0.083), ('lder', 0.083), ('mammen', 0.083), ('nowak', 0.083), ('ri', 0.079), ('depth', 0.076), ('mansour', 0.072), ('tsybakov', 0.072), ('adaptivity', 0.072), ('penalty', 0.07), ('nv', 0.065), ('min', 0.064), ('md', 0.063), ('rice', 0.062), ('ers', 0.061), ('risk', 0.056), ('scott', 0.055), ('partition', 0.054), ('subtrees', 0.054), ('rates', 0.053), ('regularity', 0.052), ('behaves', 0.052), ('rule', 0.052), ('adaptive', 0.05), ('nodes', 0.05), ('lemma', 0.049), ('logarithmic', 0.048), ('error', 0.048), ('root', 0.047), ('penalties', 0.046), ('mcallester', 0.046), ('lipschitz', 0.046), ('unbalanced', 0.046), ('bound', 0.046), ('spatial', 0.045), ('rk', 0.043), ('cube', 0.043), ('boundaries', 0.042), ('rules', 0.042), ('cation', 0.042), ('perceptron', 0.041), ('pdcts', 0.041), ('resolvability', 0.041), ('locally', 0.04), ('degree', 0.039), ('smooth', 0.039), ('let', 0.038), ('equipped', 0.036), ('subdivide', 0.036), ('okamoto', 0.036), ('achieving', 0.033), ('cubes', 0.033), ('collections', 0.033), ('cyclic', 0.032), ('inequalities', 0.032), ('bounded', 0.031), ('side', 0.031), ('parametric', 0.03), ('deeper', 0.03), ('belongs', 0.03), ('proof', 0.029), ('class', 0.029), ('bennett', 0.029), ('fragment', 0.029), ('pv', 0.029), ('lebesgue', 0.029), ('factor', 0.028), ('structural', 0.028), ('collection', 0.027), ('minimizes', 0.027), ('slower', 0.027), ('aware', 0.027), ('svms', 0.027), ('integer', 0.027), ('generalization', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
2 0.11207295 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
3 0.10585114 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
Author: Ting liu, Andrew W. Moore, Alexander Gray
Abstract: This paper is about non-approximate acceleration of high dimensional nonparametric operations such as k nearest neighbor classifiers and the prediction phase of Support Vector Machine classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure k-NN classification and the prediction phase of SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based k-NN. These results include datasets with up to 106 dimensions and 105 records, and show non-trivial speedups while giving exact answers. 1
4 0.10490619 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
5 0.098512203 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun
Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1
6 0.089494355 194 nips-2003-Warped Gaussian Processes
7 0.088718504 172 nips-2003-Semi-Supervised Learning with Trees
8 0.081217803 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
9 0.078463167 189 nips-2003-Tree-structured Approximations by Expectation Propagation
10 0.070045583 51 nips-2003-Design of Experiments via Information Theory
11 0.066400029 128 nips-2003-Minimax Embeddings
12 0.064199284 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
13 0.063787602 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
14 0.061404828 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
15 0.061156433 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
16 0.061018191 168 nips-2003-Salient Boundary Detection using Ratio Contour
17 0.060097985 42 nips-2003-Bounded Finite State Controllers
18 0.059643768 122 nips-2003-Margin Maximizing Loss Functions
19 0.059373621 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
20 0.05752394 151 nips-2003-PAC-Bayesian Generic Chaining
topicId topicWeight
[(0, -0.194), (1, -0.041), (2, -0.056), (3, -0.003), (4, 0.061), (5, -0.064), (6, -0.054), (7, -0.032), (8, -0.043), (9, 0.144), (10, 0.018), (11, 0.015), (12, 0.045), (13, 0.036), (14, 0.028), (15, -0.06), (16, -0.057), (17, 0.114), (18, 0.169), (19, -0.02), (20, -0.023), (21, -0.103), (22, -0.06), (23, 0.004), (24, 0.005), (25, 0.103), (26, 0.085), (27, 0.153), (28, 0.147), (29, -0.036), (30, 0.04), (31, 0.104), (32, 0.173), (33, 0.121), (34, -0.141), (35, -0.131), (36, -0.085), (37, 0.043), (38, -0.031), (39, 0.025), (40, -0.01), (41, 0.091), (42, 0.117), (43, -0.133), (44, 0.075), (45, -0.076), (46, -0.027), (47, -0.02), (48, -0.031), (49, -0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.9501487 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
2 0.72886616 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
Author: Ting liu, Andrew W. Moore, Alexander Gray
Abstract: This paper is about non-approximate acceleration of high dimensional nonparametric operations such as k nearest neighbor classifiers and the prediction phase of Support Vector Machine classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure k-NN classification and the prediction phase of SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based k-NN. These results include datasets with up to 106 dimensions and 105 records, and show non-trivial speedups while giving exact answers. 1
3 0.63748837 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
4 0.63073659 172 nips-2003-Semi-Supervised Learning with Trees
Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum
Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efficient computation of the optimal Bayesian classification function from the labeled examples. We test our approach on eight real-world datasets. 1
5 0.50087577 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
6 0.44161585 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
7 0.43186438 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
8 0.39860681 181 nips-2003-Statistical Debugging of Sampled Programs
9 0.38720408 3 nips-2003-AUC Optimization vs. Error Rate Minimization
10 0.37986505 99 nips-2003-Kernels for Structured Natural Language Data
11 0.35846967 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
12 0.35571325 51 nips-2003-Design of Experiments via Information Theory
13 0.3557086 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems
14 0.34882116 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process
15 0.340754 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
16 0.3405219 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
17 0.33925122 189 nips-2003-Tree-structured Approximations by Expectation Propagation
18 0.33348337 187 nips-2003-Training a Quantum Neural Network
19 0.32639745 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
20 0.31588364 30 nips-2003-Approximability of Probability Distributions
topicId topicWeight
[(0, 0.06), (1, 0.315), (11, 0.02), (29, 0.01), (30, 0.015), (35, 0.047), (53, 0.075), (69, 0.026), (71, 0.066), (76, 0.029), (85, 0.134), (91, 0.078), (99, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.78713983 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
2 0.68819761 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
Author: Liva Ralaivola, Florence D'alché-buc
Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: first, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1
3 0.53221786 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
4 0.52458876 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman
Abstract: Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field for jointly solving the tasks of object detection and scene classification. 1
5 0.52042919 3 nips-2003-AUC Optimization vs. Error Rate Minimization
Author: Corinna Cortes, Mehryar Mohri
Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classification error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a specific query. That list may contain several thousand documents, but, in practice, only the top fifty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classification of all documents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classifier is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classification algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values significantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modified under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classification, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. specificity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Definition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown more formally by Lemma 1 below. Consider a binary classification task with m positive examples and n negative examples. We will assume that a classifier outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a fixed classifier. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classifications of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a fixed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classifications with a fixed number of errors and compare that to the error rate. Different classifiers may have the same error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is fixed. We wish to compute the average value of the AUC over all classifications with k errors. Our model is based on the simple assumption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are k − x false negative examples. Figure 3 shows the corresponding configuration. The two regions of examples with classification outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The first term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classified as positive is N = m − (k − x) + x. Thus, by definition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the first term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a fixed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is fixed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classifications with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a fixed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classifications for a given x. There are N possible ways of x choosing the positions of the x misclassified negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassified positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a fixed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simplified into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a fixed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing specific learning algorithms for maximizing the AUC: a classification algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classifications, we can first determine its average < X >x for a fixed x, and then use the function F defined by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of configurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a fixed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By definition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i
6 0.51564687 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
7 0.50933367 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
8 0.50896114 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
9 0.50800198 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
10 0.5064593 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality
11 0.50568491 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
12 0.50433171 78 nips-2003-Gaussian Processes in Reinforcement Learning
13 0.50427973 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
14 0.50250107 41 nips-2003-Boosting versus Covering
15 0.50240874 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
16 0.50069249 158 nips-2003-Policy Search by Dynamic Programming
17 0.4977012 172 nips-2003-Semi-Supervised Learning with Trees
18 0.49561793 113 nips-2003-Learning with Local and Global Consistency
19 0.49471369 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
20 0.49426118 145 nips-2003-Online Classification on a Budget