nips nips2002 nips2002-72 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
Reference: text
sentIndex sentText sentNum sentScore
1 edu ¡ Abstract Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. [sent-2, score-0.198]
2 In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. [sent-4, score-0.929]
3 We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. [sent-8, score-0.359]
4 Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. [sent-9, score-0.095]
5 Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based. [sent-10, score-0.486]
6 A procedure that constructs a classifier for all is called a discrimination rule. [sent-25, score-0.1]
7 In this paper, we examine a family of classifiers called dyadic classification trees (DCTs), built by recursive, dyadic partitioning of the input space. [sent-28, score-1.244]
8 The appropriate tree from this family is obtained by building an initial tree (in a data-independent fashion), followed by a data-dependent pruning operation based on structural risk minimization (SRM). [sent-29, score-0.852]
9 Thus, one important distinction between our approach and usual decision trees is that the initial tree is not adaptively grown to fit the data. [sent-30, score-0.512]
10 The pruning strategy resembles that used by CART, except that the penalty assigned to a subtree is proportional to the square root of its size. [sent-31, score-0.315]
11 SRM penalized DCTs lead to a strongly consistent discrimination rule for input data with support in the unit cube . [sent-32, score-0.312]
12 We also derive bounds on the rate of convergence of DCTs to the Bayes error. [sent-33, score-0.163]
13 Under a modest regularity assumption (in terms of the box-counting dimension) on the underlying optimal Bayes decision boundary, we show that complexity. [sent-34, score-0.152]
14 Moreover, the regularized DCTs converge to the Bayes decision at a rate of minimax error rate for this class is at least . [sent-35, score-0.37]
15 This shows that dyadic classification trees are near minimax-rate optimal, i. [sent-36, score-0.725]
16 , that no discrimination rule can perform significantly better in this minimax sense. [sent-38, score-0.287]
17 We also present an efficient algorithm for implementing the pruning strategy, which leads to an algorithm for DCT construction. [sent-39, score-0.26]
18 The pruning algorithm requires operations to prune an initial tree with terminal nodes, and is based on the familiar pruning algorithm used by CART [1]. [sent-40, score-0.788]
19 Finally, we compare DCTs with a CART-like tree classifier on four common datasets. [sent-41, score-0.16]
20 Let be a tree-structured partition of the input is a hyperrectangle with sides parallel to the coordinate axes. [sent-46, score-0.119]
21 If is a cell at depth in the tree, let and be the rectangles formed by splitting at its midpoint along coordinate . [sent-48, score-0.184]
22 As a convention, assume contains those points of that are less than or equal to the midpoint along the dimension being split. [sent-49, score-0.117]
23 The trivial partition G @ A Definition 1 A sequential dyadic partition (SDP) is any partition of obtained by applying the following rules recursively: We define a dyadic classification tree (DCT) to be a sequential dyadic partition with a class label (0 or 1) assigned to each node in the tree. [sent-64, score-1.983]
24 The partitions are sequential because children must be split along the next coordinate after the coordinate where their parent was split. [sent-65, score-0.173]
25 The partitions are dyadic because we only allow midpoint splits. [sent-67, score-0.54]
26 0 0 By a complete DCT of depth , we mean a DCT such that every possible split up to depth has been made. [sent-68, score-0.151]
27 If is a multiple of , then the terminal nodes of a complete DCT are hypercubes of sidelength . [sent-70, score-0.121]
28 0 ¡ ¡ ¨ ¨ G 3 SRM for DCTs Structural risk minimization (SRM) is an inductive principle for selecting a classifier from a sequence of sets of classifiers based on complexity regularization. [sent-71, score-0.185]
29 We formulate structural risk minimization for dyadic classification trees by applying results from [4, Ch. [sent-74, score-0.924]
30 The Vapnik-Chervonenkis dimension (or VC dimension) of , denoted by , is the largest integer for which there exist such that shatters . [sent-79, score-0.213]
31 For dyadic , and for , denote the collection of all DCTs with terminal nodes and depth not exceeding , so that no terminal node has a side of length less than . [sent-86, score-0.674]
32 B Q0 P $ The SRM principle selects the classifier ¤ D¤ D D & ' , define the penalized risk %! [sent-96, score-0.29]
33 is selected by empirical risk minimization over B (¥ is the empirical risk of . [sent-100, score-0.322]
34 where and for arg min , for , and training data Given a dyadic integer , that % ! [sent-102, score-0.601]
35 We refer to as a penalized or complexity-regularized dyadic classification tree. [sent-104, score-0.636]
36 Even though the penalized DCT does not know the value of that optimally balances the two terms, it performs as though it does, because of the “min” in the expression. [sent-114, score-0.153]
37 Nobel [6] gives similar results for classifiers based on initial trees that depend on the data. [sent-115, score-0.227]
38 W The next result demonstrates strong consistency for the penalized DCT, where strong consistency means with probabilty one. [sent-116, score-0.29]
39 0 D )© 0 ¤ ' © Theorem 2 Suppose , with assuming only dyadic integer values. [sent-117, score-0.539]
40 If , then the penalized dyadic classification tree is strongly consistent for all distributions supported on the unit hypercube. [sent-118, score-0.796]
41 © ¢ 3 Sketch of proof: The proof follows from the first part of Theorem 1 and strong universal consistency of the regular histogram classifier. [sent-120, score-0.096]
42 4 Rates of Convergence In this section, we investigate bounds on the rate of convergence of complexity-regularized DCTs. [sent-122, score-0.163]
43 First we obtain upper bounds on the rate of convergence for a particular class of distributions on . [sent-123, score-0.22]
44 We then state a minimax lower bound on the rate of convergence of any data based classifier for this class. [sent-124, score-0.327]
45 © § ¦£ ¢ ¥ Most rate of convergence studies in pattern recognition place a constraint on the regression function by requiring it to belong to a certain smoothness class (e. [sent-125, score-0.188]
46 In contrast, the class we study is defined in terms of the regularity of the Bayes decision boundary, denoted . [sent-128, score-0.215]
47 We allow to be arbitrarily irregular away from , so long as it is well behaved near . [sent-129, score-0.1]
48 The Bayes decision boundary is informally defined as . [sent-130, score-0.157]
49 Define such that for all dyadic integers length , © V 3 We now define a class of distributions. [sent-139, score-0.514]
50 A1 (Bounded marginal): For any such cube intersecting the Bayes decision boundary, , where denotes the Lebesgue measure. [sent-143, score-0.121]
51 £ 5 © § ¦£ ¢ ¥ to be the class of all T Define V " A2 (Regularity): The Bayes decision boundary passes through at most resulting cubes. [sent-146, score-0.188]
52 The second condition T %U £ can be shown to hold when one coordinate of the Bayes decision boundary is a Lipschitz function of the others. [sent-149, score-0.212]
53 See, for example, the boundary fragment class of [7] with therein. [sent-150, score-0.119]
54 3 The regularity condition A2 is closely related to the notion of box-counting dimension of the Bayes decision boundary [8]. [sent-152, score-0.3]
55 Roughly speaking, A2 holds for some if and only if the Bayes decision boundary has box-counting dimension . [sent-153, score-0.217]
56 The box-counting dimension is an upper bound on the Hausdorff dimension, and the two dimensions are equal for most “reasonable” sets. [sent-154, score-0.115]
57 Let ) 0D Theorem 3 Assume the distribution of belongs to penalized dyadic classification tree, as described in Section 3. [sent-169, score-0.636]
58 If then there exists a constant such that for all , Sketch of proof: It can be shown that for each dyadic , there exists a pruned DCT with leaf nodes, such that . [sent-170, score-0.626]
59 Plugging this into the risk bound in Theorem 1 and minimizing over produces the desired result [5]. [sent-171, score-0.166]
60 Note that similar rate of convergence results for data-grown trees would be more difficult to establish, since the approximation error is random in those cases. [sent-174, score-0.316]
61 ¤ To illustrate the significance of Theorem 3, consider a penalized histogram classifer, with bin width determined adaptively by structural risk minimization, as described in [4, Problem 18. [sent-178, score-0.376]
62 For that rule, the best exponent on the rate of convergence for our class is , compared with for our rule. [sent-180, score-0.187]
63 Intuitively, this is because the adaptive dimensional resolution of dyadic classification trees enables them to focus on the decision boundary, rather than the dimensional regression function. [sent-181, score-0.789]
64 Thus, the penalized DCT is able to automatically adapt to the dimensionality of the input data. [sent-189, score-0.153]
65 2 Minimax Lower Bound The next result demonstrates that complexity-regularized DCTs nearly achieve the minimimax rate for our class of distributions. [sent-194, score-0.146]
66 ¤ Theorem 4 Let denote any discrimination rule based on training data. [sent-195, score-0.107]
67 We suspect that the class studied by Tsybakov [7], used in our minimax proof, is more restrictive in the above theorem can be than our class. [sent-201, score-0.287]
68 Therefore, it may be that the exponent decreased to , in which case we achieve the minimax rate. [sent-202, score-0.218]
69 Although bounds on the minimax rate of convergence in pattern recognition have been investigated in previous work [9, 10], the focus has been on placing regularity assumptions on the regression function . [sent-208, score-0.465]
70 Our approach is to study minimax rates for distributions defined in terms of the regularity of the Bayes decision boundary. [sent-214, score-0.37]
71 With this framework, we see that minimax rates for classification can be orders of magnitude faster than for estimation of , since may be arbitrarily irregular away from the decision boundary for distributions in our class. [sent-215, score-0.431]
72 This view of minimax classification has also been adopted by Mammen and Tsybakov [7,11]. [sent-216, score-0.18]
73 Our contribution with respect to their work is an implementable discrimination rule, with guaranteed computational complexity, that nearly achieves the minimax lower bounds. [sent-217, score-0.289]
74 , ) obtained by those authors require much stronger assumptions on the smoothness of the decision boundary and than we employ in this paper. [sent-220, score-0.157]
75 3 § 5 3 © ¢ AGF 2P © F ¢ G¤QP © F ¢ G¤QP )© ( © T ¢ © ¢ GF QP 5 An Efficient Pruning Algorithm In this section we describe an algorithm to compute the penalized DCT efficiently. [sent-222, score-0.153]
76 ¦© H q¢ ¢ Q ¢ T ¢ V 3 4© where denotes the number of leaf nodes of when is a complete dyadic tree, and © arg min ¥ 3 4© and ¥ arg min 3 8 G Breiman, et. [sent-230, score-0.701]
77 A similar result holds for the square-root penalty, and the trees produced are a subset of the trees produced by the additive penalty [5]. [sent-234, score-0.451]
78 Q ¤ ( ) ¢ ¨ a 3 © ¤£ ¢ ¤£ ¢ ¢ T 3 4© ¢ V 3 such that ¨ ¥3 £ T £ © ¥ T Theorem 5 For each , there exists ¢ Therefore, pruning with the square-root penalty always produces one of the trees . [sent-236, score-0.536]
79 We may then determine the pruned tree minimizing the penalized risk by minimizing this quantity over . [sent-237, score-0.519]
80 B B B ¥ &T; D ¢ U¢ 3 ¥ © ¡ ¢ ¡ 6 24 ¡ ¢ ¡ ( ¢ ¡ ¢¡ £ In the context of constructing a penalized DCT, we start with an initial tree that is a complete DCT. [sent-240, score-0.371]
81 For the classifiers in Theorems 2 and 3, this initial tree has size 3 Table 1: Comparison of a greedy tree growing procedure, with model selection based on holdout error estimate, and two DCT based methods. [sent-241, score-0.548]
82 8 % )© ( Pima Indian Diabetes Wisconsin Breast Cancer Ionosphere Waveform 6 Experimental Comparison To gain a rough idea of the usefulness of dyadic classification trees in practice, we compared two DCT based classifiers with a greedy tree growing procedure, similar to that used by CART [1] or C4. [sent-258, score-0.931]
83 ¢ For the greedy growing scheme, we used half of the training data to grow the tree, and constructed every possible pruning of the initial tree with an additive penalty. [sent-266, score-0.539]
84 The best pruned tree was chosen to minimize the holdout error on the rest of the training data. [sent-267, score-0.338]
85 The second classifier, DCT-HOLD, was constructed in a similar manner, except that the initial tree was a complete DCT, and all of the training data was used for computing the holdout error estimate. [sent-269, score-0.327]
86 From these experiments, we might conclude two things: (i) The greedily-grown partition outperforms the dyadic partition; and (ii) Much of the discrepancy between CART-HOLD and DCT-SRM comes from the partitioning, and not from the model selection method (holdout versus SRM). [sent-272, score-0.547]
87 20] that greedy partitioning based on impurity functions can perform arbitrarily poorly for some distributions, while this is never the case for complexity-regularized DCTs. [sent-275, score-0.161]
88 7 Conclusion Dyadic classification trees exhibit desirable theoretical properties (finite sample risk bounds, consistency, near minimax-rate optimality) and can be trained extremely rapidly. [sent-277, score-0.379]
89 The minimax result demonstrates that other discrimination rules, such as neural networks or support vector machines, cannot significantly outperform DCTs (in this minimax sense). [sent-278, score-0.473]
90 This minimax result is asymptotic, and considers worst-case distributions. [sent-279, score-0.18]
91 The sequential dyadic partitioning scheme is especially susceptible when many of the features are irrelevant, since it must cycle through all features before splitting a feature again. [sent-281, score-0.568]
92 Several modifications to the current dyadic partitioning scheme may be envisioned, such as free dyadic or median splits. [sent-282, score-1.022]
93 Such modified tree induction strategies would still possess many of the desirable theoretical properties of DCTs. [sent-283, score-0.187]
94 Indeed, Nobel has derived risk bounds and consistency results for classification trees grown according to data [6]. [sent-284, score-0.458]
95 Our square-root pruning algorithm now provides a means of implementing his pruning schemes for comparison with other model selection techniques (e. [sent-285, score-0.52]
96 It remains to be seen whether the rate of convergence analysis presented here extends to his work. [sent-288, score-0.118]
97 Nowak, “Complexity-regularized dyadic classification trees: Efficient pruning and rates of convergence,” Tech. [sent-313, score-0.781]
98 Nobel, “Analysis of a complexity based pruning scheme for classification trees,” IEEE Transactions on Information Theory, vol. [sent-320, score-0.26]
99 Marron, “Optimal rates of convergence to Bayes risk in nonparametric discrimination,” Annals of Statistics, vol. [sent-334, score-0.248]
100 Gray, “Optimal pruning with applications to tree-structured source coding and modeling,” IEEE Transactions on Information Theory, vol. [sent-352, score-0.26]
wordName wordTfidf (topN-words)
[('dyadic', 0.483), ('dcts', 0.326), ('dct', 0.322), ('pruning', 0.26), ('classi', 0.202), ('trees', 0.198), ('minimax', 0.18), ('tree', 0.16), ('penalized', 0.153), ('risk', 0.137), ('cart', 0.113), ('srm', 0.113), ('holdout', 0.109), ('bayes', 0.102), ('boundary', 0.088), ('nobel', 0.087), ('tsybakov', 0.087), ('qp', 0.086), ('regularity', 0.083), ('cation', 0.081), ('theorem', 0.076), ('discrimination', 0.076), ('convergence', 0.073), ('pruned', 0.069), ('decision', 0.069), ('er', 0.066), ('lugosi', 0.065), ('shatters', 0.065), ('partition', 0.064), ('dimension', 0.06), ('structural', 0.058), ('vc', 0.058), ('midpoint', 0.057), ('sdp', 0.057), ('partitioning', 0.056), ('integer', 0.056), ('growing', 0.055), ('terminal', 0.055), ('penalty', 0.055), ('coordinate', 0.055), ('ers', 0.052), ('lipschitz', 0.052), ('cube', 0.052), ('consistency', 0.05), ('sketch', 0.048), ('minimization', 0.048), ('proof', 0.046), ('rate', 0.045), ('bounds', 0.045), ('near', 0.044), ('depth', 0.044), ('besov', 0.043), ('impurity', 0.043), ('mammen', 0.043), ('zeger', 0.043), ('regression', 0.039), ('exponent', 0.038), ('splits', 0.038), ('rates', 0.038), ('breiman', 0.038), ('nowak', 0.038), ('rice', 0.038), ('subtrees', 0.038), ('demonstrates', 0.037), ('nodes', 0.037), ('greedy', 0.035), ('split', 0.034), ('nearly', 0.033), ('scott', 0.032), ('lebesgue', 0.032), ('denoted', 0.032), ('min', 0.031), ('class', 0.031), ('arg', 0.031), ('rule', 0.031), ('rules', 0.029), ('complete', 0.029), ('sequential', 0.029), ('bound', 0.029), ('irregular', 0.029), ('initial', 0.029), ('let', 0.028), ('leaf', 0.028), ('adaptively', 0.028), ('grown', 0.028), ('arbitrarily', 0.027), ('possess', 0.027), ('upper', 0.026), ('transactions', 0.026), ('bounded', 0.026), ('vg', 0.026), ('alexander', 0.026), ('ne', 0.025), ('remark', 0.025), ('yang', 0.025), ('datasets', 0.024), ('operations', 0.024), ('called', 0.024), ('vapnik', 0.023), ('exists', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
2 0.23522666 45 nips-2002-Boosted Dyadic Kernel Discriminants
Author: Baback Moghaddam, Gregory Shakhnarovich
Abstract: We introduce a novel learning algorithm for binary classification with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. An ensemble of simple dyadic hypercuts is learned incrementally by means of a confidence-rated version of AdaBoost, which provides a sound strategy for searching through the finite set of hypercut hypotheses. In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classifiers was found to be comparable to that of SVMs and k-NN classifiers. Furthermore, the computational cost of classification (at run time) was found to be similar to, or better than, that of SVM. Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). In contrast to SVMs, however, we offer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded off for accuracy). 1
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
4 0.12761904 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
5 0.11623336 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
Author: Ron Meir, Tong Zhang
Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a finite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-defined optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish finite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a finite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we define the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classification problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classification decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution defined by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a flurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classifiers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classifier can be extended to a Bayesian mixture classifier. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather unified setting for Bayesian mixture methods, where different regularization criteria may be incorporated, and their effect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and define for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider different choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we define the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating significantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We define the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We first recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we find that s(v) = log j νj ezj . From the definition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0 n 1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) . n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoff bound (Jensen) i hj (xi )2 2 (Chernoff) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are defined in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class defined in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a fixed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the first term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classifiers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately specified. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order differentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is finite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the benefits of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorff foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classifiers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classifiers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classification - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
6 0.11405857 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
7 0.10699195 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
8 0.10694863 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
9 0.10668721 92 nips-2002-FloatBoost Learning for Classification
10 0.10533473 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
11 0.092521206 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
12 0.0898242 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
13 0.089160152 85 nips-2002-Fast Kernels for String and Tree Matching
14 0.088013001 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
15 0.084886305 89 nips-2002-Feature Selection by Maximum Marginal Diversity
16 0.082886055 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
17 0.075437225 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
18 0.073842272 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
19 0.072631128 55 nips-2002-Combining Features for BCI
20 0.070120282 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
topicId topicWeight
[(0, -0.209), (1, -0.124), (2, 0.036), (3, -0.055), (4, 0.212), (5, -0.006), (6, -0.036), (7, -0.065), (8, -0.029), (9, 0.007), (10, 0.017), (11, 0.063), (12, 0.064), (13, -0.037), (14, 0.077), (15, -0.122), (16, 0.069), (17, -0.0), (18, -0.043), (19, -0.081), (20, -0.104), (21, 0.215), (22, -0.099), (23, -0.054), (24, -0.042), (25, 0.063), (26, -0.074), (27, 0.013), (28, 0.092), (29, -0.153), (30, 0.012), (31, 0.138), (32, 0.066), (33, 0.033), (34, 0.023), (35, -0.061), (36, -0.117), (37, 0.006), (38, 0.05), (39, -0.041), (40, 0.02), (41, -0.133), (42, -0.083), (43, -0.0), (44, 0.05), (45, 0.186), (46, -0.063), (47, -0.041), (48, -0.016), (49, -0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.94267082 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
2 0.65701723 45 nips-2002-Boosted Dyadic Kernel Discriminants
Author: Baback Moghaddam, Gregory Shakhnarovich
Abstract: We introduce a novel learning algorithm for binary classification with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. An ensemble of simple dyadic hypercuts is learned incrementally by means of a confidence-rated version of AdaBoost, which provides a sound strategy for searching through the finite set of hypercut hypotheses. In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classifiers was found to be comparable to that of SVMs and k-NN classifiers. Furthermore, the computational cost of classification (at run time) was found to be similar to, or better than, that of SVM. Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). In contrast to SVMs, however, we offer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded off for accuracy). 1
3 0.58639157 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
4 0.57087743 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
5 0.5474543 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
Author: Martin J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky
Abstract: We describe a method for computing provably exact maximum a posteriori (MAP) estimates for a subclass of problems on graphs with cycles. The basic idea is to represent the original problem on the graph with cycles as a convex combination of tree-structured problems. A convexity argument then guarantees that the optimal value of the original problem (i.e., the log probability of the MAP assignment) is upper bounded by the combined optimal values of the tree problems. We prove that this upper bound is met with equality if and only if the tree problems share an optimal configuration in common. An important implication is that any such shared configuration must also be the MAP configuration for the original problem. Next we develop a tree-reweighted max-product algorithm for attempting to find convex combinations of tree-structured problems that share a common optimum. We give necessary and sufficient conditions for a fixed point to yield the exact MAP estimate. An attractive feature of our analysis is that it generalizes naturally to convex combinations of hypertree-structured distributions.
6 0.53816777 92 nips-2002-FloatBoost Learning for Classification
7 0.53194296 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
8 0.49197692 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
9 0.48934096 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
10 0.47534296 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
11 0.44442818 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
12 0.4439671 85 nips-2002-Fast Kernels for String and Tree Matching
13 0.44090995 6 nips-2002-A Formulation for Minimax Probability Machine Regression
14 0.41489166 185 nips-2002-Speeding up the Parti-Game Algorithm
15 0.39381701 89 nips-2002-Feature Selection by Maximum Marginal Diversity
16 0.3933647 55 nips-2002-Combining Features for BCI
17 0.37914082 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
18 0.37828138 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
19 0.37785235 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
20 0.37576643 67 nips-2002-Discriminative Binaural Sound Localization
topicId topicWeight
[(23, 0.012), (42, 0.039), (54, 0.121), (55, 0.022), (68, 0.016), (73, 0.014), (74, 0.068), (92, 0.471), (98, 0.131)]
simIndex simValue paperId paperTitle
Author: Sumio Watanabe, Shun-ichi Amari
Abstract: A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information. In this paper, under the condition that the true parameter is almost but not contained in singularities, we show two results. (1) If the dimension of the parameter from inputs to hidden units is not larger than three, then there exits a region of true parameters where the generalization error is larger than those of regular models, however, if otherwise, then for any true parameter, the generalization error is smaller than those of regular models. (2) The symmetry of the generalization error and the training error does not hold in singular models in general. 1
same-paper 2 0.90447438 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
3 0.89273942 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages
Author: Dörthe Malzahn, Manfred Opper
Abstract: We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling.
4 0.89078838 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee
Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefficients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classifiers.
5 0.6696251 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
6 0.6231038 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
7 0.60489678 161 nips-2002-PAC-Bayes & Margins
8 0.59802932 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
9 0.57084292 45 nips-2002-Boosted Dyadic Kernel Discriminants
10 0.56695151 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
11 0.56663185 27 nips-2002-An Impossibility Theorem for Clustering
12 0.55868173 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
13 0.55351645 6 nips-2002-A Formulation for Minimax Probability Machine Regression
14 0.55059695 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
15 0.54970902 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
16 0.54350704 140 nips-2002-Margin Analysis of the LVQ Algorithm
17 0.54081798 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
18 0.53937304 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
19 0.5361104 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
20 0.53018731 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction