jmlr jmlr2008 jmlr2008-89 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
Reference: text
sentIndex sentText sentNum sentScore
1 The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. [sent-9, score-0.72]
2 We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. [sent-10, score-0.936]
3 The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. [sent-11, score-1.486]
4 We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. [sent-12, score-0.949]
5 In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. [sent-14, score-1.188]
6 Keywords: ensemble learning, boosting, support vector machine, kernel 1. [sent-16, score-0.679]
7 Some theories show that the finiteness places a restriction on the capacity of the ensemble (Freund and Schapire, 1997), and some theories suggest that the performance of boosting can be linked to its asymptotic behavior when the ensemble is allowed to be of an infinite size (R¨ tsch et al. [sent-20, score-1.069]
8 Furthermore, the framework allows us to compare SVM and ensemble learning algorithms in a fair manner using the same base hypothesis set. [sent-29, score-0.595]
9 Based on the framework, we derive two novel SVM kernels, the stump kernel and the perceptron kernel, from an ensemble learning perspective (Lin and Li, 2005a). [sent-30, score-1.367]
10 The stump kernel embodies infinitely many decision stumps, and as a consequence measures the similarity between examples by the 1 -norm distance. [sent-31, score-0.974]
11 While there exist similar kernels in literature, our derivation from an ensemble learning perspective is nevertheless original. [sent-33, score-0.55]
12 Experimental results show that SVM with these kernels is superior to successful ensemble learning algorithms with the same base hypothesis set. [sent-35, score-0.646]
13 These results reveal some weakness in traditional ensemble learning algorithms, and help understand both SVM and ensemble learning better. [sent-36, score-0.957]
14 Experimentally, SVM with the Laplacian-RBF kernel performs better than ensemble learning algorithms with decision trees. [sent-40, score-0.787]
15 In addition, our derivation from an ensemble learning perspective helps to explain the success of the kernel on some specific applications (Chapelle et al. [sent-41, score-0.679]
16 We then derive the stump kernel in Section 4, the perceptron kernel in Section 5, and the Laplacian-RBF kernel in Section 6. [sent-46, score-1.33]
17 2 Adaptive Boosting and Linear Programming Boosting The adaptive boosting (AdaBoost) algorithm (Freund and Schapire, 1996) is perhaps the most popular and successful approach for ensemble learning. [sent-80, score-0.545]
18 For a given integer T and a hypothesis set H , AdaBoost iteratively selects T hypotheses ht ∈ H and weights wt ≥ 0 to construct an ensemble classifier T ∑ wt ht (x) gT (x) = sign . [sent-81, score-0.938]
19 , 2001), it is shown that when T → ∞, AdaBoost asymptotically approximates an a infinite ensemble classifier ∞ g∞ (x) = sign ∑ wt ht (x) t=1 287 , (2) L IN AND L I such that (w, h) is an optimal solution to ∞ (P3 ) ∑ wt min wt ∈R,ht ∈H t=1 ∞ s. [sent-84, score-0.775]
20 We will introduce the soft-margin LPBoost, which constructs an ensemble classifier like (2) with the optimal solution to ∞ (P4 ) min wt ∈R,ht ∈H s. [sent-106, score-0.535]
21 2 The algorithm works by adding one unambiguous ht to the ensemble in each iteration. [sent-120, score-0.528]
22 Note that AdaBoost and LPBoost require wt ≥ 0 for ensemble learning. [sent-157, score-0.535]
23 SVM-Based Framework for Infinite Ensemble Learning Vapnik (1998) proposed a challenging task of designing an algorithm that actually generates an infinite ensemble classifier, that is, an ensemble classifier with infinitely many nonzero wt . [sent-169, score-1.0]
24 One is to actually derive the kernel, and the other is to handle the constraints wt ≥ 0 to make (1) an ensemble classifier. [sent-175, score-0.535]
25 Stump Kernel In this section, we present the stump kernel, which embodies infinitely many decision stumps. [sent-229, score-0.76]
26 The decision stump sq,d,α (x) = q · sign (x)d − α works on the d-th element of x, and classifies x according to q ∈ {−1, +1} and the threshold α (Holte, 1993). [sent-230, score-0.666]
27 1 Formulation and Properties To construct the stump kernel, we consider the following set of decision stumps S = sq,d,αd : q ∈ {−1, +1} , d ∈ {1, . [sent-245, score-0.766]
28 The stump kernel KS defined below can then be used in Algorithm 1 to obtain an infinite ensemble of decision stumps. [sent-251, score-1.308]
29 1 Definition 3 The stump kernel is KS with r(q, d, αd ) = rS = 2 , D KS (x, x ) = ∆S − ∑ (x)d − (x )d = ∆S − x − x d=1 1 , where ∆S = 1 ∑D (Rd − Ld ) is a constant. [sent-252, score-0.735]
30 The validity of the stump kernel follows directly from Theorem 2 of the general framework. [sent-257, score-0.735]
31 That is, the stump kernel is an inner product in a Hilbert space of some square integrable functions ϕ(q, d, αd ), and it produces a PSD Gram matrix for any set of input vectors {xi }N ∈ X N . [sent-258, score-0.735]
32 i=1 Given the ranges (Ld , Rd ), the stump kernel is very simple to compute. [sent-259, score-0.757]
33 Theorem 4 Solving (P2 ) with the stump kernel KS is the same as solving (P2 ) with the simplified ˜ stump kernel KS (x, x ) = − x − x 1 . [sent-261, score-1.47]
34 The equivalence demonstrates the usefulness of the stump kernel on histogram-based features, which would be further discussed in Subsection 6. [sent-273, score-0.735]
35 The simplified stump kernel is simple to compute, yet useful in the sense of dichotomizing the training set, which comes from the following positive definite (PD) property. [sent-277, score-0.759]
36 Chang and Lin (2001b) showed that when the Gram matrix of the kernel is PD, a hardmargin SVM with such a kernel can always dichotomize the training set perfectly. [sent-281, score-0.538]
37 We obtain a similar 2 theorem for the stump kernel. [sent-283, score-0.554]
38 We can see that every possible decision stump makes 50% of errors on the training input vectors. [sent-288, score-0.653]
39 Thus, AdaBoost and LPBoost would terminate with one bad decision stump in the ensemble. [sent-289, score-0.629]
40 Similarly, SVM with the stump kernel cannot dichotomize this training set perfectly, regardless of the choice of C. [sent-290, score-0.845]
41 Such a problem is inherent in any ensemble model that combines decision stumps, because the model belongs to the family of generalized additive models (Hastie and Tibshirani, 1990; Hastie et al. [sent-291, score-0.573]
42 Second, although Theorem 6 indicates how the stump kernel can be used to dichotomize the training set perfectly, the classifier obtained usually overfits to noise (Keerthi and Lin, 2003). [sent-293, score-0.845]
43 We observe similar experimental results for the stump kernel (see Section 7). [sent-296, score-0.735]
44 2 that the set of hypotheses can be partitioned into groups and traditional ensemble learning algorithms can only pick a few representatives within each group. [sent-299, score-0.616]
45 In the following theorem, we demonstrate this behavior with the stump kernel, and show how the decision stumps group together in the final ensemble classifier. [sent-308, score-1.231]
46 Theorem 8 ˆ indicates that the infinite ensemble of decision stumps produced by our framework is equivalent to a finite ensemble of data-dependent and smoother variants. [sent-317, score-1.209]
47 Then, Theorem 8 indicates that an infinite ensemble of decision stumps can be obtained by fitting an additive model of finite size using these special splines as the bases. [sent-320, score-0.71]
48 On the other hand, our SVM-based framework with the stump kernel can be thought as a route to solve this special spline fitting problem efficiently via the kernel trick. [sent-323, score-0.983]
49 As shown in the proof of Theorem 8, the averaged stump s q,d,a represents the group of ambiguˆ ous decision stumps with αd ∈ (x)d,a , (x)d,a+1 . [sent-324, score-0.766]
50 ˜ ˜ ˆ Traditional ensemble learning algorithms like AdaBoost or LPBoost rely on a base learner to choose one decision stump as the only representative within each group, and the base learner usually returns the middle stump mq,d,a . [sent-326, score-1.715]
51 As shown in Figure 2, the threshold of the middle stump is at the mean of (x)d,a and (x)d,a+1 . [sent-327, score-0.521]
52 Even though each decision stump only has an infinitesimal hypothesis weight, the averaged stump sq,d,a has a concrete weight in the ensemble. [sent-329, score-1.196]
53 Perceptron Kernel In this section, we extend the stump kernel to the perceptron kernel, which embodies infinitely many perceptrons. [sent-331, score-1.033]
54 Thus, the perceptron kernel KP defined below can be used in Algorithm 1 to obtain an infinite ensemble of perceptrons. [sent-340, score-0.846]
55 With the perceptron kernel, we can construct an infinite ensemble of perceptrons. [sent-344, score-0.632]
56 The perceptron kernel shares many similar properties to the stump kernel. [sent-350, score-0.902]
57 2 is the same Second, SVM with the perceptron kernel can also dichotomize the training set perfectly, which ˜ comes from the usefulness of the simplified perceptron kernel KP in interpolation. [sent-353, score-0.872]
58 Recall that we construct the perceptron kernel (and the stump kernel) with an embedding con√ stant rP (and rS ), and from Definition 1, multiplying the constant by γ > 0 is equivalent to scaling the Gram matrix K by γ. [sent-363, score-0.964]
59 In other words, if we use K (x, x ) in Algorithm 1, we could obtain an ensemble classifier over H1 ∪ H2 when the union is negation complete and contains a constant hypothesis. [sent-376, score-0.532]
60 In traditional ensemble learning, when multiple sets of hypotheses are considered altogether, it is usually necessary to call a base learner for each set. [sent-377, score-0.666]
61 In fact, as shown in the next theorem, our framework can be applied to work with any countable sets of hypotheses, which may not be an easy task for traditional ensemble learning algorithms. [sent-379, score-0.526]
62 j j=1 If K (x, x ) exists for all x, x ∈ X , and H = J H j is negation complete and contains a constant j=1 hypothesis, Algorithm 1 using K (x, x ) outputs an ensemble classifier over H . [sent-388, score-0.532]
63 3 Stump Region Kernel, Decision Tree Kernel, and Laplacian-RBF Kernel Next, we use the stump kernel to demonstrate the usefulness of summation and multiplication. [sent-422, score-0.768]
64 Definition 17 The L-level stump region kernel KTL is recursively defined by KT1 (x, x ) = KS (x, x ) + ∆S , ∆1 = 2∆S , KTL+1 (x, x ) = KTL (x, x ) + ∆L KS (x, x ) + ∆S , ∆L+1 = 2∆L ∆S for L ∈ N. [sent-425, score-0.735]
65 Then, by applying Theorem 13, we obtain an ensemble classifier over stump regions of any level. [sent-431, score-0.986]
66 The desired result simply The set of stump regions of any level contains all AND/OR combinations of decision stumps. [sent-444, score-0.629]
67 It is not hard to see that every stump region can be represented by recursive axis-parallel partitions 300 S UPPORT V ECTOR M ACHINERY FOR I NFINITE E NSEMBLE L EARNING that output {−1, +1}, that is, a decision tree (Quinlan, 1986; Hastie et al. [sent-445, score-0.665]
68 By recursively replacing each root node condition with a decision stump, we see that every decision tree can be represented as a stump region hypothesis. [sent-448, score-0.773]
69 Thus, the set T that contains stump regions of any level is the same as the set of all possible decision trees, which leads to the name decision tree kernel. [sent-449, score-0.773]
70 4 Decision trees are popular for ensemble learning, but traditional algorithms can only deal with trees of finite levels (Breiman, 1999; Dietterich, 2000). [sent-450, score-0.562]
71 On the other hand, when the decision tree kernel KT is plugged into our framework, it allows us to actually build an infinite ensemble over decision trees of arbitrary levels. [sent-451, score-0.966]
72 This result is a novel interpretation of the Laplacian-RBF kernel: under suitable parameters, SVM with the Laplacian-RBF kernel allows us to obtain an infinite ensemble classifier over decision trees of any level. [sent-455, score-0.822]
73 Then, the Laplacian-RBF kernel and the decision tree kernel could be used to dichotomize the training set perfectly. [sent-457, score-0.682]
74 4 Discussion on Radial Basis Function Kernels Note that the stump kernel, the perceptron kernel, the Laplacian-RBF kernel, and the Gaussian-RBF kernel are all radial basis functions. [sent-459, score-0.933]
75 They can all be used to dichotomize the training set perfectly under mild conditions, while the first three connect to explanations from an ensemble perspective. [sent-460, score-0.575]
76 The stump kernel and the Laplacian-RBF kernel deal with the 1 -norm distance between input vectors, while the others work on the 2 -norm distance. [sent-463, score-0.949]
77 From the construction of the perceptron kernel, we can see how the rotation invariance is obtained from an ensemble point-of-view. [sent-465, score-0.655]
78 We use the name decision tree kernel for KT in Theorem 18 because the kernel embodies an infinite number of decision tree “hypotheses” and can be used in our framework to construct an infinite ensemble of decision trees. [sent-468, score-1.454]
79 Note that the techniques in Theorem 18 can be coupled with Theorem 14 to show that Laplacian-RBF kernel with any γ > 0 embodies XOR stump regions (a special type of decision tree) of any level. [sent-471, score-0.974]
80 We emphasize on the AND-OR stump regions here to connect better to general decision trees. [sent-472, score-0.629]
81 1, we have also discussed some image recognition applications using the histogram intersection kernel, which is equivalent to the stump kernel, on histogram-based features. [sent-478, score-0.55]
82 Gene expression analysis, as demonstrated by Lin and Li (2005b), is another area that the stump kernel could be helpful. [sent-479, score-0.735]
83 The simplified stump kernel and the simplified perceptron kernel are scale-invariant, which means that C is the only parameter that needs to be determined. [sent-481, score-1.116]
84 Thus, SVM with the simplified stump kernel or the simplified perceptron kernel enjoys an advantage on speed during parameter selection. [sent-483, score-1.157]
85 Thus, SVM applications that consider speed as an important factor may benefit from using the simplified stump kernel or the simplified perceptron kernel. [sent-486, score-0.902]
86 Experiments We first compare our SVM-based infinite ensemble learning framework with AdaBoost and LPBoost using decision stumps, perceptrons, or decision trees as the base hypothesis set. [sent-488, score-0.846]
87 The simplified stump kernel (SVM-Stump), the simplified perceptron kernel (SVM-Perc), and the LaplacianRBF kernel (SVM-Dec) are plugged into Algorithm 1 respectively. [sent-489, score-1.33]
88 The deterministic decision stump algorithm (Holte, 1993), the random coordinate descent perceptron algorithm (Li and Lin, 2007), and the C4. [sent-491, score-0.796]
89 79 Table 2: Test error (%) of several ensemble learning algorithms using decision stumps data set twonorm twonorm-n threenorm threenorm-n ringnorm ringnorm-n australian breast german heart ionosphere pima sonar votes84 a1a splice svmguide1 w1a SVM-Perc 2. [sent-671, score-1.165]
90 01 Table 4: Test error (%) of several ensemble learning algorithms using decision trees 7. [sent-863, score-0.608]
91 1 Comparison of Ensemble Learning Algorithms Tables 2, 3, and 4 show the test performance of several ensemble learning algorithms on different base hypothesis sets. [sent-864, score-0.561]
92 Interestingly, SVM-Mid often performs better than AdaBoost-Stump, which means that a nonsparse ensemble introduced by minimizing the 2 -norm of w is better than a sparse one. [sent-871, score-0.523]
93 SVM-Stump obtains the smooth boundary by averaging over infinitely many decision stumps; SVM-Mid can also generate a smooth boundary by constructing a nonsparse ensemble over a finite number of decision stumps. [sent-878, score-0.739]
94 Note that traditional ensemble learning and our SVM-based framework differ in the concept of sparsity. [sent-881, score-0.526]
95 2, traditional ensemble learning prefers sparse ensemble classifiers, that is, ensembles that include a small number of hypotheses. [sent-883, score-0.994]
96 Nevertheless, our experimental results indicate that sparse ensemble classifiers are sometimes not sophisticated enough in practice, especially when the base hypothesis set is simple. [sent-887, score-0.561]
97 Furthermore, SVM-Stump, albeit slightly worse than SVM-Perc or SVM-Gauss, could still be a useful alternative in some applications where decision stump ensemble models are preferred, such as those described in Subsection 6. [sent-1051, score-1.094]
98 Conclusion We derived two novel kernels based on the infinite ensemble learning framework. [sent-1054, score-0.55]
99 The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. [sent-1055, score-1.486]
100 This property makes the perceptron kernel favorable to the Gaussian-RBF kernel in practice. [sent-1062, score-0.595]
wordName wordTfidf (topN-words)
[('stump', 0.521), ('ensemble', 0.465), ('kernel', 0.214), ('lpboost', 0.173), ('perceptron', 0.167), ('adaboost', 0.14), ('ks', 0.137), ('stumps', 0.137), ('embodies', 0.131), ('achinery', 0.125), ('nfinite', 0.125), ('nsemble', 0.125), ('hypotheses', 0.124), ('kh', 0.124), ('svm', 0.116), ('kp', 0.115), ('decision', 0.108), ('ector', 0.095), ('upport', 0.095), ('dichotomize', 0.086), ('kernels', 0.085), ('boosting', 0.08), ('ktl', 0.077), ('ld', 0.074), ('wt', 0.07), ('threenorm', 0.067), ('negation', 0.067), ('demiriz', 0.065), ('ht', 0.063), ('nitely', 0.063), ('tsch', 0.059), ('perceptrons', 0.058), ('nonsparse', 0.058), ('gram', 0.058), ('ling', 0.057), ('twonorm', 0.056), ('lin', 0.055), ('rosset', 0.053), ('nite', 0.051), ('base', 0.05), ('hypothesis', 0.046), ('kt', 0.044), ('rp', 0.044), ('er', 0.043), ('earning', 0.042), ('enjoys', 0.041), ('freund', 0.041), ('breast', 0.04), ('german', 0.04), ('ionosphere', 0.04), ('ringnorm', 0.04), ('rd', 0.04), ('embedding', 0.039), ('neutrality', 0.038), ('psd', 0.038), ('australian', 0.037), ('ensembles', 0.037), ('sonar', 0.037), ('sign', 0.037), ('ambiguous', 0.036), ('tree', 0.036), ('splice', 0.035), ('schapire', 0.035), ('trees', 0.035), ('framework', 0.034), ('neutral', 0.033), ('theorem', 0.033), ('pima', 0.033), ('summation', 0.033), ('simpli', 0.033), ('hettich', 0.033), ('niteness', 0.031), ('radial', 0.031), ('heart', 0.03), ('subsection', 0.03), ('hsu', 0.029), ('caltech', 0.029), ('gunnar', 0.029), ('histogram', 0.029), ('boughorbel', 0.029), ('chang', 0.029), ('classi', 0.028), ('logic', 0.028), ('rs', 0.028), ('traditional', 0.027), ('ad', 0.027), ('robert', 0.025), ('xor', 0.025), ('meir', 0.024), ('reed', 0.024), ('training', 0.024), ('xi', 0.024), ('smola', 0.024), ('scaling', 0.023), ('intercept', 0.023), ('trevor', 0.023), ('keerthi', 0.023), ('rotation', 0.023), ('sch', 0.023), ('ranges', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
2 0.27920818 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting
Author: Kristin P. Bennett
Abstract: unkown-abstract
3 0.1663208 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
Author: David Mease, Abraham Wyner
Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost
4 0.12965958 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
5 0.10188253 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
6 0.097476304 92 jmlr-2008-Universal Multi-Task Kernels
7 0.079235457 86 jmlr-2008-SimpleMKL
8 0.077915132 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
9 0.073824897 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
10 0.071407042 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
11 0.070674546 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
12 0.062363684 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
13 0.055232868 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
14 0.054621387 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
15 0.05281505 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
16 0.048995495 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
17 0.0443356 58 jmlr-2008-Max-margin Classification of Data with Absent Features
18 0.044241276 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
19 0.043721758 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
20 0.043287959 54 jmlr-2008-Learning to Select Features using their Properties
topicId topicWeight
[(0, 0.235), (1, -0.144), (2, 0.494), (3, -0.138), (4, 0.137), (5, 0.009), (6, -0.24), (7, 0.027), (8, 0.005), (9, 0.082), (10, 0.03), (11, -0.03), (12, 0.089), (13, -0.011), (14, -0.018), (15, 0.05), (16, -0.023), (17, 0.046), (18, 0.021), (19, 0.004), (20, 0.013), (21, -0.025), (22, -0.029), (23, -0.051), (24, 0.023), (25, -0.09), (26, 0.032), (27, 0.042), (28, -0.084), (29, 0.019), (30, -0.017), (31, -0.011), (32, 0.016), (33, 0.007), (34, 0.065), (35, 0.006), (36, 0.032), (37, 0.037), (38, 0.042), (39, -0.037), (40, -0.014), (41, 0.036), (42, 0.01), (43, 0.032), (44, 0.026), (45, 0.004), (46, 0.015), (47, -0.004), (48, 0.033), (49, -0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.91426039 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
2 0.76731247 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting
Author: Kristin P. Bennett
Abstract: unkown-abstract
3 0.71744704 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
Author: David Mease, Abraham Wyner
Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost
4 0.47469842 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
5 0.44475329 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
6 0.39674902 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
7 0.3711333 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
8 0.37033471 86 jmlr-2008-SimpleMKL
9 0.36475831 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
10 0.33876982 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
11 0.31917125 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
12 0.30810204 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
13 0.25530642 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
14 0.23578405 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
15 0.23275608 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
16 0.22906061 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
17 0.20419486 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
18 0.20044547 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
19 0.19932766 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
20 0.18771021 58 jmlr-2008-Max-margin Classification of Data with Absent Features
topicId topicWeight
[(0, 0.042), (5, 0.02), (31, 0.016), (34, 0.312), (40, 0.077), (54, 0.075), (58, 0.031), (66, 0.071), (76, 0.021), (78, 0.021), (88, 0.113), (92, 0.029), (94, 0.055), (99, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.77315235 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
2 0.4826746 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
3 0.47478157 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
4 0.47470546 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
5 0.47014293 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
6 0.46561426 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
7 0.46536049 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
8 0.46432954 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
9 0.46402696 9 jmlr-2008-Active Learning by Spherical Subdivision
10 0.46088284 86 jmlr-2008-SimpleMKL
11 0.46080571 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
12 0.46050435 96 jmlr-2008-Visualizing Data using t-SNE
13 0.45777261 83 jmlr-2008-Robust Submodular Observation Selection
14 0.45689517 56 jmlr-2008-Magic Moments for Structured Output Prediction
15 0.4557232 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
16 0.45397273 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
17 0.45154083 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
18 0.44654688 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
19 0.44353259 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
20 0.44345805 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers