jmlr jmlr2008 jmlr2008-89 knowledge-graph by maker-knowledge-mining

89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Source: pdf

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a ﬁnite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classiﬁer with a larger or even an inﬁnite number of hypotheses. In addition, constructing an inﬁnite ensemble itself is a challenging task. In this paper, we formulate an inﬁnite ensemble learning framework based on the support vector machine (SVM). The framework can output an inﬁnite and nonsparse ensemble through embedding inﬁnitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies inﬁnitely many decision stumps, and the perceptron kernel embodies inﬁnitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies inﬁnitely many decision trees, and can thus be explained through inﬁnite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the beneﬁt of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The framework can output an inﬁnite and nonsparse ensemble through embedding inﬁnitely many hypotheses into an SVM kernel. [sent-9, score-0.72]

2 We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. [sent-10, score-0.936]

3 The stump kernel embodies inﬁnitely many decision stumps, and the perceptron kernel embodies inﬁnitely many perceptrons. [sent-11, score-1.486]

4 We also show that the Laplacian radial basis function kernel embodies inﬁnitely many decision trees, and can thus be explained through inﬁnite ensemble learning. [sent-12, score-0.949]

5 In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the beneﬁt of faster parameter selection. [sent-14, score-1.188]

6 Keywords: ensemble learning, boosting, support vector machine, kernel 1. [sent-16, score-0.679]

7 Some theories show that the ﬁniteness places a restriction on the capacity of the ensemble (Freund and Schapire, 1997), and some theories suggest that the performance of boosting can be linked to its asymptotic behavior when the ensemble is allowed to be of an inﬁnite size (R¨ tsch et al. [sent-20, score-1.069]

8 Furthermore, the framework allows us to compare SVM and ensemble learning algorithms in a fair manner using the same base hypothesis set. [sent-29, score-0.595]

9 Based on the framework, we derive two novel SVM kernels, the stump kernel and the perceptron kernel, from an ensemble learning perspective (Lin and Li, 2005a). [sent-30, score-1.367]

10 The stump kernel embodies inﬁnitely many decision stumps, and as a consequence measures the similarity between examples by the 1 -norm distance. [sent-31, score-0.974]

11 While there exist similar kernels in literature, our derivation from an ensemble learning perspective is nevertheless original. [sent-33, score-0.55]

12 Experimental results show that SVM with these kernels is superior to successful ensemble learning algorithms with the same base hypothesis set. [sent-35, score-0.646]

13 These results reveal some weakness in traditional ensemble learning algorithms, and help understand both SVM and ensemble learning better. [sent-36, score-0.957]

14 Experimentally, SVM with the Laplacian-RBF kernel performs better than ensemble learning algorithms with decision trees. [sent-40, score-0.787]

15 In addition, our derivation from an ensemble learning perspective helps to explain the success of the kernel on some speciﬁc applications (Chapelle et al. [sent-41, score-0.679]

16 We then derive the stump kernel in Section 4, the perceptron kernel in Section 5, and the Laplacian-RBF kernel in Section 6. [sent-46, score-1.33]

17 2 Adaptive Boosting and Linear Programming Boosting The adaptive boosting (AdaBoost) algorithm (Freund and Schapire, 1996) is perhaps the most popular and successful approach for ensemble learning. [sent-80, score-0.545]

18 For a given integer T and a hypothesis set H , AdaBoost iteratively selects T hypotheses ht ∈ H and weights wt ≥ 0 to construct an ensemble classiﬁer T ∑ wt ht (x) gT (x) = sign . [sent-81, score-0.938]

19 , 2001), it is shown that when T → ∞, AdaBoost asymptotically approximates an a inﬁnite ensemble classiﬁer ∞ g∞ (x) = sign ∑ wt ht (x) t=1 287 , (2) L IN AND L I such that (w, h) is an optimal solution to ∞ (P3 ) ∑ wt min wt ∈R,ht ∈H t=1 ∞ s. [sent-84, score-0.775]

20 We will introduce the soft-margin LPBoost, which constructs an ensemble classiﬁer like (2) with the optimal solution to ∞ (P4 ) min wt ∈R,ht ∈H s. [sent-106, score-0.535]

21 2 The algorithm works by adding one unambiguous ht to the ensemble in each iteration. [sent-120, score-0.528]

22 Note that AdaBoost and LPBoost require wt ≥ 0 for ensemble learning. [sent-157, score-0.535]

23 SVM-Based Framework for Inﬁnite Ensemble Learning Vapnik (1998) proposed a challenging task of designing an algorithm that actually generates an inﬁnite ensemble classiﬁer, that is, an ensemble classiﬁer with inﬁnitely many nonzero wt . [sent-169, score-1.0]

24 One is to actually derive the kernel, and the other is to handle the constraints wt ≥ 0 to make (1) an ensemble classiﬁer. [sent-175, score-0.535]

25 Stump Kernel In this section, we present the stump kernel, which embodies inﬁnitely many decision stumps. [sent-229, score-0.76]

26 The decision stump sq,d,α (x) = q · sign (x)d − α works on the d-th element of x, and classiﬁes x according to q ∈ {−1, +1} and the threshold α (Holte, 1993). [sent-230, score-0.666]

27 1 Formulation and Properties To construct the stump kernel, we consider the following set of decision stumps S = sq,d,αd : q ∈ {−1, +1} , d ∈ {1, . [sent-245, score-0.766]

28 The stump kernel KS deﬁned below can then be used in Algorithm 1 to obtain an inﬁnite ensemble of decision stumps. [sent-251, score-1.308]

29 1 Deﬁnition 3 The stump kernel is KS with r(q, d, αd ) = rS = 2 , D KS (x, x ) = ∆S − ∑ (x)d − (x )d = ∆S − x − x d=1 1 , where ∆S = 1 ∑D (Rd − Ld ) is a constant. [sent-252, score-0.735]

30 The validity of the stump kernel follows directly from Theorem 2 of the general framework. [sent-257, score-0.735]

31 That is, the stump kernel is an inner product in a Hilbert space of some square integrable functions ϕ(q, d, αd ), and it produces a PSD Gram matrix for any set of input vectors {xi }N ∈ X N . [sent-258, score-0.735]

32 i=1 Given the ranges (Ld , Rd ), the stump kernel is very simple to compute. [sent-259, score-0.757]

33 Theorem 4 Solving (P2 ) with the stump kernel KS is the same as solving (P2 ) with the simpliﬁed ˜ stump kernel KS (x, x ) = − x − x 1 . [sent-261, score-1.47]

34 The equivalence demonstrates the usefulness of the stump kernel on histogram-based features, which would be further discussed in Subsection 6. [sent-273, score-0.735]

35 The simpliﬁed stump kernel is simple to compute, yet useful in the sense of dichotomizing the training set, which comes from the following positive deﬁnite (PD) property. [sent-277, score-0.759]

36 Chang and Lin (2001b) showed that when the Gram matrix of the kernel is PD, a hardmargin SVM with such a kernel can always dichotomize the training set perfectly. [sent-281, score-0.538]

37 We obtain a similar 2 theorem for the stump kernel. [sent-283, score-0.554]

38 We can see that every possible decision stump makes 50% of errors on the training input vectors. [sent-288, score-0.653]

39 Thus, AdaBoost and LPBoost would terminate with one bad decision stump in the ensemble. [sent-289, score-0.629]

40 Similarly, SVM with the stump kernel cannot dichotomize this training set perfectly, regardless of the choice of C. [sent-290, score-0.845]

41 Such a problem is inherent in any ensemble model that combines decision stumps, because the model belongs to the family of generalized additive models (Hastie and Tibshirani, 1990; Hastie et al. [sent-291, score-0.573]

42 Second, although Theorem 6 indicates how the stump kernel can be used to dichotomize the training set perfectly, the classiﬁer obtained usually overﬁts to noise (Keerthi and Lin, 2003). [sent-293, score-0.845]

43 We observe similar experimental results for the stump kernel (see Section 7). [sent-296, score-0.735]

44 2 that the set of hypotheses can be partitioned into groups and traditional ensemble learning algorithms can only pick a few representatives within each group. [sent-299, score-0.616]

45 In the following theorem, we demonstrate this behavior with the stump kernel, and show how the decision stumps group together in the ﬁnal ensemble classiﬁer. [sent-308, score-1.231]

46 Theorem 8 ˆ indicates that the inﬁnite ensemble of decision stumps produced by our framework is equivalent to a ﬁnite ensemble of data-dependent and smoother variants. [sent-317, score-1.209]

47 Then, Theorem 8 indicates that an inﬁnite ensemble of decision stumps can be obtained by ﬁtting an additive model of ﬁnite size using these special splines as the bases. [sent-320, score-0.71]

48 On the other hand, our SVM-based framework with the stump kernel can be thought as a route to solve this special spline ﬁtting problem efﬁciently via the kernel trick. [sent-323, score-0.983]

49 As shown in the proof of Theorem 8, the averaged stump s q,d,a represents the group of ambiguˆ ous decision stumps with αd ∈ (x)d,a , (x)d,a+1 . [sent-324, score-0.766]

50 ˜ ˜ ˆ Traditional ensemble learning algorithms like AdaBoost or LPBoost rely on a base learner to choose one decision stump as the only representative within each group, and the base learner usually returns the middle stump mq,d,a . [sent-326, score-1.715]

51 As shown in Figure 2, the threshold of the middle stump is at the mean of (x)d,a and (x)d,a+1 . [sent-327, score-0.521]

52 Even though each decision stump only has an inﬁnitesimal hypothesis weight, the averaged stump sq,d,a has a concrete weight in the ensemble. [sent-329, score-1.196]

53 Perceptron Kernel In this section, we extend the stump kernel to the perceptron kernel, which embodies inﬁnitely many perceptrons. [sent-331, score-1.033]

54 Thus, the perceptron kernel KP deﬁned below can be used in Algorithm 1 to obtain an inﬁnite ensemble of perceptrons. [sent-340, score-0.846]

55 With the perceptron kernel, we can construct an inﬁnite ensemble of perceptrons. [sent-344, score-0.632]

56 The perceptron kernel shares many similar properties to the stump kernel. [sent-350, score-0.902]

57 2 is the same Second, SVM with the perceptron kernel can also dichotomize the training set perfectly, which ˜ comes from the usefulness of the simpliﬁed perceptron kernel KP in interpolation. [sent-353, score-0.872]

58 Recall that we construct the perceptron kernel (and the stump kernel) with an embedding con√ stant rP (and rS ), and from Deﬁnition 1, multiplying the constant by γ > 0 is equivalent to scaling the Gram matrix K by γ. [sent-363, score-0.964]

59 In other words, if we use K (x, x ) in Algorithm 1, we could obtain an ensemble classiﬁer over H1 ∪ H2 when the union is negation complete and contains a constant hypothesis. [sent-376, score-0.532]

60 In traditional ensemble learning, when multiple sets of hypotheses are considered altogether, it is usually necessary to call a base learner for each set. [sent-377, score-0.666]

61 In fact, as shown in the next theorem, our framework can be applied to work with any countable sets of hypotheses, which may not be an easy task for traditional ensemble learning algorithms. [sent-379, score-0.526]

62 j j=1 If K (x, x ) exists for all x, x ∈ X , and H = J H j is negation complete and contains a constant j=1 hypothesis, Algorithm 1 using K (x, x ) outputs an ensemble classiﬁer over H . [sent-388, score-0.532]

63 3 Stump Region Kernel, Decision Tree Kernel, and Laplacian-RBF Kernel Next, we use the stump kernel to demonstrate the usefulness of summation and multiplication. [sent-422, score-0.768]

64 Deﬁnition 17 The L-level stump region kernel KTL is recursively deﬁned by KT1 (x, x ) = KS (x, x ) + ∆S , ∆1 = 2∆S , KTL+1 (x, x ) = KTL (x, x ) + ∆L KS (x, x ) + ∆S , ∆L+1 = 2∆L ∆S for L ∈ N. [sent-425, score-0.735]

65 Then, by applying Theorem 13, we obtain an ensemble classiﬁer over stump regions of any level. [sent-431, score-0.986]

66 The desired result simply The set of stump regions of any level contains all AND/OR combinations of decision stumps. [sent-444, score-0.629]

67 It is not hard to see that every stump region can be represented by recursive axis-parallel partitions 300 S UPPORT V ECTOR M ACHINERY FOR I NFINITE E NSEMBLE L EARNING that output {−1, +1}, that is, a decision tree (Quinlan, 1986; Hastie et al. [sent-445, score-0.665]

68 By recursively replacing each root node condition with a decision stump, we see that every decision tree can be represented as a stump region hypothesis. [sent-448, score-0.773]

69 Thus, the set T that contains stump regions of any level is the same as the set of all possible decision trees, which leads to the name decision tree kernel. [sent-449, score-0.773]

70 4 Decision trees are popular for ensemble learning, but traditional algorithms can only deal with trees of ﬁnite levels (Breiman, 1999; Dietterich, 2000). [sent-450, score-0.562]

71 On the other hand, when the decision tree kernel KT is plugged into our framework, it allows us to actually build an inﬁnite ensemble over decision trees of arbitrary levels. [sent-451, score-0.966]

72 This result is a novel interpretation of the Laplacian-RBF kernel: under suitable parameters, SVM with the Laplacian-RBF kernel allows us to obtain an inﬁnite ensemble classiﬁer over decision trees of any level. [sent-455, score-0.822]

73 Then, the Laplacian-RBF kernel and the decision tree kernel could be used to dichotomize the training set perfectly. [sent-457, score-0.682]

74 4 Discussion on Radial Basis Function Kernels Note that the stump kernel, the perceptron kernel, the Laplacian-RBF kernel, and the Gaussian-RBF kernel are all radial basis functions. [sent-459, score-0.933]

75 They can all be used to dichotomize the training set perfectly under mild conditions, while the ﬁrst three connect to explanations from an ensemble perspective. [sent-460, score-0.575]

76 The stump kernel and the Laplacian-RBF kernel deal with the 1 -norm distance between input vectors, while the others work on the 2 -norm distance. [sent-463, score-0.949]

77 From the construction of the perceptron kernel, we can see how the rotation invariance is obtained from an ensemble point-of-view. [sent-465, score-0.655]

78 We use the name decision tree kernel for KT in Theorem 18 because the kernel embodies an inﬁnite number of decision tree “hypotheses” and can be used in our framework to construct an inﬁnite ensemble of decision trees. [sent-468, score-1.454]

79 Note that the techniques in Theorem 18 can be coupled with Theorem 14 to show that Laplacian-RBF kernel with any γ > 0 embodies XOR stump regions (a special type of decision tree) of any level. [sent-471, score-0.974]

80 We emphasize on the AND-OR stump regions here to connect better to general decision trees. [sent-472, score-0.629]

81 1, we have also discussed some image recognition applications using the histogram intersection kernel, which is equivalent to the stump kernel, on histogram-based features. [sent-478, score-0.55]

82 Gene expression analysis, as demonstrated by Lin and Li (2005b), is another area that the stump kernel could be helpful. [sent-479, score-0.735]

83 The simpliﬁed stump kernel and the simpliﬁed perceptron kernel are scale-invariant, which means that C is the only parameter that needs to be determined. [sent-481, score-1.116]

84 Thus, SVM with the simpliﬁed stump kernel or the simpliﬁed perceptron kernel enjoys an advantage on speed during parameter selection. [sent-483, score-1.157]

85 Thus, SVM applications that consider speed as an important factor may beneﬁt from using the simpliﬁed stump kernel or the simpliﬁed perceptron kernel. [sent-486, score-0.902]

86 Experiments We ﬁrst compare our SVM-based inﬁnite ensemble learning framework with AdaBoost and LPBoost using decision stumps, perceptrons, or decision trees as the base hypothesis set. [sent-488, score-0.846]

87 The simpliﬁed stump kernel (SVM-Stump), the simpliﬁed perceptron kernel (SVM-Perc), and the LaplacianRBF kernel (SVM-Dec) are plugged into Algorithm 1 respectively. [sent-489, score-1.33]

88 The deterministic decision stump algorithm (Holte, 1993), the random coordinate descent perceptron algorithm (Li and Lin, 2007), and the C4. [sent-491, score-0.796]

89 79 Table 2: Test error (%) of several ensemble learning algorithms using decision stumps data set twonorm twonorm-n threenorm threenorm-n ringnorm ringnorm-n australian breast german heart ionosphere pima sonar votes84 a1a splice svmguide1 w1a SVM-Perc 2. [sent-671, score-1.165]

90 01 Table 4: Test error (%) of several ensemble learning algorithms using decision trees 7. [sent-863, score-0.608]

91 1 Comparison of Ensemble Learning Algorithms Tables 2, 3, and 4 show the test performance of several ensemble learning algorithms on different base hypothesis sets. [sent-864, score-0.561]

92 Interestingly, SVM-Mid often performs better than AdaBoost-Stump, which means that a nonsparse ensemble introduced by minimizing the 2 -norm of w is better than a sparse one. [sent-871, score-0.523]

93 SVM-Stump obtains the smooth boundary by averaging over inﬁnitely many decision stumps; SVM-Mid can also generate a smooth boundary by constructing a nonsparse ensemble over a ﬁnite number of decision stumps. [sent-878, score-0.739]

94 Note that traditional ensemble learning and our SVM-based framework differ in the concept of sparsity. [sent-881, score-0.526]

95 2, traditional ensemble learning prefers sparse ensemble classiﬁers, that is, ensembles that include a small number of hypotheses. [sent-883, score-0.994]

96 Nevertheless, our experimental results indicate that sparse ensemble classiﬁers are sometimes not sophisticated enough in practice, especially when the base hypothesis set is simple. [sent-887, score-0.561]

97 Furthermore, SVM-Stump, albeit slightly worse than SVM-Perc or SVM-Gauss, could still be a useful alternative in some applications where decision stump ensemble models are preferred, such as those described in Subsection 6. [sent-1051, score-1.094]

98 Conclusion We derived two novel kernels based on the inﬁnite ensemble learning framework. [sent-1054, score-0.55]

99 The stump kernel embodies inﬁnitely many decision stumps, and the perceptron kernel embodies inﬁnitely many perceptrons. [sent-1055, score-1.486]

100 This property makes the perceptron kernel favorable to the Gaussian-RBF kernel in practice. [sent-1062, score-0.595]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stump', 0.521), ('ensemble', 0.465), ('kernel', 0.214), ('lpboost', 0.173), ('perceptron', 0.167), ('adaboost', 0.14), ('ks', 0.137), ('stumps', 0.137), ('embodies', 0.131), ('achinery', 0.125), ('nfinite', 0.125), ('nsemble', 0.125), ('hypotheses', 0.124), ('kh', 0.124), ('svm', 0.116), ('kp', 0.115), ('decision', 0.108), ('ector', 0.095), ('upport', 0.095), ('dichotomize', 0.086), ('kernels', 0.085), ('boosting', 0.08), ('ktl', 0.077), ('ld', 0.074), ('wt', 0.07), ('threenorm', 0.067), ('negation', 0.067), ('demiriz', 0.065), ('ht', 0.063), ('nitely', 0.063), ('tsch', 0.059), ('perceptrons', 0.058), ('nonsparse', 0.058), ('gram', 0.058), ('ling', 0.057), ('twonorm', 0.056), ('lin', 0.055), ('rosset', 0.053), ('nite', 0.051), ('base', 0.05), ('hypothesis', 0.046), ('kt', 0.044), ('rp', 0.044), ('er', 0.043), ('earning', 0.042), ('enjoys', 0.041), ('freund', 0.041), ('breast', 0.04), ('german', 0.04), ('ionosphere', 0.04), ('ringnorm', 0.04), ('rd', 0.04), ('embedding', 0.039), ('neutrality', 0.038), ('psd', 0.038), ('australian', 0.037), ('ensembles', 0.037), ('sonar', 0.037), ('sign', 0.037), ('ambiguous', 0.036), ('tree', 0.036), ('splice', 0.035), ('schapire', 0.035), ('trees', 0.035), ('framework', 0.034), ('neutral', 0.033), ('theorem', 0.033), ('pima', 0.033), ('summation', 0.033), ('simpli', 0.033), ('hettich', 0.033), ('niteness', 0.031), ('radial', 0.031), ('heart', 0.03), ('subsection', 0.03), ('hsu', 0.029), ('caltech', 0.029), ('gunnar', 0.029), ('histogram', 0.029), ('boughorbel', 0.029), ('chang', 0.029), ('classi', 0.028), ('logic', 0.028), ('rs', 0.028), ('traditional', 0.027), ('ad', 0.027), ('robert', 0.025), ('xor', 0.025), ('meir', 0.024), ('reed', 0.024), ('training', 0.024), ('xi', 0.024), ('smola', 0.024), ('scaling', 0.023), ('intercept', 0.023), ('trevor', 0.023), ('keerthi', 0.023), ('rotation', 0.023), ('sch', 0.023), ('ranges', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

2 0.27920818 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

Author: Kristin P. Bennett

Abstract: unkown-abstract

3 0.1663208 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial ﬂaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these ﬂaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

4 0.12965958 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufﬁciently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classiﬁcation results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

5 0.10188253 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classiﬁer with no tuning parameter. It is a combination of SVM classiﬁers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classiﬁcation, support vector machines, learning rates, approximation, aggregation of classiﬁers

6 0.097476304 92 jmlr-2008-Universal Multi-Task Kernels

7 0.079235457 86 jmlr-2008-SimpleMKL

8 0.077915132 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

9 0.073824897 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

10 0.071407042 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

11 0.070674546 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

12 0.062363684 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

13 0.055232868 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

14 0.054621387 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

15 0.05281505 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

16 0.048995495 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

17 0.0443356 58 jmlr-2008-Max-margin Classification of Data with Absent Features

18 0.044241276 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

19 0.043721758 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

20 0.043287959 54 jmlr-2008-Learning to Select Features using their Properties

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.235), (1, -0.144), (2, 0.494), (3, -0.138), (4, 0.137), (5, 0.009), (6, -0.24), (7, 0.027), (8, 0.005), (9, 0.082), (10, 0.03), (11, -0.03), (12, 0.089), (13, -0.011), (14, -0.018), (15, 0.05), (16, -0.023), (17, 0.046), (18, 0.021), (19, 0.004), (20, 0.013), (21, -0.025), (22, -0.029), (23, -0.051), (24, 0.023), (25, -0.09), (26, 0.032), (27, 0.042), (28, -0.084), (29, 0.019), (30, -0.017), (31, -0.011), (32, 0.016), (33, 0.007), (34, 0.065), (35, 0.006), (36, 0.032), (37, 0.037), (38, 0.042), (39, -0.037), (40, -0.014), (41, 0.036), (42, 0.01), (43, 0.032), (44, 0.026), (45, 0.004), (46, 0.015), (47, -0.004), (48, 0.033), (49, -0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91426039 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

2 0.76731247 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

Author: Kristin P. Bennett

Abstract: unkown-abstract

3 0.71744704 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

4 0.47469842 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

5 0.44475329 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufﬁcient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

6 0.39674902 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

7 0.3711333 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

8 0.37033471 86 jmlr-2008-SimpleMKL

9 0.36475831 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

10 0.33876982 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

11 0.31917125 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

12 0.30810204 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

13 0.25530642 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

14 0.23578405 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

15 0.23275608 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

16 0.22906061 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

17 0.20419486 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

18 0.20044547 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

19 0.19932766 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)

20 0.18771021 58 jmlr-2008-Max-margin Classification of Data with Absent Features

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (5, 0.02), (31, 0.016), (34, 0.312), (40, 0.077), (54, 0.075), (58, 0.031), (66, 0.071), (76, 0.021), (78, 0.021), (88, 0.113), (92, 0.029), (94, 0.055), (99, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77315235 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

2 0.4826746 58 jmlr-2008-Max-margin Classification of Data with Absent Features

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classiﬁers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to ﬁrst complete their unknown values, and then use a standard classiﬁcation procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classiﬁed directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efﬁciently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa

3 0.47478157 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based ﬁtted value iteration (FVI) to solve inﬁnite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of ﬁnite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufﬁciently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reﬂects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical ﬁndings. Keywords: ﬁtted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control

4 0.47470546 57 jmlr-2008-Manifold Learning: The Price of Normalization

Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov

Abstract: We analyze the performance of a class of manifold-learning algorithms that ﬁnd their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the ﬁnite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap

5 0.47014293 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classiﬁcation where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by deﬁning a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction

6 0.46561426 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

7 0.46536049 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

8 0.46432954 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

9 0.46402696 9 jmlr-2008-Active Learning by Spherical Subdivision

10 0.46088284 86 jmlr-2008-SimpleMKL

11 0.46080571 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

12 0.46050435 96 jmlr-2008-Visualizing Data using t-SNE

13 0.45777261 83 jmlr-2008-Robust Submodular Observation Selection

14 0.45689517 56 jmlr-2008-Magic Moments for Structured Output Prediction

15 0.4557232 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

16 0.45397273 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

17 0.45154083 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

18 0.44654688 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

19 0.44353259 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

20 0.44345805 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers