jmlr jmlr2010 jmlr2010-113 knowledge-graph by maker-knowledge-mining

113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems


Source: pdf

Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu

Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 TW Institute of Information Science Academia Sinica Taipei, Taiwan Editor: Alexander Smola Abstract To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. [sent-13, score-0.136]

2 Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. [sent-14, score-0.096]

3 First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. [sent-15, score-0.077]

4 For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. [sent-18, score-0.145]

5 The boosting method (Schapire, 1990; Schapire and Singer, 2000) trains SVMs in a sequential manner, and the training of a particular SVM is dependent on the training and performance of previously trained SVMs. [sent-50, score-0.122]

6 This approach can reduce the total training time because the time complexity of training an SVM is in the order of n2 , where n is the number of training samples (Platt, 1998; Joachims, 1998) . [sent-75, score-0.183]

7 However, it differs from other MSR methods in that it uses a decision tree to obtain multiple reduced data sets, whereas other methods use non-supervised clustering (Rida et al. [sent-81, score-0.096]

8 A decision tree decomposes a data space recursively into smaller regions. [sent-86, score-0.096]

9 In terms of the ways the regions are formed, a decision tree can be classified into three types: axis-parallel, oblique and Voronoi types. [sent-87, score-0.137]

10 In this paper, we take an axis-parallel decision tree as our decomposition scheme because of its speed in both the training and testing phases. [sent-95, score-0.184]

11 To the best of our knowledge, using a decision tree to speed up the training of multiclass SVMs has not been proposed previously. [sent-108, score-0.157]

12 Using a decision tree as a decomposition scheme can yield following benefits when dealing with large-scale SVM problems. [sent-109, score-0.096]

13 First, the decision tree may decompose the data space so that certain decomposed regions become homogeneous; that is, they contain samples of the same labels. [sent-110, score-0.141]

14 Then, 2937 C HANG , G UO , L IN AND L U in the testing phase, when a data point flows to a homogeneous region, we simply classify it in terms of the common label of that region. [sent-111, score-0.077]

15 In fact, our experiments revealed that, for certain data sets, more than 90% of the training samples reside in homogeneous regions; thus, the decision tree method saves an enormous amount of time when training SVMs. [sent-113, score-0.252]

16 Another benefit of using the decision tree is the convenience it provides when searching for all the relevant parameter values to maximize the solution’s validation accuracy, which in turn helps maintain good test accuracy rates. [sent-115, score-0.144]

17 The goal of the DTSVM method is to attain comparable validation accuracy while consuming less time than training SVMs on full data sets. [sent-116, score-0.086]

18 Such searches are particularly easy under the DTSVM method because a decision tree is constructed in a recursive manner; hence, obtaining a tree with a larger size of σ does not require the reconstruction of a decision tree corresponding to that size of σ. [sent-121, score-0.244]

19 Using a decision tree also reduces the cost of searching for the optimal values of SVM-parameters. [sent-122, score-0.096]

20 Our strategy involves training SVMs with all combinations of SVM-parameter values, but only for decomposed regions with an initial σ-level. [sent-125, score-0.106]

21 Given the n2 -complexity of SVM training, restricting the full search of SVM-parameter values to regions with the initial σlevel certainly reduces the SVM training time. [sent-129, score-0.084]

22 Although the decision tree method may not be the only way to achieve the above benefits for large-scale SVM problems, its effect can be understood in theory and a generalization error bound can be derived for the DTSVM classifier. [sent-131, score-0.119]

23 Our experimental results show that the numerical value of the dominant term is as small as, or of the same order of magnitude as, its counterpart in the generalization error bound for SVM training conducted on the whole data set. [sent-133, score-0.084]

24 These trees can be obtained by using a randomized, rather than the optimal, split point at each tree node (Dietterich, 2000). [sent-136, score-0.077]

25 By so doing, we train SVMs on all the decomposed regions and classify the test data based on a majority vote strategy. [sent-137, score-0.102]

26 In terms of test accuracy, multiple decompositions are not as effective as searching for the optimal σ-level of decomposed regions in a single decision tree. [sent-139, score-0.112]

27 We then used the training component to build DTSVM classifiers, the validation component to determine the optimal parameters, and the test component to measure the test accuracy. [sent-143, score-0.107]

28 When evaluating the DTSVM method, we found it could train DTSVM classifiers that achieved comparable test accuracy rates to those of SVM classifiers. [sent-148, score-0.102]

29 For seven medium-size data sets, in which the largest number of sample was 494K and the largest number of feature was 62K, DTSVM achieved speedup factors between 4 and 3,691 for 1A1 training, and between 29 and 5,775 for 1AO training. [sent-149, score-0.096]

30 For all the data sets, DTSVM could complete 1A1 training and 1AO training within 18. [sent-154, score-0.122]

31 Note that the training time included the time required to build a decision tree, the time to train SVMs on all the leaves, and the time to search for the optimal parameters. [sent-156, score-0.123]

32 The DTSVM Method In this section, we describe the decision tree that we use as the decomposition scheme, and discuss the training process for the DTSVM method. [sent-163, score-0.157]

33 To grow a binary tree, we follow a recursive process, whereby each training sample flowing to a node is sent to its left-hand or right-hand child node. [sent-176, score-0.079]

34 At a given node E, a certain feature fE of the training samples flowing to E is compared with a certain value vE so that all samples with fE < vE are sent to the left-hand child node, and the remaining samples are sent to the right-hand child node. [sent-177, score-0.097]

35 2941 C HANG , G UO , L IN AND L U Figure 2: The test accuracy rates obtained by DTSVM on the seven data sets when σ0 = 1,500 and k = 1, 3, 5, 7 and 9. [sent-183, score-0.09]

36 To observe how these settings impact the test accuracy, we first fix σ0 at 1,500 and vary k from 1 to 9 at a step size of 2; we then plot the test accuracy rates obtained by DTSVM on the seven data sets whose details are shown in Table 1. [sent-190, score-0.113]

37 2942 T REE D ECOMPOSITION FOR L ARGE -S CALE SVM P ROBLEMS Figure 3: The test accuracy rates obtained by DTSVM on the seven data sets when k = 5, and σ0 = 500, 1,000, 1,500, 2,000, 2,500 and 3,000. [sent-198, score-0.09]

38 We randomly divided each data set into six parts of approximately equal size, and used four parts as the training component, one part as the validation component, and the remaining part as the test component. [sent-214, score-0.084]

39 On completion of the training process, we applied the output DTSVM classifier to the corresponding test data set to obtain the test accuracy rate. [sent-217, score-0.132]

40 If CART performs as well as DTSVM in every respect, then there is no need for DTSVM, since CART runs much faster than DTSVM in both the training and testing phases. [sent-262, score-0.088]

41 In the training phase, when a size σ is given, DTSVM randomly assigns a training sample to one of d subsets, where d is the smallest integer that is greater than or equal to n/σ and n is the number of training samples. [sent-265, score-0.183]

42 , 2005) is now the most widely used software for training and testing SVMs. [sent-287, score-0.088]

43 If a compared method is faster in training than LIBSVM, it has a speedup factor above 1. [sent-289, score-0.113]

44 , 2008) is a fast version of training and testing linear SVMs. [sent-295, score-0.088]

45 Figure 4 and Figure 9 show the training times of the seven methods. [sent-311, score-0.086]

46 The training time of each method comprises the time required to obtain reduced data sets if it is a data reduction method, the time to train all SVMs and the time to search for optimal parameters; however, the time required to input or output data is not included. [sent-312, score-0.079]

47 Figure 5 and Figure 10 show the speedup factors of all the methods except LIBSVM, where the speedup factor of a method M is computed as LIBSVM’s training time divided by M ’s training time. [sent-316, score-0.226]

48 Note that the DTSVM test accuracy is that of the DTSVM classifier with the ceiling size σopt and SVMparameters θopt . [sent-318, score-0.116]

49 NESV is defined as the number of SVs contained in the decision function used to classify a test sample. [sent-321, score-0.083]

50 In terms of training time, DTSVM outperformed all the other methods, except CART; and in terms of test accuracy and NESV, DSTVM outperformed or performed comparably to all the other methods. [sent-329, score-0.191]

51 It also achieved very large speedup factors and NESV ratios on “Shuttle”, “Poker”, “CI” and “KDD-10%”. [sent-330, score-0.1]

52 CBD achieved speedup factors above 1 and NESV ratios above 1 on most data sets; however, its scores were overall not as high as those of DTSVM. [sent-336, score-0.1]

53 Bagging achieved speedup factors below 1 and NESV ratios below 1 on several data sets. [sent-371, score-0.1]

54 TNSV expresses the space-complexity of a training process, whereas NESV expresses the time-complexity of a test process. [sent-381, score-0.084]

55 Clearly, the DTC testing time takes an extremely small proportion of DTSVM’s testing time. [sent-389, score-0.078]

56 Even though DTSVM was not as fast as CART and LIBLINEAR in training, it achieved consistently high test accuracy rates on all the four data sets. [sent-397, score-0.084]

57 The DTC testing time takes a very small proportion of DTSVM’s testing time. [sent-447, score-0.078]

58 Recall that RDSVM achieved equally good test accuracy rates on all the medium-size data sets. [sent-455, score-0.084]

59 Once again, it is clear that DTC’s testing time only takes a very small proportion of DTSVM’s testing time. [sent-472, score-0.078]

60 5 Further Discussion To gain insight into why DTSVM is so effective, we show in Table 7 the σopt derived by DTSVM on medium-size data sets, along with the proportion of training samples that flow to homogeneous leaves. [sent-481, score-0.119]

61 Note that a single table suffices to show all the results because 1A1 training and 1AO training employ the same decision trees and DTSVM yields the same σopt value for both approaches. [sent-482, score-0.21]

62 Furthermore, the proportion of training samples that flowed to homogeneous leaves under DTSVM was very high in “Shuttle” and “KDD-10%”. [sent-485, score-0.143]

63 DTC’s testing time only takes a very small proportion of DTSVM’s testing time. [sent-517, score-0.078]

64 35% Table 7: The σopt obtained by DTSVM on the medium-size data sets and the proportion of training samples that flow to homogeneous leaves. [sent-522, score-0.119]

65 Data Set Letter News20 Training Mode 1A1 1AO 1A1 1AO 1,500 633 2,730 1,631 7,056 6,000 45 178 665 2,952 24,000 90 373 757 3,365 Table 9: The DTSVM training times required for different ceiling sizes. [sent-524, score-0.129]

66 34% Table 10: The DTSVM test accuracy rates that correspond to different ceiling sizes. [sent-537, score-0.133]

67 in any homogeneous leaves, DTSVM achieved very high speedup factors and NESV ratios on these two data sets. [sent-538, score-0.134]

68 The results explain why RDSVM achieved much smaller speedup factors and NESV ratios on “CI” and “Poker”. [sent-544, score-0.1]

69 Even so, DTSVM still achieved speedup factors above 1 because it only trained lSVMs for all the parameter values on leaves with a ceiling size of 1,500, which took much less time than training them on the full training component. [sent-547, score-0.285]

70 63% Table 11: The σopt values obtained by DTSVM on the large-size data sets and the proportion of training samples that flowed to homogeneous leaves. [sent-554, score-0.119]

71 Table 9 shows the training times required for different ceiling sizes. [sent-556, score-0.129]

72 We observe that DTSVM spent most of its time on the leaves with the lowest ceiling size. [sent-557, score-0.092]

73 In addition, Table 10 shows the test accuracy rates corresponding to different ceiling sizes, assuming that the training was terminated at those sizes. [sent-558, score-0.194]

74 The results demonstrate the benefit of searching for the σopt because, if we terminated the training at ceiling size 1,500 or 6,000, we would obtain significantly lower test accuracy rates. [sent-559, score-0.177]

75 For completeness, Table 11 shows the σopt values obtained by DTSVM on the large-size data sets, along with the proportion of training samples that flowed to homogeneous leaves. [sent-561, score-0.119]

76 , fL ), and fi is related to the lSVM trained on leaf i of π for i = 1, . [sent-578, score-0.1]

77 , L is expressed as fi ◦ Φ, where Φ maps an input in Rd to a Hilbert space H, and fi is a linear function from H to R. [sent-589, score-0.16]

78 2958 T REE D ECOMPOSITION FOR L ARGE -S CALE SVM P ROBLEMS Note that if π(x) = i, then h(x, π, f) = sign( fi (Φ(x)). [sent-591, score-0.08]

79 First, we define the notion of the shatter coefficient, which we use to measure the complexity of the partition functions corresponding to binary decision trees. [sent-611, score-0.093]

80 We say that fπ has margin γ on Xn , or mg(fπ , Xn ) ≥ γ, if y · fπ (x) ≡ y · fi (Φ(x)) ≥ γi for any i ∈ {1, . [sent-624, score-0.12]

81 Suppose for some G and F = F1 × · · · × FL , we can build a classifier sign(fπ ) with π ∈ G and f ∈ F that has a large margin and zero training error. [sent-653, score-0.101]

82 For example, if we do not choose G properly (say, by using a random partition), the training examples in some parts of the partition may not be separated by SVMs with a large margin. [sent-658, score-0.08]

83 Our main contribution is the discovery that decision trees are good partition functions when combined with SVMs, since they allow a large margin and only require a short training time for lSVMs. [sent-659, score-0.189]

84 η2 We also define BL (Rd ) as the class of partition functions associated with binary trees containing L leaves that partition the space Rd into L axis-aligned parts, as described in Section 2. [sent-669, score-0.087]

85 , L is sufficiently small), and it classifies each training sample with a large margin (i. [sent-695, score-0.101]

86 Note that the bound in Theorem 7 has a similar form to the generalization error bound of the perceptron decision trees proposed by Bennett et al. [sent-700, score-0.092]

87 2 Soft Margin Bounds Note that Theorem 7 works in the case where the training data Xn can be separated with a margin vector γ. [sent-705, score-0.101]

88 Then, we can define the margin slack vector of fi as follows. [sent-718, score-0.12]

89 For 1 ≤ i ≤ L and 1 ≤ j ≤ ni , let ξi, j = max(0, γi − yi, j · fi (Φ(xi, j ))). [sent-723, score-0.08]

90 , ξi,ni ) the margin slack vector of fi with respect to π and γi over Xn . [sent-728, score-0.12]

91 For 1 ≤ i ≤ L and for any (x, y) ∈ Xn , fˆi (τρ (Φ(x))) = fi (Φ(x)). [sent-746, score-0.08]

92 , fL ) ∈ F , such that for i ≤ i ≤ L, fi has a margin slack vector ξi with respect to π and γi over Xn . [sent-767, score-0.12]

93 (1) i=1 Recall that a DTSVM classifier is associated with a tree with L leaves and each leaf is associated with an lSVM. [sent-780, score-0.096]

94 , the same binary trees and same ceiling sizes) as the old classifiers, but different (C, γ) values. [sent-808, score-0.093]

95 , gL ) ∈ Bπ (X2n ) such that for any (x, y) ∈ Xn , if π(x) = i, then γ/2 | fi (Φ(x)) − gi (Φ(x))| ≤ γi /2. [sent-862, score-0.097]

96 Now, for 1 ≤ i ≤ L, we can define fˆi : H → R by fˆi = ( fi , gi /ρ), where gi ∈ I (H) is defined by ni gi = ∑ ξi, j · yi, j · δΦ(x i, j ) . [sent-894, score-0.131]

97 j=1 Since fi ∈ L (H, βi ), there exists wi ∈ H with wi ≤ βi such that fi (z) = wi , z . [sent-895, score-0.16]

98 i (i) Furthermore, for any (xi, j , yi, j ) ∈ Xn , we have yi, j · fˆi (τρ (Φ(xi, j ))) = yi, j · fi (Φ(xi, j )) + yi, j · (ξi, j · yi, j ) = yi, j · fi (Φ(xi, j )) + ξi, j ≥ γi , by the definition of ξi, j . [sent-898, score-0.16]

99 Finally, for 1 ≤ i ≤ L and for any (x, y) ∈ Xn , we have / ni fˆi (τρ (Φ(x))) = fi (Φ(x)) + ∑ ξi, j · yi, j · δΦ(xi, j ) (Φ(x)) = fi (Φ(x)). [sent-900, score-0.16]

100 Multiclass text classification a decision tree based SVM approach. [sent-1185, score-0.096]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dtsvm', 0.852), ('rdsvm', 0.153), ('cart', 0.152), ('nesv', 0.141), ('ecomposition', 0.104), ('uo', 0.1), ('nesvs', 0.086), ('svm', 0.085), ('arge', 0.08), ('roblems', 0.08), ('cale', 0.08), ('fi', 0.08), ('gsvm', 0.074), ('poker', 0.074), ('ceiling', 0.068), ('libsvm', 0.068), ('opt', 0.064), ('training', 0.061), ('shuttle', 0.061), ('svms', 0.061), ('hang', 0.058), ('mg', 0.058), ('xn', 0.058), ('ree', 0.056), ('cbd', 0.055), ('tree', 0.052), ('lasvm', 0.052), ('ppi', 0.052), ('svs', 0.052), ('speedup', 0.052), ('dtc', 0.049), ('pr', 0.048), ('webspam', 0.047), ('err', 0.045), ('letter', 0.044), ('decision', 0.044), ('fl', 0.043), ('liblinear', 0.042), ('rd', 0.042), ('cristianini', 0.041), ('sign', 0.041), ('outperformed', 0.041), ('margin', 0.04), ('bennett', 0.038), ('phw', 0.037), ('supp', 0.037), ('pl', 0.034), ('classi', 0.034), ('homogeneous', 0.034), ('bagging', 0.033), ('msr', 0.031), ('lsvm', 0.031), ('shatter', 0.03), ('ratios', 0.029), ('bordes', 0.028), ('er', 0.028), ('forest', 0.028), ('testing', 0.027), ('trees', 0.025), ('accuracy', 0.025), ('seven', 0.025), ('dnl', 0.025), ('panda', 0.025), ('subprocess', 0.025), ('tnsv', 0.025), ('leaves', 0.024), ('proportion', 0.024), ('regions', 0.023), ('generalization', 0.023), ('bl', 0.023), ('test', 0.023), ('decomposed', 0.022), ('lagged', 0.021), ('sinica', 0.021), ('ers', 0.021), ('leaf', 0.02), ('event', 0.019), ('covering', 0.019), ('table', 0.019), ('cascade', 0.019), ('partition', 0.019), ('achieved', 0.019), ('train', 0.018), ('lsvms', 0.018), ('oblique', 0.018), ('osuna', 0.018), ('owed', 0.018), ('rida', 0.018), ('sent', 0.018), ('breiman', 0.018), ('comprised', 0.017), ('iis', 0.017), ('rates', 0.017), ('gi', 0.017), ('ci', 0.017), ('lemma', 0.017), ('tw', 0.016), ('pegasos', 0.016), ('classify', 0.016), ('ows', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu

Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory

2 0.098212093 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin

Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing

3 0.069343805 40 jmlr-2010-Fast and Scalable Local Kernel Machines

Author: Nicola Segata, Enrico Blanzieri

Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning

4 0.046921082 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

Author: Markus Ojala, Gemma C. Garriga

Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing

5 0.045668386 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

6 0.044202104 15 jmlr-2010-Approximate Tree Kernels

7 0.040029597 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

8 0.038458407 22 jmlr-2010-Classification Using Geometric Level Sets

9 0.035896085 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

10 0.034057926 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

11 0.033139557 69 jmlr-2010-Lp-Nested Symmetric Distributions

12 0.030471416 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

13 0.028343236 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

14 0.027978286 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

15 0.027425235 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

16 0.026677942 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

17 0.026074164 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

18 0.025675764 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

19 0.025239289 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

20 0.02520344 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.127), (1, 0.003), (2, -0.036), (3, 0.104), (4, 0.045), (5, 0.064), (6, -0.085), (7, 0.006), (8, -0.021), (9, -0.023), (10, 0.071), (11, -0.022), (12, 0.019), (13, 0.049), (14, -0.091), (15, -0.003), (16, 0.044), (17, -0.123), (18, -0.026), (19, -0.118), (20, 0.188), (21, 0.097), (22, 0.011), (23, -0.011), (24, 0.081), (25, 0.015), (26, -0.025), (27, 0.058), (28, 0.223), (29, 0.06), (30, -0.04), (31, -0.094), (32, -0.115), (33, 0.032), (34, 0.125), (35, -0.066), (36, -0.13), (37, -0.102), (38, 0.036), (39, -0.083), (40, 0.024), (41, 0.162), (42, -0.104), (43, 0.107), (44, 0.02), (45, 0.065), (46, 0.199), (47, 0.082), (48, 0.21), (49, -0.157)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88250715 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu

Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory

2 0.59258044 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin

Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing

3 0.54468578 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

4 0.50095916 40 jmlr-2010-Fast and Scalable Local Kernel Machines

Author: Nicola Segata, Enrico Blanzieri

Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning

5 0.30785948 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

Author: Dotan Di Castro, Ron Meir

Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference

6 0.2726081 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

7 0.27199391 61 jmlr-2010-Learning From Crowds

8 0.24344735 109 jmlr-2010-Stochastic Composite Likelihood

9 0.23918724 22 jmlr-2010-Classification Using Geometric Level Sets

10 0.23233747 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

11 0.22193091 69 jmlr-2010-Lp-Nested Symmetric Distributions

12 0.19088757 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

13 0.19069327 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

14 0.18824308 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

15 0.18087773 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases

16 0.1805259 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

17 0.17408277 70 jmlr-2010-MOA: Massive Online Analysis

18 0.17169584 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

19 0.16884625 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

20 0.16681345 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.373), (3, 0.022), (4, 0.034), (8, 0.022), (21, 0.041), (24, 0.011), (32, 0.052), (36, 0.03), (37, 0.035), (75, 0.142), (81, 0.014), (85, 0.08), (96, 0.015), (97, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85082448 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option

same-paper 2 0.69383878 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu

Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory

3 0.49187422 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

4 0.4826777 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

5 0.47361314 60 jmlr-2010-Learnability, Stability and Uniform Convergence

Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization

6 0.46935514 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

7 0.46841407 102 jmlr-2010-Semi-Supervised Novelty Detection

8 0.46317601 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

9 0.46176797 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

10 0.46134612 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

11 0.4538672 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

12 0.45137674 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

13 0.44963151 84 jmlr-2010-On Spectral Learning

14 0.44903305 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

15 0.44801232 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

16 0.44465372 25 jmlr-2010-Composite Binary Losses

17 0.44419184 109 jmlr-2010-Stochastic Composite Likelihood

18 0.44318205 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

19 0.44241202 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

20 0.44072801 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data