jmlr jmlr2010 jmlr2010-74 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
Reference: text
sentIndex sentText sentNum sentScore
1 While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. [sent-7, score-0.297]
2 Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. [sent-8, score-0.467]
3 This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. [sent-9, score-0.474]
4 Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. [sent-10, score-0.541]
5 Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. [sent-11, score-0.239]
6 Support vector machines (SVMs) and maximum margin classifiers (Vapnik, 1995; Sch¨ lkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004) have o been a particularly successful approach both in theory and in practice. [sent-16, score-0.239]
7 The parameters of the hyperplane (w, b) are estimated by maximizing the margin (e. [sent-20, score-0.239]
8 In practice, the margin is maximized by minimizing 1 w⊤ w plus an upper bound on the misclassification rate. [sent-23, score-0.263]
9 S HIVASWAMY AND J EBARA While maximum margin classification works well in practice, its solution can easily be perturbed by an (invertible) affine or scaling transformation of the input space. [sent-28, score-0.325]
10 This article will explore such shortcomings in maximum margin solutions (or equivalently, SVMs in the context of this article) which exclusively measure margin by the points near the classification boundary regardless of how spread the remaining data is away from the separating hyperplane. [sent-32, score-0.651]
11 The key is to recover a large margin solution while normalizing the margin by the spread of the data. [sent-35, score-0.611]
12 Thus, margin is measured in a relative sense rather than in the absolute sense. [sent-36, score-0.302]
13 The resulting classifier will be referred to as the relative margin machine (RMM) and was first introduced by Shivaswamy and Jebara (2009a) with this longer article serving to provide more details, more thorough empirical evaluation and more theoretical support. [sent-38, score-0.318]
14 The estimation of spread should not make second-order assumptions about the data and should be tied to the margin criterion (Vapnik, 1995). [sent-44, score-0.353]
15 In this prior work, a feature’s contribution to margin is compared to its effect on the radius of the data by computing bounding hyper-spheres rather than simple second-order statistics. [sent-48, score-0.275]
16 While these previous methods showed performance improvements, they relied on multiple-step locally optimal algorithms for interleaving spread information with margin estimation. [sent-55, score-0.353]
17 Alternatively, one may consider distributions over classifier solutions which provide a different estimate than the maximum margin setting and have also shown empirical improvements over SVMs (Jaakkola et al. [sent-63, score-0.239]
18 These distribution assumptions permit update rules that resemble whitening of the data, thus alleviating adversarial affine transformations and producing changes to the basic maximum margin formulation that are similar in spirit to those the RMM provides. [sent-69, score-0.333]
19 In addition, recently, a new batch algorithm called the Gaussian margin machine (GMM) (Crammer et al. [sent-70, score-0.239]
20 In principle, these methods also change the way margin is measured and the way regularization is applied to the learning problem. [sent-80, score-0.239]
21 The argument is that, without any additional assumptions beyond the simple classification problem, maximizing margin in the absolute sense may be suboptimal and that maximizing relative margin is a promising alternative. [sent-82, score-0.541]
22 Further, large margin methods have been successfully applied to a variety of tasks such as parsing (Collins and Roark, 2004; Taskar et al. [sent-83, score-0.239]
23 The relative margin machine formulation is detailed in Section 3 and several variants and implementations are proposed. [sent-91, score-0.301]
24 Motivation This section provides three different (an intuitive, a probabilistic and an affine transformation based) motivations for maximizing the margin relative to the data spread. [sent-102, score-0.315]
25 The figure depicts three scaled versions of the two dimensional problem to illustrate potential problems with the large margin solution. [sent-105, score-0.239]
26 The red (or dark shade) solution is the SVM estimate while the green (or light shade) solution is the proposed maximum relative margin alternative. [sent-107, score-0.337]
27 Clearly, the SVM solution achieves the largest margin possible while separating both classes, yet is this necessarily the best solution? [sent-108, score-0.258]
28 With progressive scaling, the SVM increasingly deviates from the maximum relative margin solution (green), clearly indicating that the SVM decision boundary is sensitive to affine transformations of the data. [sent-111, score-0.352]
29 Meanwhile, an algorithm producing the maximum relative margin (green) decision boundary could remain resilient to adversarial scaling. [sent-115, score-0.315]
30 Unlike the maximum margin solution, this solution accounts for the spread of the data in various directions. [sent-117, score-0.372]
31 This permits it to recover a solution which has a large margin relative to the spread in that direction. [sent-118, score-0.411]
32 Such a solution would otherwise be overlooked by a maximum margin criterion. [sent-119, score-0.258]
33 A small margin in a correspondingly smaller spread of the data might be better than a large absolute margin with correspondingly larger data spread. [sent-120, score-0.616]
34 This particular weakness in large margin estimation has only received limited attention in previous work. [sent-121, score-0.239]
35 In this case, the maximum relative margin (green) decision boundary should obtain zero test error even if it is estimated from a finite number of examples. [sent-124, score-0.32]
36 The above arguments show that large margin on its own is not enough; it is also necessary to control the spread of the data after projection. [sent-130, score-0.353]
37 Therefore, maximum margin should be traded-off or 750 M AXIMUM R ELATIVE M ARGIN AND DATA -D EPENDENT R EGULARIZATION balanced with the goal of simultaneously minimizing the spread of the projected data, for instance, by bounding the spread |w⊤ x + b|. [sent-131, score-0.485]
38 This will allow the linear classifier to recover large margin solutions not in the absolute sense but rather relative to the spread of the data in that projection direction. [sent-132, score-0.434]
39 No matter how they are mapped initially, a large margin solution still projects these points to the real line where the margin of separation is maximized. [sent-137, score-0.497]
40 Given the above motivation, it is important to achieve a large margin relative to the spread of the projections even in such situations. [sent-139, score-0.421]
41 2 Probabilistic Motivation In this subsection, an informal motivation is provided to illustrate why maximizing relative margin may be helpful. [sent-142, score-0.302]
42 Instead of precisely finding low variance and high mean projections, this paper implements this intuition by trading off between large margin and small projections of the data while correctly classifying most of the examples with a hinge loss. [sent-154, score-0.289]
43 3 Motivation From an Affine Invariance Perspective Another motivation for maximum relative margin can be made by reformulating the classification problem altogether. [sent-156, score-0.302]
44 The data will be mapped by an affine transformation such that it is separated with large margin while it also produces a small radius. [sent-158, score-0.276]
45 Recall that maximum margin classification and SVMs are motivated by generalization bounds based on Vapnik-Chervonenkis complexity arguments. [sent-159, score-0.288]
46 These generalization bounds depend on the 751 S HIVASWAMY AND J EBARA Figure 1: Left: As the data is scaled, the maximum margin SVM solution (red or dark shade) deviates from the maximum relative margin solution (green or light shade). [sent-160, score-0.622]
47 The absolute margins for the maximum margin solution (red) are 1. [sent-164, score-0.282]
48 For the maximum relative margin solution (green) the absolute margin is merely 0. [sent-168, score-0.591]
49 However, the relative margin (the ratio of absolute margin to the spread of the projections) is 41%, 28%, and 21% for the maximum margin solution (red) and 100% for the relative margin solution (green). [sent-170, score-1.21]
50 752 M AXIMUM R ELATIVE M ARGIN AND DATA -D EPENDENT R EGULARIZATION ratio of the margin to the radius of the data (Vapnik, 1995). [sent-172, score-0.257]
51 Instead of learning a classification rule, the optimization problem considered in this section will recover an affine transformation which achieves a large margin from a fixed decision rule while also achieving small radius. [sent-175, score-0.276]
52 Assume the classification hyperplane is given a priori via the decision boundary w⊤ x+b0 = 0 with the two supporting margin hyperplanes w⊤ x+b0 = ±ρ. [sent-176, score-0.258]
53 The following Lemma shows that this affine transformation learning problem is basically equivalent to learning a large margin solution with a small spread. [sent-185, score-0.295]
54 ˜ Learning a transformation matrix A so as to maximize the margin while minimizing the radius given an a priori hyperplane (w0 , b0 ) is no different from learning a classification hyperplane (w, b) with ˜ a large margin as well as a small spread. [sent-194, score-0.533]
55 This is because the rank of the affine transformation A∗ ˜ ∗ merely maps all the points xi onto a line achieving a certain margin ρ but also lim˜ is one; thus, A iting the output or spread. [sent-195, score-0.36]
56 This means that finding an affine transformation which achieves a large margin and small radius is equivalent to finding a w and b with a large margin and with projections constrained to remain close to the origin. [sent-196, score-0.562]
57 From Absolute Margin to Relative Margin This section will provide an upgrade path from the maximum margin classifier (or SVM) to a maximum relative margin formulation. [sent-200, score-0.517]
58 The above is an easily solvable quadratic program (QP) and maximizes the margin by minimizing w 2 . [sent-204, score-0.239]
59 Thus, the above formulation maximizes the margin while minimizing an upper bound on the number of classification errors. [sent-206, score-0.286]
60 1 The Whitened SVM One way of limiting sensitivity to affine transformations while recovering a large margin solution is to whiten the data with the covariance matrix prior to estimating the SVM solution. [sent-222, score-0.276]
61 A parameter will also be introduced that helps trade off between large margin and small spread of the projection of the data. [sent-245, score-0.371]
62 Consider the following formulation called the relative margin machine (RMM): min w,b 1 w 2 2 n +C ∑ ξi (8) i=1 s. [sent-248, score-0.301]
63 Specifically, with a smaller B, the RMM still finds a large margin solution but with a smaller projection of the training examples. [sent-259, score-0.308]
64 By trying different B values (within the aforementioned thresholds), different large relative margin solutions are explored. [sent-260, score-0.278]
65 Differentiating with respect to the primal variables and equating to zero produces: n n n i=1 i=1 i=1 (I + ∑ λi xi x⊤ )w + b ∑ λi xi = ∑ αi yi xi , i n n 1 ( ∑ αi yi − ∑ λi w⊤ xi ) = b, λ⊤ 1 i=1 i=1 αi + βi = C ∀1 ≤ i ≤ n. [sent-263, score-0.305]
66 The RMM maximizes the margin while also limiting the spread of projections on the training data. [sent-381, score-0.414]
67 2 2 ¯ := 1 − D and 0 < D < 1 trades off between large margin and small spread on the Above, take D projections. [sent-383, score-0.353]
68 With these landmark examples, the modified RMM function class can be written as: ¯ V Definition 5 HE,D := {x → w⊤ x| D w⊤ w + D (w⊤ vi )2 ≤ E ∀1 ≤ i ≤ nv }. [sent-390, score-0.406]
69 i=1 2 V U Note that the parameter E is fixed in HE,D but nv may be different from n. [sent-398, score-0.294]
70 λnv ≥ 0, the Lagrangian of (15) is: L (w, λ) = −w⊤ ∑n σi xi + i=1 nv 1 ¯ ∑i=1 λi 2 Dw⊤ w + D(w⊤ vi )2 − E . [sent-446, score-0.371]
71 Substituti i=1 λ,D i=1 ing this w in L (w, λ) gives the dual of (15): min λ≥0 nv n 1 n ∑ σi x⊤ Σ−1 ∑ σ j x j + E ∑ λi . [sent-448, score-0.318]
72 When an already defined set, such as V (with a known number nv of elements) is an argument to T2 , λ will be subscripted with i or j. [sent-454, score-0.294]
73 767 S HIVASWAMY AND J EBARA √ S V Theorem 17 For nv = O ( n), with probability at least 1 − 2δ, HE,D ⊆ HE,D . [sent-550, score-0.294]
74 2n (24) Similarly, applying (19) on the set V, with probability at least 1 − δ, 1 nv ≤ EDx [I[gw (x)>E] ] + 2 ∑I w nv i=1 [g (vi )>E] 2E ¯ nD 1 tr(Kv ) + 3 nv ln(2/δ) . [sent-557, score-0.882]
75 This is because, if there is a w ∈ HE,D that is not in HE,D , for such a 1 1 1 w, nv ∑nv I[gw (vi )>E] > nv and n ∑n I[gw (xi )>E] = 0. [sent-561, score-0.588]
76 Thus, equating the right hand side of (26) to i=1 i=1 1 and solving for nv , the result follows. [sent-562, score-0.294]
77 Both an exact value and the asymptotic behavior of nv are nv derived in Appendix C. [sent-563, score-0.588]
78 For the SVM bound, the quantity of interest is 4 nγ SVM bound, the quantity of interest is √ 4 2E nˆ γ ¯ ∑n x⊤ DI + D ∑n u j u⊤ i=1 i j=1 j n the RMM bound, the quantity of interest is: nv nv 2 1 n ¯ min ∑ x⊤ D ∑ λ j I + D ∑ λ j v j v⊤ j ˆ γ λ≥0 n i=1 i j=1 j=1 769 −1 nv xi + −1 xi . [sent-580, score-0.988]
79 However, in both the cases, the absolute margin of separation γ remains the same. [sent-591, score-0.263]
80 2 This merely recovers the absolute margin γ which is shown in the figure. [sent-597, score-0.294]
81 S ˆ Thus, when a linear function is selected from the function class GD,E , the margin γ is estimated from S , the margin is estimated from a a whitened version of the data. [sent-619, score-0.502]
82 On this whitened ˆ data set, the margin γ appears much larger in toy example 2 since it is large compared to the spread. [sent-625, score-0.295]
83 This gives further flexibility and rescales data not only along principal eigen-directions but in any direction where the margin is large relative to the spread of the data. [sent-629, score-0.392]
84 By exploring D values, margin can be measured relative to the spread of the data rather than in the absolute sense. [sent-630, score-0.416]
85 However, the first term in the bound is the average slack variables divided by the margin which does not go to zero asymptotically with increasing n and eventually dominates the bound. [sent-696, score-0.285]
86 Clearly, as B decreases, the absolute margin is decreased however the test error rate drops drastically. [sent-788, score-0.286]
87 This empirically suggests that maximizing the relative margin can have a beneficial effect on the performance of a classifier. [sent-789, score-0.278]
88 This experiment once again demonstrates that an absolute margin does not always result in a small test error. [sent-901, score-0.286]
89 Conclusions The article showed that support vector machines and maximum margin classifiers can be sensitive to affine transformations of the input data and are biased in favor of separating data along directions with large spread. [sent-1215, score-0.297]
90 The relative margin machine was proposed to overcome such problems and optimizes the projection direction such that the margin is large only relative to the spread of the data. [sent-1216, score-0.688]
91 Empirically, the RMM and maximum relative margin approach showed significant improvements in classification accuracy. [sent-1219, score-0.278]
92 Directions of future work include exploring the connections between maximum relative margin and generalization bounds based on margin distributions (Schapire et al. [sent-1237, score-0.566]
93 By bounding outputs, the RMM is potentially finding a better margin distribution on the training examples. [sent-1239, score-0.289]
94 Furthermore, the maximization of relative margin is a fairly promising and general concept which may be compatible with other popular problems that have recently been tackled by the maximum margin paradigm. [sent-1241, score-0.517]
95 n i=1 Apply the Woodbury matrix inversion identity to the term inside the square root: n D ¯ ∑ x⊤ DI + n uk u⊤ i k i=1 −1 1 n xi = ¯ ∑ x⊤ I − D i=1 i uk u⊤ k xi ¯ nD ⊤ D + u k uk n ∑n (x⊤ uk )2 x⊤ xi − ni=1 i i ¯ D ⊤ i=1 D + u k uk 1 = ¯ D ∑ . [sent-1294, score-0.354]
96 , unv : nv nv 1 n ¯ min ∑ x⊤ D ∑ λ j I + D ∑ λ j u j u⊤ i j λ≥0 n i=1 j=1 j=1 −1 2 nv xi + E ∑ λi . [sent-1308, score-0.935]
97 Solving for nv Let x = √1 v , c = 4R 2E and b = 3 ln (2/δ) . [sent-1317, score-0.331]
98 Thus, nv = 1/(b + √ b2 + (c + 2b)/ n) which gives an exact expression for nv . [sent-1322, score-0.588]
99 Dropping terms from the denominator √ √ √ n produces the simpler expression: nv ≤ 1/ (c + 2b)/ n. [sent-1323, score-0.294]
100 Empirical margin distributions and bounding the generalization error of combined classifiers. [sent-1463, score-0.279]
wordName wordTfidf (topN-words)
[('rmm', 0.728), ('nv', 0.294), ('margin', 0.239), ('svm', 0.158), ('universum', 0.151), ('ebara', 0.144), ('hivaswamy', 0.144), ('argin', 0.137), ('ependent', 0.137), ('rademacher', 0.122), ('elative', 0.117), ('spread', 0.114), ('aximum', 0.106), ('gw', 0.088), ('landmark', 0.088), ('af', 0.08), ('egularization', 0.074), ('mnist', 0.073), ('digits', 0.06), ('shivaswamy', 0.059), ('hw', 0.055), ('whitening', 0.053), ('xi', 0.053), ('tr', 0.049), ('klda', 0.048), ('classi', 0.046), ('landmarks', 0.042), ('article', 0.04), ('uk', 0.039), ('relative', 0.039), ('fe', 0.039), ('transformation', 0.037), ('weston', 0.037), ('ln', 0.037), ('yi', 0.036), ('cristianini', 0.036), ('yq', 0.035), ('digit', 0.034), ('rm', 0.033), ('training', 0.032), ('iid', 0.032), ('shade', 0.032), ('lda', 0.032), ('toy', 0.032), ('merely', 0.031), ('validation', 0.031), ('scaling', 0.03), ('bayes', 0.029), ('projections', 0.029), ('er', 0.028), ('edx', 0.027), ('rmms', 0.027), ('pr', 0.027), ('bounds', 0.027), ('kernel', 0.027), ('di', 0.026), ('crammer', 0.024), ('qp', 0.024), ('whitened', 0.024), ('dual', 0.024), ('rbf', 0.024), ('lipschitz', 0.024), ('vi', 0.024), ('motivation', 0.024), ('absolute', 0.024), ('bound', 0.024), ('sinz', 0.023), ('explored', 0.023), ('test', 0.023), ('paired', 0.023), ('perceptron', 0.023), ('mcdiarmid', 0.023), ('formulation', 0.023), ('slack', 0.022), ('generalization', 0.022), ('raetsch', 0.021), ('primal', 0.021), ('green', 0.021), ('examples', 0.021), ('synthetic', 0.021), ('labellings', 0.021), ('jebara', 0.02), ('svms', 0.02), ('kernels', 0.02), ('optical', 0.019), ('usps', 0.019), ('inequality', 0.019), ('boundary', 0.019), ('solution', 0.019), ('transformations', 0.018), ('radius', 0.018), ('projection', 0.018), ('bounding', 0.018), ('gram', 0.018), ('eu', 0.018), ('yw', 0.018), ('kernelized', 0.018), ('resilient', 0.018), ('gmm', 0.018), ('deviates', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
2 0.092791595 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
3 0.084809422 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
4 0.080861531 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
5 0.065236963 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
Author: Yael Ben-Haim, Elad Tom-Tov
Abstract: We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm’s accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy. Keywords: decision tree classifiers, distributed computing, streaming data, scalability
6 0.064356081 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
7 0.058425441 22 jmlr-2010-Classification Using Geometric Level Sets
8 0.047688417 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
9 0.046542402 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
10 0.045796666 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
11 0.045668386 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
12 0.045406777 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
13 0.044818748 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
14 0.04448488 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
15 0.042251665 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
16 0.039236881 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
17 0.039049305 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
18 0.037611969 25 jmlr-2010-Composite Binary Losses
19 0.037511688 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
20 0.036652882 48 jmlr-2010-How to Explain Individual Classification Decisions
topicId topicWeight
[(0, -0.18), (1, -0.033), (2, -0.003), (3, 0.136), (4, -0.002), (5, 0.105), (6, -0.043), (7, 0.047), (8, -0.006), (9, -0.071), (10, 0.044), (11, -0.081), (12, 0.084), (13, 0.067), (14, -0.056), (15, 0.019), (16, 0.042), (17, -0.093), (18, 0.003), (19, -0.08), (20, 0.11), (21, 0.058), (22, -0.008), (23, 0.016), (24, 0.183), (25, -0.148), (26, 0.073), (27, 0.074), (28, 0.216), (29, -0.079), (30, -0.037), (31, -0.006), (32, 0.039), (33, 0.114), (34, -0.062), (35, -0.083), (36, 0.01), (37, 0.04), (38, -0.042), (39, -0.096), (40, 0.0), (41, 0.093), (42, -0.084), (43, -0.027), (44, 0.069), (45, 0.117), (46, -0.026), (47, 0.054), (48, 0.06), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.8803274 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
2 0.65764785 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu
Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory
3 0.6219731 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
4 0.57928163 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
5 0.47151509 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
6 0.39137775 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
7 0.37160811 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
8 0.36442915 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
9 0.3617374 22 jmlr-2010-Classification Using Geometric Level Sets
10 0.34963107 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
11 0.30559808 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
12 0.30444098 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
13 0.30146161 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
14 0.2946077 84 jmlr-2010-On Spectral Learning
15 0.27623674 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
16 0.27520925 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
17 0.27417424 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
18 0.27261129 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
19 0.26429084 94 jmlr-2010-Quadratic Programming Feature Selection
20 0.25915489 69 jmlr-2010-Lp-Nested Symmetric Distributions
topicId topicWeight
[(1, 0.012), (3, 0.019), (4, 0.015), (8, 0.023), (21, 0.018), (24, 0.01), (32, 0.08), (36, 0.033), (37, 0.071), (71, 0.303), (75, 0.17), (81, 0.023), (85, 0.093), (96, 0.022), (97, 0.011)]
simIndex simValue paperId paperTitle
1 0.82922351 48 jmlr-2010-How to Explain Individual Classification Decisions
Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller
Abstract: After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method. Keywords: explaining, nonlinear, black box model, kernel methods, Ames mutagenicity
same-paper 2 0.72063923 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
3 0.58149773 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
4 0.58139759 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
5 0.57851893 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
6 0.57586873 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
7 0.57320988 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
8 0.57204789 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
9 0.57128012 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
10 0.5706439 102 jmlr-2010-Semi-Supervised Novelty Detection
11 0.56880742 109 jmlr-2010-Stochastic Composite Likelihood
12 0.56791055 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
13 0.56773192 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
14 0.56581038 63 jmlr-2010-Learning Instance-Specific Predictive Models
15 0.56437236 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
16 0.56427807 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
17 0.56365275 22 jmlr-2010-Classification Using Geometric Level Sets
19 0.56247765 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
20 0.56116354 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning