jmlr jmlr2010 jmlr2010-40 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
Reference: text
sentIndex sentText sentNum sentScore
1 IT Department of Information Engineering and Computer Science University of Trento Trento, Italy Editor: L´ on Bottou e Abstract A computationally efficient approach to local learning with kernel methods is presented. [sent-5, score-0.214]
2 The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. [sent-6, score-0.456]
3 Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. [sent-7, score-0.301]
4 The introduction of a fast local model selection further speeds-up the learning process. [sent-8, score-0.22]
5 More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. [sent-10, score-0.355]
6 Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning 1. [sent-11, score-0.214]
7 The result is an approximation of the optimal separation and a smoothing of the decision function which is more influenced by the global distribution of the examples than by the local behaviour of the unknown target function in each specific sub-region. [sent-23, score-0.279]
8 S EGATA AND B LANZIERI is thus to trade locality for scalability permitting, with a potentially high level of under-fitting, to achieve a fast convergence to an approximated solution of the optimisation problem. [sent-25, score-0.342]
9 We show here that locality is not necessary related to computational inefficiency, but, instead, it can be the key factor to obtain very fast kernel methods without the need to smooth locally the global decision function. [sent-26, score-0.348]
10 In our proposed approach, the model is formed by a set of accurate local models trained on fixed-cardinality sub-regions of the training set and the prediction module uses for each query point the more appropriate local model. [sent-27, score-0.508]
11 In this setting, we are not approximating with some level of inaccuracy the original SVM optimisation problem, but we are separately considering different parts of the decision function with the potential advantage of better capturing the local separation. [sent-28, score-0.324]
12 In this work we present Fast Local Kernel Support Vector Machine (FaLK-SVM), that precomputes a set of local SVMs covering with adjustable redundancy the whole training set and uses for prediction a model which is the nearest (in terms of neighbourhood rank in feature space) to each testing point. [sent-31, score-0.591]
13 The locality of the approach is regulated by the neighbourhood size k and the method uses all the training points. [sent-35, score-0.369]
14 We also introduce a procedure for local model selection in order to speed-up the selection of the parameters and better capturing local properties of the data. [sent-37, score-0.354]
15 This is due to the combination of an intrinsic dimensionality which is low with respect to the training set size and of a decision function which is not simple to learn. [sent-44, score-0.236]
16 In general, locality plays a more important role as the number of training examples increases because the ratio between training set cardinality and the dimensionality is more favourable and the local characteristics are more evident. [sent-45, score-0.514]
17 (ii) The approach is also an enhancement of the local learning algorithms because the learning process is not delayed until the prediction phase (lazy learning) but the construction of the local models occurs during training (eager learning). [sent-52, score-0.418]
18 (iv) For complex classification problems that require an high fraction of support vectors (SVs), we exploit locality to avoid the need of bounding the number of total SVs as existing approximated SVM solvers do for computational reasons. [sent-54, score-0.219]
19 In the next Section we analyse the work on local learning algorithms, Local SVM and fast large margin classifiers that are all related with our work. [sent-57, score-0.263]
20 Section 3 formally introduces some machine learning tools that we need in order to introduce FaLK-SVM in Section 4 and analyse its learning bounds, complexity bounds, implementation, local model selection procedure and intuitive interpretation. [sent-58, score-0.216]
21 We review in this section those areas that are more related with our approach: local learning algorithms, local support vector machines, approximated and scalable SVM solvers. [sent-62, score-0.362]
22 Instead of estimating a decision function which is optimal (with respect to some criteria) for all possible unseen testing examples, the idea underlying LLAs consists in estimating the optimal decision function for each single testing point. [sent-65, score-0.304]
23 For a local learning algorithm, the points in the proximity of the query point have an higher influence in the training of the local model. [sent-67, score-0.514]
24 A proper choice of the locality parameter can reduce the generalisation error with respect to a global classifier as formalised by the 1885 S EGATA AND B LANZIERI Local Risk Minimization principle (Vapnik and Bottou, 1993; Vapnik, 2000). [sent-70, score-0.217]
25 However, Local SVM suffers from the high computational cost of the testing phase that comprises, for each example, (i) the selection of the k nearest neighbours and (ii) the computation of the maximal separating hyperplane on the k examples. [sent-85, score-0.271]
26 Bagging with SVM can thus be used for obtaining scalability as long as the advantage of training smaller SVM models on subsets of the training set (that can scale cubically) overcome the disadvantage of training multiple SVMs. [sent-123, score-0.291]
27 The common idea of all the proposed methods is that the advantage of having a method that uses a huge number of training points overcomes the disadvantage of approximating the decision function with a linear model. [sent-141, score-0.22]
28 However, when the needed decision function is highly non-linear and the intrinsic dimensionality of the space is relatively small, the linear SVM approach cannot compete with SVM using non-linear kernels in terms of generalisation accuracy. [sent-144, score-0.26]
29 Notice that in situations where the neighbourhood contains only one class the local SVM does not find any separation and so considers all the neighbourhood to belong to the predominant class similarly to the behaviour of the majority rule. [sent-246, score-0.468]
30 The generalisation of kNNSVM for multi-class classification can occur locally, that is solving the local multi-class SVM problem, or globally, that is applying the binary kNNSVM classifier on multiple global binary problems. [sent-248, score-0.219]
31 In more detail, a cover tree can be viewed as a sub-graph of a navigating net (Krauthgamer and Lee, 2004) and it is a levelled tree in which each level (indexed by a decreasing integer i) is a cover (i. [sent-255, score-0.246]
32 The covering tree invariant implies that every node has a parent in a higher level such that the distance between the respective points is less than bi , while separation invariant assures that the distance between every pair of points associated to the nodes of a level i is higher than bi . [sent-262, score-0.24]
33 (2006), the space required by the cover tree data-structure is linear in the data set size (O (n)), the computational time of single point insertions, deletions and exact nearest neighbour queries is logarithmic (O (log n)) while the cover tree can be built in O (n log n). [sent-266, score-0.446]
34 Initially we detail the way to pre-compute the local models during training (Section 4. [sent-269, score-0.23]
35 2 and our approach for fast local model selection in Section 4. [sent-274, score-0.22]
36 1 Pre-computing the Local Models during Training Phase For the local approach we are proposing here, we need to generalise the decision rule of kNNSVM to the case in which the local model is trained on the k-neighbourhood of a point distinct, in the general case, from the query point. [sent-281, score-0.514]
37 A modified decision function for a query point q ∈ H and another (possibly different) point t ∈ H is: k kNNSVMt (q) = sign ∑ αr (i) yr (i) K(xr (i) , q) + b t t t (4) i=1 where rt (i) is the kNNSVM ordering function (see above Section 3. [sent-282, score-0.225]
38 In the following we will refer to kNNSVMt (q) as being centred on t, to t as the centre of the model, and, if t ∈ X , to Vt as the Voronoi cell induced by t in X , formally: Vt = {p ∈ H s. [sent-284, score-0.225]
39 The first modification of kNNSVM consists in predicting the label of a test point q using the local SVM model built on the k-neighbourhood of its nearest neighbour in X . [sent-291, score-0.343]
40 This is possible computing a local SVM model for each x ∈ X during the training phase obtaining the sets {(t, kNNSVMt ) t ∈ X } and applying the precomputed kNNSVMt model such that t = xrq (1) for each query point q during the testing phase. [sent-295, score-0.475]
41 Instead of estimating the decision function for a given test example q and thus for a specific point in the input metric space, we estimate a decision function for each Voronoi cell Vx induced by the training set in the input metric space. [sent-297, score-0.273]
42 In this way, the construction of the models in the training phase requires the estimation of N local decision functions. [sent-298, score-0.368]
43 The prediction of a test point q is done using the model built for the Voronoi region in which q lies (Vh with h = xrq (1) ) that can be retrieved by searching for the nearest neighbour of q in X . [sent-299, score-0.24]
44 2 Reducing the Number of Local Models that Need to Be Trained The pre-computation of the local models during the training phase introduced above, increases the computational efficiency of the prediction step. [sent-301, score-0.275]
45 In fact, the training of an SVM for each training point can be slower than the training of a unique global SVM (especially for non small k values), so we introduce another modification of the method which aims to dramatically reduce the number of SVMs that need to be pre-computed. [sent-303, score-0.261]
46 The idea is that we can relax the constraint that a query point x′ is always evaluated using the model trained around its nearest training point. [sent-304, score-0.31]
47 The decision function of this approach is FastLSVM(x) = kNNSVM f (x) (x) (6) where f : H → C ⊆ X is a function mapping each unseen example x to a unique training example f (x) which is, accordingly to Equation 4, the centre of the local model that is used to evaluate x. [sent-305, score-0.455]
48 Notice that if f (·) = xr (1) , we have that C = X and that FastLSVM(x) is equivalent to the · kNNSVM formulation of Equation 5, and this can happen if we use all the examples in the training set as centres for local SVM models. [sent-307, score-0.532]
49 In the general case, however, we select only a proper subset C ⊂ X of points to be used as centres of kNNSVM models. [sent-308, score-0.301]
50 In this case, if xrx (1) ∈ C then f (x) can be defined as f (x) = xrx (1) , but if xr (1) ∈ C then f (x) must be defined in a way such that the / x principle of locality is preserved and the retrieval of the model is fast at prediction time. [sent-309, score-0.678]
51 1 S ELECTING THE C ENTRES OF THE L OCAL M ODELS The approach we developed for selecting the set C of the centres of the local models is based on the idea that each training point must be in the k′ -neighbourhood of at least one centre with k′ being a fixed parameter and k′ ≤ k. [sent-313, score-0.623]
52 From a slightly different viewpoint, we need to cover the entire training set with a set of hyper-spheres whose centres will be the examples in C and each hypersphere contains exactly k′ points. [sent-314, score-0.427]
53 We can formalise this idea with the concept of k′ -neighbourhood covering set: Definition 1 Given k′ ∈ N, a k′ -neighbourhood covering set of centres C ⊆ X is a subset of the training set such that the following holds: xrc (i) | i = 1, . [sent-315, score-0.494]
54 Theoretically, for a fixed k′ , the minimisation of the number of local SVMs that we need to train can be obtained computing the SVMs centred on the points contained in the minimal k′ -neighbourhood covering set of centres. [sent-320, score-0.415]
55 Definition 2 The Minimal k′ -neighbourhood covering set of centres is a k′ -neighbourhood covering set C ⊆ X which have the minimal cardinality. [sent-321, score-0.407]
56 However, in the SC and MSSC problems one specifies the radius of the spheres rather than their cardinality in terms of points they contain and it is not required that the centres of the hyperspheres correspond to points in the set. [sent-323, score-0.341]
57 In our case, however, we do not need the minimality of the constraints of the k′ -neighbourhood covering set of centres to be strictly satisfied, because training some more local SVMs is acceptable instead of solving an NP-hard problem. [sent-326, score-0.564]
58 The first k′ -neighbourhood is selected randomly choosing its centre in X , the following k′ -neighbourhoods are retrieved selecting the centres that are still not members of other k′ -neighbourhoods and are as far as possible from the already selected centres. [sent-329, score-0.393]
59 With this strategy we can choose the centres of the local models first in the set Si+1 , then in the set Si and so on, thus selecting first the centres that are assured to be distant at least d(Si+1 ), then at least d(Si ) < d(Si+1 ) and so on. [sent-346, score-0.665]
60 We can now formalise the selection of the centres from X using the Si sets. [sent-365, score-0.295]
61 The next centre c2 is chosen among the non-empty Sl sets obtained removing from Si the first centre c1 and the points in its k′ -neighbourhood; in particular c2 is chosen from the non-empty Sl with highest l. [sent-367, score-0.304]
62 The general case for the c j centre is similar, with the only difference being that we remove from the Si sets all the centres ct with t < j and their k′ -neighbourhood. [sent-368, score-0.393]
63 cl k′ -neighbourhoods is the union of all the of the centres already included in C . [sent-373, score-0.261]
64 In fact, the iterative procedure for selecting the centres in C terminates when the choose() function cannot select a point from Sl because all S j with j = bot, . [sent-375, score-0.261]
65 Computationally, the selection of the centres from the S j sets with Equation 9 can be performed efficiently once the S j are identified. [sent-381, score-0.295]
66 Taking all these facts into consideration, we chose to use the levels of cover tree as the S j sets from which we select the centres as reported in Equation 9. [sent-389, score-0.384]
67 Consequently with the goal of reducing the number of local models, this approach no longer requires that a local SVM is trained for each training example, but we need to train only |C | SVMs centred on each c ∈ C obtaining the following models: kNNSVMc (x), ∀c ∈ C . [sent-390, score-0.5]
68 1895 S EGATA AND B LANZIERI Figure 1: Graphical representation of the proposed approach using local models with k′ = 4, k = 15, and local SVM with RBF kernel. [sent-391, score-0.286]
69 The bold dotted circles highlights the k′ neighbourhoods covering all the training set (with some unavoidable redundancy), the thin dotted circles denotes the k-neighbourhoods on which the local models are trained. [sent-392, score-0.369]
70 The local SVM (with RBF kernel) decision functions are drawn in blue. [sent-394, score-0.236]
71 Notice that, due both to the adoption of the k′ -neighbourhood cover set and to the fact that only a fraction of the neighbourhoods need to be trained, we have only 17 local decision functions for 185 points. [sent-395, score-0.381]
72 Moreover if a neighbourhood contains only points belonging to one class the local model is the majority rule (specifically, unanimity) and the training of the SVM is avoided. [sent-396, score-0.411]
73 If k′ is low, a large number of k′ -neighbourhoods are required to cover the entire training set, whereas if k′ is large fewer k′ neighbourhoods are needed. [sent-400, score-0.232]
74 2 S ELECTING THE L OCAL M ODELS FOR T ESTING P OINTS Once the set of centres C is defined and the corresponding local models are trained, we need to select the proper model to use for predicting the label of a test point. [sent-404, score-0.404]
75 A simple strategy we can adopt consists in selecting the model whose centre c ∈ C is the nearest centre with respect to the 1896 FAST AND S CALABLE L OCAL K ERNEL M ACHINES C testing example. [sent-405, score-0.411]
76 Using the general definition of FastLSVM of Equation 6 with f (x) = rx (1) where C corresponds to the reordering function defined in Equation 3 performed on the C set instead of r X , the method, called FaLK-SVMc, is defined as: FaLK-SVMc(x) = kNNSVMc (x) where c = xrx (1) . [sent-406, score-0.418]
77 However, it does not assure that the testing point is evaluated with the model centred on the point for which the testing point itself is the nearest in terms of neighbour ranking. [sent-408, score-0.411]
78 For example, a testing point q can be closer to c1 than c2 using the Euclidean distance, but at the same time we can have that q is the i-th nearest neighbour of c1 in X and the j-th nearest neighbour of c2 with i > j. [sent-409, score-0.459]
79 In order to overcome this issue of FaLK-SVMc we propose to use, for a testing point q, the model centred on the training point which is the nearest in terms of the neighbourhood ranking to its training nearest neighbour. [sent-411, score-0.643]
80 The cnt function finds, for each example x, the minimum value h such that x is in the h-neighbourhood of at least one centre c ∈ C ; then, among the centres having x in their h-neighbourhoods, it selects the centre with the minimum index. [sent-416, score-0.57]
81 In this way each training point is univocally assigned to a centre and so the decision function of this approximation of Local SVM derivable from FastLSVM of Equation 6 with f (x) = cnt(x), and called FaLK-SVM, is simply: FaLK-SVM(x) = kNNSVMcnt(t) (x) where t = xrx (1) . [sent-418, score-0.525]
82 (12) The association between training points and centres defined by Equation 11 can be efficiently precomputed during the training phase, delaying to the testing phase only the retrieval of the nearest neighbour of the testing point and the evaluation of the local SVM model. [sent-419, score-1.008]
83 3 FaLK-SVM with Internal Model Selection: FaLK-SVMl For training a kernel machine, once a proper kernel is chosen, it is crucial to carefully tune the kernel parameters and, for SVM, to set the soft margin regularisation constant C. [sent-423, score-0.338]
84 Although κ can be confused with the neighbourhood size k or with the kernel function K, κ is always used for denoting κ-fold CV, so the context should be sufficient to avoid ambiguity. [sent-426, score-0.212]
85 The only difference with SVM is that our local kernel machines need to estimate an additional parameter which is the neighbourhood size k (which is however usually chosen in a small set of possible values). [sent-435, score-0.355]
86 However, since the local model is used by kNNSVM only for the central point, the model selection should be performed in order to make the local models predictive especially for the very internal points. [sent-444, score-0.32]
87 For FaLK-SVM we can apply the k′ -internal κ-fold CV for kNNSVM model selection on a randomly chosen training example and use the resulting parameters for all the local models. [sent-447, score-0.264]
88 Definition 5 (k′ -internal κ-fold CV model selection for FaLK-SVM) The procedure applies the k′ -internal κ-fold CV for kNNSVM model selection on the k-neighbourhoods of 1 ≤ m ≤ |C | randomly chosen centres selecting the parameters that minimise the average error rate among the m applications. [sent-449, score-0.36]
89 Since FaLK-SVMl selects the local model parameters using a small subset of the training set, the variance of the error may be higher than the standard cross-validation strategies. [sent-451, score-0.23]
90 4 Generalisation Bounds for kNNSVM and FaLK-SVM The class of LLAs introduced by Bottou and Vapnik (1992) includes kNNSVM, and can be theoretically analysed using the framework based on the local risk minimisation (Vapnik and Bottou, 1993; Vapnik, 2000). [sent-462, score-0.28]
91 In fact, LLAs compute the local function for each specific testing point thus delaying the neighbourhood retrieval and model training until the testing point is available. [sent-464, score-0.516]
92 1899 S EGATA AND B LANZIERI We need to recall the bound for the local risk minimisation, which is a generalisation of the global risk minimisation theory. [sent-466, score-0.365]
93 The possibility of local approaches to obtain a lower bound on test misclassification probability acting with the locality parameter, as stated in Vapnik and Bottou (1993); Vapnik (2000) for LLA, it is even more evident for kNNSVM considering Equation 13. [sent-473, score-0.284]
94 In fact, although choosing a k < N is not sufficient to lower the bound, as the model training becomes more and more local k decreases and (very likely) the misclassification training rate νx′ decreases as well. [sent-474, score-0.317]
95 Moreover, also the complexity of the classifier (and thus hΣ ) can decrease when the neighbourhood decreases, because simpler decision functions can be used when fewer points are considered. [sent-475, score-0.274]
96 FaLK-SVM pre-computes local models to be used for testing points lying in sub-regions (k-NN Voronoi cells) of the training set. [sent-478, score-0.329]
97 For the testing phase of FaLK-SVM we can distinguish two steps (for each testing point): • the retrieval of the nearest training point that scales as O (log N); • the prediction of the testing label using the selected local model that scales as O (k). [sent-496, score-0.653]
98 , 2006; Dong, 2005), is rather critical; for FaLK-SVM is sufficient that, every time the points for a model are retrieved, the training of the local SVM is performed on a different processor. [sent-501, score-0.27]
99 For large data sets, FaLK-SVM can still maintain in memory the entire local kernel matrix (if k is not too large), whereas SVM must discard some kernel values thus increasing SVM time complexity due to the need of recomputing them. [sent-506, score-0.285]
100 , N} // the indexes for centres selection 5: Randomise indexes // randomise the indexes 6: for i ⇐ 1 to N do 7: index ⇐ indexes[i] // get the i-th index 8: if modelPtrs[index] = null then // if the point has not been assigned to a model. [sent-521, score-0.493]
wordName wordTfidf (topN-words)
[('knnsvm', 0.452), ('svm', 0.29), ('centres', 0.261), ('xrx', 0.213), ('rx', 0.205), ('blanzieri', 0.173), ('local', 0.143), ('neighbourhood', 0.141), ('locality', 0.141), ('calable', 0.133), ('egata', 0.133), ('lanzieri', 0.133), ('centre', 0.132), ('llas', 0.12), ('segata', 0.12), ('cv', 0.116), ('achines', 0.114), ('ocal', 0.113), ('neighbour', 0.112), ('si', 0.111), ('query', 0.101), ('ernel', 0.097), ('decision', 0.093), ('centred', 0.093), ('nearest', 0.088), ('optimisation', 0.088), ('training', 0.087), ('melgani', 0.08), ('sbot', 0.08), ('cover', 0.079), ('generalisation', 0.076), ('covering', 0.073), ('kernel', 0.071), ('knnsvmt', 0.066), ('mssc', 0.066), ('neighbourhoods', 0.066), ('rknnsvm', 0.066), ('indexes', 0.066), ('minimisation', 0.066), ('vapnik', 0.066), ('testing', 0.059), ('lla', 0.057), ('fastlsvm', 0.053), ('bordes', 0.053), ('voronoi', 0.051), ('bottou', 0.05), ('eager', 0.045), ('cnt', 0.045), ('neighbours', 0.045), ('phase', 0.045), ('tree', 0.044), ('separation', 0.043), ('scales', 0.043), ('knn', 0.043), ('fast', 0.043), ('xr', 0.041), ('rbf', 0.04), ('risk', 0.04), ('bot', 0.04), ('modelptrs', 0.04), ('ncl', 0.04), ('nesting', 0.04), ('xrq', 0.04), ('points', 0.04), ('approximated', 0.04), ('analyse', 0.039), ('classi', 0.039), ('solvers', 0.038), ('margin', 0.038), ('sl', 0.038), ('scalable', 0.036), ('performances', 0.035), ('stop', 0.035), ('kernels', 0.035), ('kecman', 0.034), ('trained', 0.034), ('selection', 0.034), ('svms', 0.032), ('minimise', 0.031), ('joachims', 0.031), ('analysed', 0.031), ('dist', 0.031), ('yr', 0.031), ('scalability', 0.03), ('er', 0.029), ('dimensionality', 0.029), ('tsang', 0.028), ('accuracies', 0.027), ('retrieval', 0.027), ('intrinsic', 0.027), ('xi', 0.027), ('bbot', 0.027), ('btop', 0.027), ('chvatal', 0.027), ('dagsvm', 0.027), ('disi', 0.027), ('electing', 0.027), ('enrico', 0.027), ('favourable', 0.027), ('knnsvmc', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999851 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
2 0.14460175 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
3 0.084809422 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
4 0.08195626 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis
Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is fixed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.1 Keywords: support vector machine, distributed optimization, distributed data mining, distributed learning, sensor networks
5 0.078501679 102 jmlr-2010-Semi-Supervised Novelty Detection
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions. Keywords: semi-supervised learning, novelty detection, Neyman-Pearson classification, learning reduction, two-sample problem, multiple testing
6 0.075651243 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
7 0.074067615 22 jmlr-2010-Classification Using Geometric Level Sets
8 0.07182309 15 jmlr-2010-Approximate Tree Kernels
9 0.069343805 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
10 0.068762101 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
11 0.066226393 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
12 0.055667888 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory
13 0.051570628 48 jmlr-2010-How to Explain Individual Classification Decisions
14 0.050757401 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
16 0.044380624 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
17 0.041603368 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
18 0.041453782 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
19 0.041307967 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
20 0.040830482 44 jmlr-2010-Graph Kernels
topicId topicWeight
[(0, -0.213), (1, 0.036), (2, -0.032), (3, 0.175), (4, 0.046), (5, 0.155), (6, -0.124), (7, -0.007), (8, -0.047), (9, -0.062), (10, 0.113), (11, -0.046), (12, 0.013), (13, 0.044), (14, -0.101), (15, -0.023), (16, 0.006), (17, -0.142), (18, 0.051), (19, -0.136), (20, 0.236), (21, 0.085), (22, 0.003), (23, 0.111), (24, 0.009), (25, -0.135), (26, -0.062), (27, 0.02), (28, 0.22), (29, -0.034), (30, -0.06), (31, -0.036), (32, 0.034), (33, 0.119), (34, 0.016), (35, 0.058), (36, 0.008), (37, -0.099), (38, 0.049), (39, 0.14), (40, 0.031), (41, 0.069), (42, -0.01), (43, -0.095), (44, -0.025), (45, -0.076), (46, 0.029), (47, 0.068), (48, -0.074), (49, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.92764068 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
2 0.76412249 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
3 0.66181225 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
4 0.64699823 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu
Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory
5 0.53339505 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis
Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is fixed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.1 Keywords: support vector machine, distributed optimization, distributed data mining, distributed learning, sensor networks
6 0.38333428 22 jmlr-2010-Classification Using Geometric Level Sets
7 0.37298971 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
8 0.36762553 102 jmlr-2010-Semi-Supervised Novelty Detection
9 0.36148942 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
10 0.35934544 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
11 0.35910845 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
12 0.35427898 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
14 0.33183947 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
15 0.28853655 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
16 0.28650838 48 jmlr-2010-How to Explain Individual Classification Decisions
17 0.28367376 15 jmlr-2010-Approximate Tree Kernels
18 0.27498373 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
19 0.27292451 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
20 0.2699706 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
topicId topicWeight
[(1, 0.013), (3, 0.014), (4, 0.015), (8, 0.018), (21, 0.038), (22, 0.01), (32, 0.053), (33, 0.01), (36, 0.03), (37, 0.038), (75, 0.176), (79, 0.401), (81, 0.019), (85, 0.078), (96, 0.012)]
simIndex simValue paperId paperTitle
1 0.94267291 35 jmlr-2010-Error-Correcting Output Codes Library
Author: Sergio Escalera, Oriol Pujol, Petia Radeva
Abstract: In this paper, we present an open source Error-Correcting Output Codes (ECOC) library. The ECOC framework is a powerful tool to deal with multi-class categorization problems. This library contains both state-of-the-art coding (one-versus-one, one-versus-all, dense random, sparse random, DECOC, forest-ECOC, and ECOC-ONE) and decoding designs (hamming, euclidean, inverse hamming, laplacian, β-density, attenuated, loss-based, probabilistic kernel-based, and lossweighted) with the parameters defined by the authors, as well as the option to include your own coding, decoding, and base classifier. Keywords: error-correcting output codes, multi-class classification, coding, decoding, open source, matlab, octave 1. Error-Correcting Output Codes The Error-Correcting Output Codes (ECOC) framework (Dietterich and Bakiri, 1995) is a simple but powerful framework to deal with the multi-class categorization problem based on the embedding of binary classifiers. Given a set of Nc classes, the basis of the ECOC framework consists of designing a codeword for each of the classes. These codewords encode the membership information of each class for a given binary problem. Arranging the codewords as rows of a matrix, we obtain a ”coding matrix” Mc , where Mc ∈ {−1, 0, 1}Nc ×n , being n the length of the codewords codifying each class. From the point of view of learning, Mc is constructed by considering n binary problems, each one corresponding to a column of the matrix Mc . Each of these binary problems (or dichotomizers) splits the set of classes in two partitions (coded by +1 or -1 in Mc according to their class set membership, or 0 if the class is not considered by the current binary problem). Then, at the decoding step, applying the n trained binary classifiers, a code is obtained for each data point in the test set. This code is compared to the base codewords of each class defined in the matrix Mc , and the data point is assigned to the class with the ”closest” codeword. Several decoding strategies have been proposed in literature. The reader is referred to Escalera et al. (2008) for a more detailed review. An example of an ECOC design is described in Fig. 1. The ECOC designs are independent of the base classifier applied. They involve error-correcting properties (Dietterich and Bakiri, 1995) and have shown to be able to reduce the bias and variance produced by the learning algorithm (Kong and Dietterich, 1995). Because of these reasons, ECOCs have been widely used to deal with multi-class categorization problems. c 2010 Sergio Escalera, Oriol Pujol and Petia Radeva. E SCALERA , P UJOL AND R ADEVA ECOC coding design for a 4-class problem. White, black, and grey positions corresponds to the symbols +1, -1, and 0, respectively. Once the four binary problems are learnt, at the decoding step a new test sample X is tested by the n classifiers. Then, the new codeword x = {x1 , .., xn } is compared with the class codewords {C1 , ..C4 }, classifying the new sample by the class ci which codeword minimizes the decoding measure. Figure 1: ECOC design example. 2. Library Algorithms The ECOCs library is a Matlab/Octave code under the open source GPL license (gpl) with the implementation of the state-of-the-art coding and decoding ECOC designs. A main function defines the multi-class data, coding, decoding, and base classifier. A list of parameters are also included in order to tune the different strategies. In addition to the implemented coding and decoding designs, which are described in the following section, the user can include his own coding, decoding, and base classifier as defined in the user guide. 2.1 Implemented Coding Designs The ECOC designs of the ECOC library cover the state-of-the-art of coding strategies, mainly divided in two main groups: problem-independent approaches, which do not take into account the distribution of the data to define the coding matrix, and the problem-dependent designs, where information of the particular domain is used to guide the coding design. 2.1.1 P ROBLEM - INDEPENDENT ECOC D ESIGNS • One-versus-all (Rifkin and Klautau, 2004): Nc dichotomizers are learnt for Nc classes, where each one splits one class from the rest of classes. • One-versus-one (Nilsson, 1965): n = Nc (Nc − 1)/2 dichotomizers are learnt for Nc classes, splitting each possible pair of classes. • Dense Random (Allwein et al., 2002): n = 10 · log Nc dichotomizers are suggested to be learnt for Nc classes, where P(−1) = 1 − P(+1), being P(−1) and P(+1) the probability of the symbols -1 and +1 to appear, respectively. Then, from a set of defined random matrices, the one which maximizes a decoding measure among all possible rows of Mc is selected. • Sparse Random (Escalera et al., 2009): n = 15 · log Nc dichotomizers are suggested to be learnt for Nc classes, where P(0) = 1 − P(−1) − P(+1), defining a set of random matrices Mc and selecting the one which maximizes a decoding measure among all possible rows of Mc . 662 E RROR -C ORRECTING O UPUT C ODES L IBRARY 2.1.2 P ROBLEM - DEPENDENT ECOC D ESIGNS • DECOC (Pujol et al., 2006): problem-dependent design that uses n = Nc − 1 dichotomizers. The partitions of the problem are learnt by means of a binary tree structure using exhaustive search or a SFFS criterion. Finally, each internal node of the tree is embedded as a column in Mc . • Forest-ECOC (Escalera et al., 2007): problem-dependent design that uses n = (Nc − 1) · T dichotomizers, where T stands for the number of binary tree structures to be embedded. This approach extends the variability of the classifiers of the DECOC design by including extra dichotomizers. • ECOC-ONE (Pujol et al., 2008): problem-dependent design that uses n = 2 · Nc suggested dichotomizers. A validation sub-set is used to extend any initial matrix Mc and to increase its generalization by including new dichotomizers that focus on difficult to split classes. 2.2 Implemented Decoding Designs The software comes with a complete set of ECOC decoding strategies. The notation used refers to that used in (Escalera et al., 2008): j • Hamming decoding: HD(x, yi ) = ∑n (1 − sign(x j · yi ))/2, being x a test codeword and yi a j=1 codeword from Mc corresponding to class Ci . • Inverse Hamming decoding: IHD(x, yi ) = max(∆−1 DT ), where ∆(i1 , i2 ) = HD(yi1 , yi2 ), and D is the vector of Hamming decoding values of the test codeword x for each of the base codewords yi . • Euclidean decoding: ED(x, yi ) = j ∑n (x j − yi )2 . j=1 • Attenuated Euclidean decoding: AED(x, yi ) = j j ∑n | yi || x j | (x j − yi )2 . j=1 • Loss-based decoding: LB(ρ, yi ) = ∑n L(yi · f j (ρ)), where ρ is a test sample, L is a lossj=1 function, and f is a real-valued function f : R n → R . j • Probabilistic-based decoding: PD(yi , x)=−log ∏ j∈[1,..,n]:Mc (i, j)=0 P(x j = Mc (i, j)| f j ) + K , where K is a constant factor that collects the probability mass dispersed on the invalid codes, and the probability P(x j = Mc (i, j)| f j ) j 1 , where vectors υ and ω are obtained by is estimated by means of P(x j = yi | f j ) = j j j j 1+eyi (υ f +ω ) solving an optimization problem (Passerini et al., 2004). αi +1 • Laplacian decoding: LAP(x, yi ) = αi +βi +K , where αi is the number of matched positions between x and yi , βi is the number of miss-matches without considering the positions coded by 0, and K is an integer value that codifies the number of classes considered by the classifier. νi 1 • Pessimistic β-Density Distribution decoding: accuracy si : νi −si ψi (ν, αi , βi )dν = 3 , where 1 ψi (ν, αi , βi ) = K ναi (1 − ν)βi , ψi is the β-Density Distribution between a codeword x and a class codeword yi for class ci , and ν ∈ R : [0, 1]. R j • Loss-Weighted decoding: LW (ρ, i) = ∑n MW (i, j)L(yi · f (ρ, j)), where MW (i, j) = ∑nH(i, j) j) , j=1 H(i, j=1 j 1, if x j = yi , 1 H(i, j) = mi ∑mi ϕ(h j (ρi ), i, j), ϕ(x j , i, j) = , mi is the number of training k k=1 0, otherwise. samples from class Ci , and ρi is the kth sample from class Ci . k 663 E SCALERA , P UJOL AND R ADEVA 3. Implementation Details The ECOCs Library comes with detailed documentation. A user guide describes the usage of the software. All the strategies and parameters used in the functions and files are described in detail. The user guide also presents examples of variable setting and execution, including a demo file. About the computational complexity, the training and testing time depends on the data size, coding and decoding algorithms, as well as the base classifier used in the ECOC design. Acknowledgments This work has been supported in part by projects TIN2009-14404-C02 and CONSOLIDER-INGENIO CSD 2007-00018. References URL http://www.gnu.org/licences/. E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2002. T. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–282, 1995. S. Escalera, Oriol Pujol, and Petia Radeva. Boosted landmarks of contextual descriptors and ForestECOC: A novel framework to detect and classify objects in clutter scenes. Pattern Recognition Letters, 28(13):1759–1768, 2007. S. Escalera, O. Pujol, and P. Radeva. On the decoding process in ternary error-correcting output codes. IEEE Transactions in Pattern Analysis and Machine Intelligence, 99, 2008. S. Escalera, O. Pujol, and P. Radeva. Separability of ternary codes for sparse designs of errorcorrecting output codes. Pattern Recognition Letters, 30:285–297, 2009. E. B. Kong and T. G. Dietterich. Error-correcting output coding corrects bias and variance. International Conference of Machine Learning, pages 313–321, 1995. N. J. Nilsson. Learning Machines. McGraw-Hill, 1965. A. Passerini, M. Pontil, and P. Frasconi. New results on error correcting output codes of kernel machines. IEEE Transactions on Neural Networks, 15(1):45–54, 2004. O. Pujol, P. Radeva, , and J. Vitri` . Discriminant ECOC: A heuristic method for application depena dent design of error correcting output codes. IEEE Transactions in Pattern Analysis and Machine Intelligence, 28:1001–1007, 2006. O. Pujol, S. Escalera, and P. Radeva. An incremental node embedding technique for error-correcting output codes. Pattern Recognition, 4:713–725, 2008. R. Rifkin and A. Klautau. In defense of one-vs-all classification. The Journal of Machine Learning Research, 5:101–141, 2004. 664
same-paper 2 0.67446595 40 jmlr-2010-Fast and Scalable Local Kernel Machines
Author: Nicola Segata, Enrico Blanzieri
Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning
3 0.45509264 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
4 0.45196643 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
5 0.45143092 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin
Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing
6 0.45082515 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
7 0.44980362 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
8 0.44936818 63 jmlr-2010-Learning Instance-Specific Predictive Models
9 0.44864145 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
10 0.44827732 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
11 0.44818562 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
12 0.44662067 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
13 0.44660604 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
14 0.44652164 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
15 0.44631904 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
16 0.4449288 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
17 0.44469658 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
18 0.44387865 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
19 0.44307736 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
20 0.44287968 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory