jmlr jmlr2005 jmlr2005-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
Reference: text
sentIndex sentText sentNum sentScore
1 Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels. [sent-14, score-0.274]
2 Before introducing the new Online SVM, let us briefly describe other existing online kernel methods, beginning with the kernel Perceptron. [sent-93, score-0.229]
3 (2004) suggest an additional step for removing support vectors from the kernel expansion (2). [sent-119, score-0.184]
4 This modified algorithm outperforms all other online kernel classifiers on noisy data sets and matches the performance of Support Vector Machines with less support vectors. [sent-128, score-0.273]
5 gn ) is the gradient of W (α), and gk = ∂W (α) ˆ = yk − ∑ αi K(xi , xk ) = yk − y(xk ) + b. [sent-145, score-0.199]
6 ∂αk i (9) Sequential Minimal Optimization Platt (1999) observes that direction search computations are much faster when the search direction u mostly contains zero coefficients. [sent-146, score-0.188]
7 LASVM is an online kernel classifier sporting a support vector removal step: vectors collected in the current kernel expansion can be removed during the online process. [sent-170, score-0.477]
8 Direction searches of the REPROCESS kind involve two examples that already are support vectors in the current kernel expansion. [sent-183, score-0.22]
9 2) αk ← 0 , gk ← yk − ∑s∈S αs Kks , S ← S ∪ {k} 3) If yk = +1 then i ← k , j ← arg mins∈S gs with αs > As else j ← k , i ← arg maxs∈S gs with αs < Bs 4) Bail out if (i, j) is not a τ-violating pair. [sent-196, score-0.32]
10 LASVM REPROCESS 1) i ← arg maxs∈S gs with αs < Bs j ← arg mins∈S gs with αs > As 2) Bail out if (i, j) is not a τ-violating pair. [sent-201, score-0.18]
11 The online iterations process fresh training examples as they come. [sent-209, score-0.263]
12 A single epoch is consistent with the use of LASVM in the online setup. [sent-215, score-0.202]
13 Each epoch performs n online iterations by sequentially visiting the randomly shuffled training examples. [sent-239, score-0.28]
14 Performing P epochs of online iterations ¯ ¯ requires a number of operations proportional to n P S. [sent-242, score-0.21]
15 The average number S of support vectors scales no more than linearly with n because each online iteration brings at most one new support vector. [sent-243, score-0.316]
16 Only PROCESS on a new example xkt accesses S fresh kernel values K(xkt , xi ) for i ∈ S . [sent-249, score-0.198]
17 The reordering overhead is acceptable during the online iterations because the computation of fresh kernel values takes much more time. [sent-253, score-0.25]
18 The set S of support vectors is split into an active set Sa and an inactive set Si . [sent-256, score-0.306]
19 The REPROCESS iterations are restricted to the active set Sa and do not perform any reordering. [sent-258, score-0.25]
20 About every 1000 iterations, support vectors that hit the boundaries of the box constraints are either removed from the set S of support vectors or moved from the active set Sa to the inactive set Si . [sent-259, score-0.407]
21 When all τ-violating pairs of the active set are exhausted, the inactive set examples are transferred back into the active set. [sent-260, score-0.445]
22 The LASVM×1 results were obtained by performing a single epoch of online iterations: each training example was processed exactly once during a single sequential sweep over the training set. [sent-281, score-0.268]
23 A single epoch of the Online LASVM algorithm gathers most of the support vectors of the SVM solution computed by LIBSVM. [sent-307, score-0.184]
24 • In the case of LIBSVM, the cache must accommodate about n R terms: the examples selected for the SMO iterations are usually chosen among the R free support vectors. [sent-333, score-0.213]
25 The examples selected by REPROCESS are usually chosen amont the R free support vectors; for each selected example, REPROCESS needs one distinct dot-product per support vector in set S . [sent-336, score-0.185]
26 Section 4 then proposes an active learning method to contain the growth of the number of support vectors, and recover the full benefits of the online approach. [sent-343, score-0.399]
27 The fast kernel computation times expose the relative weakness of our kernel cache implementation. [sent-354, score-0.194]
28 Therefore LASVM online iterations are able to very quickly discard a large number of examples with a high confidence. [sent-361, score-0.199]
29 The online iterations pretty much select the right support vectors on clean data sets such as “Waveform”, “Reuters” or “USPS”, and the finishing step does very little. [sent-365, score-0.265]
30 On the other problems the online iterations keep much more examples as potential support vectors. [sent-366, score-0.274]
31 9s 178s 809s 10310s 137183s Figure 8: Comparison of LIBSVM versus LASVM×1: Test error rates (Error), number of support vectors (SV), number of kernel calls (KCalc), and training time (Time). [sent-433, score-0.213]
32 5 The Collection of Potential Support Vectors The final step of the REPROCESS operation computes the current value of the kernel expansion bias b and the stopping criterion δ: gmax = max gs with αs < Bs s∈S gmin = min gs with αs > As s∈S gmax + gmin 2 δ = gmax − gmin . [sent-437, score-0.294]
33 The definition of PROCESS and the equality (9) indicate indeed that PROCESS(k) adds the support vector xk to the kernel expansion if and only if δ yk y(xk ) < 1 + − τ. [sent-441, score-0.246]
34 These measurements were performed after one epoch of online iterations without finishing step, and after one and two epochs followed by the finishing step. [sent-452, score-0.293]
35 There clearly is a sweet spot around δmax = 3 when one epoch of online iterations alone almost match the SVM performance and also makes the finishing step very fast. [sent-461, score-0.247]
36 Both LASVM and the SimpleSVM update a current kernel expansion by adding or removing one or two support vectors at each iteration. [sent-471, score-0.184]
37 Whereas each SimpleSVM iteration seeks the optimal solution of the SVM QP problem restricted to the current set of support vectors, the LASVM online iterations merely attempt to improve the value of the dual objective function W (α). [sent-473, score-0.26]
38 In both cases, the LASVM online iterations pick random training examples. [sent-483, score-0.222]
39 It is also justified in the online setup when the continuous stream of fresh training examples is too costly to process, either because the computational requirements are too high, or because it is inpractical to label all the potential training examples. [sent-486, score-0.251]
40 Depending on the class yk , the PROCESS(k) operation considers pair (k, j) or (i, k) where i and j are the indices of the examples in S with extreme gradients: i = arg max gs with αs < Bs , j = arg min gs with αs > As . [sent-493, score-0.273]
41 Using the expression (9) of the gradients and the value of b and δ computed during the previous REPROCESS (10), we can write: when yk = +1, when yk = −1, gi + g j gi − g j δ + = 1 + − yk y(xk ) ˆ 2 2 2 gi + g j gi − g j δ gi − gk = + + yk gk = 1 + − yk y(xk ). [sent-495, score-0.528]
42 ˆ 2 2 2 gk − g j = yk gk − This expression shows that the Gradient Selection Criterion simply suggests to pick the most misclassified example kG = arg min yk y(xk ). [sent-496, score-0.211]
43 Early active learning literature, also known as Experiment Design (Fedorov, 1972), contrasts the passive learner, who observes examples (x, y), with the active learner, who constructs queries x and observes their labels y. [sent-507, score-0.445]
44 In this setup, the active learner cannot beat the passive learner because he lacks information about the input pattern distribution (Eisenberg and Rivest, 1990). [sent-508, score-0.205]
45 Pool-based active learning algorithms observe the pattern distribution from a vast pool of unlabelled examples. [sent-509, score-0.205]
46 , 2000; Schohn and Cohn, 2000; Tong and Koller, 2000) propose incremental active learning algorithms that clearly are related to Active Selection. [sent-512, score-0.205]
47 In the online setup, randomized search is the only practical way to select training examples. [sent-527, score-0.211]
48 Each online iteration of the above algorithm is about M times more computationally expensive that an online iteration of the basic LASVM algorithm. [sent-540, score-0.28]
49 ˆ Finally the last two paragraphs of appendix A discuss the convergence of LASVM with example selection according to the gradient selection criterion or the active selection criterion. [sent-544, score-0.342]
50 On the other hand, the active selection criterion only does so when one uses the sampling method. [sent-546, score-0.241]
51 More specifically, we run a predefined number of online iterations, save the LASVM state, perform the finishing step, measure error rates and number of support vectors, and restore the saved LASVM state before proceeding with more online iterations. [sent-575, score-0.337]
52 Although we have not been able to observe it on this data set, we expect that, after a very long time, the ACTIVE curve for the noisy training set converges to the accuracy and the number of support vectors achieved of the LIBSVM solution obtained for the noisy training data. [sent-616, score-0.215]
53 5 Online SVMs for Active Learning The ACTIVE LASVM algorithm implements two dramatic speedups with respect to existing active learning algorithms such as (Campbell et al. [sent-618, score-0.205]
54 5 0 0 LASVM ACTIVE 50 LASVM ACTIVE ALL RETRAIN ACTIVE 50 RETRAIN ACTIVE ALL RANDOM 100 200 300 Number of labels 400 2 0 500 1000 Number of labels 1500 Figure 16: Comparing active learning methods on the USPS and Reuters data sets. [sent-634, score-0.205]
55 All the active learning methods performed approximately the same, and were superior to random selection. [sent-638, score-0.205]
56 Kernel classifier algorithms usually maintain an active set of potential support vectors and work by iterations. [sent-666, score-0.306]
57 Their computing requirements are readily associated with the training examples that belong to the active set. [sent-667, score-0.273]
58 Adding a training example to the active set increases the computing time associated with each subsequent iteration because they will require additional kernel computations involving this new support vector. [sent-668, score-0.389]
59 Removing a training example from the active set reduces the cost of each subsequent iteration. [sent-669, score-0.238]
60 The final set of support vectors is intrinsically defined by the SVM QP problem, regardless of the path followed by the online learning process. [sent-673, score-0.22]
61 Intrinsic support vectors provide a benchmark to evaluate the impact of changes in the active set of current support vectors. [sent-674, score-0.381]
62 Augmenting the active set with an example that is not an intrinsic support vector moderately increases the cost of each iteration without clear benefits. [sent-675, score-0.301]
63 • Test error rates are sometimes improved by active example selection. [sent-699, score-0.229]
64 In fact this effect has already been observed in the active learning setups (Schohn and Cohn, 2000). [sent-700, score-0.205]
65 Empirical evidence suggests that active example selection yields transitory kernel classifiers that achieve low error rates with much less support vectors. [sent-707, score-0.395]
66 The present work suggest that this cost might be smaller than n times the reduced number of support vectors achievable with the active learning technique. [sent-712, score-0.306]
67 We have then shown how active example selection can yield faster training, higher accuracies and simpler models using only a fraction of the training examples labels. [sent-731, score-0.333]
68 Results are presented for the case where directions are selected from a well chosen finite pool, like SMO (Platt, 1999), and for the stochastic algorithms, like the online and active SVM discussed in the body of this contribution. [sent-737, score-0.429]
69 Then it introduces the notion of witness family of directions which leads to a more compact characterization of the optimum. [sent-742, score-0.242]
70 Intuitively, a direction u = 0 is feasible in x when we can start from x and make a little movement along direction u without leaving the convex set F . [sent-748, score-0.212]
71 Instead of considering all feasible directions in Rn , we wish to only consider the feasible directions from a smaller set U . [sent-771, score-0.274]
72 1607 B ORDES , E RTEKIN , W ESTON , AND B OTTOU Definition 5 A set of directions U ⊂ Rn is a “witness family for F ” when, for any point x ∈ F , ∗ any feasible direction u ∈ Dx can be expressed as a positive linear combination of a finite number of feasible directions v j ∈ U ∩ Dx . [sent-785, score-0.336]
73 Theorem 6 Let U be a witness family for convex set F . [sent-787, score-0.193]
74 The following proposition provides an example of witness family for the convex domain Fs that appears in the SVM QP problem (5). [sent-806, score-0.228]
75 Set Us = {ei − e j , i = j} is a witness family for convex set Fs defined by the constraints x ∈ Fs ∀ i Ai ≤ xi ≤ Bi ∑i xi = 0. [sent-811, score-0.193]
76 If there is a finite witness family for F , then F is a convex polytope. [sent-853, score-0.193]
77 vk } = U ∩ Dx of feasible witness directions in x. [sent-864, score-0.305]
78 4 Stochastic Witness Direction Search Each iteration of the following algorithm randomly chooses a feasible witness direction and performs an optimization along this direction. [sent-889, score-0.314]
79 The successive search directions ut are randomly selected (step 2a) according to some distribution Pt defined on U . [sent-890, score-0.201]
80 , 2a) Draw a direction ut ∈ U from a distribution Pt 2b) If u ∈ Dxt−1 and ut ∇ f (xt−1 ) > 0 , xt ← argmax f (x) under x ∈ {xt−1 + λut ∈ F , λ ≥ 0} otherwise xt ← xt−1 . [sent-896, score-0.532]
81 Assume U is a finite witness set for set F , and let the sequence xt be defined by the Stochastic WDS algorithm above. [sent-902, score-0.308]
82 The definition demands a finite witness family because this leads to proposition 13 establishing that κτ-approximate solutions indicate the location of actual solutions when κ and τ tend to zero. [sent-940, score-0.203]
83 Proposition 13 Let U be a finite witness family for bounded convex set F . [sent-941, score-0.193]
84 , 2a) Draw a direction ut ∈ U from a probability distribution Pt 2b) If ut is a κτ-violating direction, xt ← argmax f (x) under x ∈ {xt−1 + λut ∈ F , λ ≥ 0} otherwise xt ← xt−1 . [sent-955, score-0.532]
85 The successive search directions ut are drawn from some unspecified distributions Pt defined on U . [sent-956, score-0.201]
86 Proposition 14 If ut is a κτ-violating direction in xt−1 , φ(xt , ut ) ut ∇ f (xt ) = 0. [sent-959, score-0.347]
87 Proof Let the maximum f (xt ) = f ∗ (xt−1 , ut ) be attained in xt = xt−1 +λ∗ ut with 0 ≤ λ∗ ≤ φ(xt−1 , ut ). [sent-960, score-0.425]
88 We know that λ∗ = 0 because ut is κτ-violating and proposition 2 implies f ∗ (xt−1 , ut ) > f (xt−1 ). [sent-961, score-0.225]
89 Otherwise xt is an unconstrained maximum and ut ∇ f (xt ) = 0. [sent-963, score-0.235]
90 Otherwise let the maximum f (xt ) = f ∗ (xt−1 , ut ) be attained in xt = xt−1 + λ∗ ut . [sent-966, score-0.33]
91 Since xt is a maximum, f (xt ) − f (xt−1 ) = f (xt−1 + λ∗ ut ) − f (xt−1 ) ≥ f (xt−1 + λut ) − f (xt−1 ). [sent-968, score-0.235]
92 A Taylor expansion with the Cauchy remainder gives 1 f (xt−1 + λut ) − f (xt−1 ) − λut ∇ f (xt−1 ) ≤ λ2 ut 2 2 1 f (xt−1 + λut ) − f (xt−1 ) − λut ∇ f (xt−1 ) ≥ − λ2 ut 2 2 H or, more specifically, H. [sent-970, score-0.218]
93 Recalling ut ∇ f (xt−1 ) > τ, and λ ut = ν xt − xt−1 , we obtain f (xt ) − f (xt−1 ) ≥ xt − xt−1 ν τ 1 − ν2 DH U 2 where U = max u and D is the diameter of the compact convex F . [sent-972, score-0.495]
94 Therefore there is a strictly increasing sequence of positive indices kt such that ukt = u is κτ-violating in point xkt −1 . [sent-982, score-0.191]
95 Assume U is a finite witness set for set F , and let the sequence xt be defined by the Approximate Stochastic WDS algorithm above. [sent-995, score-0.308]
96 Assume U is a finite witness set for set F , and let the sequence xt be defined by the Approximate Stochastic WDS algorithm above. [sent-1007, score-0.308]
97 Proof Proposition 16 establishes that for each sequence of selected directions ut , there is a time t0 and a point x∗ ∈ F such that xt = x∗ for all t ≥ t0 . [sent-1011, score-0.309]
98 For any kt > T , we know that U contains κτ-violating directions in xkt −1 = x∗ . [sent-1025, score-0.24]
99 1614 FAST K ERNEL C LASSIFIERS WITH O NLINE AND ACTIVE L EARNING would make xkt different from xkt −1 = x∗ . [sent-1030, score-0.224]
100 Support vector machine active learning with applications to text classification. [sent-1375, score-0.205]
wordName wordTfidf (topN-words)
[('lasvm', 0.72), ('active', 0.205), ('reprocess', 0.192), ('witness', 0.168), ('xt', 0.14), ('eston', 0.124), ('ordes', 0.124), ('ottou', 0.124), ('rtekin', 0.124), ('online', 0.119), ('libsvm', 0.112), ('xkt', 0.112), ('qt', 0.105), ('nline', 0.104), ('lassifiers', 0.104), ('nishing', 0.102), ('ut', 0.095), ('svm', 0.09), ('epoch', 0.083), ('support', 0.075), ('directions', 0.074), ('smo', 0.074), ('gs', 0.068), ('ernel', 0.068), ('qp', 0.067), ('mnist', 0.065), ('feasible', 0.063), ('wds', 0.062), ('direction', 0.062), ('bottou', 0.06), ('yk', 0.058), ('cache', 0.058), ('retrain', 0.056), ('kernel', 0.055), ('kt', 0.054), ('pt', 0.053), ('assertion', 0.05), ('epochs', 0.046), ('earning', 0.046), ('iterations', 0.045), ('perceptron', 0.043), ('collobert', 0.042), ('gi', 0.038), ('autoactive', 0.037), ('bordes', 0.037), ('selection', 0.036), ('cx', 0.035), ('proposition', 0.035), ('examples', 0.035), ('training', 0.033), ('search', 0.032), ('usps', 0.032), ('simplesvm', 0.031), ('banana', 0.031), ('fresh', 0.031), ('bail', 0.031), ('gilbert', 0.031), ('polytope', 0.031), ('takahashi', 0.031), ('reuters', 0.031), ('stochastic', 0.031), ('xk', 0.03), ('searches', 0.029), ('gradient', 0.029), ('expansion', 0.028), ('forest', 0.028), ('kx', 0.028), ('ers', 0.027), ('dx', 0.027), ('randomized', 0.027), ('prede', 0.027), ('assertions', 0.026), ('vectors', 0.026), ('reaches', 0.026), ('fast', 0.026), ('convex', 0.025), ('adult', 0.025), ('bs', 0.025), ('gmin', 0.025), ('graf', 0.025), ('nishi', 0.025), ('ukt', 0.025), ('pick', 0.025), ('noisy', 0.024), ('accuracies', 0.024), ('gk', 0.024), ('rates', 0.024), ('coef', 0.024), ('competitive', 0.024), ('fs', 0.024), ('accumulation', 0.023), ('perceptrons', 0.023), ('schohn', 0.023), ('informative', 0.022), ('keerthi', 0.022), ('ten', 0.022), ('arg', 0.022), ('ine', 0.022), ('criteria', 0.022), ('iteration', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
2 0.12138735 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
Author: Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins
Abstract: This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classification problems. Active learning is applied in conjunction with support vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images. Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multiclass support vector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires significantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining. Keywords: active learning, support vector machine, plankton recognition, probabilistic output, multi-class support vector machine
3 0.094146386 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung
Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability
4 0.077731565 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
Author: S. Sathiya Keerthi, Dennis DeCoste
Abstract: This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVMlight , SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber’s loss function and the L1 loss function, and also for solving ordinal regression. Keywords: linear SVMs, classification, conjugate gradient
5 0.069575764 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
6 0.062921017 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
7 0.059566308 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
8 0.059384733 5 jmlr-2005-A Generalization Error for Q-Learning
9 0.047343779 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
10 0.045993634 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
11 0.043765981 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
12 0.043760147 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
13 0.042014353 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
14 0.040202573 8 jmlr-2005-Active Coevolutionary Learning of Deterministic Finite Automata
15 0.038655888 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
16 0.038316336 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
17 0.037676856 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
18 0.037141461 48 jmlr-2005-Learning the Kernel Function via Regularization
19 0.036725033 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
20 0.035966471 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
topicId topicWeight
[(0, 0.216), (1, 0.112), (2, 0.248), (3, -0.095), (4, -0.066), (5, -0.04), (6, 0.183), (7, 0.095), (8, -0.256), (9, -0.081), (10, 0.004), (11, -0.193), (12, -0.2), (13, 0.148), (14, -0.019), (15, 0.126), (16, 0.014), (17, -0.108), (18, -0.134), (19, -0.203), (20, -0.053), (21, 0.035), (22, 0.076), (23, -0.039), (24, 0.019), (25, 0.012), (26, 0.016), (27, 0.068), (28, -0.054), (29, 0.009), (30, 0.046), (31, -0.062), (32, -0.047), (33, -0.029), (34, 0.066), (35, 0.002), (36, -0.053), (37, -0.124), (38, 0.008), (39, 0.054), (40, -0.165), (41, 0.112), (42, 0.008), (43, 0.086), (44, -0.019), (45, 0.027), (46, -0.104), (47, 0.062), (48, -0.003), (49, 0.123)]
simIndex simValue paperId paperTitle
same-paper 1 0.93664014 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
2 0.58690578 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
Author: Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins
Abstract: This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classification problems. Active learning is applied in conjunction with support vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images. Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multiclass support vector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires significantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining. Keywords: active learning, support vector machine, plankton recognition, probabilistic output, multi-class support vector machine
3 0.5023343 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung
Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability
4 0.39838931 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
5 0.31950468 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
Author: S. Sathiya Keerthi, Dennis DeCoste
Abstract: This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVMlight , SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber’s loss function and the L1 loss function, and also for solving ordinal regression. Keywords: linear SVMs, classification, conjugate gradient
6 0.27836728 5 jmlr-2005-A Generalization Error for Q-Learning
7 0.27466491 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
9 0.22348623 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
10 0.21478347 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
11 0.19799422 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
12 0.19471745 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
13 0.18930082 47 jmlr-2005-Learning from Examples as an Inverse Problem
14 0.18191507 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
15 0.17688198 8 jmlr-2005-Active Coevolutionary Learning of Deterministic Finite Automata
16 0.17348666 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
17 0.17027871 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
18 0.16730523 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
19 0.16612753 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
20 0.16548441 48 jmlr-2005-Learning the Kernel Function via Regularization
topicId topicWeight
[(13, 0.05), (17, 0.016), (19, 0.04), (32, 0.261), (36, 0.032), (37, 0.042), (42, 0.017), (43, 0.027), (47, 0.057), (52, 0.111), (59, 0.021), (70, 0.034), (88, 0.088), (90, 0.029), (94, 0.049), (96, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.73986346 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
2 0.53763503 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung
Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability
3 0.52613252 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
4 0.50868225 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.50283152 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
6 0.49954289 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
7 0.49810654 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
8 0.4957194 71 jmlr-2005-Variational Message Passing
9 0.49240634 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
10 0.4899531 39 jmlr-2005-Information Bottleneck for Gaussian Variables
11 0.48865831 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
12 0.48460707 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
13 0.48430422 3 jmlr-2005-A Classification Framework for Anomaly Detection
14 0.47980124 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
15 0.47672561 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.47628897 5 jmlr-2005-A Generalization Error for Q-Learning
17 0.47624135 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
18 0.47558269 44 jmlr-2005-Learning Module Networks
19 0.47512189 11 jmlr-2005-Algorithmic Stability and Meta-Learning
20 0.47337675 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial