jmlr jmlr2009 jmlr2009-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vojtěch Franc, Sören Sonnenburg
Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization
Reference: text
sentIndex sentText sentNum sentScore
1 39 72076 T¨ bingen, Germany u Editor: Michele Sebag Abstract We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. [sent-8, score-0.423]
2 Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. [sent-14, score-0.447]
3 Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. [sent-15, score-0.345]
4 BMRM is based on iterative approximation of the risk term by cutting planes. [sent-55, score-0.268]
5 It solves a reduced problem obtained by substituting the cutting plane approximation of the risk into the original problem (1). [sent-56, score-0.43]
6 However, we will show that the implementation of the cutting plane algorithm (CPA) used in BMRM can be drastically sped up making it efficient even for large-scale SVM binary and multi-class classification problems. [sent-59, score-0.351]
7 (2007), we propose an optimized cutting plane algorithm (OCA) that extends the standard CPA in two ways. [sent-62, score-0.374]
8 Second, we introduce a new cutting plane selection strategy that reduces the number of cutting planes required to achieve the prescribed precision and thus significantly speeds up convergence. [sent-64, score-0.666]
9 In the first application, we attack a splice site detection problem from bioinformatics. [sent-79, score-0.221]
10 The standard cutting plane algorithm (CPA) to solve (1) is reviewed in Section 2. [sent-88, score-0.351]
11 In Section 3, we point out a source of inefficiency 2159 F RANC AND S ONNENBURG of CPA and propose a new optimized cutting plane algorithm (OCA) to drastically reduce training times. [sent-89, score-0.408]
12 1, we apply OCAS to a human acceptor splice site recognition problem achieving state-of-the art results by training on all available sequences—a data set of 50 million examples (itself about 7GB in size) using a 12 million dimensional feature space. [sent-96, score-0.425]
13 2, we apply OCAM to learn a behavioral malware classifier and achieve a speedup of factor of 20 compared to the previous approach and a speedup of factor of 10 compared to the state-of-the-art implementation of the CPA. [sent-98, score-0.21]
14 (2007), one may define a reduced problem of (1) which reads wt = argmin Ft (w) := w 1 w 2 2 +CRt (w) . [sent-103, score-0.265]
15 In CPA terminology, a′ , w + b′ = 0 is called a cutting plane. [sent-112, score-0.219]
16 To approximate the risk R better than by using a single cutting plane, we can compute a collection of cutting planes { ai , w + bi = 0 | i = 1, . [sent-113, score-0.604]
17 , wt } and take their point-wise maximum Rt (w) = max 0, max ai , w + bi . [sent-119, score-0.238]
18 ,t The zero cutting plane is added to the maximization as the risk R is assumed to be non-negative. [sent-123, score-0.4]
19 The subscript t denotes the number of cutting planes used in the approximation Rt . [sent-124, score-0.291]
20 , wt } and that Rt lower bounds R, that is, that R(w) ≥ Rt (w) , ∀w ∈ ℜn holds. [sent-128, score-0.193]
21 The number of constraints in (5) equals the number of cutting planes t which can be drastically lower than the number of constraints in the master problem (1) when expressed as a constrained QP task. [sent-135, score-0.342]
22 Having (6) solved, the primal solution can be computed as t wt = − ∑ ai [αt ]i , and ξt = max ( wt , ai + bi ) . [sent-142, score-0.431]
23 ,t i=1 Solving the reduced problem is beneficial if we can effectively select a small number of cutting planes such that the solution of the reduced problem is sufficiently close to the master problem. [sent-146, score-0.402]
24 CPA selects the cutting planes using a simple strategy described by Algorithm 1. [sent-147, score-0.291]
25 2: repeat 3: Compute wt by solving the reduced problem (5). [sent-149, score-0.223]
26 4: Add a new cutting plane to approximate the risk R at the current solution wt , that is, compute at+1 ∈ ∂R(wt ) and bt+1 := R(wt ) − at+1 , wt . [sent-150, score-0.786]
27 To use it for a particular problem one only needs to supply a formula to compute the cutting plane as required in Step 4, that is, formulas for computing the subgradient a ∈ ∂R(w) and the objective value R(w) at given point w. [sent-153, score-0.427]
28 Because Ft (wt ) is the lower bound of the optimal value, F(w∗ ), it follows that a solution wt satisfying (7) also guarantees F(wt ) − F(w∗ ) ≤ ε, that is, the objective value differs from the optimal one by ε at most. [sent-155, score-0.243]
29 , 2007) Assume that ∂R(w) ≤ G for all w ∈ W , where W is some domain of interest containing all wt ′ for t ′ ≤ t. [sent-161, score-0.193]
30 CPA selects a new cutting plane such that the reduced problem objective function Ft (wt ) monotonically increases w. [sent-165, score-0.431]
31 fluctuations is that at each iteration t, CPA selects the cutting plane that perfectly approximates the master objective F at the current solution wt . [sent-173, score-0.645]
32 However, there is no guarantee that such a cutting plane will be an active constraint in the vicinity of the optimum w∗ , nor must the new solution wt+1 of the reduced problem improve the master objective. [sent-174, score-0.432]
33 As a result, a lot of the selected cutting planes do not contribute to the approximation of the master objective around the optimum which, in turn, increases the number of iterations. [sent-176, score-0.392]
34 To speed up the convergence of CPA, we propose a new method which we call the optimized cutting plane algorithm (OCA). [sent-177, score-0.396]
35 In addition, OCA tries to select cutting planes that have a higher chance of actively contributing to the approximation of the master objective function F around the optimum w∗ . [sent-179, score-0.392]
36 1 Change 2 The new best-so-far solution wtb is found by searching along a line starting at the previous b solution wt−1 and crossing the reduced problem’s solution wt , that is, b wtb = min F(wt−1 (1 − k) + wt k) . [sent-185, score-0.74]
37 k≥0 (8) Change 3 The new cutting plane is computed to approximate the master objective F at a point wtc which lies in a vicinity of the best-so-far solution wtb . [sent-186, score-0.798]
38 In particular, the point wtc is computed as (9) wtc = wtb (1 − µ) + wt µ , where µ ∈ (0, 1] is a prescribed parameter. [sent-187, score-0.723]
39 Having the point wtc , the new cutting plane is given by at+1 ∈ ∂R(wtc ) and bt+1 = R(wtc ) − at+1 , wtc . [sent-188, score-0.719]
40 0 2: repeat 3: Compute wt by solving the reduced problem (5). [sent-202, score-0.223]
41 5: Add a new cutting plane: compute at+1 ∈ ∂R(wtc ) and bt+1 := R(wtc ) − at+1 , wtc where wtc is given by (9). [sent-204, score-0.587]
42 6: t := t + 1 7: until a stopping condition is satisfied Theorem 2 Assume that ∂R(w) ≤ G for all w ∈ W , where W is some domain of interest containing all wt ′ for t ′ ≤ t. [sent-205, score-0.234]
43 To this end, we develop fast methods to solve the problem-dependent subtasks, the line-search step (step 4 in Algorithm 2) and the addition of a new cutting plane (step 5 in Algorithm 2). [sent-216, score-0.351]
44 We call the resulting algorithm the optimized cutting plane algorithm for SVM (OCAS). [sent-233, score-0.374]
45 1 L INE - SEARCH F OR L INEAR B INARY SVM C LASSIFIERS The line-search (8) requires minimization of a univariate convex function b F(wt−1 (1 − k) + wt k) = 1 b w (1 − k) + wt k 2 t−1 2 b +CR(wt−1 (1 − k) + wt k) , (13) with R defined by (12). [sent-236, score-0.579]
46 b We abbreviate F(wt−1 (1 − k) + wt k) by f (k) which is defined as m m 1 f (k) := f0 (k) + ∑ fi (k) = k2 A0 + kB0 +C0 + ∑ max{0, kBi +Ci } , 2 i=1 i=1 where A0 = B0 = C0 = b wt−1 − wt 2 b b wt−1 , wt − wt−1 , Bi = 1 b 2 w , Ci = 2 t−1 b C m yi xi , wt−1 − wt b C m (1 − yi xi , wt−1 , i = 1, . [sent-239, score-0.876]
47 b Hence the line-search (8) involves solving k∗ = argmink≥0 f (k) and computing wtb = wt−1 (1 − k∗ ) + wt k∗ . [sent-246, score-0.355]
48 The subdifferential of f reads 0 if kBi +Ci < 0 , m Bi if kBi +Ci > 0 , ∂ f (k) = kA0 + B0 + ∑ ∂ fi (k) , where ∂ fi (k) = i=1 [0, Bi ] if kBi +Ci = 0 . [sent-248, score-0.205]
49 Note that A0 = wt−1 − wt 2 equals 0 only if the algorithm has converged to the optimum w∗ , but in this case the line-search is not invoked. [sent-251, score-0.193]
50 This involves computation of the dot products wt , xi for all i = 1, . [sent-277, score-0.193]
51 Note that the remaining products with data required by OCAS, that is, wtb , xi and wtc , xi , can be computed from wt , xi in time O (m). [sent-282, score-0.539]
52 The formula for computing the subgradient a ∈ ∂R(w) of the risk (15) reads a= 1 m ˆ ∑ Ψ(xi , yi ) − Ψ(xi , yi ) , m i=1 where yi = argmax δ(yi , y) + Ψ(xi , y) − Ψ(xi , yi ), w . [sent-322, score-0.213]
53 ˆ y∈Y We call the resulting method the optimized cutting plane algorithm for multi-class SVM (OCAM). [sent-323, score-0.374]
54 b The goal of the line-search is to minimize a univariate function F(wt−1 (1 − k) + wt k) defined b by (13) with the risk R given by (15). [sent-334, score-0.242]
55 We can abbreviate F(wt−1 (1 − k) + wt k) by f (k) which reads m m 1 f (k) := f0 (k) + ∑ fi (k) = k2 A0 + kB0 +C0 + ∑ max kBy +Ciy , i 2 i=1 i=1 y∈Y where the constants A0 , B0 , C0 , (By ,Ciy ), i = 1, . [sent-335, score-0.291]
56 Then the subdifferential of f (k) reads m ˆ ∂ f (k) = kA0 + B0 + ∑ ∂ fi (k) where ∂ fi (k) = co{By | y ∈ Yi (k)} . [sent-341, score-0.205]
57 Experiments In this section we perform an extensive empirical evaluation of the proposed optimized cutting plane algorithm (OCA) applied to linear binary SVM classification (OCAS) and multi-class SVM classification (OCAM) . [sent-414, score-0.374]
58 We show that OCAS outcompetes previous solvers gaining speedups of several orders of magnitude over some of the methods and we also analyze the speedups gained by parallelizing the various core components of OCAS. [sent-421, score-0.22]
59 We augmented the Cov1, CCAT, Astro data sets from Joachims (2006) by the MNIST, an artificial dense data set and two larger bioinformatics splice site recognition data sets for worm and human. [sent-486, score-0.287]
60 The value of µ determines the point wtc = wtb (1 − µ) + wt µ at which the new cutting plane is selected. [sent-512, score-0.89]
61 Training Time and Objective For Optimal C We trained all methods on all except the human splice data set using the training data and measured training time (in seconds) and computed the unconstrained objective value F(w) (cf. [sent-558, score-0.33]
62 Also, on the splice data set we ran Pegasos for 1010 iterations, which took 13,000 seconds and achieved a similar objective as that of SVM per f 2. [sent-650, score-0.205]
63 2174 O PTIMIZED C UTTING P LANE S UPPORT V ECTOR M ACHINES Astro CCAT 1 cpa sgd ocas 0. [sent-655, score-0.961]
64 8 Objective [(obj−min)/obj] Objective [(obj−min)/obj] 1 cpa sgd ocas 0. [sent-664, score-0.961]
65 2 0 1 2 10 2 10 10 Time [s] 1 Objective [(obj−min)/obj] Objective [(obj−min)/obj] cpa sgd ocas 10 1 10 Time [s] Artificial 1 0 2 10 MNIST 1 0 −1 10 1 10 Time [s] 0. [sent-675, score-0.961]
66 2 0 −1 10 3 10 cpa sgd ocas 0 10 1 10 Time [s] 2 10 3 10 Figure 5: Objective value vs. [sent-679, score-0.961]
67 2176 O PTIMIZED C UTTING P LANE S UPPORT V ECTOR M ACHINES cpa sgd ocas linear 2 Time [s] 10 0 10 −2 10 2 10 3 10 4 5 10 10 Dataset Size 6 10 Figure 6: This figure displays how the methods scale with data set size on the Worm splice data set. [sent-757, score-1.116]
68 Effects of Parallelization As OCAS training times are very low on the above data sets, we also apply OCAS to the 15 million human splice data set. [sent-762, score-0.262]
69 6 117 45 410 671 Table 5: Speedups due to parallelizing OCAS on the 15 million human splice data set. [sent-774, score-0.255]
70 Both implementations use the Improved Mitchel-Demyanov-Malozemov algorithm (Franc, 2005) as the core QP solver and they use exactly the same functions for the computation of cutting planes and classifier outputs. [sent-841, score-0.342]
71 1, we apply OCAS to a human acceptor splice site recognition problem. [sent-928, score-0.309]
72 1 Human Acceptor Splice Site Recognition To demonstrate the effectiveness of our proposed method, OCAS, we apply it to the problem of human acceptor splice site detection. [sent-932, score-0.309]
73 Most splice sites are so-called canonical splice sites that are characterized by the presence of the dimers GT and AG at the donor and acceptor sites, respectively. [sent-935, score-0.458]
74 Here we focus on detecting the so-called acceptor splice sites that employ the AG consensus and are found at the “left-hand side” boundary of exons. [sent-938, score-0.257]
75 The classification task for splice site sensors, therefore, consists in discriminating true splice sites from decoy sites that also exhibit the consensus dimers. [sent-941, score-0.468]
76 Assuming a uniform distribution of the four bases, adenine (A), cytosine (C), guanine (G) and thymine (T), one would expect 1/16th of the dimers to contain the AG acceptor splice site consensus. [sent-942, score-0.277]
77 Many different methods to detect splice sites have been proposed. [sent-944, score-0.201]
78 They all predict splice sites based on the local context, that is, a short window around the AG dimer. [sent-945, score-0.201]
79 Currently, support vector machines are the most accurate splice site detectors (Degroeve et al. [sent-946, score-0.221]
80 38 aoPRC in a genome-wide study on human acceptor splice sites—also available as the DNA data set used in the large-scale learning challenge. [sent-959, score-0.243]
81 , 2005) feature spaces: Left and right of the splice a site spectrum kernels of order 3 up to order 6 were used (Leslie et al. [sent-963, score-0.243]
82 (2007b) using the features corresponding to two weighted spectrum kernels (one left and one right of the splice site, that is, positions 1-59 and 62-141) and a WD kernel (applied to the whole string). [sent-968, score-0.201]
83 Finally, it should be noted that storing the 138 cutting planes required almost 13 GB of memory. [sent-1001, score-0.291]
84 Iterations 138 Output 34 hours Line Search 222s Add at 7 hours Solver 5min Total 41 hours Table 9: Timing statistics for the human acceptor splice site experiment. [sent-1002, score-0.375]
85 Our proposed optimized cutting plane algorithm (OCA) extends the standard CPA algorithm of Teo et al. [sent-1089, score-0.374]
86 (2007) by, first, optimizing directly the primal problem via a line-search and, second, developing a new cutting plane selection strategy which significantly reduces the number of cutting planes needed to achieve an accurate solution. [sent-1090, score-0.642]
87 Applying OCAS to a real-world splice site detection problem, we were able to train on a 12-million dimensional data set containing 50 million examples, achieving a new record performance for that task. [sent-1096, score-0.262]
88 In the standard CPA, it is trivial to show that the new added cutting plane violates these constraints by at least εt . [sent-1115, score-0.351]
89 2186 O PTIMIZED C UTTING P LANE S UPPORT V ECTOR M ACHINES Due to the sophisticated cutting plane selection strategy used in OCA, the violation of constraints is not obvious. [sent-1116, score-0.351]
90 While in the standard CPA the violation is guaranteed to be εt , the inequality (21) shows that the new cutting plane added in OCA violates the constraints by εt + C wtc − wt 2 . [sent-1121, score-0.728]
91 Theorem 3 Assume that ∂R(w) ≤ G for all w ∈ W , where W is some domain of interest containing all wt ′ for t ′ ≤ t, and that F(wtb ) − Ft (wt ) = εt > 0. [sent-1126, score-0.193]
92 Then Algorithm 2 computes a new cutting plane w, at+1 + bt+1 = 0 which violates the constraints of the reduced problem (5) solved in the t-th iteration by at least εt , that is, it holds that C C bt+1 + wt , at+1 − ξt ≥ εt + C c w − wt 2 t 2 ≥ εt . [sent-1129, score-0.767]
93 (21) : We use the subgradient wtc +Cat+1 ∈ ∂F(wtc ) to put a lower bound on the master objective F by means of a linear function at the point wtc , that is, PROOF f (w) = F(wtc ) + wtc +Cat+1 , w − wtc ≤ F(w) , ∀w ∈ ℜn . [sent-1130, score-0.863]
94 (22) In Step 4 of Algorithm 2, the new best-so-far solution, wtb , is computed as the minimizer of F b over the line connecting the old best-so-far solution, wt−1 , and the solution of reduced problem wt . [sent-1131, score-0.385]
95 In step 5, the new cutting plane w, at+1 + bt+1 = 0 is taken at the point wtc = wtb (1 − µ) + wt µ, µ ∈ (0, 1]. [sent-1132, score-0.89]
96 Since wtc lies on the line segment connecting wtb with wt and because f (wtb ) ≤ f (wtc ) we conclude that f (wtc ) ≤ f (wt ) . [sent-1135, score-0.539]
97 Using (22) we can rewrite (23) as F(wtc ) ≤ F(wtc ) + wtc +Cat+1 , wt − wtc . [sent-1137, score-0.561]
98 Finally, substituting (24) to the latter inequality yields F(wtc ) + wtc +Cat+1 , wt − wtc − Ft (wt ) ≥ εt , which can be further rearranged into (21). [sent-1139, score-0.561]
99 SpliceMachine: predicting splice e sites from high-dimensional local context representations. [sent-1246, score-0.201]
100 OCAS optimized cutting plane algorithm for support vector machines. [sent-1282, score-0.374]
wordName wordTfidf (topN-words)
[('ocas', 0.535), ('cpa', 0.283), ('ocam', 0.259), ('cutting', 0.219), ('wt', 0.193), ('oca', 0.189), ('wtc', 0.184), ('wtb', 0.162), ('splice', 0.155), ('sgd', 0.143), ('malware', 0.138), ('plane', 0.132), ('bmrm', 0.114), ('cpam', 0.114), ('onnenburg', 0.108), ('achines', 0.102), ('ptimized', 0.102), ('upport', 0.102), ('utting', 0.102), ('solvers', 0.101), ('ranc', 0.092), ('lane', 0.087), ('ector', 0.087), ('rieck', 0.084), ('svmmulti', 0.084), ('svmperf', 0.078), ('svm', 0.077), ('svmlight', 0.077), ('astro', 0.072), ('planes', 0.072), ('sonnenburg', 0.069), ('ft', 0.069), ('ccat', 0.066), ('worm', 0.066), ('site', 0.066), ('wd', 0.061), ('acceptor', 0.056), ('fi', 0.056), ('kiz', 0.054), ('obj', 0.054), ('solver', 0.051), ('master', 0.051), ('subdifferential', 0.051), ('teo', 0.05), ('objective', 0.05), ('pegasos', 0.049), ('risk', 0.049), ('franc', 0.048), ('liblinear', 0.047), ('speedups', 0.046), ('sites', 0.046), ('larank', 0.046), ('mnist', 0.045), ('bi', 0.045), ('argmink', 0.042), ('kbi', 0.042), ('reads', 0.042), ('stopping', 0.041), ('million', 0.041), ('dt', 0.04), ('speedup', 0.036), ('string', 0.035), ('training', 0.034), ('parallelization', 0.034), ('challenge', 0.032), ('human', 0.032), ('reduced', 0.03), ('approximative', 0.03), ('gpdt', 0.03), ('zeta', 0.029), ('tsch', 0.029), ('joachims', 0.029), ('parallelizing', 0.027), ('cpu', 0.027), ('subgradient', 0.026), ('cy', 0.026), ('bz', 0.025), ('unconstrained', 0.025), ('ki', 0.025), ('precision', 0.024), ('kernel', 0.024), ('runtime', 0.024), ('iiz', 0.024), ('shwartz', 0.024), ('iterations', 0.024), ('yi', 0.024), ('optimized', 0.023), ('alpha', 0.023), ('subtasks', 0.023), ('svms', 0.022), ('convergence', 0.022), ('hours', 0.022), ('ci', 0.022), ('spectrum', 0.022), ('classi', 0.021), ('gamma', 0.021), ('rt', 0.021), ('bt', 0.021), ('aoprc', 0.02), ('czech', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
Author: Vojtěch Franc, Sören Sonnenburg
Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization
2 0.20633705 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
3 0.13590513 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
4 0.073552854 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
Author: Kristian Woodsend, Jacek Gondzio
Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method
5 0.071789421 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
Author: Wenxin Jiang
Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation
6 0.066740796 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
7 0.058341708 23 jmlr-2009-Discriminative Learning Under Covariate Shift
8 0.052784152 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
10 0.048150994 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
11 0.045975886 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
12 0.043488927 13 jmlr-2009-Bounded Kernel-Based Online Learning
13 0.03725468 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
14 0.032857489 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
15 0.03127227 67 jmlr-2009-Online Learning with Sample Path Constraints
16 0.031177931 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
17 0.029609809 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
18 0.028907485 38 jmlr-2009-Hash Kernels for Structured Data
20 0.026003039 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
topicId topicWeight
[(0, 0.198), (1, 0.114), (2, 0.021), (3, -0.14), (4, 0.29), (5, 0.149), (6, 0.008), (7, -0.073), (8, 0.11), (9, 0.123), (10, -0.11), (11, -0.022), (12, -0.051), (13, -0.023), (14, -0.045), (15, 0.058), (16, 0.003), (17, 0.1), (18, 0.015), (19, 0.113), (20, -0.09), (21, 0.019), (22, -0.035), (23, 0.112), (24, -0.002), (25, 0.051), (26, -0.091), (27, 0.061), (28, 0.045), (29, 0.076), (30, -0.015), (31, -0.021), (32, -0.025), (33, 0.039), (34, -0.145), (35, 0.053), (36, -0.029), (37, -0.04), (38, 0.02), (39, -0.075), (40, 0.062), (41, 0.037), (42, -0.07), (43, 0.0), (44, -0.142), (45, 0.018), (46, 0.066), (47, 0.193), (48, -0.043), (49, -0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.91276926 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
Author: Vojtěch Franc, Sören Sonnenburg
Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization
2 0.70333302 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
3 0.64745665 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
4 0.43658409 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
Author: Kristian Woodsend, Jacek Gondzio
Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method
Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér
Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics
6 0.26817775 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
7 0.24188542 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
8 0.23481505 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
10 0.22194585 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
11 0.20492394 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
12 0.1983169 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
13 0.19343378 13 jmlr-2009-Bounded Kernel-Based Online Learning
14 0.19136283 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
15 0.19131966 23 jmlr-2009-Discriminative Learning Under Covariate Shift
16 0.19119869 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
17 0.18419479 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
18 0.18073766 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
19 0.16267945 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
20 0.16215695 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
topicId topicWeight
[(8, 0.016), (11, 0.026), (19, 0.028), (24, 0.022), (26, 0.013), (38, 0.031), (47, 0.014), (52, 0.04), (55, 0.031), (58, 0.527), (66, 0.075), (68, 0.019), (90, 0.042), (96, 0.027)]
simIndex simValue paperId paperTitle
Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér
Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics
2 0.80445594 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti
Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels
same-paper 3 0.80054182 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
Author: Vojtěch Franc, Sören Sonnenburg
Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization
4 0.3802717 54 jmlr-2009-Markov Properties for Linear Causal Models with Correlated Errors (Special Topic on Causality)
Author: Changsung Kang, Jin Tian
Abstract: A linear causal model with correlated errors, represented by a DAG with bi-directed edges, can be tested by the set of conditional independence relations implied by the model. A global Markov property specifies, by the d-separation criterion, the set of all conditional independence relations holding in any model associated with a graph. A local Markov property specifies a much smaller set of conditional independence relations which will imply all other conditional independence relations which hold under the global Markov property. For DAGs with bi-directed edges associated with arbitrary probability distributions, a local Markov property is given in Richardson (2003) which may invoke an exponential number of conditional independencies. In this paper, we show that for a class of linear structural equation models with correlated errors, there is a local Markov property which will invoke only a linear number of conditional independence relations. For general linear models, we provide a local Markov property that often invokes far fewer conditional independencies than that in Richardson (2003). The results have applications in testing linear structural equation models with correlated errors. Keywords: Markov properties, linear causal models, linear structural equation models, graphical models
5 0.3703382 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
Author: Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras
Abstract: In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion. Keywords: Bayesian networks, consistency, scoring criterion, model selection, BIC
6 0.35379714 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
8 0.34131581 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
9 0.34019262 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
10 0.33382618 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
11 0.32890403 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
12 0.32796955 18 jmlr-2009-Consistency and Localizability
13 0.32619366 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
14 0.32121283 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
15 0.3194491 38 jmlr-2009-Hash Kernels for Structured Data
16 0.31901011 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
17 0.31316233 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
18 0.31312826 48 jmlr-2009-Learning Nondeterministic Classifiers
19 0.31048012 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models