jmlr jmlr2009 jmlr2009-83 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
Reference: text
sentIndex sentText sentNum sentScore
1 Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. [sent-8, score-0.175]
2 Keywords: support vector machine, stochastic gradient descent 1. [sent-11, score-0.15]
3 In various domains such as biology, networking, or information retrieval, fast classification methods able to scale on millions of training instances are needed. [sent-13, score-0.07]
4 Evaluations were carried out on the basis of various performance curves such as training time versus test error, data set size versus test error, and data set size versus training time. [sent-18, score-0.122]
5 Unfortunately, even in the Wild Track case, the evaluation criteria for the competition reward good scaling properties and short training durations more than they punish suboptimal test errors. [sent-21, score-0.107]
6 However, in the large scale setup, when the bottleneck is the computing time rather than the number of training examples, Bottou and Bousquet (2008) have shown that stochastic algorithms often yield the best generalization performances in spite of being worst optimizers. [sent-33, score-0.125]
7 SGD algorithms were therefore a natural choice for the “Wild Track” of the competition which focuses on the relation between training time and test performance. [sent-34, score-0.085]
8 This quasi-Newton stochastic method is statistically efficient but is penalized in practice by the cost of storing and manipulating the metric tensor. [sent-43, score-0.089]
9 The limited storage version of this algorithm is a quasi-Newton stochastic method whose cost by iteration is a small multiple of the cost of a standard SGD iteration. [sent-46, score-0.16]
10 Stochastic gradient works well in models with nonlinear parametrization. [sent-64, score-0.064]
11 Analysis This section describes our notations and summarizes theoretical results that are relevant to the design of a fast variant of stochastic gradient algorithms. [sent-70, score-0.119]
12 The linear SVM classifier is obtained by minimizing the primal cost function Pn (w) = λ w 2 2 + 1 1 n ∑ ℓ(yi w⊤xi ) = n n i=1 n ∑ i=1 λ w 2 2 + ℓ(yi w⊤xi ) , (1) where the hyper-parameter λ > 0 controls the strength of the regularization term. [sent-73, score-0.127]
13 Each iteration of the SGD algorithm consists of drawing a random training example (xt , yt ) and computing a new value of the parameter wt as wt+1 = wt − 1 B gt (wt ) where gt (wt ) = λwt + ℓ′ (yt w⊤xt ) yt xt t t + t0 (2) where the rescaling matrix B is positive definite. [sent-76, score-2.229]
14 • The traditional first-order SGD algorithm, with decreasing learning rate, is obtained by setting B = λ−1 I in the generic update (2) : wt+1 = wt − 1 gt (wt ) . [sent-79, score-0.868]
15 λ(t + t0 ) (3) • The second-order SGD algorithm is obtained by setting B to the inverse of the Hessian Matrix ′′ H = [ Pn (w∗ ) ] computed at the optimum w∗ of the primal cost Pn (w) : n n wt+1 = wt − 1 H−1 gt (wt ) . [sent-80, score-0.93]
16 In practice, one simply performs sequential passes over the randomly shuffled training set. [sent-82, score-0.069]
17 2 What Matters Are the Constant Factors Bottou and Bousquet (2008) characterize the asymptotic learning properties of stochastic gradient algorithms in the large scale regime, that is, when the bottleneck is the computing time rather than the number of training examples. [sent-84, score-0.189]
18 2 report the time for a single iteration, the number of iterations needed to reach a predefined accuracy ρ, and their product, the time needed to reach accuracy ρ. [sent-86, score-0.101]
19 Compare the second last column (time to optimize) with the last column (time to reach the excess test error ε). [sent-88, score-0.092]
20 Legend: n number of examples; d parameter dimension; c positive constant that appears in the generalization bounds; κ condition number of the Hessian matrix H; ν = tr GH−1 with G the Fisher matrix (see Theorem 1 for more details). [sent-89, score-0.182]
21 The excess test error E measures how much the test error is worse than the best possible error for this problem. [sent-91, score-0.079]
22 Both the first-order and the second-order SGD require a time inversely proportional to ε to reach the target test error. [sent-97, score-0.083]
23 To reach an excess error ε, the most favorable generalization bounds suggest that one needs a number of examples proportional to 1/ε. [sent-103, score-0.087]
24 In fact, Bottou and Bousquet (2008) obtain slightly worse scaling laws for typical non-stochastic gradient algorithms. [sent-105, score-0.086]
25 1740 SGD-QN Theorem 1 Let Eσ denote the expectation with respect to the random selection of the examples (xt , yt ) drawn independently from the training set at each iteration. [sent-109, score-0.137]
26 Let w∗ = arg minw Pn (w) be an n optimum of the primal cost. [sent-110, score-0.085]
27 Define the Hessian matrix H = ∂2 Pn (w∗ )/∂w2 and the Fisher matrix n G = Gt = Eσ gt′ (w∗ ) gt′ (w∗ )⊤ . [sent-111, score-0.064]
28 If the eigenvalues of HB are in range λmax ≥ λmin > 1/2, and if the n n SGD algorithm (2) converges to w∗ , the following inequality holds: n tr (HBGB) −1 tr (HBGB) −1 t + o t −1 ≤ Eσ [Pn (wt ) − Pn (w∗ )] ≤ t + o t −1 . [sent-112, score-0.236]
29 Convergence also holds under slightly stronger assumptions when the rescaling matrix B changes over time (e. [sent-119, score-0.224]
30 Under the assumptions of Theorem 1, we have Eσ [Pn (wt ) − Pn (w∗ )] = tr GH−1 t −1 + o t −1 = νt −1 + o t −1 . [sent-127, score-0.118]
31 Under the assumptions of Theorem 1, we have Eσ [Pn (wt ) − Pn (w∗ )] ≤ λ−2 tr H2 GH−1 t −1 + o t −1 ≤ κ2 νt −1 + o t −1 . [sent-129, score-0.118]
32 Given a sample of n independent examples (xi , yi ) , define 2 w∗ = arg minw Pn (w) and compute wn by applying the second-order SGD update (4) to each of the n n examples. [sent-131, score-0.085]
33 This result means that, asymptotically and on average, the parameter wn obtained after one pass of second-order SGD is as close to the infinite training set solution w∗ as the true optimum of the primal w∗ . [sent-133, score-0.157]
34 Therefore, when the training set is large enough, we can expect that a single pass of n second-order SGD (n iterations of (4)) optimizes the primal accurately enough to replicate the test error of the actual SVM solution. [sent-134, score-0.182]
35 When we replace the full second-order rescaling matrix B = H−1 by a more computationally acceptable approximation, Theorem 1 indicates that we lose a constant factor k on the required number of iterations to reach that accuracy. [sent-135, score-0.309]
36 In other words, we can expect to replicate the SVM test error after k passes over the randomly reshuffled training set. [sent-136, score-0.094]
37 On the other hand, a well chosen approximation of the rescaling matrix can save a large constant factor on the computation of the generic SGD update (2). [sent-137, score-0.31]
38 The best training times are therefore obtained by carefully trading the quality of the approximation for sparse representations. [sent-138, score-0.065]
39 1741 B ORDES , B OTTOU AND G ALLINARI Frequency Special example: Examples 1 to n: n skip 1 Loss λ skip w 2 ℓ(yi w⊤xi ) 2 Table 2: The regularization term in the primal cost can be viewed as an additional training example with an arbitrarily chosen frequency and a specific loss function. [sent-139, score-0.427]
40 There are of course many other ways to save constant factors: • Exploiting the sparsity of the patterns (see Section 3) can save a constant factor in the cost of each first-order iteration. [sent-142, score-0.137]
41 Scheduling Stochastic Updates to Exploit Sparsity First-order SGD iterations can be made substantially faster when the patterns xt are sparse. [sent-150, score-0.168]
42 The first-order SGD update has the form wt+1 = wt − αt wt − βt xt , (5) where αt and βt are scalar coefficients. [sent-151, score-1.27]
43 Subtracting βt xt from the parameter vector involves solely the nonzero coefficients of the pattern xt . [sent-152, score-0.267]
44 On the other hand, subtracting αt wt involves all d coefficients. [sent-153, score-0.554]
45 (2007) circumvent this problem by representing the parameter wt as the product st vt of a scalar and a vector. [sent-156, score-0.611]
46 The update (5) can then be computed as st+1 = (1 − αt )st and vt+1 = vt − βxt /st+1 in time proportional to the number of nonzero coefficients in xt . [sent-157, score-0.262]
47 A more general method consists of treating the regularization term in the primal cost (1) as an additional training example occurring with an arbitrarily chosen frequency with a specific loss function. [sent-159, score-0.163]
48 Consider examples with the frequencies and losses listed in Table 2 and write the average loss: 1 n + n skip skip n λ skip w 2 2 n + ∑ ℓ(yi w⊤ xi ) i=1 1742 = skip 1 + skip λ w 2 2 + 1 n ∑ ℓ(yi w⊤xi ) . [sent-160, score-0.681]
49 Minimizing this loss is of course equivalent to minimizing the primal cost (1) with its regularization term. [sent-162, score-0.127]
50 The parameter skip regulates the relative frequencies of these updates. [sent-164, score-0.132]
51 The SVMSGD2 algorithm (Bottou, 2007) measures the average pattern sparsity and picks a frequency that ensures that the amortized cost of the regularization update is proportional to the number of nonzero coefficients. [sent-165, score-0.183]
52 Both algorithms handle the real examples at each iteration (line 3) but SVMSGD2 only performs a regularization update every skip iterations (line 6). [sent-167, score-0.262]
53 Assume s is the average proportion of nonzero coefficients in the patterns xi and set skip to c/s where c is a predefined constant (we use c = 16 in our experiments). [sent-168, score-0.176]
54 Each regularization update (line 6) requires d operations but occurs s/c times less often. [sent-170, score-0.068]
55 The average cost per iteration is therefore proportional to O (sd) instead of O (d). [sent-171, score-0.091]
56 This is no longer true in the case of large scale machine learning because we care about the test error instead of the training error. [sent-175, score-0.095]
57 vector for representing the parameter w, but handles the patterns x using either a full vector representation or a sparse representation as an ordered list of index/value pairs. [sent-181, score-0.072]
58 For example, on a dense data set with 500 attributes, using sparse vectors increases the training time by 50%; on the sparse RCV1 data set (see Table 4), using a sparse vector to represent the parameter w increases the training time by more than 900%. [sent-184, score-0.159]
59 SGD-QN: A Careful Diagonal Quasi-Newton SGD As explained in Section 2, designing an efficient quasi-Newton SGD algorithm involves a careful trade-off between the sparsity of the scaling matrix representation B and the quality of its approximation of the inverse Hessian H−1 . [sent-192, score-0.145]
60 In the case of a linear model, such preconditioning is similar to using a constant diagonal scaling matrix. [sent-198, score-0.068]
61 d [Wt+1 ]i = [Wt ]i − ηt λ[Wt ]i + ℓ′ (yt W⊤Xt ) yt [Xt ]i , t = [Wt ]i − ηt λ[Wt ]i + ℓ′ (yt w⊤xt ) yt bi [xt ]i . [sent-204, score-0.202]
62 t 1744 SGD-QN Multiplying by bi shows how the original parameter vector wt is affected: ∀i = 1 . [sent-205, score-0.554]
63 d [wt+1 ]i = [wt ]i − ηt λ[wt ]i + ℓ′ (yt w⊤xt ) yt b2 [xt ]i . [sent-208, score-0.101]
64 t i We observe that rescaling the input is equivalent to multiplying the gradient by a fixed diagonal matrix B whose elements are the squares of the coefficients bi . [sent-209, score-0.334]
65 Instead we could consider n the current value of the Hessian Hwt = P ′′ (wt ) and compute the diagonal rescaling matrix B that makes BHwt closest to the identity. [sent-212, score-0.27]
66 Becker and Le Cun (1989) approximate the optimal diagonal rescaling matrix by inverting the diagonal coefficients of the Hessian. [sent-214, score-0.316]
67 2 Low Rank Rescaling Matrices The popular LBFGS optimization algorithm (Nocedal, 1980) maintains a low rank approximation of the inverse Hessian by storing the k most recent rank-one BFGS updates instead of the full ′ ′ inverse Hessian matrix. [sent-219, score-0.1]
68 When the successive full gradients Pn (wt−1 ) and Pn (wt ) are available, standard rank one updates can be used to directly estimate the inverse Hessian matrix H−1 . [sent-220, score-0.125]
69 Using ′ ′ this method with stochastic gradient is tricky because the full gradients Pn (wt−1 ) and Pn (wt ) are not readily available. [sent-221, score-0.161]
70 Instead we only have access to the stochastic estimates gt−1 (wt−1 ) and gt (wt ) which are too noisy to compute good rescaling matrices. [sent-222, score-0.497]
71 This reduces the noise to an acceptable level at the expense of the computation of the additional gradient vector gt−1 (wt ). [sent-225, score-0.064]
72 The most expensive part however remains the multiplication of the gradient gt (wt ) by the low-rank estimate of the inverse Hessian. [sent-227, score-0.341]
73 3 SGD-QN The SGD-QN algorithm estimates a diagonal rescaling matrix using a technique inspired by oLBFGS. [sent-230, score-0.27]
74 For any pair of parameters wt−1 and wt , a Taylor series of the gradient of the primal cost P provides the secant equation: ′ ′ (6) wt − wt−1 ≈ H−1 Pn (wt ) − Pn (wt−1 ) . [sent-231, score-1.271]
75 wt We would then like to replace the inverse Hessian matrix H−1 by a diagonal estimate B wt ′ ′ wt − wt−1 ≈ B Pn (wt ) − Pn (wt−1 ) . [sent-232, score-1.767]
76 ′ Since we are designing a stochastic algorithm, we do not have access to the full gradient Pn . [sent-233, score-0.141]
77 Following oLBFGS, we replace them by the local gradients gt−1 (wt ) and gt−1 (wt−1 ) and obtain wt − wt−1 ≈ B gt−1 (wt ) − gt−1 (wt−1 ) . [sent-234, score-0.574]
78 1745 B ORDES , B OTTOU AND G ALLINARI Since we chose to use a diagonal rescaling matrix B, we can write the term-by-term equality [wt − wt−1 ]i ≈ Bii [gt−1 (wt ) − gt−1 (wt−1 )]i , where the notation [v]i still represents the i-th coefficient of vector v. [sent-235, score-0.27]
79 d , and where the integer r is incremented whenever we update the matrix B. [sent-240, score-0.072]
80 Since the curvature of the primal cost (1) is always larger than λ, the ratio [gt−1 (wt ) − gt−1 (wt−1 )]i /[wt − wt−1 ]i is always larger than λ. [sent-242, score-0.099]
81 Performing the weight update (2) with a diagonal rescaling matrix B consists in performing term-by-term operations with a time complexity that is marginally greater than the complexity of the first-order SGD (3) update. [sent-246, score-0.31]
82 Fortunately this higher computational cost per iteration can be nearly avoided by scheduling the reestimation of the rescaling matrix with the same frequency as the regularization updates. [sent-248, score-0.402]
83 1 has shown that a diagonal rescaling matrix does little more than rescaling the input variables. [sent-250, score-0.462]
84 Since a fixed diagonal rescaling matrix already works quite well, there is little need to update its coefficients very often. [sent-251, score-0.31]
85 Whenever SVMSGD2 performs a regularization update, we set the flag updateB to schedule a reestimation of the rescaling coefficients during the next iteration. [sent-253, score-0.304]
86 Therefore the rescaling matrix reestimation schedule can be regulated with the same skip parameter as the regularization updates. [sent-255, score-0.468]
87 Because SGD-QN reestimates the rescaling matrix after a pattern update, special care must be taken when the ratio [wt − wt−1 ]i /[gt−1 (wt ) − gt−1 (wt−1 )]i has the form 0/0 because the corresponding input coefficient [xt−1 ]i is zero. [sent-257, score-0.224]
88 Since the secant Equation (6) is valid for any two values of the parameter vector, one can compute the ratios with parameter vectors wt−1 and wt + ε and derive the correct value by continuity when ε → 0. [sent-258, score-0.554]
89 Features s λ t0 skip A LPHA D ELTA RCV1 100,000 100,000 781,265 50,000 50,000 23,149 500 500 47,152 1 1 0. [sent-265, score-0.132]
90 Experiments We demonstrate the good scaling properties of SGD-QN in two ways: we present a detailed comparison with other stochastic gradient methods, and we summarize the results obtained on the PASCAL Large Scale Challenge. [sent-268, score-0.141]
91 Our implementation of oLBFGS maintains a rank 10 rescaling matrix. [sent-296, score-0.192]
92 2 Quasi-Newton Figure 3 shows how the primal cost Pn (w) of the Alpha data set evolves with the number of passes (left) and the training time (right). [sent-311, score-0.168]
93 5 2 Figure 3: Primal costs according to the number of epochs (left) and the training duration (right) on the Alpha data set. [sent-346, score-0.077]
94 3 Training Speed Figure 4 displays the test errors achieved on the Alpha, Delta and RCV1 data sets as a function of the number of passes (left) and the training time (right). [sent-349, score-0.094]
95 According to Theorem 4, given a large enough training set, a perfect second-order SGD algorithm would reach the batch test error after a single pass. [sent-353, score-0.099]
96 ) RCV1 DATA SET Figure 4: Test errors (in %) according to the number of epochs (left) and training duration (right). [sent-421, score-0.077]
97 Conclusion The SGD-QN algorithm strikes a good compromise for large scale application because it has low time and memory requirements per iteration and because it reaches competitive test errors after a small number of iterations. [sent-427, score-0.096]
98 Proof of Theorem 1 Define vt = wt − w∗ and observe that a second-order Taylor expansion of the primal gives n Pn (wt ) − Pn (w∗ ) = v⊤Hvt + o t −2 = tr Hvt v⊤ + o t −2 . [sent-434, score-0.794]
99 − = Eσ tr Hvt−1 v⊤ t−1 t + t0 (t + t0 )2 Hvt v⊤ = Hvt−1 v⊤ − t t−1 Et−1 Hvt v⊤ t Et−1 tr Hvt v⊤ t Eσ tr Hvt v⊤ t Let λmax ≥ λmin > 1/2 be the extreme eigenvalues of HB. [sent-438, score-0.354]
100 Then tr (HBGB) −1 t + o t −1 2λmin − 1 and tr (HBGB) −1 tr (HBGB) −1 t + o t −1 ≤ Eσ [Pn (wt ) − Pn (w∗ )] ≤ t + o t −1 . [sent-441, score-0.354]
wordName wordTfidf (topN-words)
[('wt', 0.554), ('sgd', 0.459), ('gt', 0.25), ('olbfgs', 0.239), ('rescaling', 0.192), ('hvt', 0.191), ('pn', 0.148), ('skip', 0.132), ('xt', 0.122), ('tr', 0.118), ('bii', 0.109), ('allinari', 0.107), ('ordes', 0.107), ('ottou', 0.107), ('yt', 0.101), ('hessian', 0.095), ('bgb', 0.083), ('hbgb', 0.072), ('bottou', 0.071), ('liblinear', 0.07), ('primal', 0.065), ('gradient', 0.064), ('count', 0.061), ('schraudolph', 0.06), ('updateb', 0.06), ('vt', 0.057), ('stochastic', 0.055), ('bordes', 0.054), ('alpha', 0.052), ('pascal', 0.051), ('wild', 0.05), ('reestimation', 0.048), ('diagonal', 0.046), ('epochs', 0.041), ('cun', 0.041), ('update', 0.04), ('reach', 0.038), ('sparsity', 0.038), ('sonnenburg', 0.037), ('iteration', 0.037), ('antoine', 0.036), ('schedule', 0.036), ('training', 0.036), ('delta', 0.036), ('eapp', 0.036), ('lpha', 0.036), ('nec', 0.036), ('coef', 0.035), ('cost', 0.034), ('scale', 0.034), ('gh', 0.033), ('passes', 0.033), ('matrix', 0.032), ('descent', 0.031), ('pass', 0.031), ('svms', 0.031), ('scheduling', 0.031), ('secondorder', 0.03), ('sparse', 0.029), ('excess', 0.029), ('track', 0.028), ('regularization', 0.028), ('inverse', 0.027), ('bfgs', 0.026), ('careful', 0.026), ('test', 0.025), ('wn', 0.025), ('tricks', 0.025), ('iterations', 0.025), ('generic', 0.024), ('updates', 0.024), ('competition', 0.024), ('eest', 0.024), ('elta', 0.024), ('eopt', 0.024), ('sident', 0.024), ('online', 0.024), ('cients', 0.023), ('amari', 0.023), ('hsieh', 0.023), ('becker', 0.023), ('nonzero', 0.023), ('bousquet', 0.023), ('scaling', 0.022), ('full', 0.022), ('save', 0.022), ('patrick', 0.022), ('listed', 0.021), ('patterns', 0.021), ('france', 0.021), ('universit', 0.021), ('solvers', 0.021), ('murata', 0.02), ('minw', 0.02), ('table', 0.02), ('proportional', 0.02), ('paris', 0.02), ('gradients', 0.02), ('speedup', 0.02), ('dual', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
2 0.4057714 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
3 0.20633705 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
Author: Vojtěch Franc, Sören Sonnenburg
Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization
4 0.15611006 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
Author: Wenxin Jiang
Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation
5 0.13288967 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
6 0.12727655 23 jmlr-2009-Discriminative Learning Under Covariate Shift
8 0.091364376 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
9 0.091345057 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
10 0.090252928 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
11 0.088987313 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
12 0.077168152 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
13 0.070302457 13 jmlr-2009-Bounded Kernel-Based Online Learning
14 0.069171138 90 jmlr-2009-Structure Spaces
15 0.066818111 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
16 0.063587941 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
17 0.059848774 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
18 0.05807589 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
19 0.045502663 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
20 0.041615374 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
topicId topicWeight
[(0, 0.316), (1, 0.256), (2, -0.016), (3, -0.193), (4, 0.517), (5, 0.244), (6, -0.084), (7, -0.152), (8, -0.0), (9, 0.003), (10, -0.053), (11, -0.015), (12, -0.07), (13, 0.093), (14, -0.026), (15, 0.044), (16, -0.016), (17, 0.083), (18, -0.021), (19, 0.093), (20, -0.014), (21, -0.031), (22, -0.016), (23, 0.077), (24, -0.023), (25, 0.045), (26, -0.007), (27, -0.023), (28, 0.009), (29, -0.006), (30, 0.011), (31, 0.0), (32, 0.007), (33, 0.001), (34, -0.013), (35, 0.039), (36, 0.016), (37, 0.004), (38, 0.02), (39, -0.004), (40, 0.029), (41, -0.005), (42, -0.04), (43, -0.003), (44, 0.004), (45, 0.003), (46, -0.016), (47, 0.03), (48, -0.016), (49, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.97762883 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
2 0.92479008 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
3 0.7496962 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
Author: Vojtěch Franc, Sören Sonnenburg
Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classification, binary classification, large-scale learning, parallelization
Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér
Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics
5 0.40704679 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
6 0.34744579 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
7 0.32360804 23 jmlr-2009-Discriminative Learning Under Covariate Shift
8 0.31610188 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
9 0.27238914 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
10 0.2492002 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
11 0.24001713 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
12 0.22269237 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
13 0.21738975 90 jmlr-2009-Structure Spaces
14 0.20957977 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
15 0.19940953 13 jmlr-2009-Bounded Kernel-Based Online Learning
16 0.19042511 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
17 0.17835793 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
18 0.17414783 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
19 0.16825357 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
20 0.16776946 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
topicId topicWeight
[(8, 0.016), (11, 0.012), (19, 0.424), (24, 0.033), (26, 0.011), (38, 0.051), (47, 0.012), (52, 0.053), (55, 0.055), (58, 0.038), (66, 0.094), (68, 0.056), (90, 0.04), (96, 0.023)]
simIndex simValue paperId paperTitle
Author: Leonid (Aryeh) Kontorovich, Boaz Nadler
Abstract: We propose a novel framework for supervised learning of discrete concepts. Since the 1970’s, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them. Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and an oracle for evaluating the universal kernel, we can efficiently compute a linear classifier (via SVM, for example) and use margin bounds to control its generalization error. Even though exact evaluation of the universal kernel may be infeasible, in various natural situations it is efficiently approximable. Though our approach is general, our main application is to regular languages. Our approach presents a substantial departure from current learning paradigms and in particular yields a novel method for learning this fundamental concept class. Unlike existing techniques, we make no structural assumptions on the corresponding unknown automata, the string distribution or the completeness of the training set. Instead, given a labeled sample our algorithm outputs a classifier with guaranteed distribution-free generalization bounds; to our knowledge, the proposed framework is the only one capable of achieving the latter. Along the way, we touch upon several fundamental questions in complexity, automata, and machine learning. Keywords: grammar induction, regular language, finite state automaton, maximum margin hyperplane, kernel approximation
same-paper 2 0.79298556 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
3 0.38476264 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
4 0.38049468 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low order Fourier-based approximations, however, may lead to functions that do not correspond to valid distributions. To address this problem, we present a quadratic program defined directly in the Fourier domain for projecting the approximation onto a relaxation of the polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario. Keywords: identity management, permutations, approximate inference, group theoretical methods, sensor networks
5 0.37918645 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
Author: Kristian Woodsend, Jacek Gondzio
Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method
6 0.37083715 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
7 0.36850962 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning
8 0.36526346 18 jmlr-2009-Consistency and Localizability
9 0.36492747 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
10 0.36356992 48 jmlr-2009-Learning Nondeterministic Classifiers
11 0.36170375 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
12 0.35982379 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
13 0.35902429 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
14 0.3582617 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
15 0.35626549 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
16 0.35565948 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
17 0.35516563 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
18 0.35511291 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence
19 0.35359079 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
20 0.35084522 25 jmlr-2009-Distributed Algorithms for Topic Models