jmlr jmlr2008 jmlr2008-91 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
Reference: text
sentIndex sentText sentNum sentScore
1 Research Santa Clara, CA 95054, USA Editor: Alexander Smola Abstract Large-scale logistic regression arises in many applications such as document classification and natural language processing. [sent-10, score-0.179]
2 In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. [sent-11, score-0.721]
3 Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. [sent-13, score-0.218]
4 Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines 1. [sent-15, score-1.059]
5 Introduction The logistic regression model is useful for two-class classification. [sent-16, score-0.153]
6 There are many methods for training logistic regression models. [sent-30, score-0.196]
7 , 2003), nonlinear conjugate gradient, quasi Newton (in particular, limited memory BFGS) (Liu and Nocedal, 1989; Benson and Mor´ , 2001), and truncated Newton (Komarek and Moore, e 2005). [sent-34, score-0.432]
8 Truncated Newton methods have been an effective approach for large-scale unconstrained optimization, but their use for logistic regression has not been fully exploited. [sent-53, score-0.182]
9 In Section 2, we discuss an efficient and robust truncated Newton method for logistic regression. [sent-55, score-0.247]
10 This approach, called trust region Newton method, uses only approximate Newton steps in the beginning, but takes full Newton directions in the end for fast convergence. [sent-56, score-0.547]
11 In Sections 3 and 4, we discuss some existing optimization methods for logistic regression and conduct comparisons. [sent-57, score-0.181]
12 Section 5 investigates a variant of our proposed method by using preconditioned conjugate gradients in the trust region framework. [sent-60, score-0.778]
13 In Section 6, we extend the proposed trust region method to solve L2-loss support vector machines. [sent-61, score-0.568]
14 For large-scale logistic regression, we then propose a trust region Newton method, which is a type of truncated Newton approach. [sent-72, score-0.773]
15 1 Newton and Truncated Newton Methods To discuss Newton methods, we need the gradient and Hessian of f (w): l ∇ f (w) = w +C ∑ (σ(yi wT xi ) − 1)yi xi , i=1 T ∇2 f (w) = I +CX DX, (3) (4) where I is the identity matrix, Tx σ(yi wT xi ) = (1 + e−yi w i )−1 . [sent-74, score-0.179]
16 Since ∇2 f (wk ) is invertible, the simplest Newton method updates w by the following way wk+1 = wk + sk , (5) where k is the iteration index and sk , the Newton direction, is the solution of the following linear system: ∇2 f (wk )sk = −∇ f (wk ). [sent-83, score-0.649]
17 Two techniques are often used: line search and trust region. [sent-92, score-0.396]
18 Therefore, for large logistic regression, iterative methods are more suitable than direct methods, which require the whole Hessian matrix. [sent-100, score-0.155]
19 Among iterative methods, currently conjugate gradients are the most used ones in Newton methods. [sent-101, score-0.199]
20 The optimization procedure then has two layers of iterations: at each outer iteration an inner conjugate gradient procedure finds the Newton direction. [sent-102, score-0.352]
21 Unfortunately, conjugate gradient methods may suffer from lengthy iterations in certain situations. [sent-103, score-0.352]
22 Komarek and Moore (2005) are among the first to apply truncated Newton methods for logistic regression. [sent-106, score-0.226]
23 1 They approximately solve (6) by conjugate gradient procedures and use (5) to update wk . [sent-107, score-0.751]
24 They terminate the conjugate gradient procedure if the relative difference of log likelihoods between two consecutive conjugate gradient iterations is smaller than a threshold. [sent-108, score-0.603]
25 Some comparisons between limited memory quasi Newton and truncated Newton are by Nocedal and Nash (1991) and Zou et al. [sent-116, score-0.263]
26 2 A Trust Region Newton Method We consider the trust region method (Lin and Mor´ , 1999), which is a truncated Newton method to e deal with general bound-constrained optimization problems (i. [sent-119, score-0.718]
27 At each iteration of a trust region Newton method for minimizing f (w), we have an iterate wk , a size ∆k of the trust region, and a quadratic model 1 qk (s) = ∇ f (wk )T s + sT ∇2 f (wk )s 2 1. [sent-124, score-1.517]
28 630 T RUST R EGION N EWTON M ETHOD FOR L OGISTIC R EGRESSION Algorithm 1 A trust region algorithm for logistic regression 1. [sent-126, score-0.7]
29 • Find an approximate solution sk of the trust region sub-problem min qk (s) s subject to s ≤ ∆k . [sent-133, score-0.709]
30 Next, we find a step sk to approximately minimize qk (s) subject to the constraint s ≤ ∆k . [sent-138, score-0.162]
31 We then update wk and ∆k by checking the ratio ρk = f (wk + sk ) − f (wk ) qk (sk ) (8) of the actual reduction in the function to the predicted reduction in the quadratic model. [sent-139, score-0.634]
32 The direction is accepted if ρk is large enough: wk+1 = wk + sk wk if ρk > η0 , if ρk ≤ η0 , (9) where η0 > 0 is a pre-specified value. [sent-140, score-0.975]
33 The trust region bound ∆k is updated by the rules ∆k+1 ∈ [σ1 min{ sk , ∆k }, σ2 ∆k ] if ρk ≤ η1 , ∆k+1 ∈ [σ1 ∆k , σ3 ∆k ] if ρk ∈ (η1 , η2 ), ∆k+1 ∈ [∆k , σ3 ∆k ] if ρk ≥ η2 . [sent-142, score-0.626]
34 (10) Similar rules are used in most modern trust region methods. [sent-143, score-0.547]
35 A description of our trust region algorithm is given in Algorithm 1. [sent-144, score-0.547]
36 The conjugate gradient method to approximately solve the trust region sub-problem (11) is given in Algorithm 2. [sent-147, score-0.847]
37 Note that only one Hessian-vector product is needed at each conjugate gradient iteration. [sent-150, score-0.279]
38 Since ri = −∇ f (wk ) − ∇2 f (wk )¯i , s the stopping condition (12) is the same as − ∇ f (wk ) − ∇2 f (wk )¯i ≤ ξk ∇ f (wk ) , s 631 L IN , W ENG AND K EERTHI Algorithm 2 Conjugate gradient procedure for approximately solving the trust region sub-problem (11) ¯ 1. [sent-151, score-0.77]
39 (inner iterations) • If ri ≤ ξk ∇ f (wk ) , (12) ¯ then output sk = si and stop. [sent-158, score-0.188]
40 ¯ • If si+1 ≥ ∆k , compute τ such that ¯ si + τdi = ∆k , (13) ¯ then output sk = si + τdi and stop. [sent-161, score-0.169]
41 However, Algorithm 2 is different from standard conjugate gradient methods for linear systems as the constraint s ≤ ∆ ¯ must be taken care of. [sent-166, score-0.279]
42 1) with s0 = 0, we have ¯ ¯ si < si+1 , ∀i, ¯ so in a finite number of conjugate gradient iterations, either (12) is satisfied or si+1 violates the trust region constraint. [sent-168, score-0.871]
43 In the latter situation, (13) finds a point on the trust region boundary as qk (¯i + τdi ) < qk (¯i ). [sent-169, score-0.713]
44 s s The whole procedure is a careful design to make sure that the approximate Newton direction is good enough and the trust region method converges. [sent-170, score-0.568]
45 Next, we discuss convergence properties of the trust region Newton method. [sent-171, score-0.574]
46 If ξk < 1, then the trust region method Q-linearly converges: lim k→∞ wk+1 − w∗ < 1, wk − w∗ (14) where w∗ is the unique optimal solution of (2). [sent-186, score-1.016]
47 wk − w∗ 2 Regarding the computational complexity, the cost per iteration is O(nnz) for 1 function and 0/1 gradient evaluations + O(nnz) × number of conjugate gradient iterations, (15) where nnz is the number of nonzero elements in the sparse matrix X. [sent-192, score-1.01]
48 Note that if wk is not updated in (9), then the gradient is the same for the next iteration. [sent-193, score-0.558]
49 Related Methods and Implementation Issues In this section, we discuss a general limited memory quasi Newton implementation (Liu and Nocedal, 1989). [sent-195, score-0.162]
50 Many consider it to be very efficient for training logistic regression. [sent-196, score-0.168]
51 We also discuss implementation issues of the proposed trust region Newton method. [sent-197, score-0.547]
52 This property is useful for large logistic regression as we cannot afford to store Hk . [sent-204, score-0.153]
53 In contrast, limited memory quasi Newton methods consider only approximate Hessian matrices, so we can expect that it has slower convergence. [sent-234, score-0.162]
54 Moreover, compared to (15), the cost per iteration is less than that for our trust region method. [sent-237, score-0.592]
55 We set the initial ∆0 = ∇ f (w0 ) and take η0 = 10−4 in (9) to update wk . [sent-245, score-0.472]
56 In the conjugate gradient e procedure, we use ξk = 0. [sent-254, score-0.279]
57 These choices are considered appropriate following the research on trust region methods in the past several decades. [sent-257, score-0.547]
58 It is unclear yet if they are the best for logistic regression problems, but certainly we would like to try custom settings first. [sent-258, score-0.153]
59 Experiments In this section, we compare our approach with a quasi Newton implementation for logistic regression. [sent-260, score-0.218]
60 2 Comparisons We compare two logistic regression implementations: • TRON: the trust region Newton method discussed in Section 2. [sent-370, score-0.721]
61 • LBFGS: the limited memory quasi Newton implementation (Liu and Nocedal, 1989). [sent-372, score-0.162]
62 For the first one, we simulate the practical use of logistic regression by setting a stopping condition and checking the prediction ability. [sent-395, score-0.202]
63 Most unconstrained optimization software use gradient information to terminate the iterative procedure, so we use ∇ f (wk ) ∞ ≤ 10−3 (18) as the stopping condition. [sent-396, score-0.267]
64 On training time, TRON is better than LBFGS, so truncated Newton methods are effective for training logistic regression. [sent-406, score-0.312]
65 For example, if X= −10 0 −20 0 , 30 0 0 10 then its compressed column format is by three arrays: X val = [−10, 30, −20, 10], X rowind = [1, 2, 1, 2], X colptr = [1, 3, 3, 4, 5], where rowind means row indices and colptr means column pointers. [sent-449, score-0.275]
66 The main conjugate gradient operation (7) involves two matrix-vector products—one is with X T , and the other is with X. [sent-454, score-0.279]
67 If the number of nonzeros in a column is significantly larger than those in a row, very likely a column cannot be fit into the same memory level as that for a row. [sent-462, score-0.152]
68 Preconditioned Conjugate Gradient Methods To reduce the number of conjugate gradient iterations, in the truncated Newton method one often uses preconditioned conjugate gradient procedures. [sent-475, score-0.721]
69 If the approximate factorization (19) is good enough, P−1 ∇2 f (wk )P−T is close to the identity and less conjugate gradient iterations are needed. [sent-477, score-0.324]
70 However, as we need extra efforts to find P and the cost per conjugate iteration is higher, a smaller number of conjugate gradient iterations may not lead to shorter training time. [sent-478, score-0.581]
71 First, lengthy conjugate gradient iterations often occur at final outer steps, but for machine learning applications the algorithm may stop before reaching such a stage. [sent-482, score-0.375]
72 The cost of obtaining the preconditioner is thus no more than that of one conjugate gradient iteration. [sent-488, score-0.31]
73 642 (20) T RUST R EGION N EWTON M ETHOD FOR L OGISTIC R EGRESSION Algorithm 4 Preconditioned conjugate gradient procedure for approximately solving the trust region sub-problem (21) ˆ 1. [sent-491, score-0.826]
74 (inner iterations) • If ˆ ri ≤ ξk g , ˆ then output sk = P−T si and stop. [sent-498, score-0.188]
75 ˆ With s = PT s, we transform (20) to ˆ s min qk (ˆ) ˆ s where ˆ subject to s ≤ ∆k , (21) 1 ˆ ˆ s ˆ ˆ qk (ˆ) = gT s + sT Hˆ, ˆ s 2 and ˆ g = P−1 ∇ f (wk ), ˆ H = P−1 ∇2 f (wk )P−T . [sent-505, score-0.166]
76 Note that in practical implementations we calculate Hdi by a way similar to (7) P−1 (P−T di +C(X T (D(X(P−T di ))))). [sent-509, score-0.16]
77 In Table 4, we present the average number of conjugate gradient iterations per fold in the CV procedure. [sent-510, score-0.347]
78 643 L IN , W ENG AND K EERTHI Problem a9a real-sim news20 citeseer yahoo-japan rcv1 yahoo-korea CG 567 104 71 113 278 225 779 PCG 263 160 155 115 326 280 736 Table 4: Average number of conjugate gradient iterations per fold in the CV procedure. [sent-515, score-0.347]
79 Trust Region Method for L2-SVM The second term in (2) can be considered as a loss function, so regularized logistic regression is related to other learning approaches such as Support Vector Machines (SVM) (Boser et al. [sent-521, score-0.153]
80 In this section, we extend our trust region method for L2-SVM. [sent-535, score-0.568]
81 The gradient ∇ f2 (w) is Lipschitz continuous, so one can define the generalized Hessian matrix B(w) = I + 2CX T DX, where 1 if 1 − yi wT xi > 0, Dii = any element in [0, 1] if 1 − yi wT xi = 0, 0 if 1 − yi wT xi < 0. [sent-543, score-0.303]
82 Then the trust region method (Lin and Mor´ , 1999) can be applied by replacing ∇2 f (w) in Section e 2. [sent-544, score-0.568]
83 The Hessian-vector product in the conjugate gradient procedure is then T B(w)s = s + 2C · XI,: (DI,I (XI,: s)). [sent-553, score-0.279]
84 Keerthi and DeCoste (2005) use conjugate gradient methods to solve (26), and the procedure is described in Algorithm 5. [sent-563, score-0.279]
85 In contrast, the convergence of our trust region method holds when the conjugate gradient procedure only approximately minimizes the trust-region sub-problem. [sent-567, score-0.874]
86 ¯ Solve (26) by the conjugate gradient procedure and obtain wk . [sent-576, score-0.727]
87 Find αk = arg min f (wk + αsk ), α≥0 and set wk+1 = wk + αk sk . [sent-578, score-0.527]
88 3 Comparisons We compare our proposed trust region implementation (TRON) in Section 6. [sent-580, score-0.547]
89 To solve (26), SVMlin considers a relative stopping condition for the conjugate gradient procedure. [sent-586, score-0.328]
90 Following their convergence result, we modify SVMlin to quite accurately solve the linear system (26): Recall in Algorithm 5 that we sequentially obtain the following items: ¯ wk → Ik → wk . [sent-587, score-0.923]
91 We then use ¯ (I + 2CXIT,: XIk ,: )wk − 2CXIT,: yIk k k ∞ ≤ 10−3 as the stopping condition of the conjugate gradient procedure in SVMlin. [sent-588, score-0.328]
92 Both approaches spend most of their time on the operation (24) in the conjugate gradient procedure. [sent-590, score-0.305]
93 In contrast, trust region methods are effective on using only approximate directions in the early stage of the procedure. [sent-594, score-0.547]
94 Discussion and Conclusions As logistic regression is a special case of maximum entropy models and conditional random fields, it is possible to extend the proposed approach for them. [sent-596, score-0.153]
95 In summary, we have shown that a trust region Newton method is effective for training largescale logistic regression problems as well as L2-SVM. [sent-612, score-0.764]
96 It is interesting that we do not need many special settings for logistic regression; a rather direct use of modern trust region techniques already yields excellent performances. [sent-614, score-0.672]
97 If this property is wrong, there is a sequence {wk } in the set (29) satisfying wk → ∞. [sent-625, score-0.448]
98 However, 1 f (wk ) ≥ wk 2 → ∞ 2 contradicts the fact that f (wk ) ≤ f (w0 ), ∀k. [sent-626, score-0.448]
99 Making logistic regression a core data mining tool: A practical investigation of accuracy, speed, and simplicity. [sent-699, score-0.153]
100 The conjugate gradient method and trust regions in large scale optimization. [sent-771, score-0.696]
wordName wordTfidf (topN-words)
[('wk', 0.448), ('trust', 0.396), ('newton', 0.369), ('mor', 0.206), ('tron', 0.185), ('conjugate', 0.169), ('lbfgs', 0.154), ('region', 0.151), ('egion', 0.134), ('ogistic', 0.134), ('cv', 0.132), ('logistic', 0.125), ('eerthi', 0.124), ('ewton', 0.113), ('rust', 0.113), ('gradient', 0.11), ('truncated', 0.101), ('eng', 0.094), ('wt', 0.094), ('ethod', 0.093), ('quasi', 0.093), ('egression', 0.086), ('hessian', 0.085), ('qk', 0.083), ('nocedal', 0.083), ('di', 0.08), ('sk', 0.079), ('preconditioners', 0.073), ('hk', 0.072), ('ri', 0.064), ('lin', 0.063), ('nnz', 0.061), ('svmlin', 0.061), ('komarek', 0.051), ('yahoo', 0.051), ('stopping', 0.049), ('argonne', 0.049), ('steihaug', 0.049), ('bfgs', 0.047), ('decoste', 0.047), ('keerthi', 0.046), ('iterations', 0.045), ('si', 0.045), ('liu', 0.044), ('training', 0.043), ('memory', 0.043), ('format', 0.043), ('jorge', 0.042), ('formats', 0.041), ('preconditioned', 0.041), ('nonzeros', 0.041), ('row', 0.04), ('bouaricha', 0.036), ('hdi', 0.036), ('column', 0.034), ('stephen', 0.034), ('yi', 0.034), ('sathiya', 0.031), ('differentiable', 0.031), ('preconditioner', 0.031), ('iterative', 0.03), ('unconstrained', 0.029), ('regression', 0.028), ('optimization', 0.028), ('compressed', 0.028), ('lengthy', 0.028), ('mangasarian', 0.028), ('ruby', 0.028), ('pietra', 0.028), ('dii', 0.028), ('convergence', 0.027), ('time', 0.026), ('limited', 0.026), ('document', 0.026), ('weng', 0.025), ('pt', 0.025), ('continuously', 0.025), ('benson', 0.024), ('colptr', 0.024), ('malouf', 0.024), ('normalizations', 0.024), ('pcg', 0.024), ('rong', 0.024), ('rowind', 0.024), ('update', 0.024), ('nonzero', 0.023), ('outer', 0.023), ('xi', 0.023), ('per', 0.023), ('dx', 0.022), ('iteration', 0.022), ('taiwan', 0.022), ('matrix', 0.022), ('sparse', 0.022), ('software', 0.021), ('instances', 0.021), ('moore', 0.021), ('method', 0.021), ('twice', 0.021), ('asuncion', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
2 0.44274959 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
Author: Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON. Keywords: linear support vector machines, document classification, coordinate descent
3 0.15626225 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
4 0.078567103 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
5 0.076479286 2 jmlr-2008-A Library for Locally Weighted Projection Regression (Machine Learning Open Source Software Paper)
Author: Stefan Klanke, Sethu Vijayakumar, Stefan Schaal
Abstract: In this paper we introduce an improved implementation of locally weighted projection regression (LWPR), a supervised learning algorithm that is capable of handling high-dimensional input data. As the key features, our code supports multi-threading, is available for multiple platforms, and provides wrappers for several programming languages. Keywords: regression, local learning, online learning, C, C++, Matlab, Octave, Python
6 0.071262084 46 jmlr-2008-LIBLINEAR: A Library for Large Linear Classification (Machine Learning Open Source Software Paper)
7 0.063177243 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
8 0.055624332 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
9 0.049620785 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
10 0.042886458 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.041004416 52 jmlr-2008-Learning from Multiple Sources
12 0.037598375 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
13 0.035268597 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
14 0.034373932 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
15 0.034198642 86 jmlr-2008-SimpleMKL
16 0.034170199 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
17 0.033850849 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
18 0.03021778 96 jmlr-2008-Visualizing Data using t-SNE
19 0.029578194 80 jmlr-2008-Ranking Individuals by Group Comparisons
20 0.029110109 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
topicId topicWeight
[(0, 0.202), (1, -0.108), (2, 0.09), (3, 0.719), (4, 0.002), (5, -0.057), (6, -0.1), (7, -0.174), (8, 0.136), (9, -0.008), (10, 0.096), (11, 0.035), (12, -0.009), (13, -0.002), (14, -0.006), (15, 0.053), (16, 0.004), (17, -0.028), (18, -0.055), (19, -0.04), (20, 0.05), (21, -0.027), (22, 0.037), (23, -0.013), (24, 0.034), (25, -0.015), (26, 0.037), (27, -0.041), (28, 0.007), (29, -0.024), (30, -0.065), (31, -0.039), (32, 0.03), (33, -0.015), (34, 0.003), (35, -0.024), (36, 0.034), (37, 0.007), (38, 0.057), (39, 0.053), (40, -0.004), (41, 0.026), (42, -0.027), (43, 0.052), (44, -0.036), (45, -0.021), (46, 0.098), (47, -0.031), (48, -0.084), (49, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.96918297 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
2 0.94962102 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
Author: Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON. Keywords: linear support vector machines, document classification, coordinate descent
3 0.38716203 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
4 0.2429688 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
5 0.22063243 2 jmlr-2008-A Library for Locally Weighted Projection Regression (Machine Learning Open Source Software Paper)
Author: Stefan Klanke, Sethu Vijayakumar, Stefan Schaal
Abstract: In this paper we introduce an improved implementation of locally weighted projection regression (LWPR), a supervised learning algorithm that is capable of handling high-dimensional input data. As the key features, our code supports multi-threading, is available for multiple platforms, and provides wrappers for several programming languages. Keywords: regression, local learning, online learning, C, C++, Matlab, Octave, Python
6 0.21768695 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
7 0.19988556 46 jmlr-2008-LIBLINEAR: A Library for Large Linear Classification (Machine Learning Open Source Software Paper)
8 0.18907945 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
9 0.17786573 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
10 0.15376261 86 jmlr-2008-SimpleMKL
11 0.15337054 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
12 0.14320731 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
13 0.14263691 96 jmlr-2008-Visualizing Data using t-SNE
14 0.13139725 58 jmlr-2008-Max-margin Classification of Data with Absent Features
15 0.13060397 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
16 0.12878579 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
17 0.12062351 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
18 0.1185094 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
19 0.11286525 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
20 0.11090921 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
topicId topicWeight
[(0, 0.015), (5, 0.015), (31, 0.018), (40, 0.023), (54, 0.036), (58, 0.038), (66, 0.037), (76, 0.016), (88, 0.039), (92, 0.021), (94, 0.625), (99, 0.03)]
simIndex simValue paperId paperTitle
1 0.99126619 2 jmlr-2008-A Library for Locally Weighted Projection Regression (Machine Learning Open Source Software Paper)
Author: Stefan Klanke, Sethu Vijayakumar, Stefan Schaal
Abstract: In this paper we introduce an improved implementation of locally weighted projection regression (LWPR), a supervised learning algorithm that is capable of handling high-dimensional input data. As the key features, our code supports multi-threading, is available for multiple platforms, and provides wrappers for several programming languages. Keywords: regression, local learning, online learning, C, C++, Matlab, Octave, Python
same-paper 2 0.97212887 91 jmlr-2008-Trust Region Newton Method for Logistic Regression
Author: Chih-Jen Lin, Ruby C. Weng, S. Sathiya Keerthi
Abstract: Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM). Keywords: logistic regression, newton method, trust region, conjugate gradient, support vector machines
3 0.95116192 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
Author: Suhrid Balakrishnan, David Madigan
Abstract: Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multipass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function. Keywords: Laplace approximation, expectation propagation, LASSO
4 0.83593619 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
Author: Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON. Keywords: linear support vector machines, document classification, coordinate descent
Author: Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin
Abstract: LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets. Keywords: large-scale linear classification, logistic regression, support vector machines, open source, machine learning
6 0.55793184 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
7 0.55520445 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
8 0.55426919 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
9 0.54699439 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
10 0.53770059 86 jmlr-2008-SimpleMKL
11 0.49934024 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
12 0.48989332 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
13 0.48966777 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
14 0.48015493 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
15 0.47804642 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
16 0.46110252 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
17 0.45586991 83 jmlr-2008-Robust Submodular Observation Selection
18 0.44632116 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
19 0.44569519 9 jmlr-2008-Active Learning by Spherical Subdivision
20 0.43675679 56 jmlr-2008-Magic Moments for Structured Output Prediction