jmlr jmlr2010 jmlr2010-3 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
Reference: text
sentIndex sentText sentNum sentScore
1 When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. [sent-9, score-0.349]
2 Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. [sent-13, score-1.114]
3 ℓ1 -regularized logistic regression or so-called sparse logistic regression (Tibshirani, 1996), where the weight vector of the classifier has a small number of nonzero values, has been shown to have attractive properties such as feature selection and robustness to noise. [sent-24, score-0.643]
4 The classifier parameters w and v can be determined by minimizing the average logistic loss function, arg min lavg (w, v). [sent-41, score-0.583]
5 2 ℓ1 -Regularized Logistic Regression The so-called sparse logistic regression has emerged as a popular linear decoder in the field of machine learning, adding the ℓ1 -penalty on the weights w: arg min lavg (w, v) + λ w 1 , w,v (1) where λ is a regularization parameter. [sent-44, score-0.815]
6 An exact form of sparsity can be sought using the ℓ0 regularization, which explicitly penalizes the number of nonzero components, arg min lavg (w, v) + λ w 0 . [sent-52, score-0.514]
7 In the realm of machine learning, ℓ1 regularization exists in various forms of classifiers, including ℓ1 -regularized logistic regression (Tibshirani, 1996), ℓ1 -regularized probit regression (Figueiredo and Jain, 2001; Figueiredo, 2003), ℓ1 -regularized support vector machines (Zhu et al. [sent-77, score-0.379]
8 Glmpath, a solver for ℓ1 -regularized generalized linear models using path following methods, can also handle the logistic regression problem (Park and Hastie, 2007). [sent-93, score-0.419]
9 MOSEK is a general purpose primal-dual interior point solver, which can solve the ℓ1 -regularized logistic regression by formulating the dual problem, or treating it as a geometric program (Boyd et al. [sent-94, score-0.441]
10 SMLR, algorithms for various sparse linear classifiers, can also solve sparse logistic regression (Krishnapuram et al. [sent-96, score-0.408]
11 4 Our Hybrid Algorithm In this paper, we propose a hybrid algorithm that is comprised of two phases: the first phase is based on a new algorithm called iterative shrinkage, inspired by a fixed point continuation (FPC) (Hale et al. [sent-106, score-0.694]
12 , 2008), which is computationally fast and memory friendly; the second phase is a customized interior point method, devised by Koh et al. [sent-107, score-0.325]
13 The iterative shrinkage phase only performs matrix-vector multiplications in size of X, as well as a very simple shrinkage operation (see (6) below), and therefore requires minimal memory consumption. [sent-111, score-0.772]
14 Specifically, our algorithm predicts the sign changes in future shrinkage iterations, and when the signs of wk are likely to be stable, switches to the interior point method and operates on a reduced problem that is much smaller than the original. [sent-115, score-0.476]
15 The interior point method achieves high accuracy in the solution, making our hybrid algorithm equally accurate, as will be shown in the Section 4. [sent-116, score-0.36]
16 The HIS algorithm is comprised of two phases: the iterative shrinkage phase and the interior point phase. [sent-118, score-0.666]
17 The iterative shrinkage is inspired by a fixed point continuation method (Hale et al. [sent-119, score-0.674]
18 The rationale of the hybrid approach is based on the observation that the iterative shrinkage phase reduces the algorithm to gradient projection after a finite number of iterations, which will be described in Section 3. [sent-127, score-0.683]
19 (2008), we customize the iterative shrinkage algorithm for the sparse logistic regression, whose objective function is not quadratic. [sent-131, score-0.7]
20 In particular, the step length in the iterative shrinkage algorithm is not constant, unlike the compressive sensing problem. [sent-132, score-0.615]
21 In addition, the ℓ1 regularization is only applied to the w component and not v in sparse logistic regression. [sent-134, score-0.35]
22 This change in the model requires a different shrinkage step, as well as a careful treatment in the line search strategy. [sent-135, score-0.383]
23 In Section 2, we present the iterative shrinkage algorithm for sparse logistic regression, and prove its global convergence and Q-linear convergence. [sent-137, score-0.719]
24 In Section 3, we provide the rationale for the hybrid approach, together with a description of the hybrid algorithm. [sent-138, score-0.31]
25 Sparse Logistic Regression using Iterative Shrinkage The iterative shrinkage algorithm used in the first phase is inspired by a fixed point continuation algorithm by Hale et al. [sent-142, score-0.81]
26 2 Review of Fixed Point Continuation for ℓ1 -minimization A fixed-point continuation algorithm was proposed in Hale et al. [sent-162, score-0.321]
27 (2008) obtained the following optimality condition of x∗ : x∗ ∈ X ∗ ⇐⇒ 0 ∈ g(x∗ ) + λ SGN(x∗ ) ⇐⇒ x∗ = (I + τT1 )−1 (I − τT2 )x∗ , where T2 (·) = g(·), the gradient of f (·), and (I + τT1 )−1 (·) is the shrinkage operator. [sent-174, score-0.31]
28 In each iteration, the gradient descent step h reduces f (x) by moving along the negative gradient direction of f (xk ) and the shrinkage step s reduces the ℓ1 -norm by “shrinking” the magnitude of each nonzero component in the input vector. [sent-179, score-0.465]
29 3 Iterative Shrinkage for Sparse Logistic Regression Recall that in the sparse logistic regression problem (1), the ℓ1 regularization is only applied to w, not to v. [sent-181, score-0.408]
30 This reduces (1) to min lavg (u) + λ u1:n 1 , u and θ denotes the logistic transfer function θ(z) = log(1 + exp(−z)). [sent-187, score-0.583]
31 where lavg = The gradient and Hessian of lavg with respect to u is given by 1 m ∑m θ(cT u), i=1 i g(u) ≡ ∇lavg (u) = 1 m ′ ⊤ ∑ θ (ci u)ci , m i=1 H(u) ≡ ∇2 lavg (u) = 1 m ′′ ⊤ ∑ θ (ci u)ci c⊤ , i m i=1 where θ′ (z) = −(1 + ez )−1 and θ′′ (z) = (2 + e−z + ez )−1 . [sent-188, score-1.26]
32 The iterative shrinkage algorithm for sparse logistic regression is uk+1 = s ◦ h1:n (uk ), for w component, uk+1 = hn+1 (uk ), for v component, (7) which is a composition of two mappings h and s from Rn to Rn , where the gradient operator is h(·) = · − τg(·) = · − τ∇lavg (·). [sent-190, score-0.797]
33 We will present the convergence of the iterative shrinkage algorithm in Section 2. [sent-193, score-0.456]
34 First, the iteration (7) is shown to be non-expansive in ℓ2 , that is, uk − u∗ does not increase in k with the assumption on the step length τ. [sent-207, score-0.408]
35 Specifically, in Assumption, the step length τ is chosen small enough to guarantee that h(uk ) − h(u∗ ) ≤ uk − u∗ (in practice, τ is determined, for example, by line search. [sent-208, score-0.424]
36 Finally, to obtain the finite convergence result, we need to take a closer look at the shrinkage k+1 operator s(·). [sent-216, score-0.346]
37 5 Line Search An important element of the iterative shrinkage algorithm is the step length τ at each iteration. [sent-223, score-0.515]
38 A line search method, at each i=1 i iteration, computes the step length αk and the search direction pk : uk+1 = uk + αk pk . [sent-232, score-0.706]
39 For our sparse logistic regression, a sequence of step length candidates are identified, and a decision is made to accept one when certain conditions are satisfied. [sent-235, score-0.369]
40 Ideally the choice of step length α0 , would be a global minimizer of the smooth part of the objective function, ϕ(α) = lavg (uk + αpk ), α > 0, which is too expensive to evaluate, unlike the quadratic case in compressive sensing. [sent-238, score-0.594]
41 , 2007), we compute the heuristic step length through a minimizer of the quadratic approximation for ϕ(α), lavg (uk − α∇lavg (uk )) ≈ lavg (uk ) − α∇lavg (uk )T ∇lavg (uk ) + 0. [sent-241, score-0.92]
42 Differentiating the right-hand side with respect to α and setting the derivative to zero, we obtain α0 = ∇lavg (uk )T ∇lavg (uk ) ¯ ¯ , k )T H(uk )∇l (uk ) ∇lavg (u ¯ ¯ avg ¯ (11) where uk = 0, if ui = 0 or |gi | < λ and uk = uk , otherwise. [sent-243, score-0.846]
43 Computationally a very useful trick is not to compute the Hessian matrix directly, since we only use the vector-matrix product between the gradient vector lavg (uk ) and the Hessian matrix H(uk ). [sent-245, score-0.446]
44 ¯ ¯ 723 S HI , Y IN , O SHER AND S AJDA Based on the heuristic step length α0 , we can obtain the search direction pk , which is a combination of the gradient descent step and the shrinkage step: uk− = uk − α0 ∇lavg (uk ), uk+ = s1:n (uk− , λα0 ), pk = uk+ − uk . [sent-246, score-1.205]
45 Using the fact u+ = s1:n (u − τg, λτ), p = u+ − u, we get gT p + λ( u+ 1:n which means 1− u1:n 1 ) ≤ −pT p/τ, T ∇lavg (uk ) pk + λ uk+ 1:n 1 −λ uk 1:n 1 ≤ 0. [sent-250, score-0.347]
46 Notice that the Armijo-like condition for line search stipulates that the step length α j in the search direction pk should produce a sufficient decrease of the objective function φ(u). [sent-255, score-0.387]
47 Ck is a reference value with respect to the previous objective values, while the decrease in the objective function is described as T ∆k := ∇lavg (uk ) pk + λ uk+ 1 − λ uk 1 ≤ 0. [sent-256, score-0.403]
48 For the Armijo-like line search, we illustrate both the heuristic step length α0 (solid black curve) and the actual step length after backtracking (dashed red curve). [sent-268, score-0.318]
49 Red asterisk labels the transition points on the continuation path, a concept we will discuss in the next section. [sent-269, score-0.321]
50 724 H YBRID I TERATIVE S HRINKAGE - HIS Figure 2: Illustration of the Armijo-like line search, comparing the iterative shrinkage algorithm with (right column) and without (left column) line search. [sent-272, score-0.499]
51 (a) The objective function of the iterative shrinkage algorithm without line search, attaining convergence after 6000 iterations. [sent-273, score-0.567]
52 (b) The objective of the iterative shrinkage algorithm with line search, converging at around 150 iterations. [sent-274, score-0.482]
53 The transition point on the continuation path is indicated in (red asterisk). [sent-279, score-0.403]
54 6 Continuation Path A continuation strategy is adopted in our algorithm, by designing a regularization path similar to that is used in Hale et al. [sent-290, score-0.49]
55 The ratio2 nale of using such a continuation strategy is due to a fast rate of convergence for large λ. [sent-296, score-0.34]
56 In the case of the logistic regression, we have decided to use the geometric progression for the continuation path. [sent-300, score-0.499]
57 , L − 1, ¯ where λ0 can be calculated based on the ultimate λ we are interested in and the continuation path ¯ L−1 . [sent-304, score-0.403]
58 length L, that is, λ0 = λ/β As mentioned earlier, the goal of a continuation strategy is to construct a path with different rate of convergence, with which we can speed up the whole algorithm. [sent-305, score-0.519]
59 Note that we design the path length L and the geometric progression rate β in such a way that the initial regularization λ0 is fairly large, leading to a sparse solution for the initial path. [sent-307, score-0.423]
60 Another design issue regarding such a continuation strategy is we stop each subpath according to some criteria, in an endeavor to approximate the solution in the next λ as fast as possible. [sent-309, score-0.387]
61 The following two stopping criteria are used: uk+1 − uk < utoli , max( uk , 1) ∇lavg (uk ) ∞ − 1 < gtol. [sent-311, score-0.608]
62 Figure 3 shows the continuation path using fixed utol and a varying utol following geometric progression. [sent-320, score-0.729]
63 Further optimization of the continuation path can potentially accelerate the computation even more, which remains an open question for future research. [sent-325, score-0.46]
64 2 100 200 300 Iteration 400 500 0 50 100 Iteration 150 200 Figure 3: Illustration of the continuation strategy (a) using a fixed utol = 0. [sent-334, score-0.441]
65 Note that a stronger convergence is not necessary in earlier stages on the continuation path. [sent-336, score-0.34]
66 By using a varying utol, especially tightening utol as we move along the path, we can accelerate the fixed point continuation algorithm. [sent-337, score-0.498]
67 A continuation path of length 8, starting from 0. [sent-340, score-0.481]
68 Hybrid Iterative Shrinkage (HIS) Algorithm In this section we describe a hybrid approach called HIS, which uses the iterative shrinkage algorithm described previously to enforce sparsity and identify the support in the data, followed by a subspace optimization via an interior point method. [sent-344, score-0.788]
69 The hybrid approach is based on an interesting observation for the iterative shrinkage algorithm, regarding some finite convergence properties. [sent-347, score-0.583]
70 According to the above finite convergence result, the iterative shrinkage algorithm obtains L and E, and thus the optimal support and signs of the optimal nonzero components, in a finite number of steps. [sent-359, score-0.516]
71 The fixed point continuation reduces to the gradient projection after a finite number of iterations. [sent-365, score-0.332]
72 Corollary 2 implies an important fact: there are two phases in the fixed point continuation algorithm. [sent-367, score-0.324]
73 Precisely, it means that for all k > K, the nonzero entries in uk include all true nonzero entries in u∗ with the matched signs. [sent-369, score-0.393]
74 At this point, the fixed point continuation reduces to the gradient projection, starting the second phase of the algorithm. [sent-371, score-0.412]
75 This is due to 2 the quadratic function, and in an application to compressive sensing, the fixed point continuation algorithm alone has resulted in super-fast performance for large-scale problems (Hale et al. [sent-375, score-0.374]
76 In the case of sparse logistic regression, we have a non-strictly convex f , the average logistic regression. [sent-377, score-0.439]
77 In view of the continuation strategy we have, this greatly affects the speed of the last subpath, with the 728 H YBRID I TERATIVE S HRINKAGE - HIS ¯ regularization parameter λ of interest. [sent-379, score-0.418]
78 In some sense, we have designed a continuation path that is super-fast until it reaches the second phase of the final subpath. [sent-380, score-0.483]
79 This is not surprising given that the fixed point continuation algorithm is based on gradient descent and shrinkage operator. [sent-381, score-0.631]
80 Based on this intuition, we are now in a position to describe a hybrid algorithm: a fixed point continuation plus an interior point truncated Newton method. [sent-383, score-0.625]
81 1 m ∑ lavg (wT ai + vbi ) + λ1T u m i=1 −ui ≤ wi ≤ ui , i = 1, . [sent-396, score-0.434]
82 3 The Hybrid Algorithm The hybrid algorithm leverages the computational strengths of both the iterative shrinkage solver and the interior point solver. [sent-419, score-0.816]
83 Recall that we use a continuation strategy for the iterative shrinkage phase, where a sequence of λ’s is used along a regularization path. [sent-422, score-0.761]
84 1 states that iterative shrinkage recovers the true support in a finite number of steps. [sent-425, score-0.381]
85 In addition, iterative shrinkage obtains all true nonzero components long before the true support is obtained. [sent-426, score-0.441]
86 Therefore, as long as the iterative shrinkage seems to stagnate, which can be observed when the objective function evolves very slowly, it is highly likely that all true nonzero components are obtained. [sent-427, score-0.469]
87 In practice, we require the following transition condition, uk+1 − uk < utolt , max( uk , 1) and extract the nonzero components in w as the input to the interior point solver. [sent-429, score-0.783]
88 The resulting hybrid algorithm achieves high computational speed while attaining the same numerical accuracy as the interior point method, as demonstrated with empirical results in the next section. [sent-431, score-0.436]
89 8486 Table 1: Comparison of solution accuracy for our hybrid iterative shrinkage (HIS) algorithm and the interior point (IP) algorithm. [sent-500, score-0.772]
90 Note that in the simulation, the continuation path used in the iterative shrinkage may or may not be optimal, which means that the speed profile for the HIS algorithm can be essentially accelerated even more. [sent-578, score-0.85]
91 We used a spiking neuron model of primary visual cortex to map the input into cortical space, and decoded the resulting spike trains using sparse logistic regression (Shi et al. [sent-593, score-0.321]
92 Note that there is an issue of model selection when we apply sparse logistic regression model to the data, in a sense there exists an optimal level of sparsity that achieves the best classification result. [sent-599, score-0.368]
93 Conclusion We have presented in this paper a computationally efficient algorithm for the ℓ1 -regularized logistic regression, also called the sparse logistic regression. [sent-620, score-0.467]
94 The sparse logistic regression is a widely used model for binary classification in supervised learning. [sent-621, score-0.321]
95 Solving the large-scale sparse logistic regression usually requires expensive computational resources, depending on the specific solver, memory and/or CPU time. [sent-657, score-0.361]
96 We have presented the HIS algorithm, which couples a fast shrinkage method and a slower but more accurate interior point method. [sent-659, score-0.448]
97 The iterative shrinkage algorithm has global convergence with a Q-linear rate. [sent-660, score-0.456]
98 Various techniques such as line search and continuation strategy are used to accelerate the computation. [sent-661, score-0.462]
99 The shrinkage solver only involves the gradient descent and the shrinkage operator, both of which are first-order. [sent-662, score-0.656]
100 Based solely on efficient memory operations such as matrix-vector multiplication, the shrinkage solver serves as the first phase for the algorithm. [sent-663, score-0.466]
wordName wordTfidf (topN-words)
[('lavg', 0.407), ('continuation', 0.293), ('uk', 0.273), ('shrinkage', 0.271), ('interior', 0.177), ('logistic', 0.176), ('hale', 0.169), ('hybrid', 0.155), ('ip', 0.153), ('utol', 0.148), ('ybrid', 0.148), ('hrinkage', 0.148), ('terative', 0.148), ('ajda', 0.133), ('hi', 0.129), ('sher', 0.114), ('path', 0.11), ('iterative', 0.11), ('sgn', 0.106), ('koh', 0.105), ('donoho', 0.089), ('regularization', 0.087), ('sparse', 0.087), ('osher', 0.085), ('ue', 0.084), ('phase', 0.08), ('length', 0.078), ('solver', 0.075), ('pk', 0.074), ('kfold', 0.074), ('yin', 0.07), ('search', 0.067), ('rn', 0.064), ('subpath', 0.063), ('newton', 0.063), ('hee', 0.062), ('utoli', 0.062), ('nonzero', 0.06), ('regression', 0.058), ('accelerate', 0.057), ('cardinality', 0.053), ('compressive', 0.053), ('uci', 0.05), ('friendly', 0.049), ('figueiredo', 0.049), ('krishnapuram', 0.047), ('parra', 0.047), ('shi', 0.047), ('sparsity', 0.047), ('convergence', 0.047), ('sensing', 0.047), ('line', 0.045), ('grafting', 0.042), ('wotao', 0.042), ('iterations', 0.04), ('memory', 0.04), ('xk', 0.04), ('gradient', 0.039), ('speed', 0.038), ('genkin', 0.038), ('attaining', 0.038), ('benchmark', 0.037), ('claerbout', 0.037), ('fpc', 0.037), ('gtol', 0.037), ('shooting', 0.037), ('speedup', 0.037), ('ionosphere', 0.037), ('dimensions', 0.036), ('boyd', 0.035), ('columbia', 0.035), ('hessian', 0.035), ('ck', 0.033), ('backtracking', 0.033), ('cand', 0.033), ('lewis', 0.032), ('jianing', 0.032), ('rice', 0.032), ('solution', 0.031), ('az', 0.031), ('phases', 0.031), ('geometric', 0.03), ('bregman', 0.03), ('wavelet', 0.029), ('iteration', 0.029), ('asterisk', 0.028), ('customized', 0.028), ('gerson', 0.028), ('oe', 0.028), ('sajda', 0.028), ('subproblem', 0.028), ('objective', 0.028), ('attractive', 0.028), ('operator', 0.028), ('red', 0.028), ('step', 0.028), ('classi', 0.028), ('algorithm', 0.028), ('tk', 0.028), ('ui', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999899 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
2 0.10321908 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
Author: Tong Zhang
Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation
3 0.080559827 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines, document classification
4 0.072342075 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
Author: Jitkomut Songsiri, Lieven Vandenberghe
Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization
Author: Dapo Omidiran, Martin J. Wainwright
Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization
6 0.062377207 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
7 0.056109536 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
8 0.054892443 69 jmlr-2010-Lp-Nested Symmetric Distributions
9 0.054604135 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
10 0.054111198 22 jmlr-2010-Classification Using Geometric Level Sets
11 0.052237943 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
12 0.05206497 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
13 0.051244833 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
14 0.051185716 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
15 0.050373856 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
16 0.050317019 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
17 0.049833652 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
18 0.046254203 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
19 0.044258986 72 jmlr-2010-Matrix Completion from Noisy Entries
20 0.042217754 102 jmlr-2010-Semi-Supervised Novelty Detection
topicId topicWeight
[(0, -0.216), (1, -0.079), (2, 0.084), (3, 0.008), (4, 0.027), (5, 0.012), (6, 0.083), (7, -0.07), (8, -0.008), (9, -0.083), (10, 0.027), (11, -0.007), (12, 0.066), (13, 0.026), (14, 0.038), (15, -0.049), (16, 0.106), (17, -0.026), (18, -0.078), (19, -0.176), (20, -0.009), (21, -0.069), (22, 0.028), (23, 0.049), (24, -0.014), (25, 0.016), (26, 0.077), (27, -0.108), (28, -0.159), (29, -0.03), (30, 0.07), (31, 0.097), (32, -0.036), (33, -0.236), (34, 0.043), (35, 0.109), (36, 0.039), (37, -0.042), (38, -0.158), (39, 0.022), (40, -0.12), (41, 0.031), (42, -0.216), (43, -0.061), (44, -0.015), (45, -0.026), (46, 0.042), (47, 0.038), (48, 0.054), (49, 0.201)]
simIndex simValue paperId paperTitle
same-paper 1 0.9418366 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
2 0.5799067 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
Author: Tong Zhang
Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation
3 0.55479324 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines, document classification
4 0.51047593 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
5 0.36499485 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
Author: Michel Journée, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms
6 0.33580235 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
7 0.33444542 37 jmlr-2010-Evolving Static Representations for Task Transfer
8 0.32073116 22 jmlr-2010-Classification Using Geometric Level Sets
9 0.31847247 69 jmlr-2010-Lp-Nested Symmetric Distributions
10 0.31535661 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
11 0.30094618 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
12 0.29810712 102 jmlr-2010-Semi-Supervised Novelty Detection
13 0.28718653 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
14 0.27260172 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
16 0.26529616 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
17 0.25862852 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
18 0.25084656 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
19 0.24828833 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
20 0.24062264 72 jmlr-2010-Matrix Completion from Noisy Entries
topicId topicWeight
[(3, 0.014), (8, 0.028), (21, 0.014), (32, 0.039), (36, 0.024), (37, 0.038), (75, 0.675), (85, 0.044)]
simIndex simValue paperId paperTitle
1 0.99777585 77 jmlr-2010-Model-based Boosting 2.0
Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner
Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for fitting and evaluating a variety of regression and classification models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for fitting complex prediction models, taking potentially many interactions of features into account, as well as for fitting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly defined as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a fitted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientific question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The definition of ξ is determined by defining a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then fitted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively fits residuals defined by the negative gradient of the loss function evaluated at the current fit by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j defines the final model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for fitting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classification (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be fitted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners defining the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more unified way. Specifically, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for fitting smooth functions via penalized splines, varying coefficients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efficient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is specified via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are defined by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (specified using the formula fm), the data and the family are then passed to function mboost() for model fitting:3 R> library(
same-paper 2 0.99588275 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
3 0.99568212 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
4 0.99524134 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
Author: Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu
Abstract: We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE). A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuoustime Markov process. This software provides libraries and programs for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data. For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern. Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence. Keywords: continuous time Bayesian networks, C++, open source software
5 0.98930132 6 jmlr-2010-A Rotation Test to Verify Latent Structure
Author: Patrick O. Perry, Art B. Owen
Abstract: In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors. We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure of non-Gaussianity in the histogram of the principal eigenvector components instead of on the eigenvalue. The method finds and verifies some latent dichotomies in the microarray data from the AGEMAP consortium. Keywords: independent components analysis, Kronecker covariance, latent variables, projection pursuit, transposable data
6 0.98778796 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
7 0.93110961 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
8 0.93062508 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
10 0.92052048 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
11 0.90764344 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
12 0.90599936 84 jmlr-2010-On Spectral Learning
13 0.9042784 40 jmlr-2010-Fast and Scalable Local Kernel Machines
14 0.90222389 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
15 0.90170968 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
16 0.90145171 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
17 0.89902407 63 jmlr-2010-Learning Instance-Specific Predictive Models
18 0.89837682 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
19 0.89833456 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
20 0.89620477 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning