jmlr jmlr2007 jmlr2007-10 knowledge-graph by maker-knowledge-mining

10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

Source: pdf

Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd

Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classiﬁcation problems. In this paper we describe an efﬁcient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efﬁciently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efﬁciently than by solving a family of problems independently. [sent-10, score-0.177]

2 Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods. [sent-11, score-0.266]

3 Introduction In this section we describe the basic logistic regression problem, the 2 - and 1 -regularized versions, and the regularization path. [sent-13, score-0.26]

4 We use p log (v, w) ∈ Rm to denote the vector of conditional probabilities, according to the logistic model, plog (v, w)i = Prob(bi |xi ) = exp(wT ai + vbi ) , 1 + exp(wT ai + vbi ) i = 1, . [sent-32, score-0.633]

5 The likelihood function associated with the samples is ∏m plog (v, w)i , and the i=1 log-likelihood function is given by m m i=1 i=1 ∑ log plog (v, w)i = − ∑ f (wT ai + vbi ), where f is the logistic loss function f (z) = log(1 + exp(−z)). [sent-36, score-0.667]

6 The negative of the loglikelihood function is called the (empirical) logistic loss, and dividing by m we obtain the average logistic loss, m lavg (v, w) = (1/m) ∑ f (wT ai + vbi ). [sent-38, score-0.607]

7 The average logistic loss is always nonnegative, that is, lavg (v, w) ≥ 0, since f (z) ≥ 0 for any z. [sent-41, score-0.316]

8 For the choice w = 0, v = 0, we have lavg (0, 0) = log 2 ≈ 0. [sent-42, score-0.179]

9 For example, when w = 0, we can form the logistic classiﬁer φ(x) = sgn(wT x + v), (3) where sgn(z) = +1 z > 0 −1 z ≤ 0, which picks the more likely outcome, given x, according to the logistic model. [sent-57, score-0.274]

10 The 2 -regularized logistic regression problem is minimize lavg (v, w) + λ w 2 2 = (1/m) ∑m f (wT ai + vbi ) + λ ∑n w2 . [sent-64, score-0.509]

11 i=1 i=1 i (4) Here λ > 0 is the regularization parameter, used to control the trade-off between the average logistic loss and the size of the weight vector, as measured by the 2 -norm. [sent-65, score-0.221]

12 Newton’s method and variants are very effective for small and medium sized problems, while conjugate-gradients and limited memory Newton (or truncated Newton) methods can handle very large problems. [sent-72, score-0.203]

13 For large-scale iterative methods such as truncated Newton or CG, the convergence typically improves as the regularization parameter λ is increased, since (roughly speaking) this makes the objective more quadratic, and improves the conditioning of the problem. [sent-77, score-0.231]

14 The logistic regression problem is minimize lavg (v, w) + λ w 1 = (1/m) ∑m f (wT ai + vbi ) + λ ∑n |wi |, i=1 i=1 1 -regularized (5) where λ > 0 is the regularization parameter. [sent-80, score-0.593]

15 Any solution of the 1 regularized logistic regression problem (5) can be interpreted in a Bayesian framework as a MAP estimate of w and v, when w has a Laplacian prior distribution and v has the (improper) uniform prior. [sent-83, score-0.213]

16 Despite the additional computational challenge posed by 1 -regularized logistic regression, compared to 2 -regularized logistic regression, interest in its use has been growing. [sent-85, score-0.274]

17 ) When w j = 0, the associated logistic model does not use the jth component of the feature vector, so sparse w corresponds to a logistic model that uses only a few of the features, that is, components of the feature vector. [sent-88, score-0.358]

18 A logistic model with sparse w is, in a sense, simpler or more parsimonious than one with nonsparse w. [sent-92, score-0.221]

19 We can solve the 1 -regularized LRP (5), by solving an equivalent formulation, with linear inequality constraints, minimize lavg (v, w) + λ1T u subject to −ui ≤ wi ≤ ui , i = 1, . [sent-127, score-0.277]

20 The main goal of this paper is to describe a specialized interior-point method for solving the 1 regularized logistic regression problem that is very efﬁcient, for all size problems. [sent-146, score-0.213]

21 In many applications, the regularization path (or some portion) needs to be computed, in order to determine an appropriate value of λ. [sent-155, score-0.177]

22 Using this fact, the entire regularization path in a (small or medium size) 1 regularized linear regression problem can be computed efﬁciently (Hastie et al. [sent-158, score-0.309]

23 Park and Hastie (2006a) describe an algorithm for (approximately) computing the entire regularization path for general linear models (GLMs) including logistic regression models. [sent-165, score-0.353]

24 When the number of kinks on the portion of the regularization path of interest is modest, however, these path-following methods can be very fast, requiring just a small multiple of the effort needed to solve one regularized problem to compute the whole path (or a portion). [sent-168, score-0.374]

25 This is essential when computing the regularization path in a large-scale problem. [sent-173, score-0.177]

26 Our method is far more efﬁcient than path following methods in computing a good approximation of the regularization for a medium-sized or large data set. [sent-175, score-0.177]

27 In Section 6, we consider the problem of computing the regularization path efﬁciently, at a variety of values of λ (but potentially far fewer than the number of kinks on the path). [sent-185, score-0.213]

28 1 Optimality Conditions The objective function of the 1 -regularized LRP, lavg (v, w)+λ w 1 , is convex but not differentiable, so we use a ﬁrst-order optimality condition based on subdifferential calculus (see Bertsekas, 1999, Prop. [sent-192, score-0.224]

29 This occurs if and only if bT (1 − plog (v, 0)) = 0, (1/m)AT (1 − plog (v, 0)) ∞ ≤ λ. [sent-202, score-0.376]

30 Using this value of v, the second condition becomes λ ≥ λmax = (1/m)AT (1 − plog (log(m+ /m− ), 0)) ∞. [sent-204, score-0.188]

31 The number λmax gives us an upper bound on the useful range of the regularization parameter λ: For any larger value of λ, the logistic model obtained from 1 -regularized LR has weight zero (and therefore has no ability to classify). [sent-205, score-0.221]

32 We can give a more explicit formula for λmax : λmax = (1/m) m− m+ ∑ ai + m ∑ ai m bi =1 bi =−1 ∞ ˜ = (1/m) X T b ∞ , where ˜ bi = m− /m bi = 1 i = 1, . [sent-207, score-0.328]

33 2 Dual Problem To derive a Lagrange dual of the 1 -regularized LRP (5), we ﬁrst introduce a new variable z ∈ Rm , as well as new equality constraints zi = wT ai + vbi , i = 1, . [sent-214, score-0.269]

34 , m, to obtain the equivalent problem minimize (1/m) ∑m f (zi ) + λ w 1 i=1 subject to zi = wT ai + vbi , i = 1, . [sent-217, score-0.194]

35 For general background on convex duality and conjugates, see, for example, Boyd and Vandenberghe (2004, Chap. [sent-223, score-0.221]

36 They are related by θ = (1/m)(1 − plog (v , w )). [sent-235, score-0.188]

37 We also note that the dual problem (10) can be derived starting from the equivalent problem (6), by introducing new variables zi (as we did in (9)), and associating dual variables θ+ ≥ 0 for the inequalities w ≤ u, and θ− ≥ 0 for the inequalities −u ≤ w. [sent-236, score-0.19]

38 Deﬁne v as ¯ v = arg min lavg (v, w), ¯ (12) v that is, v is the optimal intercept for the weight vector w, characterized by b T (1 − plog (v, w)) = 0. [sent-240, score-0.404]

39 ¯ ¯ ¯ as Now, we deﬁne θ ¯ θ = (s/m)(1 − plog (v, w)), ¯ (13) where the scaling constant s is s = min mλ/ AT (1 − plog (v, w)) ¯ ∞, 1 . [sent-241, score-0.376]

40 This is a one-dimensional smooth convex ¯ optimization problem, which can be solved very efﬁciently, for example, by a bisection method on the optimality condition bT (1 − plog (v, w)) = 0, since the lefthand side is a monotone function of v. [sent-244, score-0.267]

41 The difference between the primal objective value of (v, w), and the associated lower bound ¯ is called the duality gap, and denoted η: G(θ), ¯ η(v, w) = lavg (v, w) + λ w 1 − G(θ) m T a + vb ) + f ∗ (−mθ ) + λ w . [sent-247, score-0.391]

42 An Interior-Point Method In this section we describe an interior-point method for solving the equivalent formulation 1 -regularized minimize lavg (v, w) + λ1T u subject to −ui ≤ wi ≤ ui , i = 1, . [sent-251, score-0.246]

43 1 Logarithmic Barrier and Central Path The logarithmic barrier for the bound constraints −ui ≤ wi ≤ ui is n n n i=1 i=1 i=1 Φ(w, u) = − ∑ log(ui + wi ) − ∑ log(ui − wi ) = − ∑ log(u2 − w2 ), i i with domain dom Φ = {(w, u) ∈ Rn × Rn | |wi | < ui , i = 1, . [sent-256, score-0.2]

44 ) With the point (v (t), w (t), u (t)) we associate θ (t) = (1/m)(1 − plog (v (t), w (t))), ¯ which can be shown to be dual feasible. [sent-266, score-0.263]

45 ) The associated duality gap satisﬁes lavg (v (t), w (t)) + λ w (t) 1 − G(θ (t)) ≤ lavg (v (t), w (t)) + λ1T u (t) − G(θ (t)) = 2n/t. [sent-269, score-0.687]

46 ) The construction of a dual feasible point and duality gap, in steps 4–6, is explained in Section 2. [sent-308, score-0.251]

47 If (v, w, u) is on the central ˆ path, that is, φt is minimized, the duality gap is η = 2n/t. [sent-321, score-0.329]

48 Thus t is the value of t for which the associated central point has the same duality gap as the current point. [sent-322, score-0.329]

49 Another interpretation is that ˆ ˆ ˆ ˆ if t were held constant at t = t , (v, w, u) would converge to (v (t ), w (t ), u (t )), at which point the duality gap would be exactly η. [sent-323, score-0.329]

50 In the ﬁrst case, the duality gap η converges to zero, so the algorithm must exit. [sent-332, score-0.329]

51 Therefore the duality gap converges to η = ¯ 2n/t . [sent-335, score-0.329]

52 The gradient g = ∇φt (v, w, u) is given by   g1 g =  g2  ∈ R2n+1 , g3 where g1 = ∇v φt (v, w, u) = −(t/m)bT (1 − plog (v, w)) ∈ R,  g2 g3  2w1 /(u2 − w2 ) 1 1   . [sent-340, score-0.188]

53 = ∇w φt (v, w, u) = −(t/m)AT (1 − plog (v, w)) +  ∈R , . [sent-342, score-0.188]

54 In reporting card(w), we consider a component wi to be zero when (1/m) AT (1 − plog (v, w)) 1533 i ≤ τλ, KOH , K IM AND B OYD Data Leukemia (Golub et al. [sent-424, score-0.223]

55 The vertical axis shows duality gap, and the horizontal axis shows iteration number, which is the natural measure of computational effort when dense linear algebra methods are used. [sent-468, score-0.211]

56 The ﬁgures show that the algorithm has linear convergence, with duality gap decreasing by a factor around 1. [sent-469, score-0.329]

57 When the search direction in Newton’s method is computed approximately, using an iterative method such as PCG, the overall algorithm is called a conjugate gradient Newton method, or a truncated Newton method (Ruszczynski, 2006; Dembo and Steihaug, 1983). [sent-494, score-0.178]

58 repeat k := k + 1 z := H pk θk := yT rk−1 /pT z k−1 k xk := xk−1 + θk pk rk := rk−1 − θk z yk := P−1 rk µk+1 := yT rk /yT rk−1 k k−1 pk+1 := yk + µk+1 pk until rk 2 / g 2 ≤ εpcg or k = Npcg . [sent-507, score-0.286]

59 2 Truncated Newton Interior-Point Method The truncated Newton interior-point method is the same as the interior-point algorithm described in Section 3, with the search direction computed using the PCG algorithm. [sent-513, score-0.178]

60 The cost of computing H pk is O(p) ﬂops when A is sparse with p nonzero elements. [sent-515, score-0.177]

61 The Hessian can be written as H = t∇2 lavg (v, w) + ∇2 Φ(w, u). [sent-518, score-0.179]

62 To obtain the preconditioner, we replace the ﬁrst term with its diagonal part, to get   d0 0 0 P = diag t∇2 lavg (v, w) + ∇2 Φ(w, u) =  0 D3 D2  , 0 D 2 D1 (17) where d0 = tbT D0 b, D3 = diag(tAT D0 A) + D1 . [sent-519, score-0.235]

63 1537 KOH , K IM AND B OYD The PCG relative tolerance parameter εpcg has to be carefully chosen to obtain good efﬁciency in a truncated Newton method. [sent-533, score-0.193]

64 If the tolerance is too small, too many PCG steps are needed to compute each search direction; if the tolerance is too high, then the computed search directions do not give adequate reduction in duality gap per iteration. [sent-534, score-0.483]

65 1, ξη/ g 2 } , (18) where g is the gradient and η is the duality gap at the current iterate. [sent-536, score-0.329]

66 In other words, we solve the Newton system with low accuracy (but never worse than 10%) at early iterations, and solve it more accurately as the duality gap decreases. [sent-540, score-0.391]

67 In extensive testing, we found the truncated Newton interiorpoint method to be very efﬁcient, requiring a total number of PCG steps ranging between a few hundred (for medium size problems) and several thousand (for large problems). [sent-543, score-0.248]

68 For medium size (and sparse) problems it was faster than the basic interior-point method; moreover the truncated Newton interior-point method was able to solve very large problems, for which forming the Hessian H (let alone computing the search direction) would be prohibitively expensive. [sent-544, score-0.265]

69 The lefthand plot shows the duality gap versus outer iterations; the righthand plot shows duality gap versus cumulative PCG iterations, which is the more accurate measure of computational effort. [sent-578, score-0.692]

70 To give a very rough comparison with the direct method applied to this sparse problem, the truncated Newton interior-point method is much more efﬁcient than the basic interior-point method that does not exploit the sparsity of the data. [sent-582, score-0.263]

71 Figure 4 shows the progress of the algorithm, with duality gap versus iteration (lefthand plot), and duality gap versus cumulative PCG iteration (righthand plot). [sent-618, score-0.658]

72 The increase in running time, for decreasing λ, is due primarily to an increase in the average number of PCG iterations required per iteration, but also from an increase in the overall number of iterations required. [sent-628, score-0.18]

73 5λmax 100 10−1 10−2 2 10 103 104 105 number of features n 106 107 Figure 5: Runtime of the truncated Newton interior-point method, for randomly generated sparse problems, with three values of λ. [sent-639, score-0.231]

74 We compare the runtimes of the truncated Newton interior-point and the basic interior-point method using dense linear algebra methods to compute the search direction. [sent-647, score-0.213]

75 The truncated Newton interior-point method is far more efﬁcient for medium problems. [sent-650, score-0.203]

76 The number of cold-start iterations required is always near 36, while the number of warm-start iterations varies, but is always smaller, and typically much smaller, with an average value of 3. [sent-693, score-0.18]

77 Comparison In this section we compare the performance of our basic and truncated Newton interior-point methods, implemented in C (called l1 logreg), with several existing methods for 1 -regularized logistic regression, We make comparisons with MOSEK (MOSEK ApS, 2002), IRLS-LARS (Lee et al. [sent-725, score-0.284]

78 We ran it until the primal objective is within tolerance from the optimal objective value, which is computed using l1 logreg, with small (10 −12 ) duality gap. [sent-741, score-0.258]

79 The stopping criterion is based on lack of progress, and not on a suboptimality bound or duality gap. [sent-747, score-0.221]

80 glmpath uses a path-following method for generalized linear models, including logistic models, and computes a portion of the regularization path. [sent-749, score-0.302]

81 1λmax max card(w) PCG iterations 3000 2000 1000 2000 1000 cold start warm start 0 0. [sent-779, score-0.181]

82 We also report the actual accuracy achieved, that is, the difference between the achieved primal objective and the optimal objective value (as computed by l1 logreg with duality gap 10 −12 ). [sent-792, score-0.41]

83 We report comparison results using four data sets: the leukemia cancer gene expression and spambase data sets, two dense benchmark data sets used in Section 4. [sent-795, score-0.24]

84 1, the Internet advertisements data set, the medium sparse data set used in Section 5. [sent-796, score-0.266]

85 The solvers could handle the standardized Internet advertisements data set but do not exploit the sparse plus rank-one structure of the standardized data matrix. [sent-800, score-0.471]

86 Therefore, the Internet advertisements data set was not standardized as well. [sent-801, score-0.236]

87 2GHz Pentium IV under Linux; for medium and large problem (Internet advertisements and 20 Newsgroups), the solvers were run on an AMD Opteron 254 (with 8GB RAM) under Linux. [sent-803, score-0.223]

88 MOSEK could not handle the unstandardized large 20 Newsgroups data set, so we compared l1 logreg with BBR and IRLS-LARS-MC on the unstandardized 20 Newsgroups data set, for the regularization parameter λ = 0. [sent-820, score-0.219]

89 We consider the leukemia cancer gene expression data set and the Internet advertisements data set as benchmark examples of small dense and medium sparse problems, respectively. [sent-824, score-0.438]

90 glmpath ﬁnds an approximation of the regularization path by choosing the kink points adaptively over the same interval. [sent-830, score-0.258]

91 1 1 × 10−12 Newsgroups time accuracy 90 2 × 10−6 140 2 × 10−10 450 8 × 10−5 1200 7 × 10−9 - - 140 2 × 10−6 850 1 × 10−6 Table 3: Comparison results with two standardized data sets (leukemia and spambase) and two unstandardized data sets (Internet advertisements and 20 Newsgroups). [sent-849, score-0.281]

92 9 940 Table 4: Regularization path computation time (in seconds) of the warm-start method and glmpath for standardized leukemia cancer gene expression data and Internet advertisements data. [sent-858, score-0.547]

93 The warm-start method is more efﬁcient than glmpath for the Internet advertisements data set, a medium-sized sparse problem. [sent-861, score-0.291]

94 The performance discrepancy is partially explained by the fact that our warm-start method exploits the sparse plus rank-one structure of the standardized data matrix, whereas glmpath does not. [sent-862, score-0.275]

95 In 1 -regularized (binary) classiﬁcation, we have yi ∈ {−1, +1} (binary labels), and zi has the form zi = yi (wT xi + v), so we have the form (19) with ai = yi xi , bi = yi , and ci = 0. [sent-873, score-0.185]

96 u∈R As with 1 -regularized logistic regression, we can derive a bound on the suboptimality of (v, w), by ¯ constructing a dual feasible point θ, from an arbitrary w,   φ (wT a1 + vb1 + c1 ) ¯   . [sent-885, score-0.257]

97 φ (wT am + vbm + cm ) ¯ where v is the optimal intercept for the offset w, ¯ m v = arg min(1/m) ∑ φ(wT ai + vbi + ci ), ¯ v i=1 and the scaling constant s is given by s = min mλ/ AT p(v, w)) ∞ , 1 . [sent-888, score-0.191]

98 ¯ Using this method for cheaply computing a dual feasible point and associated duality gap for any (v, w), we can extend the custom interior-point method for 1 -regularized LRPs to general 1 regularized convex (twice differentiable) loss problems. [sent-889, score-0.524]

99 We carry out logistic regression (possibly regularized) using the data matrix Astd in place of A, to obtain (standardized) logistic model parameters w std , vstd . [sent-904, score-0.313]

100 A simple and efﬁcient algorithm for gene selection using sparse logistic regression. [sent-1394, score-0.271]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pcg', 0.368), ('lrp', 0.242), ('newton', 0.228), ('plog', 0.188), ('lavg', 0.179), ('duality', 0.176), ('nterior', 0.17), ('oint', 0.161), ('oyd', 0.161), ('gap', 0.153), ('koh', 0.152), ('truncated', 0.147), ('logistic', 0.137), ('cale', 0.137), ('egularized', 0.137), ('advertisements', 0.126), ('arge', 0.118), ('wt', 0.11), ('standardized', 0.11), ('vbi', 0.108), ('egression', 0.104), ('ogistic', 0.104), ('ethod', 0.103), ('astd', 0.099), ('lrps', 0.099), ('path', 0.093), ('max', 0.091), ('lr', 0.091), ('iterations', 0.09), ('newsgroups', 0.087), ('leukemia', 0.087), ('sparse', 0.084), ('regularization', 0.084), ('mosek', 0.084), ('tat', 0.081), ('glmpath', 0.081), ('tbt', 0.081), ('internet', 0.077), ('dual', 0.075), ('psfrag', 0.07), ('spambase', 0.068), ('hessian', 0.068), ('im', 0.066), ('card', 0.065), ('boyd', 0.064), ('replacements', 0.063), ('hred', 0.063), ('preconditioned', 0.061), ('bi', 0.059), ('nonzero', 0.059), ('diag', 0.056), ('medium', 0.056), ('hastie', 0.056), ('bbr', 0.054), ('colon', 0.054), ('ops', 0.054), ('lasso', 0.051), ('runtime', 0.051), ('gene', 0.05), ('vandenberghe', 0.048), ('rk', 0.046), ('ai', 0.046), ('bt', 0.046), ('tolerance', 0.046), ('preconditioner', 0.046), ('standardization', 0.046), ('cand', 0.045), ('interiorpoint', 0.045), ('logreg', 0.045), ('suboptimality', 0.045), ('unstandardized', 0.045), ('convex', 0.045), ('rosset', 0.044), ('donoho', 0.043), ('solvers', 0.041), ('park', 0.04), ('zi', 0.04), ('outcome', 0.04), ('regression', 0.039), ('custom', 0.038), ('regularized', 0.037), ('intercept', 0.037), ('primal', 0.036), ('abs', 0.036), ('kinks', 0.036), ('prob', 0.036), ('smin', 0.036), ('rn', 0.035), ('wi', 0.035), ('dense', 0.035), ('geophysics', 0.034), ('lefthand', 0.034), ('pk', 0.034), ('sparsity', 0.032), ('descent', 0.032), ('ui', 0.032), ('zou', 0.031), ('solve', 0.031), ('logarithmic', 0.031), ('search', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd

2 0.10593387 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

Author: Art B. Owen

Abstract: In binary classiﬁcation problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. In this paper we consider the inﬁnitely imbalanced case where one class has a ﬁnite sample size and the other class’s sample size grows without bound. For logistic regression, the inﬁnitely imbalanced case often has a useful solution. Under mild conditions, the intercept diverges as expected, but the rest of the coefﬁcient vector approaches a non trivial and useful limit. That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. Keywords: classiﬁcation, drug discovery, fraud detection, rare events, unbalanced data

3 0.090508074 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

Abstract: Regularization is an approach to function learning that balances ﬁt and smoothness. In practice, we search for a function f with a ﬁnite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning

4 0.068437621 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa

Abstract: Classiﬁcation of ordinal data is one of the most important tasks of relation learning. This paper introduces a new machine learning paradigm speciﬁcally intended for classiﬁcation problems where the classes have a natural order. The technique reduces the problem of classifying ordered classes to the standard two-class problem. The introduced method is then mapped into support vector machines and neural networks. Generalization bounds of the proposed ordinal classiﬁer are also provided. An experimental study with artiﬁcial and real data sets, including an application to gene expression analysis, veriﬁes the usefulness of the proposed approach. Keywords: classiﬁcation, ordinal data, support vector machines, neural networks

5 0.061952159 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Author: Ofer Dekel, Philip M. Long, Yoram Singer

Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is deﬁned by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Speciﬁcally, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron

6 0.059017312 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

7 0.053428877 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

8 0.048945773 77 jmlr-2007-Stagewise Lasso

9 0.04400865 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

10 0.043424949 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

11 0.042223398 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

12 0.04160466 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

13 0.040487442 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

14 0.040131681 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

15 0.039852098 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

16 0.039347485 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)

17 0.039328866 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

18 0.038779344 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

19 0.038165167 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

20 0.036432568 52 jmlr-2007-Margin Trees for High-dimensional Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.217), (1, -0.024), (2, 0.032), (3, 0.027), (4, 0.04), (5, -0.011), (6, -0.247), (7, 0.052), (8, -0.004), (9, 0.011), (10, 0.094), (11, -0.054), (12, -0.016), (13, 0.062), (14, 0.129), (15, 0.205), (16, 0.059), (17, -0.069), (18, -0.227), (19, -0.225), (20, -0.089), (21, -0.165), (22, 0.041), (23, 0.075), (24, 0.072), (25, 0.096), (26, 0.052), (27, 0.117), (28, 0.044), (29, 0.103), (30, 0.058), (31, 0.046), (32, 0.01), (33, -0.165), (34, 0.048), (35, -0.023), (36, 0.067), (37, -0.021), (38, -0.009), (39, -0.004), (40, 0.046), (41, -0.1), (42, 0.11), (43, -0.136), (44, 0.095), (45, -0.085), (46, -0.122), (47, -0.046), (48, 0.067), (49, -0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95156676 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd

2 0.56248361 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

Author: Art B. Owen

3 0.45303416 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

4 0.436474 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

Abstract: We present a uniﬁed and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and uniﬁes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling

5 0.39586213 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa

6 0.38074571 77 jmlr-2007-Stagewise Lasso

7 0.3364448 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)

8 0.28862488 15 jmlr-2007-Bilinear Discriminant Component Analysis

9 0.28722614 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

10 0.28594846 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

11 0.25571597 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

12 0.23502974 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

13 0.22945222 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

14 0.22879136 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

15 0.22683354 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

16 0.20629096 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

17 0.19094232 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

18 0.18596685 44 jmlr-2007-Large Margin Semi-supervised Learning

19 0.18589926 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

20 0.18198517 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.016), (8, 0.019), (10, 0.026), (11, 0.01), (12, 0.019), (15, 0.031), (22, 0.015), (28, 0.051), (40, 0.093), (45, 0.018), (48, 0.022), (60, 0.032), (80, 0.013), (85, 0.055), (98, 0.088), (99, 0.429)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73252422 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd

2 0.32348531 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

3 0.32251745 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

4 0.31295681 77 jmlr-2007-Stagewise Lasso

Author: Peng Zhao, Bin Yu

Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between ﬁdelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a ﬁxed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a ﬁnite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences

5 0.31223947 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semideﬁnite programming

6 0.30306101 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

7 0.30124509 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

8 0.30113709 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

9 0.30010349 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

10 0.29860991 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

11 0.29794934 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

12 0.29500544 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

13 0.29438397 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

14 0.29421034 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

15 0.29368937 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

16 0.29353574 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

17 0.29342398 39 jmlr-2007-Handling Missing Values when Applying Classification Models

18 0.29303259 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

19 0.29248807 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

20 0.29196003 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition