nips nips2008 nips2008-21 knowledge-graph by maker-knowledge-mining

21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations


Source: pdf

Author: Pierre Garrigues, Laurent E. Ghaoui

Abstract: It has been shown that the problem of 1 -penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper RecLasso, an algorithm to solve the Lasso with online (sequential) observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. We also propose an algorithm to automatically update the regularization parameter after observing a new data point. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. [sent-8, score-0.859]

2 We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. [sent-9, score-0.439]

3 Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. [sent-10, score-0.806]

4 We also propose an algorithm to automatically update the regularization parameter after observing a new data point. [sent-11, score-0.39]

5 We wish to fit a linear model to predict the response yi as a function of xi and a feature vector θ ∈ Rm , yi = xT θ + νi , where νi represents the noise in the observation. [sent-19, score-0.146]

6 The i Lasso optimization problem is given by min θ 1 2 n (xT θ − yi )2 + µn θ 1 , i (1) i=1 where µn is a regularization parameter. [sent-20, score-0.29]

7 the solution θ has few entries that are non-zero, and therefore identifies which dimensions in xi are useful to predict yi . [sent-23, score-0.197]

8 Homotopy methods have also been applied to the Lasso to compute the full regularization path when λ varies [5] [6][7]. [sent-27, score-0.306]

9 They are particularly efficient when the solution is very sparse [8]. [sent-28, score-0.207]

10 1 We propose an algorithm to compute the solution of the Lasso when the training examples (yi , xi )i=1. [sent-30, score-0.225]

11 Let θ(n) be the solution of the Lasso after observing n training examples and θ(n+1) the solution after observing a new data point (yn+1 , xn+1 ) ∈ R × Rm . [sent-34, score-0.445]

12 We introduce an optimization problem that allows us to compute an homotopy from θ(n) to θ(n+1) . [sent-35, score-0.511]

13 Hence we use the previously computed solution as a “warm-start”, which makes our method particularly efficient when the supports of θ(n) and θ(n+1) are close. [sent-36, score-0.144]

14 In Section 2 we review the optimality conditions of the Lasso, which we use in Section 3 to derive our algorithm. [sent-37, score-0.132]

15 We test in Section 4 our algorithm numerically, and show applications to compressive sensing with sequential observations and leave-one-out cross-validation. [sent-38, score-0.49]

16 We also propose an algorithm to automatically select the regularization parameter each time we observe a new data point. [sent-39, score-0.283]

17 The optimality i conditions for the Lasso are given by X T (Xθ − y) + µn v = 0, v ∈ ∂ θ 1 . [sent-47, score-0.132]

18 We define as the active set the indices of the elements of θ that are non-zero. [sent-48, score-0.264]

19 To simplify notations we T T T assume that the active set appears first, i. [sent-49, score-0.258]

20 Let X = (X1 X2 ) be the partitioning of X according to the T active set. [sent-52, score-0.227]

21 If the solution is unique it can be shown that X1 X1 is invertible, and we can rewrite the optimality conditions as T T θ1 = (X1 X1 )−1 (X1 y − µn v1 ) . [sent-53, score-0.327]

22 T −µn v2 = X2 (X1 θ1 − y) Note that if we know the active set and the signs of the coefficients of the solution, then we can compute it in closed form. [sent-54, score-0.417]

23 1 Proposed homotopy algorithm Outline of the algorithm Suppose we have computed the solution θ(n) to the Lasso with n observation and that we are given an additional observation (yn+1 , xn+1 ) ∈ R × Rm . [sent-56, score-0.575]

24 Our goal is to compute the solution θ(n+1) of the augmented problem. [sent-57, score-0.193]

25 We propose an algorithm that computes a path from θ(n) to θ(n+1) in two steps: Step 1 Vary the regularization parameter from µn to µn+1 with t = 0. [sent-60, score-0.336]

26 This amounts to computing the regularization path between µn and µn+1 as done in Lars. [sent-61, score-0.257]

27 The solution path is piecewise linear and we do not review it in this paper (see [15][7][5]). [sent-62, score-0.297]

28 We saw in Section 2 that the solution to the Lasso can be easily computed once the active set and signs of the coefficients are known. [sent-69, score-0.512]

29 This information is available at t = 0, and we show that the active set and signs will remain the same for t in an interval [0, t∗ ) where the solution θ(t) is smooth. [sent-70, score-0.557]

30 We denote such a point where the active set changes a “transition point” and show how to compute it analytically. [sent-71, score-0.313]

31 At t∗ we update the active set and signs which will remain valid until t reaches the next transition point. [sent-72, score-0.814]

32 This process is iterated until we know the active set and signs of the solution at t = 1, and therefore can compute the desired solution θ(n+1) . [sent-73, score-0.705]

33 We suppose as in Section 2 and without loss of generality that the solution at t = 0 is such that T T T θ(0) = (θ1 , 0T ) and v T = (v1 , v2 ) ∈ ∂ θ(0) 1 satisfy the optimality conditions. [sent-74, score-0.274]

34 There exist t∗ > 0 such that for all t ∈ [0, t∗ ), the solution of (2) has the same support and the same sign as θ(0). [sent-77, score-0.194]

35 We show that there exists a solution θ(t)T = (θ1 (t)T , 0T ) and w(t)T = T (v1 , w2 (t)T ) ∈ ∂ θ(t) 1 satisfying the optimality conditions for t sufficiently small. [sent-80, score-0.276]

36 We partition T T xn+1 = (xT n+1,1 , xn+1,2 ) according to the active set. [sent-81, score-0.227]

37 We rewrite the optimality conditions as T X1 (X1 θ1 (t) − y) + t2 xn+1,1 xn+1,1 T θ1 (t) − yn+1 + µv1 = 0 T X2 (X1 θ1 (t) − y) + t2 xn+1,2 xn+1,1 T θ1 (t) − yn+1 + µw2 (t) = 0 . [sent-82, score-0.183]

38 The solution θ(t) will therefore be smooth until t reaches a transition point where either a component of θ1 (t) becomes zero, or one of the component of w2 (t) reaches one in absolute value. [sent-89, score-0.734]

39 We now show how to compute the value of the transition point. [sent-90, score-0.27]

40 We partition X = X1 X2 according to the active yn+1 xn+1 T set. [sent-92, score-0.227]

41 We have ˜ u = (X1 t1i = 1+ eui ¯ −α ˜ θ1i −1 1 2 , We now examine the case where a component of w2 (t) reaches one in absolute value. [sent-95, score-0.192]

42 The j th component of w2 (t) will become 1 in absolute value as soon as e(t2 − 1) ¯ ˜ cT e + x(j) − cT X1 u = µ. [sent-99, score-0.139]

43 1 −1 2  − ˜ e(x(j) −cT X1 u) ¯  j  −α t2 j = 1 +  µ−cT e j ˜ Hence the transition point will be equal to t = min{mini t1i , minj t+ j , minj t− j } where we re2 2 strict ourselves to the real solutions that lie between 0 and 1. [sent-104, score-0.358]

44 Algorithm 1 RecLasso: homotopy algorithm for online Lasso 1: Compute the path from θ (n) = θ(0, µn ) to θ(0, µn+1 ). [sent-106, score-0.551]

45 2: Initialize the active set to the non-zero coefficients of θ(0, µn+1 ) and let v = sign (θ(0, µn+1 )). [sent-107, score-0.277]

46 ˜ Let v1 and xn+1,1 be the subvectors of v and xn+1 corresponding to the active set, and X1 the ˜ whose columns correspond to the active set. [sent-108, score-0.454]

47 If it is smaller than the previous transition point or greater than 1, go to Step 5. [sent-112, score-0.258]

48 Case 1 The component of θ1 (t ) corresponding to the ith coefficient goes to zero: Remove i from the active set. [sent-113, score-0.317]

49 Case 2 The component of w2 (t ) corresponding to the j th coefficient reaches one in absolute value: Add j to the active set. [sent-115, score-0.465]

50 ˜ 4: Update v1 , X1 and xn+1,1 according to the updated active set. [sent-119, score-0.227]

51 ˜ 5: Compute final value at t = 1, where the values of θ (n+1) on the active set are given by θ1 . [sent-122, score-0.227]

52 The initialization amounts to computing the solution of the Lasso when we have only one data point (y, x) ∈ R × Rm . [sent-123, score-0.181]

53 In this case, the active set has at most one element. [sent-124, score-0.227]

54 We illustrate our algorithm by showing the solution path when the regularization parameter and t are successively varied with a simple numerical example in Figure 1. [sent-127, score-0.448]

55 3 Complexity ˜T ˜ The complexity of our algorithm is dominated by the inversion of the matrix X1 X1 at each transition point. [sent-129, score-0.265]

56 As the update to this matrix after a 4 Figure 1: Solution path for both steps of our algorithm. [sent-131, score-0.168]

57 On the left is the homotopy when the regularization parameter goes from µn = . [sent-135, score-0.697]

58 There is one transition point as θ2 becomes inactive. [sent-138, score-0.258]

59 On the right is the piecewise smooth path of θ(t) when t goes from 0 to 1. [sent-139, score-0.202]

60 We can see that θ3 becomes zero, θ2 goes from being 0 to being positive, whereas θ1 , θ4 and θ5 remain active with their signs unchanged. [sent-140, score-0.462]

61 The three transition points are shown as black dots. [sent-141, score-0.256]

62 transition point is rank 1, the cost of computing the inverse is O(q 2 ). [sent-142, score-0.258]

63 Let k be the total number of transition points after varying the regularization parameter from µn to µn+1 and t from 0 to 1. [sent-143, score-0.516]

64 In practice, the size of the active set d is much lower than q, and if it remains ∼ d throughout the homotopy, the complexity is O(kd2 ). [sent-145, score-0.271]

65 For this problem the solution typically has m non-zero elements, and therefore the cost of updating the solution after a new observation is O(m2 ). [sent-147, score-0.325]

66 Hence if the solution is sparse (d small) and the active set does not change much (k small), updating the solution of the Lasso will be faster than updating the solution to the non-penalized least-square problem. [sent-148, score-0.796]

67 Suppose that we applied Lars directly to the problem with n + 1 observations without using knowledge of θ(n) by varying the regularization parameter from a large value where the size of the active set is 0 to µn+1 . [sent-149, score-0.538]

68 The complexity of this approach is O(k q 2 ), and we can therefore compare the efficiency of these two approaches by comparing the number of transition points. [sent-151, score-0.265]

69 1 Applications Compressive sensing Let θ0 ∈ Rm be an unknown vector that we wish to reconstruct. [sent-153, score-0.209]

70 This approach is known as compressive sensing [16][17] and has generated a tremendous amount of interest in the signal processing community. [sent-157, score-0.487]

71 The reconstruction is given by the solution of the Basis Pursuit (BP) problem min θ 1 subject to Xθ = y. [sent-158, score-0.214]

72 θ If measurements are obtained sequentially, it is advantageous to start estimating the unknown sparse signal as measurements arrive, as opposed to waiting for a specified number of measurements. [sent-159, score-0.35]

73 Algorithms to solve BP with sequential measurements have been proposed in [18][19], and it has been shown that the change in the active set gives a criterion for how many measurements are needed to recover the underlying signal [19]. [sent-160, score-0.628]

74 In the case where the measurements are noisy (σ > 0), a standard approach to recover θ0 is to solve the Basis Pursuit DeNoising problem instead [20]. [sent-161, score-0.145]

75 Hence, our algorithm is well suited for 5 compressive sensing with sequential and noisy measurements. [sent-162, score-0.439]

76 We also compare our method to coordinate descent [11] with warm start: when receiving a new measurement, we initialize coordinate descent (CD) to the actual solution. [sent-164, score-0.249]

77 We sample measurements of a model where m = 100, the vector θ0 used to sample the data has 25 non- zero elements whose values are Bernoulli ±1, xi ∼ N (0, Im ), σ = 1, and we set µn = . [sent-165, score-0.139]

78 The reconstruction error decreases as the number of measurements grows (not plotted). [sent-167, score-0.136]

79 The parameter that controls the complexity of Lars and RecLasso is the number of transition points. [sent-168, score-0.312]

80 We see in Figure 2 that this quantity is consistently smaller for RecLasso, and that after 100 measurements when the support of the solution does not change much there are typically less than 5 transition points for RecLasso. [sent-169, score-0.502]

81 We observed that CD requires a lot of iterations to converge to the optimal solution when n < m, and we found difficult to set a stopping criterion that ensures convergence. [sent-171, score-0.144]

82 On the left is the comparison of the number of transition points for Lars and RecLasso, and on the right is the timing comparison for the three algorithms. [sent-175, score-0.291]

83 2 Selection of the regularization parameter We have supposed until now a pre- determined regularization schedule, an assumption that is not practical. [sent-178, score-0.387]

84 The amount of regularization depends indeed on the variance of the noise present in the data which is not known a priori. [sent-179, score-0.17]

85 The problem with n observations is given by 1 θ(λ) = arg min 2n θ n (xT θ − yi )2 + λ θ 1 . [sent-183, score-0.14]

86 i i=1 We have seen previously that θ(λ) is piecewise linear, and we can therefore compute its gradient unless λ is a transition point. [sent-184, score-0.336]

87 We therefore use the new observation as a test set, which allows us to update the regularization parameter before introducing the new observation by varying t from 0 6 to 1. [sent-187, score-0.341]

88 We obtain a very similar result, and understanding the convergence properties of our proposed update rule for the regularization parameter is the object of current research. [sent-194, score-0.298]

89 For a range of values of λ, we use n − 1 data points to solve the Lasso with regularization parameter (n − 1)λ and then compute the prediction error on the data point that was left out. [sent-202, score-0.381]

90 Our proposed algorithm can be adapted to the case where we wish to update the solution of the Lasso after a data point is removed. [sent-205, score-0.302]

91 To do so, we compute the first homotopy by varying the regularization parameter from nλ to (n − 1)λ. [sent-206, score-0.74]

92 We then compute the second homotopy by varying t from 1 to 0 which has the effect of removing the data point that will be used for testing. [sent-207, score-0.598]

93 We show in Figure 4 the histogram of the number of transition points for our algorithm when solving the Lasso with n − 1 data points (we solve this problem 10 × n times). [sent-214, score-0.367]

94 Note that in the majority cases there are very few transition points, which makes our approach very efficient in this setting. [sent-215, score-0.221]

95 Figure 3: Evolution of the regularization parameter when using our proposed update rule. [sent-216, score-0.298]

96 5 Figure 4: Histogram of the number of transition points when removing an observation. [sent-217, score-0.294]

97 We use the current solution as a “warm-start” and introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. [sent-219, score-1.003]

98 The algorithm is particularly efficient if the active set does not change much, and we show a computational advantage as compared to Lars and Coordinate Descent with warm-start for applications such as compressive sensing with sequential observations and leave-one-out cross-validation. [sent-220, score-0.717]

99 We have also proposed an algorithm to automatically select the regularization parameter where each new measurement is used as a test set. [sent-221, score-0.251]

100 Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. [sent-318, score-0.316]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('homotopy', 0.431), ('lasso', 0.294), ('xn', 0.236), ('active', 0.227), ('transition', 0.221), ('compressive', 0.199), ('yn', 0.184), ('regularization', 0.17), ('sensing', 0.169), ('reclasso', 0.166), ('lars', 0.164), ('solution', 0.144), ('signs', 0.141), ('xt', 0.104), ('measurements', 0.102), ('reaches', 0.099), ('optimality', 0.094), ('path', 0.087), ('icassp', 0.086), ('compressed', 0.084), ('signal', 0.083), ('update', 0.081), ('acoustics', 0.077), ('rm', 0.076), ('yx', 0.074), ('sequential', 0.071), ('err', 0.066), ('figueiredo', 0.066), ('ipm', 0.066), ('malioutov', 0.066), ('piecewise', 0.066), ('ct', 0.065), ('sparse', 0.063), ('coordinate', 0.061), ('observing', 0.06), ('icip', 0.058), ('ima', 0.058), ('rozell', 0.058), ('toulouse', 0.058), ('pursuit', 0.058), ('cd', 0.055), ('yi', 0.053), ('absolute', 0.052), ('observations', 0.051), ('rewrite', 0.051), ('speech', 0.051), ('sign', 0.05), ('minj', 0.05), ('compute', 0.049), ('goes', 0.049), ('subdifferential', 0.047), ('parameter', 0.047), ('th', 0.046), ('initialize', 0.045), ('remain', 0.045), ('qp', 0.045), ('eecs', 0.045), ('september', 0.045), ('complexity', 0.044), ('coef', 0.044), ('international', 0.043), ('solve', 0.043), ('varying', 0.043), ('proceedings', 0.043), ('bp', 0.043), ('component', 0.041), ('descent', 0.041), ('wish', 0.04), ('denoising', 0.039), ('conditions', 0.038), ('removing', 0.038), ('vi', 0.038), ('point', 0.037), ('updating', 0.037), ('cients', 0.037), ('elements', 0.037), ('processing', 0.036), ('suppose', 0.036), ('min', 0.036), ('timing', 0.035), ('sgn', 0.035), ('march', 0.035), ('points', 0.035), ('reconstruction', 0.034), ('select', 0.034), ('thresholding', 0.034), ('histogram', 0.033), ('online', 0.033), ('cj', 0.032), ('vj', 0.032), ('propose', 0.032), ('boyd', 0.032), ('conference', 0.031), ('france', 0.031), ('notations', 0.031), ('hastie', 0.031), ('optimization', 0.031), ('topics', 0.029), ('regression', 0.029), ('elghaoui', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

Author: Pierre Garrigues, Laurent E. Ghaoui

Abstract: It has been shown that the problem of 1 -penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper RecLasso, an algorithm to solve the Lasso with online (sequential) observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. We also propose an algorithm to automatically update the regularization parameter after observing a new data point. 1

2 0.24623206 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

3 0.19787848 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

4 0.15991321 179 nips-2008-Phase transitions for high-dimensional joint support recovery

Author: Sahand Negahban, Martin J. Wainwright

Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1

5 0.1493154 138 nips-2008-Modeling human function learning with Gaussian processes

Author: Thomas L. Griffiths, Chris Lucas, Joseph Williams, Michael L. Kalish

Abstract: Accounts of how people learn functional relationships between continuous variables have tended to focus on two possibilities: that people are estimating explicit functions, or that they are performing associative learning supported by similarity. We provide a rational analysis of function learning, drawing on work on regression in machine learning and statistics. Using the equivalence of Bayesian linear regression and Gaussian processes, we show that learning explicit rules and using similarity can be seen as two views of one solution to this problem. We use this insight to define a Gaussian process model of human function learning that combines the strengths of both approaches. 1

6 0.14168848 99 nips-2008-High-dimensional support union recovery in multivariate regression

7 0.13597497 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

8 0.12292062 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

9 0.12054706 75 nips-2008-Estimating vector fields using sparse basis field expansions

10 0.12028393 214 nips-2008-Sparse Online Learning via Truncated Gradient

11 0.11052889 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions

12 0.1091277 101 nips-2008-Human Active Learning

13 0.10590298 62 nips-2008-Differentiable Sparse Coding

14 0.09779904 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

15 0.089781016 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

16 0.086499386 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition

17 0.080098607 206 nips-2008-Sequential effects: Superstition or rational behavior?

18 0.079231225 226 nips-2008-Supervised Dictionary Learning

19 0.078918472 78 nips-2008-Exact Convex Confidence-Weighted Learning

20 0.075548716 216 nips-2008-Sparse probabilistic projections


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.247), (1, -0.007), (2, -0.132), (3, 0.126), (4, 0.133), (5, 0.07), (6, -0.186), (7, 0.031), (8, -0.074), (9, 0.131), (10, -0.172), (11, -0.145), (12, -0.104), (13, -0.104), (14, 0.012), (15, -0.008), (16, -0.015), (17, -0.066), (18, 0.028), (19, 0.009), (20, -0.004), (21, -0.1), (22, 0.023), (23, 0.082), (24, 0.021), (25, 0.01), (26, 0.065), (27, -0.056), (28, -0.049), (29, -0.03), (30, 0.067), (31, 0.028), (32, -0.041), (33, -0.0), (34, -0.07), (35, 0.081), (36, -0.026), (37, -0.076), (38, 0.026), (39, -0.087), (40, -0.174), (41, -0.063), (42, 0.03), (43, 0.024), (44, 0.144), (45, 0.014), (46, -0.016), (47, -0.042), (48, 0.05), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96636957 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

Author: Pierre Garrigues, Laurent E. Ghaoui

Abstract: It has been shown that the problem of 1 -penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper RecLasso, an algorithm to solve the Lasso with online (sequential) observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. We also propose an algorithm to automatically update the regularization parameter after observing a new data point. 1

2 0.6956774 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

3 0.66215551 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

4 0.64145505 179 nips-2008-Phase transitions for high-dimensional joint support recovery

Author: Sahand Negahban, Martin J. Wainwright

Abstract: Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1 / ∞ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1 / ∞ -regularized quadratic programming, considering both consistency rates in ∞ -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ∞ error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1 / ∞ -regularization is qualitatively similar to that of ordinary 1 -regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1 / ∞ -regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞ (n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α) s)}. More precisely, for any δ > 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1,∞ ≥ 1 + δ, and converges to 0 for scalings for which θ1,∞ ≤ 1 − δ. An implication of this threshold is that use of 1,∞ -regularization yields improved statistical efficiency if the overlap parameter is large enough (α > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1

5 0.63713193 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

Author: Tong Zhang

Abstract: Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1

6 0.60821259 75 nips-2008-Estimating vector fields using sparse basis field expansions

7 0.60117346 99 nips-2008-High-dimensional support union recovery in multivariate regression

8 0.57331806 185 nips-2008-Privacy-preserving logistic regression

9 0.57147127 138 nips-2008-Modeling human function learning with Gaussian processes

10 0.55511004 101 nips-2008-Human Active Learning

11 0.53028488 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints

12 0.52039504 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

13 0.51965922 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

14 0.50039744 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

15 0.45228669 62 nips-2008-Differentiable Sparse Coding

16 0.4495393 214 nips-2008-Sparse Online Learning via Truncated Gradient

17 0.44932288 249 nips-2008-Variational Mixture of Gaussian Process Experts

18 0.44667128 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction

19 0.42154613 216 nips-2008-Sparse probabilistic projections

20 0.41133618 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.102), (7, 0.101), (12, 0.021), (15, 0.01), (28, 0.167), (53, 0.237), (57, 0.069), (59, 0.03), (63, 0.043), (71, 0.011), (77, 0.062), (78, 0.012), (83, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79477078 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

Author: Pierre Garrigues, Laurent E. Ghaoui

Abstract: It has been shown that the problem of 1 -penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper RecLasso, an algorithm to solve the Lasso with online (sequential) observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and Coordinate Descent, and present an application to compressive sensing with sequential observations. Our approach can easily be extended to compute an homotopy from the current solution to the solution that corresponds to removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation. We also propose an algorithm to automatically update the regularization parameter after observing a new data point. 1

2 0.73896277 48 nips-2008-Clustering via LP-based Stabilities

Author: Nikos Komodakis, Nikos Paragios, Georgios Tziritas

Abstract: A novel center-based clustering algorithm is proposed in this paper. We first formulate clustering as an NP-hard linear integer program and we then use linear programming and the duality theory to derive the solution of this optimization problem. This leads to an efficient and very general algorithm, which works in the dual domain, and can cluster data based on an arbitrary set of distances. Despite its generality, it is independent of initialization (unlike EM-like methods such as K-means), has guaranteed convergence, can automatically determine the number of clusters, and can also provide online optimality bounds about the quality of the estimated clustering solutions. To deal with the most critical issue in a centerbased clustering algorithm (selection of cluster centers), we also introduce the notion of stability of a cluster center, which is a well defined LP-based quantity that plays a key role to our algorithm’s success. Furthermore, we also introduce, what we call, the margins (another key ingredient in our algorithm), which can be roughly thought of as dual counterparts to stabilities and allow us to obtain computationally efficient approximations to the latter. Promising experimental results demonstrate the potentials of our method.

3 0.71262324 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1

4 0.70451659 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

5 0.70294482 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

6 0.69995201 75 nips-2008-Estimating vector fields using sparse basis field expansions

7 0.69735456 194 nips-2008-Regularized Learning with Networks of Features

8 0.69525534 143 nips-2008-Multi-label Multiple Kernel Learning

9 0.69508922 226 nips-2008-Supervised Dictionary Learning

10 0.6927346 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

11 0.69185144 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

12 0.69142342 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

13 0.69106477 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

14 0.69063872 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

15 0.69048178 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform

16 0.69033825 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

17 0.69006068 196 nips-2008-Relative Margin Machines

18 0.68881971 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

19 0.68868893 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube

20 0.68858123 179 nips-2008-Phase transitions for high-dimensional joint support recovery