nips nips2011 nips2011-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lei Yuan, Jun Liu, Jieping Ye
Abstract: The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. [sent-7, score-0.331]
2 The non-overlapping group structure limits its applicability in practice. [sent-8, score-0.193]
3 There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. [sent-9, score-0.159]
4 The resulting optimization is, however, much more challenging to solve due to the group overlaps. [sent-10, score-0.24]
5 In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. [sent-11, score-0.466]
6 We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. [sent-12, score-1.721]
7 We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. [sent-13, score-0.681]
8 However, in some applications [28], we are interested in finding important explanatory factors in predicting the response variable, where each explanatory factor is represented by a group of input features. [sent-20, score-0.235]
9 In such cases, the selection of important features corresponds to the selection of groups of features. [sent-21, score-0.138]
10 As an extension of Lasso, group Lasso [28] based on the combination of the ℓ1 norm and the ℓ2 norm has been proposed for group feature selection, and quite a few efficient algorithms [16, 17, 19] have been proposed for efficient optimization. [sent-22, score-0.482]
11 However, the non-overlapping group structure in group Lasso limits its applicability in practice. [sent-23, score-0.386]
12 For example, in microarray gene expression data analysis, genes may form overlapping groups as each gene may participate in multiple pathways [12]. [sent-24, score-0.953]
13 Several recent work [3, 12, 15, 18, 29] studies the overlapping group Lasso, where groups of features are given, potentially with overlaps between the groups. [sent-25, score-0.573]
14 The resulting optimization is, however, much more challenging to solve due to the group overlaps. [sent-26, score-0.24]
15 When optimizing the overlapping group Lasso problem, one can reformulate it as a second order cone program and solve it by a generic toolbox, which, however, does not scale well. [sent-27, score-0.46]
16 However, SLasso involves an expensive matrix inversion at each alternating iteration, and there is no known global convergence rate for such an alternating procedure. [sent-30, score-0.113]
17 [1] adopted the proximal gradient method for solving the overlapping group lasso, and a fixed point method was developed to compute the proximal operator. [sent-33, score-1.223]
18 [6] employed a 1 smoothing technique to solve the overlapping group Lasso problem. [sent-35, score-0.46]
19 Mairal [18] proposed to solve the proximal operator associated with the overlapping group Lasso defined as the sum of the ℓ∞ norms, which, however, is not applicable to the overlapping group Lasso defined as the sum of the ℓ2 norms considered in this paper. [sent-36, score-1.384]
20 In this paper, we develop an efficient algorithm for the overlapping group Lasso penalized problem via the accelerated gradient descent method. [sent-37, score-0.571]
21 The accelerated gradient descent method has recently received increasing attention in machine learning due to the fast convergence rate even for nonsmooth convex problems. [sent-38, score-0.182]
22 One of the key operations is the computation of the proximal operator associated with the penalty. [sent-39, score-0.539]
23 We reveal several key properties of the proximal operator associated with the overlapping group Lasso penalty, and proposed several possible reformulations that can be solved efficiently. [sent-40, score-1.056]
24 We have performed empirical evaluations using both synthetic data and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. [sent-42, score-0.681]
25 2 The Overlapping Group Lasso We consider the following overlapping group Lasso penalized problem: min f (x) = l(x) + φλ1 (x) λ2 (1) x∈Rp where l(·) is a smooth convex loss function, e. [sent-50, score-0.525]
26 , the least squares loss, g φλ1 (x) = λ1 x λ2 1 + λ2 w i x Gi (2) i=1 is the overlapping group Lasso penalty, λ1 ≥ 0 and λ2 ≥ 0 are regularization parameters, wi > 0, i = 1, 2, . [sent-52, score-0.512]
27 , p} contains the indices corresponding to the i-th group of features, and · denotes the Euclidean norm. [sent-58, score-0.214]
28 The g groups of features are pre-specified, and they may overlap. [sent-60, score-0.138]
29 When the groups are disjoint with λ1 = 0 and λ2 > 0, the model in (1) reduces to the group Lasso [28]. [sent-62, score-0.331]
30 In this paper, we propose to make use of the accelerated gradient descent (AGD) [2, 21, 22] for solving (1), due to its fast convergence rate. [sent-64, score-0.172]
31 The algorithm is called “FoGLasso”, which stands for Fast overlapping Group Lasso. [sent-65, score-0.221]
32 One of the key steps in the proposed FoGLasso algorithm is the computation of the proximal operator associated with the penalty in (2); and we present an efficient algorithm for the computation in the next section. [sent-66, score-0.602]
33 The model fL,x (y) consists of the first-order Taylor expansion of the smooth function l(·) at the point x, the non-smooth penalty φλ2 (x), and a regularization term L y − x 2 . [sent-68, score-0.093]
34 A key building block in FoGLasso is the minimization of (3), whose solution is known as the proximal operator [20]. [sent-72, score-0.558]
35 The computation of the proximal operator is the main technical contribution of this paper. [sent-73, score-0.506]
36 The pseudocode of FoGLasso is summarized in Algorithm 1, where the proximal operator π(·) is defined in (4). [sent-74, score-0.533]
37 2: for i = 1 to k do i−2 −1 3: Set βi = ααi−1 , si = xi + βi (xi − xi−1 ) 4: Find the smallest L = 2j Li−1 , j = 0, 1, . [sent-77, score-0.132]
38 Recently, it has been shown in [14] that, the efficient computation of the proximal operator is key to many sparse λ1 learning algorithms. [sent-82, score-0.539]
39 1, we discuss some key properties of the proximal operator, based on which we propose a pre-processing technique that will significantly reduce the size of the problem. [sent-86, score-0.433]
40 We then proposed to solve it via the dual formulation in Section 3. [sent-87, score-0.151]
41 Several alternative methods for solving the proximal operator via proximal splitting methods are discussed in Section 3. [sent-89, score-0.965]
42 1 Key Properties of the Proximal Operator λ1 We first reveal several basic properties of the proximal operator πλ2 (v). [sent-92, score-0.537]
43 The ∗ ∗ following holds: 1) if vi > 0, then 0 ≤ xi ≤ vi ; 2) if vi < 0, then vi ≤ xi ≤ 0; 3) if vi = 0, then λ1 λ1 x∗ = 0; 4) SGN(v) ⊆ SGN(x∗ ); and 5) πλ2 (v) = sgn(v) ⊙ πλ2 (|v|). [sent-99, score-0.348]
44 If x∗ > vi , i then we can construct a x as follows: xj = x∗ , j = i and xi = vi . [sent-106, score-0.148]
45 The difficulty in the optimization of (5) lies in the large number of groups that may overlap. [sent-128, score-0.16]
46 In practice, many groups will be zero, thus achieving a sparse solution (a sparse solution is desirable in many applications). [sent-129, score-0.176]
47 However, the zero groups are not known in advance. [sent-130, score-0.158]
48 The key question we aim to address is how we can identify as many zero groups as possible to reduce the complexity of the optimization. [sent-131, score-0.191]
49 Next, we present a sufficient condition for a group to be zero. [sent-132, score-0.193]
50 If the i-th group satisfies uGi ≤ λ2 wi , then x∗ i = 0, i. [sent-135, score-0.268]
51 G G Lemma 2 may not identify many true zero groups due to the strong condition imposed. [sent-146, score-0.158]
52 Intuitively, for a group Gi , we first identify all existing zero groups that overlap with Gi , and then compute the overlapping index subset Si of Gi as: Si = (Gj ∩ Gi ). [sent-148, score-0.572]
53 By removing these groups, the original problem (5) can then be reduced to: 1 wi xGi −Si min x(I1 ) − u(I1 ) 2 + λ2 x(I1 )∈R|I1 | 2 i∈G1 where I1 is the reduced index set, i. [sent-157, score-0.094]
54 G G Lemma 3 naturally leads to an iterative procedure for identifying the zero groups: For each group Gi , if uGi ≤ λ2 wi , then we set uGi = 0; we cycle through all groups repeatedly until u does not change. [sent-166, score-0.426]
55 Making use of the dual norm of the Euclidean norm · , we can rewrite hλ2 (x) as: hλ2 (x) = max Y ∈Ω 1 x−u 2 g 2 x, Y i , + (12) i=1 where Ω is defined as follows: i Ω = {Y ∈ Rp×g : YGi = 0, Y i ≤ λ2 wi , i = 1, 2, . [sent-175, score-0.223]
56 Therefore, we convert the non-smooth problem (11) to the smooth problem (15), making the smooth convex optimization tools applicable. [sent-189, score-0.116]
57 In this paper, we employ the accelerated gradient descent to solve (15), due to its fast convergence property. [sent-190, score-0.177]
58 1 Computing the Duality Gap We show how to estimate the duality gap of the min-max problem (13), which can be used to check the quality of the solution and determine the convergence of the algorithm. [sent-195, score-0.181]
59 The duality gap for the min-max problem (13) at the point (˜, Y ) ˜ can be computed as: ˜ ˜ (16) gap(Y ) = max ψ(˜ , Y ) − min ψ(x, Y ). [sent-197, score-0.178]
60 In our experiments, we terminate the algorithm when the estimated duality gap is less than 10−10 . [sent-206, score-0.157]
61 3 Proximal Splitting Methods Recently, a family of proximal splitting methods [8] have been proposed for converting a challenging optimization problem into a series of sub-problems with a closed form solution. [sent-208, score-0.486]
62 We consider two reformulations of the proximal operator (4), based on the Dykstra-like Proximal Splitting Method and the alternating direction method of multipliers (ADMM). [sent-209, score-0.617]
63 The efficiency of these two methods for overlapping Group Lasso will be demonstrated in the next section. [sent-210, score-0.221]
64 suggested that the original overlapping group lasso problem (1) can be reformulated and solved by ADMM directly. [sent-212, score-0.615]
65 We use both synthetic data sets and a real world data set and the evaluation is done in various problem size and precision settings. [sent-216, score-0.092]
66 The proposed algorithms are mainly implemented in Matlab, with the proximal operator implemented in standard C for improved efficiency. [sent-217, score-0.531]
67 1 Synthetic Data In the first set of simulation we consider only the key component of our algorithm, the proximal operator. [sent-223, score-0.412]
68 The group indices are predefined such that G1 = {1, 2, . [sent-224, score-0.193]
69 , with each group overlapping half of the previous group. [sent-233, score-0.414]
70 100 examples are generated for each set of fixed problem size p and group size g, and the results are summarized in Figure 1. [sent-234, score-0.22]
71 As we can observe from the figure, the dual formulation yields the best performance, followed closely by ADMM and then the Dykstra method. [sent-235, score-0.101]
72 We can also observe that our method scales very well to high dimensional problems, since even with p = 106 , the proximal operator can be computed in a few seconds. [sent-236, score-0.506]
73 The group number is fixed in the left figure and the problem size is fixed in the middle figure. [sent-241, score-0.193]
74 As is evident from Figure 1, the dual formulation proposed in Section 3. [sent-244, score-0.126]
75 In the following experiments, only the dual method will be used for computing the proximal operator, and our method will then be called as “FoGLasso”. [sent-246, score-0.459]
76 2 Gene Expression Data We have also conducted experiments to evaluate the efficiency of the proposed algorithm using the breast cancer gene expression data set [26], which consists of 8,141 genes in 295 breast cancer tumors (78 metastatic and 217 non-metastatic). [sent-248, score-0.737]
77 For the sake of analyzing microarrays in terms of biologically meaningful gene sets, different approaches have been used to organize the genes into (overlapping) gene sets. [sent-249, score-0.463]
78 In our experiments, we follow [12] and employ the following two approaches for generating the overlapping gene sets (groups): pathways [24] and edges [7]. [sent-250, score-0.48]
79 For pathways, the canonical pathways from the Molecular Signatures Database (MSigDB) [24] are used. [sent-251, score-0.106]
80 It contains 639 groups of genes, of which 637 groups involve the genes in our study. [sent-252, score-0.511]
81 The statistics of the 637 gene groups are summarized as follows: the average number of genes in each group is 23. [sent-253, score-0.707]
82 7, the largest gene group has 213 genes, and 3,510 genes appear in these 637 groups with an average appearance frequency of about 4. [sent-254, score-0.68]
83 For edges, the network built in [7] will be used, and we follow [12] to extract 42,594 edges from the network, leading to 42,594 overlapping gene sets of size 2. [sent-255, score-0.374]
84 All 8,141 genes appear in the 42,594 groups with an average appearance frequency of about 10. [sent-256, score-0.373]
85 We vary the number of genes involved, and report the total computational time (seconds) including all nine regularization parameters in Figure 2. [sent-262, score-0.258]
86 We can observe that: 1) for all precision levels, our proposed FoGLasso is much more efficient than SLasso, ADMM and Prox-Grad; 2) the advantage of FoGLasso over other three methods in efficiency grows with the increasing number of genes (variables). [sent-263, score-0.328]
87 For example, with the grouping by pathways, FoGLasso is about 25 and 70 times faster than SLasso for 1000 and 2000 genes, respectively; and 3) the efficiency on edges is inferior to that on pathways, due to the larger number of overlapping groups. [sent-264, score-0.26]
88 Table 1: Comparison of FoGLasso and Picard-Nesterov using different numbers (p) of genes and various precision levels. [sent-271, score-0.303]
89 For each particular method, the first row denotes the number of outer iterations required for convergence, while the second row represents the total number of inner iterations. [sent-272, score-0.101]
90 As we can observe from the table, though Picard-Nesterov method often takes less outer iterations to converge, it takes a lot more inner iterations to compute the proximal operator. [sent-283, score-0.491]
91 It is straight forward to verify that the inner iterations in Picard-Nesterov method and our proposed method have the same complexity of O(pg). [sent-284, score-0.104]
92 5 Conclusion In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem based on the accelerated gradient descent method. [sent-285, score-0.593]
93 We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator via solving the smooth and convex dual problem. [sent-286, score-1.652]
94 Numerical experiments on both synthetic and the breast cancer data set demonstrate the efficiency of the proposed algorithm. [sent-287, score-0.218]
95 Although with an inexact proximal operator, the optimal convergence rate of the accelerated gradient descent might not be guaranteed [23, 11], the algorithm performs quite well empirically. [sent-288, score-0.556]
96 In the future, we also plan to apply the proposed algorithm to other real-world applications involving overlapping groups. [sent-290, score-0.246]
97 An efficient proximal gradient method for general structured sparse learning. [sent-336, score-0.41]
98 An accelerated inexact proximal point algorithm for convex minimization. [sent-372, score-0.492]
99 Tree-guided group lasso for multi-task regression with structured sparsity. [sent-399, score-0.394]
100 A gene-expression signature as a predictor of survival in breast cancer. [sent-459, score-0.11]
wordName wordTfidf (topN-words)
[('proximal', 0.379), ('foglasso', 0.357), ('admm', 0.314), ('slasso', 0.236), ('genes', 0.235), ('sgn', 0.231), ('overlapping', 0.221), ('gi', 0.21), ('lasso', 0.201), ('group', 0.193), ('ugi', 0.179), ('groups', 0.138), ('operator', 0.127), ('gene', 0.114), ('breast', 0.11), ('prox', 0.107), ('xgi', 0.107), ('pathways', 0.106), ('grad', 0.094), ('si', 0.088), ('dual', 0.08), ('dykstra', 0.079), ('cpu', 0.078), ('wi', 0.075), ('gap', 0.073), ('precision', 0.068), ('duality', 0.064), ('splitting', 0.06), ('cancer', 0.059), ('accelerated', 0.058), ('fli', 0.054), ('jenatton', 0.053), ('lemma', 0.052), ('vi', 0.052), ('reformulations', 0.047), ('alternating', 0.044), ('xi', 0.044), ('tempe', 0.041), ('minimizer', 0.04), ('involved', 0.039), ('edges', 0.039), ('penalty', 0.038), ('arxiv', 0.038), ('descent', 0.038), ('ygi', 0.036), ('arizona', 0.034), ('az', 0.034), ('key', 0.033), ('smooth', 0.032), ('argyriou', 0.032), ('reformulation', 0.032), ('obozinski', 0.032), ('iterations', 0.032), ('reveal', 0.031), ('gj', 0.031), ('gradient', 0.031), ('convex', 0.03), ('ciency', 0.03), ('penalized', 0.03), ('outer', 0.027), ('mairal', 0.027), ('summarized', 0.027), ('verify', 0.026), ('supplemental', 0.026), ('miny', 0.026), ('solve', 0.025), ('preprint', 0.025), ('composite', 0.025), ('convergence', 0.025), ('proposed', 0.025), ('expression', 0.025), ('inexact', 0.025), ('synthetic', 0.024), ('liu', 0.024), ('norm', 0.023), ('regularization', 0.023), ('et', 0.023), ('max', 0.022), ('optimization', 0.022), ('overlaps', 0.021), ('explanatory', 0.021), ('reformulate', 0.021), ('li', 0.021), ('denotes', 0.021), ('technique', 0.021), ('arg', 0.021), ('pre', 0.021), ('formulation', 0.021), ('ef', 0.021), ('rp', 0.021), ('inner', 0.021), ('solving', 0.02), ('gure', 0.02), ('multipliers', 0.02), ('terminate', 0.02), ('zero', 0.02), ('matlab', 0.02), ('solution', 0.019), ('min', 0.019), ('ui', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 78 nips-2011-Efficient Methods for Overlapping Group Lasso
Author: Lei Yuan, Jun Liu, Jieping Ye
Abstract: The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. 1
2 0.19513202 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
3 0.17344764 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
Author: Shilin Ding, Grace Wahba, Xiaojin Zhu
Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1
4 0.16147396 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
Author: Edouard Grave, Guillaume R. Obozinski, Francis R. Bach
Abstract: Using the 1 -norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm of the selected covariates, which is a convex surrogate of their rank, as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1
5 0.1280309 251 nips-2011-Shaping Level Sets with Submodular Functions
Author: Francis R. Bach
Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.
6 0.10803055 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
7 0.10466299 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
8 0.089290231 226 nips-2011-Projection onto A Nonnegative Max-Heap
9 0.074863091 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
10 0.07348185 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
11 0.073364325 258 nips-2011-Sparse Bayesian Multi-Task Learning
12 0.07007359 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
13 0.068418309 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
14 0.066881642 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
15 0.066796884 72 nips-2011-Distributed Delayed Stochastic Optimization
16 0.064249277 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
17 0.063680001 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
18 0.063012071 276 nips-2011-Structured sparse coding via lateral inhibition
19 0.062306568 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
20 0.062022615 271 nips-2011-Statistical Tests for Optimization Efficiency
topicId topicWeight
[(0, 0.163), (1, 0.01), (2, -0.073), (3, -0.124), (4, -0.083), (5, 0.048), (6, -0.079), (7, 0.096), (8, -0.039), (9, 0.084), (10, 0.062), (11, -0.053), (12, -0.134), (13, 0.059), (14, -0.049), (15, 0.106), (16, 0.065), (17, 0.002), (18, 0.104), (19, 0.009), (20, -0.003), (21, 0.014), (22, -0.056), (23, 0.058), (24, -0.067), (25, -0.003), (26, 0.014), (27, -0.092), (28, 0.095), (29, -0.138), (30, 0.065), (31, 0.016), (32, -0.183), (33, 0.019), (34, -0.099), (35, 0.124), (36, 0.085), (37, -0.154), (38, 0.102), (39, 0.012), (40, 0.053), (41, -0.009), (42, 0.09), (43, -0.089), (44, 0.048), (45, -0.043), (46, -0.043), (47, 0.092), (48, -0.036), (49, 0.149)]
simIndex simValue paperId paperTitle
same-paper 1 0.95400709 78 nips-2011-Efficient Methods for Overlapping Group Lasso
Author: Lei Yuan, Jun Liu, Jieping Ye
Abstract: The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. 1
2 0.66797006 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
3 0.65157944 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
Author: Edouard Grave, Guillaume R. Obozinski, Francis R. Bach
Abstract: Using the 1 -norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm of the selected covariates, which is a convex surrogate of their rank, as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1
4 0.57523251 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
5 0.57103968 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
Author: Shilin Ding, Grace Wahba, Xiaojin Zhu
Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1
6 0.5434655 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
7 0.44132346 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
8 0.39990488 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
9 0.39425308 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
10 0.3823286 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
11 0.37173998 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
12 0.35288414 251 nips-2011-Shaping Level Sets with Submodular Functions
13 0.34909698 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
14 0.34547174 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
15 0.34440947 226 nips-2011-Projection onto A Nonnegative Max-Heap
16 0.33898297 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
17 0.33390874 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
18 0.32637843 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
19 0.31668702 265 nips-2011-Sparse recovery by thresholded non-negative least squares
20 0.31626663 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
topicId topicWeight
[(0, 0.037), (4, 0.034), (20, 0.015), (26, 0.041), (31, 0.052), (33, 0.02), (43, 0.104), (45, 0.198), (55, 0.249), (57, 0.019), (74, 0.052), (83, 0.039), (99, 0.041)]
simIndex simValue paperId paperTitle
Author: Shuai Huang, Jing Li, Jieping Ye, Teresa Wu, Kewei Chen, Adam Fleisher, Eric Reiman
Abstract: Diagnosis of Alzheimer’s disease (AD) at the early stage of the disease development is of great clinical importance. Current clinical assessment that relies primarily on cognitive measures proves low sensitivity and specificity. The fast growing neuroimaging techniques hold great promise. Research so far has focused on single neuroimaging modality. However, as different modalities provide complementary measures for the same disease pathology, fusion of multi-modality data may increase the statistical power in identification of disease-related brain regions. This is especially true for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from a single modality alone. We propose a sparse composite linear discriminant analysis model (SCLDA) for identification of disease-related brain regions of early AD from multi-modality data. SCLDA uses a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the modalities and a parameter specific to each modality, which enables joint analysis of all the modalities and borrowing strength from one another. We prove that this formulation is equivalent to a penalized likelihood with non-convex regularization, which can be solved by the DC (difference of convex functions) programming. We show that in using the DC programming, the property of the nonconvex regularization in terms of preserving weak-effect features can be nicely revealed. We perform extensive simulations to show that SCLDA outperforms existing competing algorithms on feature selection, especially on the ability for identifying weak-effect features. We apply SCLDA to the Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images of 49 AD patients and 67 normal controls (NC). Our study identifies disease-related brain regions consistent with findings in the AD literature. 1 In tro du cti on Alzheimer’s disease (AD) is a fatal, neurodegenerative disorder that currently affects over five million people in the U.S. It leads to substantial, progressive neuron damage that is irreversible, which eventually causes death. Early diagnosis of AD is of great clinical importance, because disease-modifying therapies given to patients at the early stage of their disease development will have a much better effect in slowing down the disease progression and helping preserve some cognitive functions of the brain. However, current clinical assessment that majorly relies on cognitive measures proves low sensitivity and specificity in early diagnosis of AD. This is because these cognitive measures are vulnerable to the confounding effect from some non-AD related factors such as patients’ mood, and presence of other illnesses or major life events [1]. The confounding effect is especially severe in the diagnosis of early AD, at which time cognitive 1 impairment is not yet apparent. On the other hand, fast growing neuroimaging techniques, such as Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET), provide great opportunities for improving early diagnosis of AD, due to their ability for overcoming the limitations of conventional cognitive measures. There are two major categories of neuroimaging techniques, i.e., functional and structure neuroimaging. MRI is a typical structural neuroimaging technique, which allows for visualization of brain anatomy. PET is a typical functional neuroimaging technique, which measures the cerebral metabolic rate for glucose. Both techniques have been extensively applied to AD studies. For example, studies based on MRI have consistently revealed brain atrophy that involves the hippocampus and entorhinal cortex [2-6]; studies based on PET have revealed functional abnormality that involves the posterior temporal and parietal association cortices [8-10], posterior cingulate, precuneus, and medial temporal cortices [11-14]. There is overlap between the disease-related brain regions detected by MRI and those by PET, such as regions in the hippocampus area and the mesia temporal lobe [15-17]. This is not surprising since MRI and PET are two complementary measures for the same disease pathology, i.e., it starts mainly in the hippocampus and entorhinal cortex, and subsequently spreads throughout temporal and orbiogrontal cortext, poseterior cingulated, and association cortex [7]. However, most existing studies only exploited structural and functional alterations in separation, which ignore the potential interaction between them. The fusion of MRI and PET imaging modalities will increase the statistical power in identification of disease-related brain regions, especially for early AD, at which stage the disease-related regions are most likely to be weakeffect regions that are difficult to be detected from MRI or PET alone. Once a good set of diseaserelated brain regions is identified, they can be further used to build an effective classifier (i.e., a biomarker from the clinical perspective) to enable AD diagnose with high sensitivity and specificity. The idea of multi-modality data fusion in the research of neurodegenerative disorders has been exploited before. For example, a number of models have been proposed to combine electroencephalography (EEG) and functional MRI (fMRI), including parallel EEG-fMRI independent component analysis [18]-[19], EEG-informed fMRI analysis [18] [20], and variational Bayesian methods [18] [21]. The purpose of these studies is different from ours, i.e., they aim to combine EEG, which has high temporal resolution but low spatial resolution, and fMRI, which has low temporal resolution but high spatial resolution, so as to obtain an accurate picture for the whole brain with both high spatial and high temporal resolutions [18]-[21]. Also, there have been some studies that include both MRI and PET data for classification [15], [22][25]. However, these studies do not make use of the fact that MRI and PET measure the same underlying disease pathology from two complementary perspectives (i.e., structural and functional perspectives), so that the analysis of one imaging modality can borrow strength from the other. In this paper, we focus on the problem of identifying disease-related brain regions from multimodality data. This is actually a variable selection problem. Because MRI and PET data are highdimensional, regularization techniques are needed for effective variable selection, such as the L1regularization technique [25]-[30] and the L2/L1-regularization technique [31]. In particular, L2/L1-regularization has been used for variable selection jointly on multiple related datasets, also known as multitask feature selection [31], which has a similar nature to our problem. Note that both L1- and L2/L1-regularizations are convex regularizations, which have gained them popularity in the literature. On the other hand, there is increasing evidence that these convex regularizations tend to produce too severely shrunken parameter estimates. Therefore, these convex regularizations could lead to miss-identification of the weak-effect disease-related brain regions, which unfortunately make up a large portion of the disease-related brain regions especially in early AD. Also, convex regularizations tend to select many irrelevant variables to compensate for the overly severe shrinkage in the parameters of the relevant variables. Considering these limitations of convex regularizations, we study non-convex regularizations [33]-[35] [39], which have the advantage of producing mildly or slightly shrunken parameter estimates so as to be able to preserve weak-effect disease-related brain regions and the advantage of avoiding selecting many disease-irrelevant regions. Specifically in this paper, we propose a sparse composite linear discriminant analysis model, called SCLDA, for identification of disease-related brain regions from multi-modality data. The contributions of our paper include: 2 • • • 2 Formulation: We propose a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the data sources and a parameter specific to each data source, which enables joint analysis of all the data sources and borrowing strength from one another. We further prove that this formulation is equivalent to a penalized likelihood with non-convex regularization. Algorithm: We show that the proposed non-convex optimization can be solved by the DC (difference of convex functions) programming [39]. More importantly, we show that in using the DC programming, the property of the non-convex regularization in terms of preserving weak-effect features can be nicely revealed. Application: We apply the proposed SCLDA to the PET and MRI data of early AD patients and normal controls (NC). Our study identifies disease-related brain regions that are consistent with the findings in the AD literature. AD vs. NC classification based on these identified regions achieves high accuracy, which makes the proposed method a useful tool for clinical diagnosis of early AD. In contrast, the convex-regularization based multitask feature selection method [31] identifies more irrelevant brain regions and yields a lower classification accuracy. Rev iew o f L D A a nd it s v ari an ts ! Denote đ?’ = đ?‘?!, đ?‘?! , ‌ , đ?‘?! as the variables and assume there are đ??˝ classes. Denote đ?‘ ! as the ! sample size of class đ?‘— and đ?‘ = !!! đ?‘ ! is the total sample size. Let đ??ł = đ?’›! , đ?’›! , ‌ , đ?’›! ! be the đ?‘ Ă—đ?‘? sample matrix, where đ?’›! is the đ?‘– !! sample and đ?‘” đ?‘– is its associated class index. Let ! ! ! ! đ?›?! = !!! đ?’›! be the overall sample mean, !!!,! ! !! đ?’›! be the sample mean of class đ?‘—, đ?›? = !! ! ! đ?’› − đ?›? ! !!! ! ! ! đ??–! = !! !!!,! ! !! ! ! đ?‘ đ??– be the ! !!! ! ! đ??“= ! đ?’›! − đ?›? đ?’›! − đ?›?! ! be the total normalized sum of squares and products (SSQP), đ?’›! − đ?›?! ! be the normalized class SSQP of class đ?‘—, and đ??–= overall normalized class SSQP. The objective of LDA is to seek for a đ?‘?Ă—đ?‘ž linear transformation matrix, đ?›‰! , with which đ?›‰! đ?‘? ! retains the maximum amount of class discrimination information in đ?‘?. To achieve this objective, one approach is to seek for the đ?›‰! that maximizes the between-class variance of đ?›‰! đ?‘?, which can ! be measured by tr(đ?›‰! đ??“đ?›‰! ), while minimizing the within-class variance of đ?›‰! đ?‘?, which can be ! ! measured by tr(đ?›‰! đ??–đ?›‰! ). Here tr() is the matrix trace operator. This is equivalent to solving the ! following optimization problem:  đ?›‰! = argmax đ?›‰! đ??đ??Ť(đ?›‰! đ??“đ?›‰! ) ! đ??đ??Ť(đ?›‰! đ??–đ?›‰! ) ! . (1) Note that đ?›‰! corresponds to the right eigenvector of đ??– !! đ??“ and đ?‘ž = đ??˝ − 1. Another approach used for finding the đ?›‰! is to use the maximum likelihood estimation for Gaussian populations that have different means and a common covariance matrix. Specifically, as in [36], this approach is developed by assuming the class distributions are Gaussian with a common covariance matrix, and their mean differences lie in a đ?‘ž-dimensional subspace of the đ?‘?dimensional original variable space. Hastie [37] further generalized this approach by assuming that class distributions are a mixture of Gaussians, which has more flexibility than LDA. However, both approaches assume a common covariance matrix for all the classes, which is too strict in many practical applications, especially in high-dimensional problems where the covariance matrices of different classes tend to be different. Consequently, the linear transformation explored by LDA may not be effective. In [38], a heterogeneous LDA (HLDA) is developed to relax this assumption. The HLDA seeks for a đ?‘?Ă—đ?‘? linear transformation matrix, đ?›‰, in which only the first đ?‘ž columns (đ?›‰! ) contain discrimination information and the remaining đ?‘? − đ?‘ž columns (đ?›‰!!! ) contain no discrimination information. For Gaussian models, assuming lack of discrimination information is equivalent to assuming that the means and the covariance matrices of the class distributions are the same for all 3 classes, in the đ?‘? − đ?‘ž dimensional subspace. Following this, the log-likelihood function of đ?›‰ can be written as below [38]: ! đ?‘™ đ?›‰|đ??™ = − log đ?›‰! đ??“đ?›‰!!! − !!! ! !! ! !!! ! log đ?›‰! đ??–! đ?›‰! + đ?‘ log đ?›‰ , ! (2) Here đ??€ denotes the determinant of matrix đ??€. There is no closed-form solution for đ?›‰. As a result, numeric methods are needed to derive the maximum likelihood estimate for đ?›‰. It is worth mentioning that the LDA in the form of (1) is a special case of the HLDA [38]. 3 T he p ro po sed SC L DA Suppose that there are multiple data sources, đ??™ ! , đ??™ ! , ‌ , đ??™ ! , with each data source capturing one aspect of the same set of physical variables, e.g., the MRI and PET capture the structural and functional aspects of the same brain regions. For each data source, đ??™ ! , there is a linear transformation matrix đ?›‰ ! , which retains the maximum amount of class discrimination information in đ??™ ! . A naive way for estimating đ?šŻ = đ?›‰ ! , đ?›‰ ! , ‌ , đ?›‰ ! is to separately estimate each đ?›‰ ! based on đ??™ ! . Apparently, this approach does not take advantage of the fact that all the data sources measure the same physical process. Also, when the sample size of each data source is small, this approach may lead to unreliable estimates for the đ?›‰ ! ’s. To tackle these problems, we propose a composite parameterization following the line as [40]. ! Specifically, let đ?œƒ!,! be the element at the k-th row and l-th column of đ?›‰ ! . We treat ! ! ! ! ! ! đ?œƒ!,! , đ?œƒ!,! , ‌ , đ?œƒ!,! as an interrelated group and parameterize each đ?œƒ!,! as đ?œƒ!,! = đ?›ż! đ?›ž!,! , for 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘? and 1 ≤ đ?‘š ≤ đ?‘€. In order to assure identifiability, we restrict each đ?›ż! ≼ 0. Here, đ?›ż! represents the common information shared by all the data sources about variable đ?‘˜, while ! đ?›ž!,! represents the specific information only captured by the đ?‘š !! data source. For example, for disease-related brain region identification, if đ?›ż! = 0, it means that all the data sources indicate variable đ?‘˜ is not a disease-related brain region; otherwise, variable đ?‘˜ is a disease-related brain ! region. đ?›ž!,! ≠0 means that the đ?‘š !! data source supports this assertion. The log-likelihood function of đ?šŻ is: đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! = ! !!! − ! ! ! đ?‘ ! ! log đ?›‰!!! đ??“ ! log đ?›‰ ! ! ! ! đ?›‰!!! − !! ! !!! ! ! log đ?›‰! đ??–! ! ! đ?›‰! +  , which follows the same line as (2). However, our formulation includes the following constraints on đ?šŻ: ! ! đ?œƒ!,! = đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€. Let ! đ?šŞ = đ?›ž!,! , 1 ≤ đ?‘˜ ≤ đ?‘?, 1 ≤ đ?‘™ ≤ đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€ and (3) đ?šż = đ?›ż! , 1 ≤ đ?‘˜ ≤ đ?‘? . An intuitive choice for estimation of đ?šŞ and đ?šż is to maximize the đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ !   subject to the constraints in (3). However, it can be anticipated that no element in the estimated đ?šŞ and đ?šż will be exactly zero, resulting in a model which is not interpretable, i.e., poor identification of diseaserelated regions. Thus, we encourage the estimation of đ?šż and the first đ?‘ž columns of đ?šŞ (i.e., the columns containing discrimination information) to be sparse, by imposing the L1-penalty on đ?šŞ and đ?šż. By doing so, we obtain the following optimization problem for the proposed SCLDA: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œƒ!,! = = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ ! , đ??™ ! , ‌ , đ??™ ! ! đ?œ†! !,!,! đ?›ž!,!  , subject to ! đ?›ż! đ?›ž!,! , đ?›ż! ≼ 0, 1 ≤ đ?‘˜, đ?‘™ ≤ +  đ?œ†! ! đ?›ż! + đ?‘?, 1 ≤ đ?‘š ≤ đ?‘€.                               (4) Here, đ?œ†! and đ?œ†! control the degrees of sparsity of đ?šż and đ?šŞ, respectively. Tuning of two regularization parameters is difficult. Fortunately, we prove the following Theorem which indicates that formulation (4) is equivalent to a simpler optimization problem involving only one regularization parameter. 4 Theorem 1: The optimization problem (4) is equivalent to the following optimization problem: đ?šŻ = argmin đ?šŻ đ?‘™! đ?šŻ| đ??™ = argmin đ?šŻ −đ?‘™! đ?šŻ| đ??™ with đ?œ† = 2 ! ! , đ??™ ! ! , đ??™ ,‌, đ??™ ! ! ,‌, đ??™ +  đ?œ† ! ! ! !!! ! !!! ! đ?œƒ!,!  , (5) ! đ?œ†! đ?œ†! , i.e., đ?œƒ!,! = đ?œƒ!,! . The proof can be found in the supplementary document. It can also be found in the supplementary material how this formulation will serve the purpose of the composite parameterization, i.e., common information and specific information can be estimated separately and simultaneously. The optimization problem (5) is a non-convex optimization problem that is difficult to solve. We address this problem by using an iterative two-stage procedure known as Difference of Convex functions (DC) programming [39]. A full description of the algorithm can be found in the supplemental material. 4 S im ula tion s tu d ies In this section, we conduct experiments to compare the performance of the proposed SCLDA with sparse LDA (SLDA) [42] and multitask feature selection [31]. Specifically, as we focus on LDA, we use the multitask feature selection method developed in [31] on LDA, denoted as MSLDA. Both SLDA and MSLDA adopt convex regularizations. Specifically, SLDA selects features from one single data source with L1-regularization; MSLDA selects features from multiple data sources with L2/L1 regularization. We evaluate the performances of these three methods across various parameters settings, including the number of variables, đ?‘?, the number of features, đ?‘™, the number of data sources, M, sample size, đ?‘›, and the degree of overlapping of the features across different data sources, s% (the larger the đ?‘ %, the more shared features among the datasets). Definition of đ?‘ % can be found in the simulation procedure that is included in the supplemental material. For each specification of the parameters settings, đ?‘€ datasets can be generated following the simulation procedure. We apply the proposed SCLDA to the đ?‘€ datasets, and identify one feature vector đ?›‰(!) for each dataset, with đ?œ† and đ?‘ž chosen by the method described in section 3.3. The result can be described by the number of true positives (TPs) as well as the number of false positives (FPs). Here, true positives are the non-zero elements in the learned feature vector đ?›‰(!) which are also non-zero in the đ?›ƒ! ; false positives are the non-zero elements in đ?›‰(!) , which are actually zero in đ?›ƒ! . As there are đ?‘š pairs of the TPs and FPs for the đ?‘€ datasets, the average TP over the M datasets and the average FP over the M datasets are used as the performance measures. This procedure (i.e., from data simulation, to SCLDA, to TPs and FPs generation) can be repeated for đ??ľ times, and đ??ľ pairs of average TP and average FP are collected for SCLDA. In a similar way, we can obtain đ??ľ pairs of average TP and average FP for both SLDA and MSLDA. Figures 1 (a) and (b) show comparison between SCLDA, SLDA and MSLDA by scattering the average TP against the average FP for each method. Each point corresponds to one of the N repetitions. The comparison is across various parameters settings, including the number of variables (đ?‘? = 100,200,500), the number of data sources (đ?‘š = 2,5,10), and the degree of overlapping of the features across different data sources (đ?‘ % = 90%, 70%). Additionally, đ?‘› đ?‘? is kept constant, đ?‘› đ?‘? = 1. A general observation is that SCLDA is better than SLDA and MSLDA across all the parameter settings. Some specific trends can be summarized as follows: (i) Both SCLDA and MSLDA outperform SLDA in terms of TPs; SCLDA further outperforms MSLDA in terms of FPs. (ii) In Figure 2 (a), rows correspond to different numbers of data sources, i.e., đ?‘š = 2,5,10 respectively. It is clear that the advantage of SCLDA over both SLDA and MSLDA is more significant when there are more data sources. Also, MSLDA performs consistently better than SLDA. Similar phenomena are shown in Figure 2 (b). This demonstrates that in analyzing each data source, both SCLDA and MSLDA are able to make use of the information contained in other data sources. SCLDA can use this information more efficiently, as SCLDA can produce less shrunken parameter estimates than MSLDA and thus it is able to preserve weak-effect features. (iii) Comparing Figures 2 (a) and (b), it can be seen that the advantage of SCLDA or MSLDA over SLDA is more significant as the data sources have more degree of overlapping in their 5 features. Finally, although not presented here, our simulation shows that the three methods perform similarly when đ?‘ % = 40 or less. (a) (b) Figure 1: Average numbers of TPs vs FPs for SCLDA (green symbols “+â€?), SLDA (blue symbols “*â€?) and MSLDA (red symbols “oâ€?) (a) đ?‘ % = 90%, đ?‘› đ?‘? = 1; (b) đ?‘ % = 70%, đ?‘› đ?‘? = 1 5 C ase st ud y 5.1 Data preprocessing Our study includes 49 AD patient and 67 age-matched normal controls (NC), with each subject of AD or NC being scanned both by PET and MRI. The PET and MRI images can be downloaded from the database by the Alzheimer’s Disease Neuroimaging Initiative. In what follows, we outline the data preprocessing steps. Each image is spatially normalized to the Montreal Neurological Institute (MNI) template, using the affine transformation and subsequent non-linear wraping algorithm [43] implemented in the SPM MATLAB toolbox. This is to ensure that each voxel is located in the same anatomical region for all subjects, so that spatial locations can be reported and interpreted in a consistent manner. Once all the images in the MNI template, we further apply the Automated Anatomical Labeling (AAL) technique [43] to segment the whole brain of each subject into 116 brain regions. The 90 regions that belong to the cerebral cortex are selected for the later analysis, as the other regions are not included in the cerebral cortex are rarely considered related with AD in the literature. The measurement of each region in the PET data is regional cerebral blood flow (rCBF); the measurement of each region in the MRI data is the structural volume of the region. 5.2 Disease-related brain regions SCLDA is applied to the preprocessed PET and MRI data of AD and NC with the penalty parameter selected by the AIC method mentioned in section 3. 26 disease-related brain regions are identified from PET and 21 from MRI (see Table 1 for their names). The maps of the diseaserelated brain regions identified from PET and MRI are highlighted in Figure 2 (a) and (b), respectively, with different colors given to neighboring regions in order to distinguish them. Each figure is a set of horizontal cut away slices of the brain as seen from the top, which aims to provide a full view of locations of the regions. One major observation is that the identified disease-related brain regions from MRI are in the hippocampus, parahippocampus, temporal lobe, frontal lobe, and precuneus, which is consistent with the existing literature that reports structural atrophy in these brain areas. [3-6,12-14]. The identified disease-related brain regions from PET are in the temporal, frontal and parietal lobes, which is consistent with many functional neuroimaging studies that report reduced rCBF or 6 reduced cortical glucose metabolism in these areas [8-10, 12-14]. Many of these identified disease-related regions can be explained in terms of the AD pathology. For example, hippocampus is a region affected by AD the earliest and severely [6] Also, as regions in the temporal lobe are essential for memory, damage on these regions by AD can explain the memory loss which is a major clinic symptom of AD. The consistency of our findings with the AD literature supports effectiveness of the proposed SCLDA. Another finding is that there is a large overlap between the identified disease-related regions from PET and those from MRI, which implies strong interaction between functional and structural alterations in these regions. Although well-accepted biological mechanisms underlying this interaction are still not very clear, there are several explanations existing in the literature. The first explanation is that both functional and structural alterations could be the consequence of dendritic arborizations, which results from intracellular accumulation of PHFtau and further leads to neuron death and grey matter loss [14]. The second explanation is that the AD pathology may include a vascular component, which may result in reduced rCBF due to limited blood supply and may ultimately result in structural alteration such as brain atrophy [45]. (a) (b) Figure 2: locations of disease-related brain regions identified from (a) MRI; (b) PET 5.3 Classification accuracy As one of our primary goals is to distinguish AD from NC, the identified disease-related brain regions through SCLDA are further utilized for establishing a classification model. Specifically, for each subject, the rCBF values of the 26 disease-related brain regions identified from PET and the structural volumes of the 21 disease-related brain regions identified from MRI are used, as a joint spatial pattern of both brain physiology and structure. As a result, each subject is associated with a vector with 47 features/variables. Linear SVM (Support Vector Machine) is employed as the classifier. The classification accuracy based on 10-fold cross-validation is 94.3%. For comparison purposes, MSLDA is also applied, which identifies 45 and 38 disease-related brain regions for PET and MRI, respectively. Linear SVM applied to the 45+38 features gives a classification accuracy of only 85.8%. Note that MSLDA identifies a much larger number of disease-related brain regions than SCLDA, but some of the identified regions by MSLDA may indeed be disease-irrelevant, so including them deteriorates the classification. 5.4 Relationship between structural atrophy and abnormal rCBF, and severity of cognitive impairment in AD In addition to classification, it is also of interest to further verify relevance of the identified disease-related regions with AD in an alternative way. One approach is to investigate the degree to which those disease-related regions are relevant to cognitive impairment that can be measured by the Alzheimer’s disease assessment scale – cognitive subscale (ADAS-cog). ADAS measures severity of the most important symptoms of AD, while its subscale, ADAS-cog, is the most 7 popular cognitive testing instrument used in clinic trails. The ADAS-cog consists of 11 items measuring disturbances of memory, language, praxis, attention and other cognitive abilities that are often affected by AD. As the total score of these 11 items provides an overall assessment of cognitive impairment, we regress this ADAS-cog total score (the response) against the rCBF or structure volume measurement (the predictor) of each identified brain region, using a simple regression. The regression results are listed in Table 1. It is not surprising to find that some regions in the hippocampus area and temporal lobes are among the best predictors, as these regions are extensively reported in the literature as the most severely affected by AD [3-6]. Also, it is found that most of these brain regions are weak-effect predictors, as most of them can only explain a small portion of the variability in the ADAS-cog total score, i.e., many R-square values in Table 1 are less than 10%. However, although the effects are weak, most of them are significant, i.e., most of the p-values in Table 1 are smaller than 0.05. Furthermore, it is worth noting that 70.22% variability in ADAS-cog can be explained by taking all the 26 brain regions identified from PET as predictors in a multiple regression model; 49.72% variability can be explained by taking all the 21 brain regions from MRI as predictors in a multiple regression model. All this findings imply that the disease-related brain regions are indeed weakeffect features if considered individually, but jointly they can play a strong role for characterizing AD. This verifies the suitability of the proposed SCLDA for AD studies, as SCLDA can preserve weak-effect features. Table 1: Explanatory power of regional rCBF and structural volume for variability in ADAS-cog (“~â€? means this region is not identified from PET (or MRI) as a disease-related region by SCLDA) PET Brain regions Precentral_L Precentral_R Frontal_Sup_L Frontal_Sup_R Frontal_Mid_R Frontal_M_O_L Frontal_M_O_R Insula_L Insula_R Cingulum_A_R Cingulum_Mid_L Cingulum_Post_L Hippocampus_L Hippocampus_R ParaHippocamp_L 6 R 2 0.003 0.044 0.051 0.044 0.056 0.036 0.019 0.016 ~ 0.004 0.001 0.184 0.158 ~ 0.206 pvalue 0.503 0.022 0.013 0.023 0.010 0.040 0.138 0.171 ~ 0.497 0.733 <10-4 <10-4 ~ <10-4 MRI R 2 0.027 ~ 0.047 ~ 0.072 0.086 0.126 0.163 0.125 0.082 0.040 ~ ~ 0.242 ~ PET Brain regions pvalue 0.077 ~ 0.018 ~ 0.003 0.001 0.000 <10-4 0.000 0.001 0.030 ~ ~ <10-4 ~ Amygdala_L Calcarine_L Lingual_L Postcentral_L Parietal_Sup_R Angular_R Precuneus_R Paracentr_Lobu_L Pallidum_L Pallidum_R Heschl_L Heschl_R Temporal_P_S_R Temporal_Inf_R All regions R 2 0.090 0.038 0.066 0.038 0.001 0.173 0.063 0.035 0.082 ~ 0.001 0.000 0.008 0.187 0.702 pvalue 0.001 0.034 0.005 0.035 0.677 <10-4 0.006 0.043 0.001 ~ 0.640 0.744 0.336 <10-4 <10-4 MRI R 2 0.313 0.028 0.044 0.026 ~ 0.063 0.025 0.000 ~ 0.020 ~ 0.111 0.071 0.147 0.497 pvalue <10-4 0.070 0.023 0.081 ~ 0.006 0.084 0.769 ~ 0.122 ~ 0.000 0.003 <10-4 <10-4 C on clu sio n In the paper, we proposed a SCLDA model for identification of disease-related brain regions of AD from multi-modality data, which is capable to preserve weak-effect disease-related brain regions due to its less shrinkage imposed on its parameters. We applied SCLDA to the PET and MRI data of early AD patients and normal controls. As MRI and PET measure two complementary aspects (structural and functional aspects, respectively) of the same AD pathology, fusion of these two image modalities can make effective use of their interaction and thus improve the statistical power in identification of disease-related brain regions. Our findings were consistent with the literature and also showed some new aspects that may suggest further investigation in neuroimaging research in the future. 8 References [1] deToledo-Morrell, L., Stoub, T.R., Bulgakova, M. 2004. MRI-derived entorhinal volume is a good predictor of conversion from MCI to AD. Neurobiol. Aging 25, 1197–1203. [2] Morra, J.H., Tu, Z. Validation of automated hippocampal segmentation method. NeuroImage 43, 59–68, 2008. [3] Morra, J.H., Tu, Z. 2009a. Automated 3D mapping of hippocampal atrophy. Hum. Brain Map. 30, 2766–2788. [4] Morra, J.H., Tu, Z. 2009b. Automated mapping of hippocampal atrophy in 1-year repeat MRI data. NeuroImage 45, 213-221. [5] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [6] Braak, H., Braak, E. 1991. Neuropathological stageing of Alzheimer-related changes. Acta Neuro. 82, 239–259. [7] Bradley, K.M., O'Sullivan. 2002. Cerebral perfusion SPET correlated with Braak pathological stage in AD. Brain 125, 1772–1781. [8] Keilp, J.G., Alexander, G.E. 1996. Inferior parietal perfusion, lateralization, and neuropsychological dysfunction in AD. Brain Cogn. 32, 365–383. [9] Schroeter, M.L., Stein, T. 2009. Neural correlates of AD and MCI. NeuroImage 47, 1196–1206. [10] Asllani, I., Habeck, C. 2008. Multivariate and univariate analysis of continuous arterial spin labeling perfusion MRI [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] in AD. J. Cereb. Blood Flow Metab. 28, 725–736. Du,A.T., Jahng, G.H. 2006. Hypoperfusion in frontotemporal dementia and AD. Neurology 67, 1215–1220. Ishii, K., Kitagaki, H. 1996. Decreased medial temporal oxygen metabolism in AD. J. Nucl. Med. 37, 1159–1165. Johnson, N.A., Jahng, G.H. 2005. Pattern of cerebral hypoperfusion in AD. Radiology 234, 851–859. Wolf, H., Jelic, V. 2003. A critical discussion of the role of neuroimaging in MCI. Acta Neuroal: 107 (4), 52-76. Tosun, D., Mojabi, P. 2010. Joint analysis of structural and perfusion MRI for cognitive assessment and classification of AD and normal aging. NeuroImage 52, 186-197. Alsop, D., Casement, M. 2008. Hippocampal hyperperfusion in Alzheimer's disease. NeuroImage 42, 1267–1274. Mosconi, L., Tsui, W.-H. 2005. Reduced hippocampal metabolism in MCI and AD. Neurology 64, 1860–1867. Mulert, C., Lemieux, L. 2010. EEG-fMRI: physiological basis, technique and applications. Springer. Xu, L., Qiu, C., Xu, P. and Yao, D. 2010. A parallel framework for simultaneous EEG/fMRI analysis: methodology and simulation. NeuroImage, 52(3), 1123-1134. Philiastides, M. and Sajda, P. 2007. EEG-informed fMRI reveals spatiotemporal characteristics of perceptual decision making. Journal of Neuroscience, 27(48), 13082-13091. Daunizeau, J., Grova, C. 2007. Symmetrical event-related EEG/fMRI information fusion. NeuroImage 36, 69-87. Jagust, W. 2006. PET and MRI in the diagnosis and prediction of dementia. Alzheimer’s Dement 2, 36-42. Kawachi, T., Ishii, K. and Sakamoto, S. 2006. Comparison of the diagnostic performance of FDG-PET and VBM. Eur.J.Nucl.Med.Mol.Imaging 33, 801-809. Matsunari, I., Samuraki, M. 2007. Comparison of 18F-FDG PET and optimized voxel-based morphometry for detection of AD. J.Nucl.Med 48, 1961-1970. Schmidt, M., Fung, G. and Rosales, R. 2007. Fast optimization methods for L1-regularization: a comparative study and 2 new approaches. ECML 2007. Liu, J., Ji, S. and Ye, J. 2009. SLEP: sparse learning with efficient projections, Arizona state university. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso, JRSS, Series B, 58(1):267–288. Friedman, J., Hastie, T. and Tibshirani, R. 2007. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 8(1):1–10. Zou, H., Hastie, T. and Tibshirani, R. 2006. Sparse PCA, J. of Comp. and Graphical Statistics, 15(2), 262-286. Qiao, Z., Zhou, L and Huang, J. 2006. Sparse LDA with applications to high dimensional low sample size data. IAENG applied mathematics, 39(1). Argyriou, A., Evgeniou, T. and Pontil, M. 2008. Convex multi-task feature learning. Machine Learning 73(3): 243– 272. Huang, S., Li, J., et al. 2010. Learning Brain Connectivity of AD by Sparse Inverse Covariance Estimation, NeuroImage, 50, 935-949. Candes, E., Wakin, M. and Boyd, S. 2008. Enhancing sparsity by reweighted L1 minimization. Journal of Fourier analysis and applications, 14(5), 877-905. Mazumder, R.; Friedman, J. 2009. SparseNet: Coordinate Descent with Non-Convex Penalties. Manuscript. Zhang, T. 2008. Multi-stage Convex Relaxation for Learning with Sparse Regularization. NIPS 2008. Campbell, N. 1984. Canonical variate analysis ageneral formulation. Australian Jour of Stat 26, 86–96. Hastie, T. and Tibshirani, R. 1994. Discriminant analysis by gaussian mixtures. Technical report. AT&T; Bell Lab. Kumar, N. and Andreou, G. 1998. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26 (4), 283-297. Gasso, G., Rakotomamonjy, A. and Canu, S. 2009. Recovering sparse signals with non-convex penalties and DC programming. IEEE Trans. Signal Processing 57( 12), 4686-4698. Guo, J., Levina, E., Michailidis, G. and Zhu, J. 2011. Joint estimation of multiple graphical models. Biometrika 98(1) 1-15. Bertsekas, D. 1982. Projected newton methods for optimization problems with simple constraints. SIAM J. Control Optim 20, 221-246. Clemmensen, L., Hastie, T., Witten, D. and Ersboll:, B. 2011. Sparse Discriminant Analysis. Technometrics (in press) Friston, K.J., Ashburner, J. 1995. Spatial registration and normalization of images. HBM 2, 89–165. Tzourio-Mazoyer, N., et al., 2002. Automated anatomical labelling of activations in SPM. NeuroImage 15, 273–289. Bidzan, L. 2005. Vascular factors in dementia. Psychiatr. Pol. 39, 977-986. 9
same-paper 2 0.81858951 78 nips-2011-Efficient Methods for Overlapping Group Lasso
Author: Lei Yuan, Jun Liu, Jieping Ye
Abstract: The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. 1
3 0.75854212 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
Author: Feng Yan, Yuan Qi
Abstract: For many real-world applications, we often need to select correlated variables— such as genetic variations and imaging features associated with Alzheimer’s disease—in a high dimensional space. The correlation between variables presents a challenge to classical variable selection methods. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not exploit the correlation information embedded in the data to select correlated variables. To overcome this limitation, we present a novel hybrid model, EigenNet, that uses the eigenstructures of data to guide variable selection. Specifically, it integrates a sparse conditional classification model with a generative model capturing variable correlations in a principled Bayesian framework. We develop an efficient active-set algorithm to estimate the model via evidence maximization. Experimental results on synthetic data and imaging genetics data demonstrate the superior predictive performance of the EigenNet over the lasso, the elastic net, and the automatic relevance determination. 1
4 0.7006458 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
5 0.69852871 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
6 0.69709712 178 nips-2011-Multiclass Boosting: Theory and Algorithms
7 0.69662434 251 nips-2011-Shaping Level Sets with Submodular Functions
8 0.69607681 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
9 0.69591057 220 nips-2011-Prediction strategies without loss
10 0.69563848 202 nips-2011-On the Universality of Online Mirror Descent
11 0.69481266 80 nips-2011-Efficient Online Learning via Randomized Rounding
12 0.69472188 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
13 0.69250733 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
14 0.69002002 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
15 0.68961239 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
16 0.68900543 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
17 0.68885076 4 nips-2011-A Convergence Analysis of Log-Linear Training
18 0.68818313 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
19 0.68632776 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
20 0.68630463 265 nips-2011-Sparse recovery by thresholded non-negative least squares