nips nips2010 nips2010-217 knowledge-graph by maker-knowledge-mining

217 nips-2010-Probabilistic Multi-Task Feature Selection


Source: pdf

Author: Yu Zhang, Dit-Yan Yeung, Qian Xu

Abstract: Recently, some variants of the đ?‘™1 norm, particularly matrix norms such as the đ?‘™1,2 and đ?‘™1,∞ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the đ?‘™1,2 and đ?‘™1,∞ norms by considering a family of đ?‘™1,đ?‘ž norms for 1 < đ?‘ž ≤ ∞ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ?‘™1,đ?‘ž norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization (EM) algorithms to learn all model parameters, including đ?‘ž, automatically. Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ‘™1,∞ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. [sent-7, score-0.428]

2 ‘ž ≤ ∞ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. [sent-13, score-0.402]

3 Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ? [sent-14, score-0.861]

4 Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. [sent-17, score-0.392]

5 Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. [sent-21, score-0.603]

6 ‘™1 regularization is its ability to enforce sparsity in the solutions. [sent-25, score-0.273]

7 ‘™1,∞ norms, were proposed to enforce sparsity via joint regularization [24, 17, 28, 1, 2, 15, 20, 16, 18]. [sent-29, score-0.311]

8 Regularizers based on these two matrix norms encourage row sparsity, i. [sent-35, score-0.315]

9 , they encourage entire rows of the matrix to have zero elements. [sent-37, score-0.143]

10 Moreover, these norms have also been used for enforcing group sparsity among features in conventional classiďŹ cation and regression problems, e. [sent-38, score-0.453]

11 Recently, they have been widely used in multi-task learning, compressed sensing and other related areas. [sent-41, score-0.177]

12 In this paper, we study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection [17, 15]. [sent-43, score-0.548]

13 ‘ž ≤ ∞ to ensure that all norms in this family are convex, making it easier to solve the optimization problem formulated based on it. [sent-51, score-0.274]

14 ‘ž norm, we formulate the general multi-task feature selection problem and give it a probabilistic interpretation. [sent-57, score-0.416]

15 It is noted that the automatic relevance determination (ARD) prior [9, 3, 26] comes as a special case under this interpretation. [sent-58, score-0.195]

16 Based on this probabilistic interpretation, we develop a probabilistic formulation using a noninformative prior called the Jeffreys prior [10]. [sent-59, score-0.568]

17 Moreover, an underlying assumption of existing multi-task feature selection methods is that all tasks are similar to each other and they share the same features. [sent-62, score-0.482]

18 This assumption may not be correct in practice because there may exist outlier tasks 1 or tasks with negative correlation. [sent-63, score-0.441]

19 As another contribution of this paper, we propose to use a matrix variate generalized normal prior [13] for the model parameters to learn the relationships between tasks. [sent-64, score-0.828]

20 The task relationships learned here can be seen as an extension of the task covariance used in [4, 32, 31]. [sent-65, score-0.235]

21 Experiments will be reported on two cancer classiďŹ cation applications using microarray gene expression data. [sent-66, score-0.603]

22 , document classiďŹ cation, the feature dimensionality is usually very high and it has been found that linear methods usually perform better. [sent-115, score-0.121]

23 The objective functions of most existing multi-task feature selection methods [24, 17, 28, 1, 2, 15, 20, 16, 18] can be expressed in the following form: đ? [sent-116, score-0.267]

24 ‘…(â‹…) is the regularization function that enforces feature sparsity under the multi-task setting, and đ? [sent-144, score-0.293]

25 Multi-task feature selection seeks to minimize the objective function above to obtain the optimal parameters {wđ? [sent-146, score-0.267]

26 Two regularization functions are widely used in existing multi-task feature selection methods. [sent-151, score-0.327]

27 ‘ž norm of W to deďŹ ne a more general regularization function: đ? [sent-172, score-0.163]

28 ‘ž = 1, each element of W is independent of each other and so the regularization function cannot enforce feature sparsity. [sent-184, score-0.345]

29 ‘ž ≤ ∞ can enforce feature sparsity between different tasks, different values of đ? [sent-188, score-0.334]

30 ‘ž approaches 1, the cost grows almost linearly with the number of tasks that use a feature, and when đ? [sent-191, score-0.182]

31 In the following, we ďŹ rst give a probabilistic interpretation for multi-task feature selection methods. [sent-195, score-0.499]

32 Based on this probabilistic interpretation, we then develop a probabilistic model which, among other things, can solve the model selection problem automatically by estimating đ? [sent-196, score-0.479]

33 3 Probabilistic Interpretation In this section, we will show that existing multi-task feature selection methods are related to the maximum a posteriori (MAP) solution of a probabilistic model. [sent-198, score-0.416]

34 This probabilistic interpretation sets the stage for introducing our probabilistic model in the next section. [sent-199, score-0.381]

35 We ďŹ rst introduce the generalized normal distribution [11] which is useful for the model to be introduced. [sent-200, score-0.362]

36 Ԥ is a univariate generalized normal random variable iff its probability density function (p. [sent-202, score-0.506]

37 ‘§ is a univariate generalized normal random variable, we write đ? [sent-217, score-0.445]

38 The (ordinary) normal distribution can be viewed as a special case of the generalized normal distribution when đ? [sent-224, score-0.619]

39 ‘ž approaches +∞, the generalized normal distribution approaches the uniform distribution in the range [đ? [sent-228, score-0.362]

40 The generalized normal distribution has proven useful in Bayesian analysis and robustness studies. [sent-233, score-0.362]

41 ‘Ÿ Ă— 1 multivariate generalized normal random variable z = (đ? [sent-235, score-0.431]

42 ‘Ÿ Ă— 1 multivariate generalized normal random variable, we write z âˆź â„łđ? [sent-248, score-0.401]

43 ‘ž With these deďŹ nitions, we now begin to present our probabilistic interpretation for multi-task feature selection by proposing a probabilistic model. [sent-270, score-0.648]

44 For notational simplicity, we assume that all tasks perform regression. [sent-271, score-0.182]

45 Extension to include classiďŹ cation tasks will go through similar derivation. [sent-272, score-0.22]

46 For a regression problem, we use the normal distribution to deďŹ ne the likelihood for xđ? [sent-273, score-0.257]

47 ‘ 2 ) denotes the (univariate) normal distribution with mean đ? [sent-290, score-0.283]

48 We impose the generalized normal prior on each element of W: đ? [sent-293, score-0.513]

49 ‘ž = 2, this becomes the ARD prior [9, 3, 26] commonly used in Bayesian methods for enforcing sparsity. [sent-321, score-0.154]

50 From this view, the generalized normal prior can be viewed as a generalization of the ARD prior. [sent-322, score-0.481]

51 3 So the solutions of multi-task feature selection methods can be viewed as the solution of the relaxed optimization problem above. [sent-421, score-0.298]

52 4 A Probabilistic Framework for Multi-Task Feature Selection In the probabilistic interpretation above, we use a type II method to estimate {đ? [sent-496, score-0.232]

53 ‘– } in the generalized normal prior which can be viewed as a generalization of the ARD prior. [sent-498, score-0.481]

54 In the following, we will present our probabilistic framework for multi-task feature selection by imposing priors on the hyperparameters. [sent-503, score-0.416]

55 ‘– is also deďŹ ned based on the normal distribution: đ? [sent-506, score-0.226]

56 ‘– for different tasks to make our model more exible. [sent-523, score-0.182]

57 ‘– as a random variable with the noninformative Jeffreys prior: √ đ? [sent-535, score-0.124]

58 Extension to Deal with Outlier Tasks and Tasks with Negative Correlation An underlying assumption of multi-task feature selection using the đ? [sent-965, score-0.267]

59 ‘ž norm is that all tasks are similar to each other and they share the same features. [sent-967, score-0.318]

60 This assumption may not be correct in practice because there may exist outlier tasks (i. [sent-968, score-0.259]

61 , tasks that are not related to all other tasks) or tasks with negative correlation (i. [sent-970, score-0.364]

62 , tasks that are negatively correlated with some other tasks). [sent-972, score-0.182]

63 In this section, we will discuss how to extend our probabilistic model to deal with these tasks. [sent-973, score-0.149]

64 We ďŹ rst introduce the matrix variate generalized normal distribution [13] which is a generalization of the generalized normal distribution to random matrices. [sent-974, score-0.999]

65 ‘Ą is a matrix variate generalized normal random variable iff its p. [sent-978, score-0.698]

66 ‘ž) for a matrix variate generalized normal random variable Z. [sent-1032, score-0.667]

67 ‘ž = 2, the matrix variate generalized normal distribution becomes the (ordinary) matrix variate normal distribution [12] with row covariance matrix ÎŁÎŁđ? [sent-1034, score-1.244]

68 From this view, Σ is used to model the relationships between the rows of Z and Ί is to model the relationships between the columns. [sent-1037, score-0.246]

69 ‘ž), where 0 denotes a zero vector or matrix of proper size, Iđ? [sent-1051, score-0.127]

70 The likelihood is still based on the normal distribution. [sent-1069, score-0.226]

71 Since in practice the relationships between tasks are not known in advance, we also need to estimate Ί from data. [sent-1070, score-0.285]

72 For a ďŹ xed Ί, the problem with respect to W is a convex problem and we use conjugate gradient to solve it with the following subgradient đ? [sent-1298, score-0.142]

73 6 5 Related Work Some probabilistic multi-task feature selection methods have been proposed before [28, 2]. [sent-1379, score-0.416]

74 [30] proposed a latent variable model for multi-task learning by using the Laplace prior to enforce sparsity. [sent-1384, score-0.219]

75 ż1,1 norm in our framework which, as discussed above, cannot enforce group sparsity among different features over all tasks. [sent-1387, score-0.346]

76 6 Experiments In this section, we study our methods empirically on two cancer classiďŹ cation applications using microarray gene expression data. [sent-1388, score-0.603]

77 We compare our methods with three related methods: multi-task feature learning (MTFL) [1]1 , multi-task feature selection using đ? [sent-1389, score-0.388]

78 ‘™1,2 regularization [16]2 , and multitask feature selection using đ? [sent-1390, score-0.327]

79 1 Breast Cancer ClassiďŹ cation We ďŹ rst conduct empirical study on a breast cancer classiďŹ cation application. [sent-1393, score-0.527]

80 This application consists of three learning tasks with data collected under different platforms [21]. [sent-1394, score-0.268]

81 The dataset for the ďŹ rst task, collected at the Koo Foundation Sun Yat-Sen Cancer Centre in Taipei, contains 89 samples with 8948 genes per sample. [sent-1395, score-0.123]

82 The dataset for the third task, obtained using 22K Agilent oligonucleotide arrays, contains 114 samples with 12065 genes per sample. [sent-1398, score-0.117]

83 Even though these three datasets were collected under different platforms, they share 6092 common genes which are used in our experiments. [sent-1399, score-0.156]

84 This veriďŹ es the usefulness of exploiting the relationships between tasks in multi-task feature selection. [sent-1409, score-0.406]

85 Table 1: Comparison of different methods on the breast cancer classiďŹ cation application in terms of classiďŹ cation error rate (in meanÂąstd-dev). [sent-1418, score-0.527]

86 0263 Prostate Cancer ClassiďŹ cation We next study a prostate cancer classiďŹ cation application consisting of two tasks. [sent-1451, score-0.576]

87 The RMA preprocessing method was used to produce gene expression values from these images. [sent-1453, score-0.165]

88 edu/âˆźaquattoni/ 2 7 hand, the Welsh dataset [25] for the second task is already in the form of gene expression values. [sent-1464, score-0.231]

89 Table 2: Comparison of different methods on the prostate cancer classiďŹ cation application in terms of classiďŹ cation error rate (in meanÂąstd-dev). [sent-1474, score-0.576]

90 0059 Concluding Remarks In this paper, we have proposed a probabilistic framework for general multi-task feature selection using the đ? [sent-1496, score-0.416]

91 Besides considering the case in which all tasks are similar, we have also considered the more general and challenging case in which there also exist outlier tasks or tasks with negative correlation. [sent-1502, score-0.623]

92 Compressed sensing aims at recovering the sparse signal w from a measurement vector b = Aw for a given matrix A. [sent-1503, score-0.284]

93 Compressed sensing can be extended to the multiple measurement vector (MMV) model in which the signals are represented as a set of jointly sparse vectors sharing a common set of nonzero elements [7, 6, 23]. [sent-1504, score-0.18]

94 SpeciďŹ cally, joint compressed sensing considers the reconstruction of the signal represented by a matrix W, which is given by a dictionary (or measurement matrix) A and multiple measurement vector B such that B = AW. [sent-1505, score-0.441]

95 Similar to multi-task feature selection, we can use âˆĽWâˆĽ1,đ? [sent-1506, score-0.121]

96 So we can use the probabilistic model presented in Section 4 to develop a probabilistic model for joint compressed sensing. [sent-1512, score-0.429]

97 Joint covariate selection and joint subspace selection for multiple classiďŹ cation problems. [sent-1644, score-0.368]

98 Analysis of gene expression identiďŹ es candidate markers and pharmacological targets in prostate cancer. [sent-1735, score-0.322]

99 Model selection and estimation in regression with grouped variables. [sent-1760, score-0.177]

100 A convex formulation for learning task relationships in multi-task learning. [sent-1772, score-0.199]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ln', 0.35), ('cancer', 0.343), ('normal', 0.226), ('ard', 0.216), ('variate', 0.205), ('th', 0.185), ('tasks', 0.182), ('norms', 0.176), ('prostate', 0.157), ('probabilistic', 0.149), ('selection', 0.146), ('generalized', 0.136), ('jeffreys', 0.126), ('det', 0.123), ('feature', 0.121), ('sparsity', 0.112), ('breast', 0.108), ('mtfl', 0.107), ('gene', 0.105), ('relationships', 0.103), ('norm', 0.103), ('enforce', 0.101), ('noninformative', 0.094), ('compressed', 0.093), ('prior', 0.088), ('genes', 0.086), ('sensing', 0.084), ('interpretation', 0.083), ('univariate', 0.083), ('outlier', 0.077), ('classi', 0.071), ('matrix', 0.07), ('enforcing', 0.066), ('task', 0.066), ('element', 0.063), ('mmv', 0.063), ('welsh', 0.063), ('em', 0.063), ('measurement', 0.061), ('regularization', 0.06), ('expression', 0.06), ('xiong', 0.058), ('wipf', 0.058), ('microarray', 0.057), ('denotes', 0.057), ('sign', 0.056), ('unify', 0.051), ('platforms', 0.049), ('reweighted', 0.045), ('aw', 0.044), ('hong', 0.043), ('mm', 0.043), ('relevance', 0.042), ('standardized', 0.041), ('diag', 0.041), ('conjugate', 0.041), ('zhang', 0.041), ('devise', 0.04), ('rows', 0.04), ('multivariate', 0.039), ('rao', 0.039), ('cation', 0.038), ('joint', 0.038), ('collected', 0.037), ('gupta', 0.037), ('gradient', 0.036), ('row', 0.036), ('sparse', 0.035), ('surrogate', 0.035), ('laplace', 0.035), ('liu', 0.035), ('solve', 0.035), ('chapman', 0.034), ('signal', 0.034), ('singh', 0.034), ('ordinary', 0.034), ('determination', 0.033), ('encourage', 0.033), ('share', 0.033), ('family', 0.032), ('automatic', 0.032), ('iff', 0.031), ('besides', 0.031), ('taipei', 0.031), ('qian', 0.031), ('febbo', 0.031), ('ladd', 0.031), ('sapinoso', 0.031), ('richie', 0.031), ('jackson', 0.031), ('laser', 0.031), ('oligonucleotide', 0.031), ('renshaw', 0.031), ('formulated', 0.031), ('viewed', 0.031), ('regression', 0.031), ('variable', 0.03), ('convex', 0.03), ('column', 0.03), ('group', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 217 nips-2010-Probabilistic Multi-Task Feature Selection

Author: Yu Zhang, Dit-Yan Yeung, Qian Xu

Abstract: Recently, some variants of the đ?‘™1 norm, particularly matrix norms such as the đ?‘™1,2 and đ?‘™1,∞ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the đ?‘™1,2 and đ?‘™1,∞ norms by considering a family of đ?‘™1,đ?‘ž norms for 1 < đ?‘ž ≤ ∞ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ?‘™1,đ?‘ž norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization (EM) algorithms to learn all model parameters, including đ?‘ž, automatically. Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. 1

2 0.24572971 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization

Author: Feiping Nie, Heng Huang, Xiao Cai, Chris H. Ding

Abstract: Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint 2,1 -norm minimization on both loss function and regularization. The 2,1 -norm based loss function is robust to outliers in data points and the 2,1 norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies are performed on six data sets to demonstrate the performance of our feature selection method. 1

3 0.16409305 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Author: Yi Zhang, Jeff G. Schneider

Abstract: In this paper, we propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks. Learning multiple (parametric) models can be viewed as estimating a matrix of parameters, where rows and columns of the matrix correspond to tasks and features, respectively. Following the matrix-variate normal density, we design a penalty that decomposes the full covariance of matrix elements into the Kronecker product of row covariance and column covariance, which characterizes both task relatedness and feature representation. Several recently proposed methods are variants of the special cases of this formulation. To address the overfitting issue and select meaningful task and feature structures, we include sparse covariance selection into our matrix-normal regularization via ℓ1 penalties on task and feature inverse covariances. We empirically study the proposed method and compare with related models in two real-world problems: detecting landmines in multiple fields and recognizing faces between different subjects. Experimental results show that the proposed framework provides an effective and flexible way to model various different structures of multiple tasks.

4 0.15281466 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

Author: Rina Foygel, Mathias Drton

Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1

5 0.10699146 258 nips-2010-Structured sparsity-inducing norms through submodular functions

Author: Francis R. Bach

Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.

6 0.10125393 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

7 0.094783343 231 nips-2010-Robust PCA via Outlier Pursuit

8 0.090912342 23 nips-2010-Active Instance Sampling via Matrix Partition

9 0.085056119 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

10 0.083788581 265 nips-2010-The LASSO risk: asymptotic results and real world examples

11 0.081768513 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

12 0.080456495 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

13 0.079262294 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

14 0.078961581 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

15 0.077456653 89 nips-2010-Factorized Latent Spaces with Structured Sparsity

16 0.077054769 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings

17 0.076697223 5 nips-2010-A Dirty Model for Multi-task Learning

18 0.073773548 146 nips-2010-Learning Multiple Tasks using Manifold Regularization

19 0.073704235 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference

20 0.073295131 183 nips-2010-Non-Stochastic Bandit Slate Problems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.24), (1, 0.052), (2, 0.078), (3, 0.034), (4, 0.037), (5, -0.1), (6, -0.02), (7, 0.075), (8, -0.158), (9, -0.043), (10, -0.009), (11, 0.13), (12, 0.075), (13, 0.045), (14, 0.041), (15, 0.008), (16, -0.059), (17, 0.095), (18, 0.1), (19, 0.035), (20, -0.132), (21, -0.062), (22, -0.024), (23, 0.091), (24, 0.006), (25, 0.024), (26, 0.038), (27, -0.001), (28, 0.103), (29, 0.171), (30, -0.051), (31, -0.062), (32, -0.069), (33, -0.147), (34, 0.13), (35, -0.014), (36, -0.027), (37, 0.003), (38, -0.002), (39, 0.094), (40, -0.06), (41, 0.009), (42, 0.092), (43, -0.158), (44, -0.102), (45, -0.075), (46, -0.06), (47, -0.099), (48, -0.006), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95975703 217 nips-2010-Probabilistic Multi-Task Feature Selection

Author: Yu Zhang, Dit-Yan Yeung, Qian Xu

Abstract: Recently, some variants of the đ?‘™1 norm, particularly matrix norms such as the đ?‘™1,2 and đ?‘™1,∞ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the đ?‘™1,2 and đ?‘™1,∞ norms by considering a family of đ?‘™1,đ?‘ž norms for 1 < đ?‘ž ≤ ∞ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ?‘™1,đ?‘ž norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization (EM) algorithms to learn all model parameters, including đ?‘ž, automatically. Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. 1

2 0.83724642 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization

Author: Feiping Nie, Heng Huang, Xiao Cai, Chris H. Ding

Abstract: Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint 2,1 -norm minimization on both loss function and regularization. The 2,1 -norm based loss function is robust to outliers in data points and the 2,1 norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies are performed on six data sets to demonstrate the performance of our feature selection method. 1

3 0.71184033 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Author: Yi Zhang, Jeff G. Schneider

Abstract: In this paper, we propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks. Learning multiple (parametric) models can be viewed as estimating a matrix of parameters, where rows and columns of the matrix correspond to tasks and features, respectively. Following the matrix-variate normal density, we design a penalty that decomposes the full covariance of matrix elements into the Kronecker product of row covariance and column covariance, which characterizes both task relatedness and feature representation. Several recently proposed methods are variants of the special cases of this formulation. To address the overfitting issue and select meaningful task and feature structures, we include sparse covariance selection into our matrix-normal regularization via ℓ1 penalties on task and feature inverse covariances. We empirically study the proposed method and compare with related models in two real-world problems: detecting landmines in multiple fields and recognizing faces between different subjects. Experimental results show that the proposed framework provides an effective and flexible way to model various different structures of multiple tasks.

4 0.57371736 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

Author: Rina Foygel, Mathias Drton

Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1

5 0.53498673 172 nips-2010-Multi-Stage Dantzig Selector

Author: Ji Liu, Peter Wonka, Jieping Ye

Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m n) and a noisy observation vector y ∈ Rn satisfying y = Xβ ∗ + where is the noise vector following a Gaussian distribution N (0, σ 2 I), how to recover the signal (or parameter vector) β ∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β ∗ . We show that if X obeys a certain condition, then with a large probability the difference between the solution ˆ β estimated by the proposed method and the true solution β ∗ measured in terms of the lp norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N )1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β ∗ , ∆ is independent of m and is much smaller than the first term, and N is the number of entries of √ β ∗ larger than a certain value in the order of O(σ log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately √ √ from Cs1/p log mσ to C(s − N )1/p log mσ where the value N depends on the number of large entries in β ∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. 1

6 0.52571112 5 nips-2010-A Dirty Model for Multi-task Learning

7 0.50945175 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference

8 0.5045467 91 nips-2010-Fast detection of multiple change-points shared by many signals using group LARS

9 0.49177909 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods

10 0.49138853 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

11 0.47882175 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection

12 0.4695456 231 nips-2010-Robust PCA via Outlier Pursuit

13 0.46913937 45 nips-2010-CUR from a Sparse Optimization Viewpoint

14 0.46807718 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression

15 0.46620402 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

16 0.43817413 23 nips-2010-Active Instance Sampling via Matrix Partition

17 0.43689591 204 nips-2010-Penalized Principal Component Regression on Graphs for Analysis of Subnetworks

18 0.43616071 290 nips-2010-t-logistic regression

19 0.42350733 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

20 0.41221452 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.091), (17, 0.014), (27, 0.07), (30, 0.062), (35, 0.083), (45, 0.204), (50, 0.114), (52, 0.043), (60, 0.014), (71, 0.153), (77, 0.042), (90, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92411906 247 nips-2010-Sparse Instrumental Variables (SPIV) for Genome-Wide Studies

Author: Paul Mckeigue, Jon Krohn, Amos J. Storkey, Felix V. Agakov

Abstract: This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modified here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as “instruments” to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype-phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early-stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for fine mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with ∼ 105 genetic instruments and ∼ 47 × 103 gene transcripts. 1

2 0.87040669 203 nips-2010-Parametric Bandits: The Generalized Linear Case

Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári

Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1

same-paper 3 0.8575024 217 nips-2010-Probabilistic Multi-Task Feature Selection

Author: Yu Zhang, Dit-Yan Yeung, Qian Xu

Abstract: Recently, some variants of the đ?‘™1 norm, particularly matrix norms such as the đ?‘™1,2 and đ?‘™1,∞ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the đ?‘™1,2 and đ?‘™1,∞ norms by considering a family of đ?‘™1,đ?‘ž norms for 1 < đ?‘ž ≤ ∞ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ?‘™1,đ?‘ž norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization (EM) algorithms to learn all model parameters, including đ?‘ž, automatically. Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. 1

4 0.84002823 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

Author: Pierre Garrigues, Bruno A. Olshausen

Abstract: We propose a class of sparse coding models that utilizes a Laplacian Scale Mixture (LSM) prior to model dependencies among coefficients. Each coefficient is modeled as a Laplacian distribution with a variable scale parameter, with a Gamma distribution prior over the scale parameter. We show that, due to the conjugacy of the Gamma prior, it is possible to derive efficient inference procedures for both the coefficients and the scale parameter. When the scale parameters of a group of coefficients are combined into a single variable, it is possible to describe the dependencies that occur due to common amplitude fluctuations among coefficients, which have been shown to constitute a large fraction of the redundancy in natural images [1]. We show that, as a consequence of this group sparse coding, the resulting inference of the coefficients follows a divisive normalization rule, and that this may be efficiently implemented in a network architecture similar to that which has been proposed to occur in primary visual cortex. We also demonstrate improvements in image coding and compressive sensing recovery using the LSM model. 1

5 0.8299619 117 nips-2010-Identifying graph-structured activation patterns in networks

Author: James Sharpnack, Aarti Singh

Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1

6 0.82529455 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

7 0.81902111 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

8 0.81823462 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection

9 0.81747371 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

10 0.81694442 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

11 0.81416196 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

12 0.81360936 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

13 0.81268793 5 nips-2010-A Dirty Model for Multi-task Learning

14 0.80935574 265 nips-2010-The LASSO risk: asymptotic results and real world examples

15 0.80880213 148 nips-2010-Learning Networks of Stochastic Differential Equations

16 0.80729491 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups

17 0.80657488 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

18 0.80640024 258 nips-2010-Structured sparsity-inducing norms through submodular functions

19 0.80587143 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

20 0.805206 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning