nips nips2012 nips2012-221 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang
Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu † Abstract Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. [sent-6, score-0.229]
2 Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. [sent-8, score-0.201]
3 In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. [sent-9, score-0.157]
4 Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. [sent-11, score-0.215]
5 Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. [sent-12, score-0.215]
6 1 Introduction Multi-task learning (MTL) exploits the relationships among multiple related tasks to improve the generalization performance. [sent-13, score-0.08]
7 In this paper, we focus on multi-task feature learning, in which we learn the features specific to each task as well as the common features shared among tasks. [sent-16, score-0.263]
8 Although many multi-task feature learning algorithms have been proposed, most of them assume that the relevant features are shared by all tasks. [sent-17, score-0.154]
9 (2010) [9] proposed an ℓ1 + ℓ1,∞ regularized formulation, called dirty model, to leverage the common features shared among tasks. [sent-20, score-0.301]
10 The dirty model allows a certain feature to be shared by some tasks but not all tasks. [sent-21, score-0.322]
11 (2010) also presented a theoretical analysis under the incoherence condition [5, 15] which is more restrictive than RIP [3, 27]. [sent-23, score-0.179]
12 The ℓ1 + ℓ1,∞ regularizer is a convex relaxation for the ℓ0 -type one, which, however, is too loose to well approximate the ℓ0 -type regularizer and usually achieves suboptimal performance (requiring restrictive conditions or obtaining a suboptimal error bound) [23, 26, 27]. [sent-24, score-0.197]
13 To remedy the shortcoming, we propose to use a non-convex regularizer for multi-task feature learning in this paper. [sent-25, score-0.091]
14 1 Contributions: We propose to employ a capped-ℓ1 ,ℓ1 regularized formulation (non-convex) to learn the features specific to each task as well as the common features shared among tasks. [sent-27, score-0.258]
15 Specifically, we present a detailed theoretical analysis on the parameter estimation error bound for the MSMTFL algorithm. [sent-30, score-0.215]
16 Our analysis shows that, under the sparse eigenvalue condition which is weaker than the incoherence condition in Jalali et al. [sent-31, score-0.306]
17 (2010) [9], MSMTFL improves the error bound during the multi-stage iteration, i. [sent-32, score-0.105]
18 , the error bound at the current iteration improves the one at the last iteration. [sent-34, score-0.105]
19 Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of the MSMTFL algorithm in comparison with the state of the art algorithms. [sent-35, score-0.096]
20 Matrices and sets are denoted by capital letters and calligraphic capital letters, respectively. [sent-37, score-0.093]
21 For a d×m matrix W and sets Ii ⊆ Nd ×{i}, I ⊆ Nd ×Nd , we let wIi be a d × 1 vector with the j-th entry being wji , if (j, i) ∈ Ii , and 0, otherwise. [sent-42, score-0.106]
22 We also let WI be a d × m matrix with the (j, i)-th entry being wji , if (j, i) ∈ I, and 0, otherwise. [sent-43, score-0.106]
23 We consider learning a weight matrix W = [w1 , · · · , wm ] ∈ Rd×m consisting of the weight vectors for m linear predictive models: yi ≈ fi (Xi ) = Xi wi , i ∈ Nm . [sent-45, score-0.184]
24 In this paper, we propose a non-convex multi-task feature learning formulation to learn these m models simultaneously, based on the capped-ℓ1 ,ℓ1 regularization. [sent-46, score-0.106]
25 In ∑m 1 2 this paper, we focus on the quadratic loss function: l(W ) = i=1 mni ∥Xi wi − yi ∥ . [sent-50, score-0.064]
26 For a nonzero row (w⋆ )k , some entries may be zero, due to the ℓ1 -norm imposed on each 2 row of W . [sent-55, score-0.117]
27 (1), a certain feature can be shared by some tasks but not all the tasks. [sent-57, score-0.18]
28 Therefore, the proposed formulation can leverage the common features shared among tasks. [sent-58, score-0.172]
29 Note that if we terminate the algorithm with ℓ = 1, the MSMTFL algorithm is equivalent to the ℓ1 regularized multi-task feature learning algorithm (Lasso). [sent-62, score-0.093]
30 Specifically, we will theoretically show that the solution obtained by Algorithm 1 improves the performance of the parameter estimation error bound during the multi-stage iteration. [sent-65, score-0.194]
31 3 Theoretical Analysis In this section, we theoretically analyze the parameter estimation performance of the solution obtained by the MSMTFL algorithm. [sent-68, score-0.089]
32 To simplify the notations in the theoretical analysis, we assume that the number of samples for all the tasks are the same. [sent-69, score-0.127]
33 However, our theoretical analysis can be easily extended to the case where the tasks have different sample sizes. [sent-70, score-0.077]
34 We first present a sub-Gaussian noise assumption which is very common in the analysis of sparse regularization literature [23, 25, 26, 27]. [sent-71, score-0.128]
35 We next introduce the following sparse eigenvalue concept which is also common in the analysis of sparse regularization literature [22, 23, 25, 26, 27]. [sent-78, score-0.209]
36 i min i w i∈Nm n∥w∥2 Remark 3 ρ+ (k) (ρ− (k)) is in fact the maximum (minimum) eigenvalue of (Xi )T (Xi )S /n, where S i i S is a set satisfying |S| ≤ k and (Xi )S is a submatrix composed of the columns of Xi indexed by S. [sent-80, score-0.092]
37 i i We present our parameter estimation error bound on MSMTFL in the following theorem: 3 ¯ ¯ ¯ Theorem 1 Let Assumption 1 hold. [sent-82, score-0.194]
38 Define Fi = {(j, i) : wji ̸= 0} and F = ∪i∈Nm Fi . [sent-83, score-0.082]
39 If we choose λ and θ such that for some s ≥ r: ¯ ¯ √ 2ρ+ (1) ln(2dm/η) max λ ≥ 12σ , n 11mλ θ≥ − , ρmin (2¯ + s) r then the following parameter estimation error bound holds with probability larger than 1 − η: √ √ 39. [sent-86, score-0.194]
40 (3) assumes that the ℓ1 -norm of each nonzero row of W is away from zero. [sent-94, score-0.083]
41 (4) is called the sparse eigenvalue condition [27], which requires the eigenvalue ratio ρ+ (s)/ρ− (s) to grow sub-linearly with respect to s. [sent-97, score-0.248]
42 Such a condition is very common in the analysis i i of sparse regularization [22, 25] and it is slightly weaker than the RIP condition [3, 27]. [sent-98, score-0.246]
43 (7) dominates the error bound in the order of ( √ ) ˆ ¯ ∥W Lasso − W ∥2,1 = O m r ln(dm/η)/n , ¯ (8) since λ satisfies the condition in Eq. [sent-100, score-0.16]
44 When ℓ is sufficiently large in the order of O(ln(m r/n) + ¯ ln ln(dm)), this term tends to zero and we obtain the following parameter estimation error bound: ( √ ) ˆ ¯ ∥W (ℓ) − W ∥2,1 = O m r/n + ln(1/η)/n . [sent-104, score-0.242]
45 (2010) [9] gave an ℓ∞,∞ -norm error bound ∥W Dirty −W ∥∞,∞ = O ˆ ¯ as well as a sign consistency result between W and W . [sent-106, score-0.105]
46 On the other hand, the worst-case estimate of the ℓ2,1 -norm error bound of the algorithm in Jalali et ) (2010) [9] is in the same order with Eq. [sent-108, score-0.105]
47 When dm is large and the ground truth has ¯ a large number of sparse rows (i. [sent-111, score-0.085]
48 (2010) [9] presented an ℓ∞,∞ -norm parameter estimation error bound and hence a sign consistency result can be obtained. [sent-116, score-0.194]
49 The results are derived under the incoherence condition which is more restrictive than the RIP condition and hence more restrictive than the sparse eigenvalue condition in Eq. [sent-117, score-0.411]
50 From the viewpoint of the parameter estimation error, our proposed algorithm can achieve a better bound under weaker conditions. [sent-119, score-0.163]
51 Please refer to [19, 25, 27] for more details about the incoherence condition, the RIP condition, the sparse eigenvalue condition and their relationships. [sent-120, score-0.221]
52 Remark 7 The capped-ℓ1 regularized formulation in Zhang (2010) [26] is a special case of our formulation when m = 1. [sent-121, score-0.101]
53 Different from previous work on multi-stage sparse learning which focuses on a single task [26, 27], we study a more general multi-stage framework in the multi-task setting. [sent-123, score-0.082]
54 We need to exploit the relationship among tasks, by using the relations of sparse eigenvalues ρ+ (k) (ρ− (k)) i i and treating the ℓ1 -norm on each row of the weight matrix as a whole for consideration. [sent-124, score-0.152]
55 1 T ¯ ¯ ¯ ¯ Lemma 1 Let Υ = [¯1 , · · · , ϵm ] with ϵi = [¯1i , · · · , ϵdi ]T = n Xi (Xi wi − yi ) (i ∈ Nm ). [sent-127, score-0.064]
56 Under the conditions of Assumption 1 and the notations of Theorem 1, the followings hold with probability larger than 1 − η: √ 2ρ+ (1) ln(2dm/η) max ¯ ∥Υ∥∞,∞ ≤ σ , (10) n ¯¯ F r (11) ∥ΥH ∥2 ≤ mσ 2 ρ+ (¯)(7. [sent-129, score-0.09]
57 This lemma provides a fundamental basis for the proof of Theorem 1. [sent-138, score-0.067]
58 If 2∥¯i ∥∞ < λG , then the following inequality holds at any stage ℓ ≥ 1: ˆ maxi λ ϵ i m ∑ ∑ ˆ m ¯ 2∥Υ∥∞,∞ + λ0 ∑ ∑ (ℓ) (ℓ) |wji | ≤ ˆ |∆wji |. [sent-144, score-0.081]
59 Lemma 2 (ℓ) (ℓ) (ℓ) ˆ ˆ ˆ says that ∥∆WG ∥1,1 = ∥WG ∥1,1 is upper bounded in terms of ∥∆WG c ∥1,1 , which indicates that ¯ the error of the estimated coefficients locating outside of F should be small enough. [sent-146, score-0.061]
60 This provides an intuitive explanation why the parameter estimation error of our algorithm can be small. [sent-147, score-0.15]
61 (12) provides a parameter estimation error ˆ (ℓ−1) (see the definition ¯ bound in terms of ℓ2,1 -norm by ∥ΥG c ∥2 and the regularization parameters λ (ℓ) F ji ˆ ˆ (ℓ−1) of λji (λji ) in Lemma 2). [sent-157, score-0.397]
62 (13) states that the error bound is upper bounded in terms of λ, the right-hand side of which constitutes the shrinkage part of the error bound in Eq. [sent-160, score-0.231]
63 ji ¯ (j,i)∈F ji 2,1 ¯ (j,i)∈H 5 ∑ ˆ ¯¯ ˆ¯ λ2 by ∥WH − WH ∥2 , which is critical for 2,1 ji (ℓ−1) ¯ ˆ ¯ building the recursive relationship between ∥W − W ∥2,1 and ∥W − W ∥2,1 in the proof of Theorem 1. [sent-166, score-0.516]
64 This recursive relation is crucial for the shrinkage part of the error bound in Eq. [sent-167, score-0.105]
65 5 2¯ 4∥ΥG(ℓ) ∥2 + (j,i)∈F (λji )2 ¯ F s ˆ ¯ 2,1 ˆ ∥W (ℓ) − W ∥2 = ∥∆W (ℓ) ∥2 ≤ 2,1 ( ˆ ¯ 78m 4u + (37/36)mλ2 θ−2 W (ℓ−1) − W ≤ (ρ− (2¯ + s))2 min r 2 2 ) 2,1 (ρ− (2¯ + s))2 min r ≤ 1 − 0. [sent-178, score-0.08]
66 ¯ ∥W ρ− (2¯ + s) ρ− (2¯ + s) min r min r Substituting Eq. [sent-194, score-0.08]
67 1 Synthetic Data Experiments We generate synthetic data by setting the number of tasks as m and each task has n samples which are of dimensionality d; each element of the data matrix Xi ∈ Rn×d (i ∈ Nm ) for the i-th task is sampled i. [sent-200, score-0.151]
68 from the uniform distribution ¯ in the interval [−10, 10]; we randomly set 90% rows of W as zero vectors and 80% elements of the remaining nonzero entries as zeros; each entry of the noise δi ∈ Rn is sampled i. [sent-206, score-0.073]
69 from the ¯ Gaussian distribution N (0, σ 2 ); the responses are computed as yi = Xi wi + δi (i ∈ Nm ). [sent-209, score-0.064]
70 ˆ ¯ We first report the averaged parameter estimation error ∥W −W ∥2,1 vs. [sent-210, score-0.206]
71 We observe that the error decreases as ℓ increases, which shows the advantage of our proposed algorithm over Lasso. [sent-212, score-0.061]
72 Moreover, the parameter estimation error decreases quickly and converges in a few stages. [sent-214, score-0.15]
73 0005 150 100 50 0 2 4 6 Stage 8 10 Paramter estimation error (L2,1) m=15,n=40,d=250,σ=0. [sent-222, score-0.126]
74 01 Paramter estimation error (L2,1) Paramter estimation error (L2,1) ˆ ¯ We then report the averaged parameter estimation error ∥W − W ∥2,1 in comparison with four algorithms in different parameter settings (Figure 2). [sent-223, score-0.512]
75 For a fair comparison, we compare the smallest estimation errors of the four algorithms in all the parameter settings [25, 26]. [sent-224, score-0.148]
76 As expected, the parameter estimation error of the MSMTFL algorithm is the smallest among the four algorithms. [sent-225, score-0.233]
77 We also have the following observations: (a) When λ is large enough, all four algorithms tend to have the same parameter ˆ estimation error. [sent-227, score-0.119]
78 001 3 10 Paramter estimation error (L2,1) m=15,n=40,d=250,σ=0. [sent-236, score-0.126]
79 01 3 10 Paramter estimation error (L2,1) Paramter estimation error (L2,1) ˆ ¯ Figure 1: Averaged parameter estimation error ∥W − W ∥2,1 vs. [sent-237, score-0.402]
80 4mλ) 2 10 1 10 0 10 −1 10 −6 10 −4 10 −2 λ 10 0 10 ˆ ¯ Figure 2: Averaged parameter estimation error ∥W − W ∥2,1 vs. [sent-245, score-0.15]
81 MSMTFL has the smallest parameter estimation error among the four algorithms. [sent-247, score-0.233]
82 Thus, we have 6 tasks with each task corresponding to a time point and the sample sizes corresponding to 6 tasks are 648, 642, 293, 569, 389 and 87, respectively. [sent-259, score-0.143]
83 (2) The Isolet data set2 is collected from 150 speakers who speak the name of each English letter of the alphabet twice. [sent-260, score-0.076]
84 Thus, we naturally have 5 tasks with each task corresponding to a subset. [sent-263, score-0.087]
85 The 5 tasks respectively have 1560, 1560, 1560, 1558, and 1559 samples (Three samples are historically missing), where each sample includes 617 features and the response is the English letter label (1-26). [sent-264, score-0.118]
86 For both data sets, we randomly extract the training samples from each task with different training ratios (15%, 20% and 25%) and use the rest of samples to form the test set. [sent-266, score-0.075]
87 We evaluate the four multi-task feature learning algorithms in terms of normalized mean squared error (nMSE) and averaged means squared error (aMSE), which are commonly used in 1 2 www. [sent-267, score-0.276]
88 html 7 Table 1: Comparison of four multi-task feature learning algorithms on the MRI data set in terms of averaged nMSE and aMSE (standard deviation), which are averaged over 10 random splittings. [sent-273, score-0.21]
89 For each training ratio, both nMSE and aMSE are averaged over 10 random splittings of training and test sets and the standard deviation is also shown. [sent-329, score-0.1]
90 From these results, we observe that: (a) Our proposed MSMTFL algorithm outperforms all the competing feature learning algorithms on both data sets, with the smallest regression errors (nMSE and aMSE) as well as the smallest standard deviations. [sent-352, score-0.126]
91 The performance for the 15% training ratio is comparable to that for the 25% training ratio. [sent-354, score-0.082]
92 6 Conclusions In this paper, we propose a non-convex multi-task feature learning formulation based on the cappedℓ1 ,ℓ1 regularization. [sent-357, score-0.106]
93 The proposed formulation learns the specific features of each task as well as the common features shared among tasks. [sent-358, score-0.233]
94 We also present a detailed theoretical analysis in terms of the parameter estimation error bound for the MSMTFL algorithm. [sent-360, score-0.215]
95 The analysis shows that our MSMTFL algorithm achieves good performance under the sparse eigenvalue condition, which is weaker than the incoherence condition. [sent-361, score-0.196]
96 Experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our proposed MSMTFL algorithm in comparison with the state of the art multi-task feature learning algorithms. [sent-362, score-0.164]
97 In our future work, we will focus on a general non-convex regularization framework for multi-task feature learning settings (involving different loss functions and non-convex regularization terms) and derive theoretical bounds. [sent-363, score-0.151]
98 Tree-guided group lasso for multi-task regression with structured sparsity. [sent-425, score-0.084]
99 The sparsity and bias of the lasso selection in high-dimensional linear regression. [sent-496, score-0.084]
100 A general theory of concave regularization for high dimensional sparse estimation problems. [sent-501, score-0.17]
wordName wordTfidf (topN-words)
[('msmtfl', 0.768), ('dirtymtl', 0.202), ('nm', 0.182), ('ji', 0.172), ('amse', 0.162), ('dirty', 0.142), ('paramter', 0.121), ('nmse', 0.118), ('jalali', 0.113), ('gi', 0.111), ('mri', 0.105), ('ln', 0.092), ('wg', 0.088), ('lasso', 0.084), ('wji', 0.082), ('feature', 0.068), ('lemma', 0.067), ('estimation', 0.065), ('incoherence', 0.063), ('error', 0.061), ('isolet', 0.06), ('wh', 0.056), ('shared', 0.056), ('tasks', 0.056), ('averaged', 0.056), ('condition', 0.055), ('rip', 0.054), ('eigenvalue', 0.052), ('sparse', 0.051), ('notations', 0.05), ('fi', 0.049), ('nonzero', 0.049), ('wj', 0.047), ('ye', 0.046), ('remark', 0.046), ('stage', 0.045), ('speakers', 0.044), ('bound', 0.044), ('wi', 0.041), ('followings', 0.04), ('rni', 0.04), ('restrictive', 0.04), ('min', 0.04), ('ratio', 0.038), ('formulation', 0.038), ('effectiveness', 0.037), ('inequality', 0.036), ('row', 0.034), ('dm', 0.034), ('gong', 0.034), ('synthetic', 0.033), ('xi', 0.033), ('zhang', 0.033), ('multistage', 0.033), ('letter', 0.032), ('tx', 0.032), ('nn', 0.032), ('task', 0.031), ('regularization', 0.031), ('letters', 0.031), ('capital', 0.031), ('mmse', 0.031), ('mtl', 0.031), ('tsinghua', 0.031), ('features', 0.03), ('four', 0.03), ('weaker', 0.03), ('wm', 0.029), ('smallest', 0.029), ('exp', 0.029), ('tresp', 0.028), ('arizona', 0.028), ('art', 0.026), ('negahban', 0.025), ('sigkdd', 0.025), ('suboptimal', 0.025), ('regularized', 0.025), ('english', 0.025), ('among', 0.024), ('common', 0.024), ('entry', 0.024), ('parameter', 0.024), ('concave', 0.023), ('theorem', 0.023), ('regularizer', 0.023), ('evgeniou', 0.023), ('plots', 0.023), ('pages', 0.023), ('yi', 0.023), ('obozinski', 0.022), ('relations', 0.022), ('assumption', 0.022), ('training', 0.022), ('rd', 0.022), ('side', 0.021), ('theoretical', 0.021), ('weight', 0.021), ('please', 0.021), ('multitask', 0.02), ('mini', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 221 nips-2012-Multi-Stage Multi-Task Feature Learning
Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang
Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1
2 0.071037687 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
Author: Yi Wu, David P. Wipf
Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1
3 0.070360348 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
Author: Arnak Dalalyan, Yin Chen
Abstract: In this paper, we develop a novel approach to the problem of learning sparse representations in the context of fused sparsity and unknown noise level. We propose an algorithm, termed Scaled Fused Dantzig Selector (SFDS), that accomplishes the aforementioned learning task by means of a second-order cone program. A special emphasize is put on the particular instance of fused sparsity corresponding to the learning in presence of outliers. We establish finite sample risk bounds and carry out an experimental evaluation on both synthetic and real data. 1
4 0.064785786 145 nips-2012-Gradient Weights help Nonparametric Regressors
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
5 0.064568587 324 nips-2012-Stochastic Gradient Descent with Only One Projection
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi
Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1
6 0.062184598 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
7 0.059690122 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
8 0.056332089 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
9 0.054211836 319 nips-2012-Sparse Prediction with the $k$-Support Norm
10 0.053755872 304 nips-2012-Selecting Diverse Features via Spectral Regularization
11 0.051668815 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
12 0.051079124 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
13 0.05043507 16 nips-2012-A Polynomial-time Form of Robust Regression
14 0.050346855 187 nips-2012-Learning curves for multi-task Gaussian process regression
15 0.049590148 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
16 0.049042255 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
17 0.048034277 34 nips-2012-Active Learning of Multi-Index Function Models
18 0.047888085 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
19 0.04765154 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
20 0.04747786 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
topicId topicWeight
[(0, 0.142), (1, 0.028), (2, 0.056), (3, -0.039), (4, 0.055), (5, 0.036), (6, -0.006), (7, 0.034), (8, -0.021), (9, -0.03), (10, 0.011), (11, -0.007), (12, -0.028), (13, -0.001), (14, 0.011), (15, -0.011), (16, 0.044), (17, 0.037), (18, 0.018), (19, -0.023), (20, 0.008), (21, 0.018), (22, -0.025), (23, 0.001), (24, -0.01), (25, 0.023), (26, 0.002), (27, 0.015), (28, -0.011), (29, -0.025), (30, -0.028), (31, 0.007), (32, 0.041), (33, -0.047), (34, -0.041), (35, 0.033), (36, 0.014), (37, -0.047), (38, -0.046), (39, 0.012), (40, -0.076), (41, -0.003), (42, -0.027), (43, -0.002), (44, -0.036), (45, 0.083), (46, 0.024), (47, 0.037), (48, -0.031), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.88179213 221 nips-2012-Multi-Stage Multi-Task Feature Learning
Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang
Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1
2 0.71380502 145 nips-2012-Gradient Weights help Nonparametric Regressors
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
3 0.68243009 330 nips-2012-Supervised Learning with Similarity Functions
Author: Purushottam Kar, Prateek Jain
Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1
4 0.67867178 319 nips-2012-Sparse Prediction with the $k$-Support Norm
Author: Andreas Argyriou, Rina Foygel, Nathan Srebro
Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1
5 0.66998488 247 nips-2012-Nonparametric Reduced Rank Regression
Author: Rina Foygel, Michael Horrell, Mathias Drton, John D. Lafferty
Abstract: We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a q-dimensional response, with a shared p-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backfitting algorithms are derived and justified using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data. 1
6 0.66869795 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
7 0.63937044 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
8 0.63407165 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning
9 0.62329888 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
10 0.61955363 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
11 0.61015564 271 nips-2012-Pointwise Tracking the Optimal Regression Function
12 0.60910994 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
13 0.59730172 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
14 0.59537798 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
15 0.58800685 86 nips-2012-Convex Multi-view Subspace Learning
16 0.58643675 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem
17 0.58334035 254 nips-2012-On the Sample Complexity of Robust PCA
18 0.57831633 304 nips-2012-Selecting Diverse Features via Spectral Regularization
19 0.57550716 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
20 0.57101053 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
topicId topicWeight
[(0, 0.039), (3, 0.028), (17, 0.011), (21, 0.015), (38, 0.142), (39, 0.023), (42, 0.036), (44, 0.014), (54, 0.035), (64, 0.174), (74, 0.041), (76, 0.178), (80, 0.065), (92, 0.057), (94, 0.01), (96, 0.013)]
simIndex simValue paperId paperTitle
1 0.88850552 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson
Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1
2 0.87264955 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Author: Arthur Guez, David Silver, Peter Dayan
Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
3 0.87116557 147 nips-2012-Graphical Models via Generalized Linear Models
Author: Eunho Yang, Genevera Allen, Zhandong Liu, Pradeep K. Ravikumar
Abstract: Undirected graphical models, also known as Markov networks, enjoy popularity in a variety of applications. The popular instances of these models such as Gaussian Markov Random Fields (GMRFs), Ising models, and multinomial discrete models, however do not capture the characteristics of data in many settings. We introduce a new class of graphical models based on generalized linear models (GLMs) by assuming that node-wise conditional distributions arise from exponential families. Our models allow one to estimate multivariate Markov networks given any univariate exponential distribution, such as Poisson, negative binomial, and exponential, by fitting penalized GLMs to select the neighborhood for each node. A major contribution of this paper is the rigorous statistical analysis showing that with high probability, the neighborhood of our graphical models can be recovered exactly. We also provide examples of non-Gaussian high-throughput genomic networks learned via our GLM graphical models. 1
same-paper 4 0.86105919 221 nips-2012-Multi-Stage Multi-Task Feature Learning
Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang
Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1
5 0.85344511 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
6 0.80798137 304 nips-2012-Selecting Diverse Features via Spectral Regularization
7 0.80279696 163 nips-2012-Isotropic Hashing
8 0.80155259 148 nips-2012-Hamming Distance Metric Learning
9 0.80153596 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
10 0.80144233 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
11 0.80140805 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
12 0.80077744 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
13 0.79985023 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
14 0.79961878 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
15 0.79919732 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
16 0.79905367 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
17 0.7979396 38 nips-2012-Algorithms for Learning Markov Field Policies
18 0.79756361 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
19 0.79755193 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means
20 0.79705775 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization