nips nips2011 nips2011-51 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jiayu Zhou, Jianhui Chen, Jieping Ye
Abstract: Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. [sent-5, score-0.109]
2 Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. [sent-6, score-0.17]
3 As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i. [sent-8, score-0.295]
4 , tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. [sent-10, score-0.292]
5 The objectives in ASO and CMTL differ in how multiple tasks are related. [sent-11, score-0.135]
6 Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. [sent-12, score-0.119]
7 The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. [sent-13, score-0.226]
8 We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. [sent-14, score-0.565]
9 In addition, we present three algorithms for solving the convex CMTL formulation. [sent-15, score-0.084]
10 A naive approach is to apply single task learning (STL) where each task is solved independently and thus the task relatedness is not exploited. [sent-18, score-0.21]
11 Recently, there is a growing interest in multi-task learning (MTL), where we learn multiple related tasks simultaneously by extracting appropriate shared information across tasks. [sent-19, score-0.154]
12 In MTL, multiple tasks are expected to benefit from each other, resulting in improved generalization performance. [sent-20, score-0.109]
13 Many different MTL approaches have been proposed in the past; they differ in how the relatedness among different tasks is modeled. [sent-23, score-0.161]
14 [8] proposed the regularized MTL which constrained the models of all tasks to be close to each other. [sent-25, score-0.107]
15 The task relatedness can also be modeled by constraining multiple tasks to share a common underlying structure [4, 6, 9, 10]. [sent-26, score-0.236]
16 Ando and Zhang [5] proposed a structural learning formulation, which assumed multiple predictors for different tasks shared a common structure on the underlying predictor space. [sent-27, score-0.193]
17 For linear predictors, they proposed the alternating structure optimization (ASO) that simultaneously performed inference on multiple tasks and discovered the shared low-dimensional predictive structure. [sent-28, score-0.309]
18 One limitation of the original ASO formulation is that it involves a non-convex optimization problem and a globally optimal solution is not guaranteed. [sent-30, score-0.123]
19 A convex relaxation of ASO called CASO was proposed and analyzed in [13]. [sent-31, score-0.151]
20 Many existing MTL formulations are based on the assumption that all tasks are related. [sent-32, score-0.118]
21 In practical applications, the tasks may exhibit a more sophisticated group structure where the models of tasks from the same group are closer to each other than those from a different group. [sent-33, score-0.199]
22 There have been many prior work along this line of research, known as clustered multi-task learning (CMTL). [sent-34, score-0.093]
23 In [14], the mutual relatedness of tasks was estimated and knowledge of one task could be transferred to other tasks in the same cluster. [sent-35, score-0.284]
24 Bakker and Heskes [15] used clustered multi-task learning in a Bayesian setting by considering a mixture of Gaussians instead of single Gaussian priors. [sent-36, score-0.093]
25 [8] proposed the task clustering regularization and showed how cluster information could be encoded in MTL, and however the group structure was required to be known a priori. [sent-38, score-0.19]
26 In [17], a clustered MTL framework was proposed that simultaneously identified clusters and performed multi-task inference. [sent-41, score-0.111]
27 Because the formulation is non-convex, they also proposed a convex relaxation to obtain a global optimum [17]. [sent-42, score-0.244]
28 [18] used a similar idea to consider clustered tasks by introducing an inter-task regularization. [sent-44, score-0.182]
29 , ASO which aims to identify a shared low-dimensional predictive structure for all tasks) which are based on the standard assumption that each task can learn equally well from any other task. [sent-47, score-0.151]
30 Next, we show that the spectral relaxation of the clustering (on tasks) in CMTL and the projection (on the features) in ASO lead to an identical regularization, related to the negative Ky Fan k-norm of the weight matrix involving all task models, thus establishing their equivalence relationship. [sent-50, score-0.257]
31 One major limitation of the ASO/CMTL formulation is that it involves a non-convex optimization, as the negative Ky Fan k-norm is concave. [sent-53, score-0.093]
32 We propose a convex relaxation of CMTL, and establish the equivalence relationship between the proposed convex relaxation of CMTL and the convex ASO formulation proposed in [13]. [sent-54, score-0.565]
33 We show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. [sent-55, score-0.169]
34 We further develop three algorithms for solving the convex CMTL formulation based on the block coordinate descent, accelerated projected gradient, and gradient descent, respectively. [sent-56, score-0.251]
35 , (xi i , yni )} ⊂ Rd × R, and a linear predictive function fi : n 1 i T i fi (xj ) = wi xj , where wi is the weight vector of the i-th task, d is the data dimensionality, and ni is the number of samples of the i-th task. [sent-67, score-0.21]
36 1 Alternating structure optimization In ASO [5], all tasks are assumed to share a common feature space Θ ∈ Rh×d , where h ≤ min(m, d) is the dimensionality of the shared feature space and Θ has orthonormal columns, i. [sent-76, score-0.245]
37 The predictive function of ASO is: fi (xi ) = wi xi = uT xi +vi Θxi , where the weight j j i j j T wi = ui + Θ vi consists of two components including the weight ui for the high-dimensional feature space and the weight vi for the low-dimensional space based on Θ. [sent-79, score-0.355]
38 ASO minimizes the m following objective function: L(W ) + α i=1 ui 2 , subject to: ΘΘT = Ih , where α is the reg2 ularization parameter for task relatedness. [sent-80, score-0.092]
39 We can further improve the formulation by including m a penalty, β i=1 wi 2 , to improve the generalization performance as in traditional supervised 2 learning. [sent-81, score-0.153]
40 Since ui = wi − ΘT vi , we obtain the following ASO formulation: m min W,{vi },Θ:ΘΘT =Ih 2. [sent-82, score-0.145]
41 2 L(W ) + i=1 α wi − Θ T v i 2 2 + β wi 2 2 . [sent-83, score-0.12]
42 (1) Clustered multi-task learning In CMTL, we assume that the tasks are clustered into k < m clusters, and the index set of the j-th cluster is defined as Ij = {v|v ∈ cluster j}. [sent-84, score-0.268]
43 3 Equivalence of ASO and CMTL ∗ In the ASO formulation in Eq. [sent-91, score-0.093]
44 Thus, the penalty in ASO has the following equivalent form: m ΩASO (W, Θ) = i=1 α wi − ΘT Θwi 2 2 + β wi 2 2 = α tr W T W − tr W T ΘT ΘW + β tr W T W , resulting in the following equivalent ASO formulation: min L(W ) + ΩASO (W, Θ). [sent-93, score-0.674]
45 W,Θ:ΘΘT =Ih (6) (7) The penalty of the ASO formulation in Eq. [sent-94, score-0.119]
46 (7) looks very similar to the penalty of the CMTL formulation in Eq. [sent-95, score-0.119]
47 (5), the matrix F is operated on the task dimension, as it is derived from the K-means clustering on the tasks; while in the ASO formulation in Eq. [sent-98, score-0.203]
48 (7), the matrix Θ is operated on the feature dimension, as it aims to identify a shared low-dimensional predictive structure for all tasks. [sent-99, score-0.139]
49 (7) are equivalent if the cluster number, k, in K-means equals to the size, h, of the shared low-dimensional feature space. [sent-105, score-0.105]
50 Denote Q(W ) = L(W ) + (α + β) tr W T W , with α, β > 0. [sent-107, score-0.165]
51 Then, CMTL and ASO solve the following optimization problems: min W,F :F T F =Ip Q(W ) − α tr W F F T W T , min W,Θ:ΘΘT =Ip Q(W ) − α tr W T ΘT ΘW , respectively. [sent-108, score-0.426]
52 Thus, the optimal F and Θ for these two optimization problems are given by solving: [CMTL] max F :F T F =Ik tr W F F T W T , [ASO] max Θ:ΘΘT =Ik tr W T ΘT ΘW . [sent-110, score-0.36]
53 3 Convex Relaxation of CMTL The formulation in Eq. [sent-114, score-0.093]
54 A natural approach is to perform a convex relaxation on CMTL. [sent-116, score-0.133]
55 (5) as follows: ΩCMTL0 (W, F ) = α tr W ((1 + η)I − F F T )W T , (8) where η is defined as η = β/α > 0. [sent-118, score-0.165]
56 (8) as the following equivalent form: ΩCMTL1 (W, F ) = αη(1 + η) tr W (ηI + F F T )−1 W T . [sent-121, score-0.165]
57 (10) Following [13, 17], we obtain the following convex relaxation of Eq. [sent-123, score-0.133]
58 + (11) where ΩcCMTL (W, M ) is defined as: ΩcCMTL (W, M ) = αη(1 + η) tr W (ηI + M )−1 W T . [sent-127, score-0.165]
59 1 Equivalence of cASO and cCMTL A convex relaxation (cASO) of the ASO formulation in Eq. [sent-131, score-0.226]
60 tr (S) = h, S W,S I, S ∈ Sd , + (13) where ΩcASO is defined as: ΩcASO (W, S) = αη(1 + η) tr W T (ηI + S)−1 W . [sent-134, score-0.33]
61 (13) are equivalent if the cluster number, k, in K-means equals to the size, h, of the shared low-dimensional feature space. [sent-144, score-0.105]
62 Define the following two convex functions of W : gcCMTL (W ) = min tr W (ηI + M )−1 W T , s. [sent-146, score-0.256]
63 tr (M ) = k, M I, M ∈ Sm , + M (15) and gcASO (W ) = min tr W T (ηI + S)−1 W , s. [sent-148, score-0.363]
64 W : [cCMTL] min L(W ) + c · gCMTL (W ), [cASO] min L(W ) + c · gASO (W ), W W where c = αη(1 + η). [sent-154, score-0.066]
65 It follows from the basic properties of the trace that: T tr W (ηI + M )−1 W T = tr (ηI + Λ1 )−1 P1 Q2 Σ2 QT P1 . [sent-168, score-0.33]
66 (15) is thus equivalent to: d (1) T T T min tr (ηI + Λ1 )−1 P1 Q2 Σ2 QT P1 , s. [sent-170, score-0.198]
67 (18) It follows that gcCMTL (W ) = tr (ηI + Λ∗ )−1 Σ2 . [sent-176, score-0.165]
68 Similarly, we can show that gcASO (W ) = 1 tr (ηI + Λ∗ )−1 Σ2 , where 2 q Λ∗ = argmin 2 Λ2 i=1 2 σi q (2) , s. [sent-177, score-0.165]
69 In many practical + MTL problems the data dimensionality d is much larger than the task number m, and in such cases cCMTL is significantly more efficient in terms of both time and space. [sent-188, score-0.078]
70 , Alternating Optimization Method (altCMTL), Accelerated Projected Gradient Method (apgCMTL), and Direct Gradient Descent Method (graCMTL), respectively, for solving the convex relaxation in Eq. [sent-193, score-0.159]
71 The altCMTL algorithm involves the following two steps in each iteration: Optimization of W For a fixed M , the optimal W can be obtained via solving: min L(W ) + c tr W (ηI + M )−1 W T . [sent-201, score-0.198]
72 Optimization of M For a fixed W , the optimal M can be obtained via solving: min tr W (ηI + M )−1 W T , s. [sent-206, score-0.198]
73 2 Accelerated Projected Gradient Method The accelerated projected gradient method (APG) has been applied to solve many machine learning formulations [24]. [sent-216, score-0.103]
74 We apply APG to solve the cCMTL formulation in Eq. [sent-217, score-0.093]
75 tr (MZ ) = k, MZ I, MZ ∈ Sm , + (21) ˆ ˆ where the details about the construction of WS and MS can be found in [24]. [sent-222, score-0.165]
76 tr (MZ ) = k, MZ I, MZ ∈ Sm , + (23) ˆ where MS is not guaranteed to be positive semidefinite. [sent-234, score-0.165]
77 (23) admits an analytical solution via solving a simple convex projection problem. [sent-236, score-0.1]
78 Given the intermediate solution Wk−1 from the (k − 1)-th iteration of graCMTL, we compute the gradient of gCMTL (W ) and then apply the general gradient descent scheme [25] to obtain Wk . [sent-242, score-0.086]
79 In the ground truth there are 100 tasks clustered into 5 groups. [sent-250, score-0.182]
80 5 Experiments In this section, we empirically evaluate the effectiveness and the efficiency of the proposed algorithms on synthetic and real-world data sets. [sent-257, score-0.065]
81 Simulation Study We apply the proposed cCMTL formulation in Eq. [sent-262, score-0.111]
82 (11) on a synthetic data set (with a pre-defined cluster structure). [sent-263, score-0.07]
83 We construct the synthetic data set following a procedure similar to the one in [17]: the constructed synthetic data set consists of 5 clusters, where each cluster includes 20 (regression) tasks and each task is represented by a weight vector of length d = 300. [sent-265, score-0.259]
84 From the result we can observe (1) cCMTL is able to capture the cluster structure among tasks and achieves a small test error; (2) RegMTL is better than RidgeSTL in terms of test error. [sent-269, score-0.153]
85 It however introduces unnecessary correlation among tasks possibly due to the assumption that all tasks are related; (3) In cCMTL we also notice some ‘noisy’ correlation, which may because of the spectral relaxation. [sent-270, score-0.196]
86 0016 Effectiveness Comparison Next, we empirically evaluate the effectiveness of the cCMTL formulation in comparison with RidgeSTL and RegMTL using real world benchmark datasets including the School data1 and the Sarcos data2 . [sent-335, score-0.131]
87 Efficiency Comparison We compare the efficiency of the three algorithms including altCMTL, apgCMTLand graCMTL for solving the cCMTL formulation in Eq. [sent-349, score-0.119]
88 Specifically, we study how the feature dimensionality, the sample size, and the task number affect the required computation cost (in seconds) for convergence. [sent-353, score-0.069]
89 Because in Yahoo data the task number is very small, we construct a synthetic data for the third experiment. [sent-356, score-0.079]
90 In the first experiment, we vary the feature dimensionality in the set [500 : 500 : 2500] with the sample size fixed at 4000 and the task numbers fixed at 17. [sent-357, score-0.111]
91 In the second experiment, we vary the sample size in the set [3000 : 1000 : 9000] with the dimensionality fixed at 500 and the task number fixed at 17. [sent-359, score-0.094]
92 In the third experiment, we vary the task number in the set [10 : 10 : 190] with the feature dimensionality fixed at 600 and the sample size fixed at 2000. [sent-362, score-0.111]
93 The employed synthetic data set is constructed as follows: for each task, we generate the entries of the data matrix Xi from N (0, 1), and generate the entries of the weight vector from N (0, 1), the response vector yi is computed as yi = Xi wi + ξ, where ξ ∼ N (0, 0. [sent-363, score-0.108]
94 6 Conclusion In this paper we establish the equivalence relationship between two multi-task learning techniques: alternating structure optimization (ASO) and clustered multi-task learning (CMTL). [sent-367, score-0.309]
95 We further establish the equivalence relationship between our proposed convex relaxation of CMTL and an existing convex relaxation of ASO. [sent-368, score-0.396]
96 In addition, we propose three algorithms for solving the convex CMTL formulation and demonstrate their effectiveness and efficiency on benchmark datasets. [sent-369, score-0.215]
97 A convex optimization approach to modeling consumer heterogeneity in conjoint estimation. [sent-379, score-0.088]
98 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-406, score-0.142]
99 A convex formulation for learning shared structures from multiple tasks. [sent-459, score-0.216]
100 Clustering learning tasks and the selective cross-task transfer of knowledge. [sent-465, score-0.089]
wordName wordTfidf (topN-words)
[('cmtl', 0.539), ('aso', 0.447), ('ccmtl', 0.355), ('mtl', 0.206), ('altcmtl', 0.199), ('caso', 0.184), ('gracmtl', 0.17), ('tr', 0.165), ('mz', 0.129), ('apgcmtl', 0.128), ('regmtl', 0.114), ('ridgestl', 0.114), ('gcmtl', 0.099), ('formulation', 0.093), ('clustered', 0.093), ('wz', 0.092), ('tasks', 0.089), ('relaxation', 0.075), ('nmse', 0.071), ('wi', 0.06), ('convex', 0.058), ('equivalence', 0.056), ('relatedness', 0.054), ('alternating', 0.053), ('task', 0.052), ('evgeniou', 0.048), ('sm', 0.047), ('shared', 0.045), ('cluster', 0.043), ('amse', 0.043), ('gcaso', 0.043), ('gccmtl', 0.043), ('sarcos', 0.043), ('ik', 0.039), ('ih', 0.039), ('wk', 0.038), ('ws', 0.037), ('clustering', 0.035), ('supplemental', 0.034), ('min', 0.033), ('predictive', 0.033), ('vi', 0.032), ('relationship', 0.032), ('inherent', 0.031), ('optimization', 0.03), ('gradient', 0.03), ('formulations', 0.029), ('gaso', 0.028), ('ky', 0.028), ('ciency', 0.027), ('synthetic', 0.027), ('solving', 0.026), ('accelerated', 0.026), ('objectives', 0.026), ('descent', 0.026), ('dimensionality', 0.026), ('penalty', 0.026), ('seconds', 0.026), ('establish', 0.024), ('tenth', 0.023), ('operated', 0.023), ('sse', 0.023), ('ms', 0.022), ('sd', 0.022), ('apg', 0.022), ('stl', 0.022), ('regularization', 0.021), ('weight', 0.021), ('structure', 0.021), ('effectiveness', 0.02), ('marketing', 0.02), ('bakker', 0.02), ('wv', 0.02), ('ui', 0.02), ('multiple', 0.02), ('qt', 0.02), ('objective', 0.02), ('completes', 0.019), ('yahoo', 0.019), ('minw', 0.019), ('ando', 0.019), ('xue', 0.019), ('fan', 0.019), ('wm', 0.019), ('school', 0.018), ('benchmark', 0.018), ('projected', 0.018), ('ding', 0.018), ('pontil', 0.018), ('fi', 0.018), ('proposed', 0.018), ('spectral', 0.018), ('micchelli', 0.017), ('argyriou', 0.017), ('reformulate', 0.017), ('feature', 0.017), ('vary', 0.016), ('admits', 0.016), ('ij', 0.016), ('record', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
Author: Jiayu Zhou, Jianhui Chen, Jieping Ye
Abstract: Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms. 1
2 0.12506786 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
Author: Nico Goernitz, Christian Widmer, Georg Zeller, Andre Kahles, Gunnar Rätsch, Sören Sonnenburg
Abstract: We present a novel regularization-based Multitask Learning (MTL) formulation for Structured Output (SO) prediction for the case of hierarchical task relations. Structured output prediction often leads to difficult inference problems and hence requires large amounts of training data to obtain accurate models. We propose to use MTL to exploit additional information from related learning tasks by means of hierarchical regularization. Training SO models on the combined set of examples from multiple tasks can easily become infeasible for real world applications. To be able to solve the optimization problems underlying multitask structured output learning, we propose an efficient algorithm based on bundle-methods. We demonstrate the performance of our approach in applications from the domain of computational biology addressing the key problem of gene finding. We show that 1) our proposed solver achieves much faster convergence than previous methods and 2) that the Hierarchical SO-MTL approach outperforms considered non-MTL methods. 1
3 0.083583795 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
4 0.064059548 54 nips-2011-Co-regularized Multi-view Spectral Clustering
Author: Abhishek Kumar, Piyush Rai, Hal Daume
Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.
5 0.061339792 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
6 0.053057764 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
7 0.050166205 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
8 0.049577534 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
9 0.0451927 186 nips-2011-Noise Thresholds for Spectral Clustering
10 0.043968529 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
11 0.043089811 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
12 0.042123433 78 nips-2011-Efficient Methods for Overlapping Group Lasso
13 0.041672248 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
14 0.041512061 145 nips-2011-Learning Eigenvectors for Free
15 0.041180365 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
16 0.037714697 198 nips-2011-On U-processes and clustering performance
17 0.036426011 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
18 0.035247143 4 nips-2011-A Convergence Analysis of Log-Linear Training
19 0.03481672 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
20 0.034720886 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
topicId topicWeight
[(0, 0.111), (1, 0.015), (2, -0.04), (3, -0.062), (4, -0.042), (5, 0.028), (6, -0.004), (7, 0.006), (8, -0.008), (9, 0.038), (10, 0.013), (11, 0.048), (12, 0.008), (13, -0.039), (14, -0.04), (15, 0.069), (16, -0.094), (17, 0.05), (18, 0.093), (19, -0.057), (20, 0.033), (21, -0.021), (22, 0.009), (23, 0.07), (24, -0.045), (25, -0.014), (26, 0.017), (27, -0.026), (28, -0.017), (29, -0.021), (30, -0.014), (31, 0.005), (32, -0.016), (33, 0.054), (34, -0.025), (35, 0.087), (36, -0.041), (37, -0.024), (38, -0.107), (39, -0.074), (40, -0.028), (41, -0.015), (42, 0.059), (43, -0.052), (44, -0.048), (45, 0.001), (46, 0.01), (47, 0.045), (48, 0.062), (49, -0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.90008134 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
Author: Jiayu Zhou, Jianhui Chen, Jieping Ye
Abstract: Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms. 1
2 0.65899855 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
Author: Nico Goernitz, Christian Widmer, Georg Zeller, Andre Kahles, Gunnar Rätsch, Sören Sonnenburg
Abstract: We present a novel regularization-based Multitask Learning (MTL) formulation for Structured Output (SO) prediction for the case of hierarchical task relations. Structured output prediction often leads to difficult inference problems and hence requires large amounts of training data to obtain accurate models. We propose to use MTL to exploit additional information from related learning tasks by means of hierarchical regularization. Training SO models on the combined set of examples from multiple tasks can easily become infeasible for real world applications. To be able to solve the optimization problems underlying multitask structured output learning, we propose an efficient algorithm based on bundle-methods. We demonstrate the performance of our approach in applications from the domain of computational biology addressing the key problem of gene finding. We show that 1) our proposed solver achieves much faster convergence than previous methods and 2) that the Hierarchical SO-MTL approach outperforms considered non-MTL methods. 1
3 0.59828371 181 nips-2011-Multiple Instance Learning on Structured Data
Author: Dan Zhang, Yan Liu, Luo Si, Jian Zhang, Richard D. Lawrence
Abstract: Most existing Multiple-Instance Learning (MIL) algorithms assume data instances and/or data bags are independently and identically distributed. But there often exists rich additional dependency/structure information between instances/bags within many applications of MIL. Ignoring this structure information limits the performance of existing MIL algorithms. This paper explores the research problem as multiple instance learning on structured data (MILSD) and formulates a novel framework that considers additional structure information. In particular, an effective and efficient optimization algorithm has been proposed to solve the original non-convex optimization problem by using a combination of ConcaveConvex Constraint Programming (CCCP) method and an adapted Cutting Plane method, which deals with two sets of constraints caused by learning on instances within individual bags and learning on structured data. Our method has the nice convergence property, with specified precision on each set of constraints. Experimental results on three different applications, i.e., webpage classification, market targeting, and protein fold identification, clearly demonstrate the advantages of the proposed method over state-of-the-art methods. 1
4 0.51974553 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
Author: David Adametz, Volker Roth
Abstract: A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3 ) in the TIWD to O(n2 ). Our experiments show that the cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to ‘mine’ large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. 1
5 0.5167796 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
6 0.50469226 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
7 0.4826059 258 nips-2011-Sparse Bayesian Multi-Task Learning
8 0.46138239 277 nips-2011-Submodular Multi-Label Learning
9 0.45526171 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
10 0.45392668 4 nips-2011-A Convergence Analysis of Log-Linear Training
11 0.45015693 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
12 0.44888574 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
13 0.44806194 165 nips-2011-Matrix Completion for Multi-label Image Classification
14 0.44322887 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
15 0.43229893 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
16 0.42542925 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
17 0.4238849 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
18 0.4233281 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
19 0.41719925 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
20 0.41636848 260 nips-2011-Sparse Features for PCA-Like Linear Regression
topicId topicWeight
[(0, 0.035), (4, 0.032), (20, 0.026), (26, 0.023), (31, 0.062), (33, 0.036), (43, 0.05), (45, 0.12), (57, 0.026), (69, 0.34), (74, 0.043), (83, 0.031), (99, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.69950378 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
Author: Jiayu Zhou, Jianhui Chen, Jieping Ye
Abstract: Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms. 1
2 0.65613008 15 nips-2011-A rational model of causal inference with continuous causes
Author: Thomas L. Griffiths, Michael James
Abstract: Rational models of causal induction have been successful in accounting for people’s judgments about causal relationships. However, these models have focused on explaining inferences from discrete data of the kind that can be summarized in a 2× 2 contingency table. This severely limits the scope of these models, since the world often provides non-binary data. We develop a new rational model of causal induction using continuous dimensions, which aims to diminish the gap between empirical and theoretical approaches and real-world causal induction. This model successfully predicts human judgments from previous studies better than models of discrete causal inference, and outperforms several other plausible models of causal induction with continuous causes in accounting for people’s inferences in a new experiment. 1
3 0.46315771 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
4 0.4546659 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
5 0.45430976 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
6 0.45372078 271 nips-2011-Statistical Tests for Optimization Efficiency
7 0.45271894 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
8 0.45121631 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
9 0.45020267 80 nips-2011-Efficient Online Learning via Randomized Rounding
10 0.44956744 202 nips-2011-On the Universality of Online Mirror Descent
11 0.44948447 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
12 0.44945249 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
13 0.4493134 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
14 0.44902256 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
15 0.44779781 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
16 0.4475477 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
17 0.44708854 29 nips-2011-Algorithms and hardness results for parallel large margin learning
18 0.4461146 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
19 0.44584626 251 nips-2011-Shaping Level Sets with Submodular Functions
20 0.44579494 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning