nips nips2007 nips2007-12 knowledge-graph by maker-knowledge-mining

12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning


Source: pdf

Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli

Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. [sent-12, score-0.288]

2 Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. [sent-13, score-0.247]

3 We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. [sent-14, score-0.299]

4 This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. [sent-15, score-0.387]

5 We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. [sent-17, score-0.197]

6 Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. [sent-18, score-0.218]

7 This problem is important in a variety of applications, ranging from conjoint analysis [12], to object detection in computer vision [18], to multiple microarray data set integration in computational biology [8] – to mention just a few. [sent-20, score-0.112]

8 Finding this common structure is important because it allows pooling information across the tasks, a property which is particularly appealing when there are many tasks but only few data per task. [sent-22, score-0.281]

9 Moreover, knowledge of the common structure may facilitate learning new tasks (transfer learning), see [6] and references therein. [sent-23, score-0.247]

10 In this paper, we extend the formulation of [4], where the structure shared by the tasks is described by a positive definite matrix. [sent-24, score-0.259]

11 In Section 2, we propose a framework in which the task parameters and the structure matrix are jointly computed by minimizing a regularization function. [sent-25, score-0.413]

12 When the structure matrix is fixed, the function decomposes across the tasks, which can hence be learned independently with standard methods such as SVMs. [sent-27, score-0.14]

13 When the task parameters are fixed, the optimal structure matrix is a spectral function of the covariance of the tasks and can often be explicitly computed. [sent-28, score-0.512]

14 As we shall see, spectral functions are of particular interest in this context because they lead to an efficient alternating minimization algorithm. [sent-29, score-0.453]

15 Second, in Section 4 we characterize the spectral functions which relate to Schatten Lp regularization and present the alternating minimization algorithm. [sent-32, score-0.497]

16 Third, in Section 5 we discuss the connection between our framework and the convex optimization method for learning the kernel [11, 15], which leads to a much simpler proof of the convexity in the kernel than the one given in [15]. [sent-33, score-0.318]

17 The experiments indicate that the alternating algorithm runs significantly faster than gradient descent and that our method improves on state of the art statistical performance on these data sets. [sent-35, score-0.291]

18 We let T be the number of tasks which we want to simultaneously learn. [sent-43, score-0.179]

19 We assume for simplicity that each task t ∈ INT is well described by a linear function defined, for every x ∈ IRd , as wt x, where wt is a fixed vector of coefficients. [sent-44, score-0.322]

20 In this paper we follow the formulation in [4], where the tasks’ structure is summarized by a positive definite matrix D which is linked to the covariance matrix between the tasks, W W . [sent-51, score-0.267]

21 Here, W denotes the d × T matrix whose t-th column is given by the vector wt (we have assumed for simplicity that the mean task is zero). [sent-52, score-0.284]

22 The former may be any bounded from below and convex function evaluated at the values w t xtj , t ∈ INT , j ∈ INm . [sent-55, score-0.182]

23 Typically, it will be the average error on the tasks, namely, Err(W ) = t∈INT Lt (wt ), where Lt (wt ) = j∈INm (ytj , wt xtj ) and : IR × IR → [0, ∞) is a prescribed loss function (e. [sent-56, score-0.258]

24 We shall assume that the loss is convex in its second argument, which ensures that the function Err is also convex. [sent-60, score-0.15]

25 The latter term favors the tasks sharing some common structure and is given by T Penalty(W, D) = tr(F (D)W W ) = wt F (D)wt , (2. [sent-61, score-0.416]

26 2) t=1 where F : Sd → Sd is a prescribed spectral matrix function. [sent-62, score-0.279]

27 3) In the rest of the paper, we will always use F to denote a spectral matrix function and f to denote the associated real function, as above. [sent-72, score-0.244]

28 Minimization of the function Reg allows us to learn the tasks and at the same time a good representation for them which is summarized by the eigenvectors and eigenvalues of the matrix D. [sent-73, score-0.354]

29 Different choices of the function f reflect different properties which we would like the tasks to share. [sent-74, score-0.179]

30 In the special case that f is a constant, the tasks are totally independent and the regularizer (2. [sent-75, score-0.255]

31 In the case f (λ) = λ−1 , which is considered in [4], the regularizer favors a sparse representation in the sense that the tasks share a small common set of features. [sent-77, score-0.282]

32 Thus, we propose to solve the minimization problem inf Reg(W, D) : W ∈ IRd×T , D ∈ Sd , tr D ≤ 1 ++ 2 (2. [sent-81, score-0.273]

33 Moreover, even though the infimum above is not attained in general, the problem in W resulting after partial minimization over D admits a minimizer. [sent-85, score-0.19]

34 That is, we can compute the infimum Ωf (W ) := inf tr(F (D)W W ) : D ∈ Sd , tr D ≤ 1 . [sent-88, score-0.186]

35 In particular, in Section 4 we shall show that Ωf is a function of the singular values of W only. [sent-95, score-0.126]

36 Hence, the only matrix operation required by alternate minimization is singular value decomposition and the rest are merely vector problems. [sent-96, score-0.29]

37 3 Joint Convexity via Matrix Concave Functions In this section, we address the issue of convexity of the regularization function (2. [sent-98, score-0.184]

38 Our main result characterizes the class of spectral functions F for which the term w F (D)w is jointly convex in (w, D), which in turn implies that (2. [sent-100, score-0.337]

39 We say that the real-valued function g : (0, ∞) → IR is matrix concave of order d if λG(A) + (1 − λ)G(B) ∀A, B ∈ Sd and λ ∈ [0, 1] , ++ G(λA + (1 − λ)B) where G is defined as in (2. [sent-103, score-0.183]

40 If g is a matrix concave function of order d for any d ∈ IN, we simply say that g is matrix concave. [sent-106, score-0.282]

41 We also say that g is matrix convex (of order d) if −g is matrix concave (of order d). [sent-107, score-0.378]

42 Clearly, matrix concavity implies matrix concavity of smaller orders (and hence standard concavity). [sent-108, score-0.33]

43 Then the function ρ : IRd ×Sd → [0, ∞) ++ ++ ++ 1 defined as ρ(w, D) = w F (D)w is jointly convex if and only if f is matrix concave of order d. [sent-112, score-0.349]

44 7]) A−1 + B −1 3 C −1 , or, using the initial notation, λ F (D1 ) −1 + (1 − λ) F (D2 ) −1 F (λD1 + (1 − λ)D2 ) By definition, this inequality holds for any D1 , D2 ∈ concave of order d. [sent-127, score-0.13]

45 ∈ (0, 1) if and only if 1 f is matrix Examples of matrix concave functions on (0, ∞) are log(x + 1) and the function x s for s ∈ [0, 1] 1 – see [7] for other examples and theoretical results. [sent-129, score-0.308]

46 We conclude with the remark that, whenever f is matrix concave of order d, function Ωf in (2. [sent-130, score-0.183]

47 5) is convex, because it is the partial infimum of a jointly convex function [9, Sec. [sent-131, score-0.2]

48 5) reduces to a minimization problem in IRd , by application of a useful matrix inequality. [sent-139, score-0.186]

49 Let F : Sd → Sd be a spectral function, B ∈ Sd and βi , i ∈ INd , the eigenvalues of B. [sent-143, score-0.193]

50 Then, inf{tr(F (D)B) : D ∈ Sd , tr D ≤ 1} = inf ++ δi ≤ 1 f (δi )βi : δi > 0, i ∈ INd , . [sent-144, score-0.186]

51 Then we have that + 1 (tr B s ) s = inf tr(D s−1 s B) : D ∈ Sd , tr D ≤ 1 . [sent-158, score-0.186]

52 ++ Moreover, if B ∈ Sd the infimum is attained and the minimizer is given by D = ++ Bs . [sent-159, score-0.145]

53 The Schatten Lp prenorm is the Lp prenorm of the singular values of a matrix. [sent-168, score-0.27]

54 In particular, trace norm regularization (see [1, 17]) corresponds to the case p = 1. [sent-169, score-0.168]

55 We also note that generalization error bounds for Schatten Lp norm regularization can be derived along the lines of [14]. [sent-170, score-0.132]

56 2) are computationally appealing, since they decompose to vector problems in d variables along with singular value decomposition of the matrix W . [sent-175, score-0.203]

57 4) using an alternating minimization algorithm, which is an extension of the one presented in [4] for the special case F (D) = D −1 . [sent-189, score-0.228]

58 This consists in solving the problem min wt F (D)wt : W ∈ IRd×T Lt (wt ) + γ t∈INT . [sent-192, score-0.137]

59 t∈INT This minimization can be carried out independently for each task since the regularizer decouples 1 when D is fixed. [sent-193, score-0.211]

60 Specifically, introducing new variables for (F (D)) 2 wt yields a standard L2 regularization problem for each task with the same kernel K(x, z) = x (F (D))−1 z, x, z ∈ IRd . [sent-194, score-0.336]

61 In other words, we simply learn the parameters wt – the columns of matrix W – independently by a regularization method, for example by an SVM or ridge regression method, for which there are well developed tool boxes. [sent-195, score-0.373]

62 We only note that following the proof detailed in [3] for the case p = 1, one can show that the sequence produced by the algorithm converges to the unique minimizer of Reg ε if p ∈ [1, 2], or to a local minimizer if p ∈ (0, 1). [sent-199, score-0.152]

63 To this end, we define the kernel Kf (D)(x, z) = x (F (D))−1 z, x, z ∈ IRd , the set of kernels Kf = {Kf (D) : D ∈ Sd , tr D ≤ 1} and, for every kernel K, the task kernel matrix Kt = (K(xti , xtj ) : i, j ∈ INm ), ++ t ∈ INT . [sent-208, score-0.511]

64 5], 1 that the set Kf is convex if and only if f is matrix concave. [sent-214, score-0.195]

65 4) is equivalent to minimizing the function (yti , (Kt ct )i ) + γ ct Kt ct t∈INT i∈INm 5 (5. [sent-218, score-0.291]

66 However, minimizing each term over the vector ct gives a convex function of K. [sent-222, score-0.211]

67 −1 For every a ∈ IRm and K ∈ K , we define the function Gt (a, K) = i∈INm (yti , ai )+γ a Kt a, which is jointly convex by Theorem 3. [sent-229, score-0.166]

68 Recalling that the partial minimum of a jointly convex function is convex [9, Sec. [sent-232, score-0.296]

69 Here, we were able to simplify the proof of this result by appealing to the joint convexity property stated in Theorem 3. [sent-237, score-0.147]

70 6 Experiments In this section, we first report a comparison of the computational cost between the alternating minimization algorithm and the gradient descent algorithm. [sent-239, score-0.339]

71 The first one is the computer survey data from [12]. [sent-242, score-0.115]

72 It was taken from a survey of 180 persons who rated the likelihood of purchasing one of 20 different personal computers. [sent-243, score-0.116]

73 Here the persons correspond to tasks and the computer models to examples. [sent-244, score-0.244]

74 Here, in order to compare our results with those in [5], we used the measure of percentage explained variance, which is defined as one minus the mean squared test error over the variance of the test data and indicates the percentage of variance explained by the prediction model. [sent-260, score-0.184]

75 In the first experiment, we study the computational cost of the alternating minimization algorithm against the gradient descent algorithm, both implemented in Matlab, for the Schatten L 1. [sent-262, score-0.339]

76 1) versus the number of iterations, on the computer survey data. [sent-265, score-0.115]

77 The alternating algorithm curve for ε = 10 −16 is also shown. [sent-268, score-0.141]

78 It is clear that our algorithm is at least an order of magnitude faster than gradient descent with the optimal learning rate and scales better with the number of tasks. [sent-271, score-0.111]

79 We note that the computational cost of our method is mainly due to the T ridge regressions in the supervised step (learning W ) and the singular 6 value decomposition in the unsupervised step (learning D). [sent-272, score-0.182]

80 A singular value decomposition is also needed in gradient descent, for computing the gradient of the Schatten Lp norm. [sent-273, score-0.194]

81 We have observed that the cost per iteration is smaller for gradient descent but the number of iterations is at least an order of magnitude larger, leading to the large difference in time cost. [sent-274, score-0.111]

82 5 0 20 40 60 80 0 50 100 100 150 200 tasks iterations Figure 1: Comparison between the alternating algorithm and the gradient descent algorithm. [sent-284, score-0.431]

83 8 2 p Figure 2: Performance versus p for the computer survey data (left) and the school data (right). [sent-312, score-0.182]

84 Table 1: Comparison of different methods on the computer survey data (left) and school data (right). [sent-313, score-0.182]

85 4% In the second experiment we study the statistical performance of our method as the spectral function changes. [sent-325, score-0.145]

86 The results, shown in Figure 2, indicate that the trace norm is the best norm on these data sets. [sent-327, score-0.104]

87 However, on the computer survey data a value of p less than one gives the best result overall. [sent-328, score-0.115]

88 In contrast, on the school data the trace norm gives almost the best result. [sent-330, score-0.137]

89 Our method improves on the HB method on the computer survey data and is competitive on the school data (even though our regularizer is simpler than HB and the data splits of [5] are not available). [sent-333, score-0.258]

90 On the computer survey data, we trained our method with p = 1 on 150 randomly selected tasks and then used the learned structure matrix D for training 30 ridge regressions on the remaining tasks. [sent-335, score-0.512]

91 In comparison, when 7 I using the raw data (D = d ) on the 30 tasks we obtained an RMSE of 3. [sent-339, score-0.179]

92 A similar experiment was performed on the school data, first training on a random subset of 110 schools and then transferring D to the remaining 29 schools. [sent-341, score-0.106]

93 8% on the 110 tasks but still better than the explained variance of 13. [sent-345, score-0.271]

94 7 Conclusion We have presented a spectral regularization framework for learning the structure shared by many supervised tasks. [sent-347, score-0.353]

95 This structure is summarized by a positive definite matrix which is a spectral function of the tasks’ covariance matrix. [sent-348, score-0.313]

96 Theoretically, it brings to bear the rich class of spectral functions which is well-studied in matrix analysis. [sent-350, score-0.27]

97 Practically, we have argued via the concrete example of negative power spectral functions, that the tasks’ parameters and the structure matrix can be efficiently computed using an alternating minimization algorithm, improving upon state of the art statistical performance on two real data sets. [sent-351, score-0.552]

98 A natural question is to which extent the framework can be generalized to allow for more complex task sharing mechanisms, in which the structure parameters depend on higher order statistical properties of the tasks. [sent-352, score-0.151]

99 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-365, score-0.209]

100 Learning multiple related tasks using latent independent component analysis. [sent-488, score-0.179]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ind', 0.432), ('sd', 0.319), ('ird', 0.281), ('schatten', 0.248), ('tasks', 0.179), ('int', 0.174), ('inm', 0.174), ('spectral', 0.145), ('alternating', 0.141), ('wt', 0.137), ('reg', 0.13), ('lp', 0.129), ('tr', 0.119), ('kt', 0.111), ('prenorm', 0.099), ('matrix', 0.099), ('regularization', 0.098), ('convex', 0.096), ('kf', 0.092), ('mum', 0.091), ('ct', 0.088), ('minimization', 0.087), ('irm', 0.086), ('xtj', 0.086), ('convexity', 0.086), ('concave', 0.084), ('survey', 0.083), ('err', 0.079), ('regularizer', 0.076), ('minimizer', 0.076), ('hb', 0.074), ('yti', 0.074), ('ir', 0.073), ('singular', 0.072), ('rmse', 0.071), ('jointly', 0.07), ('attained', 0.069), ('inf', 0.067), ('school', 0.067), ('concavity', 0.066), ('descent', 0.066), ('penalty', 0.066), ('argyriou', 0.065), ('od', 0.065), ('appealing', 0.061), ('explained', 0.057), ('shall', 0.054), ('transfer', 0.054), ('kernel', 0.053), ('evgeniou', 0.052), ('london', 0.052), ('albany', 0.05), ('bristol', 0.05), ('conjoint', 0.05), ('gower', 0.05), ('prenorms', 0.05), ('ytj', 0.05), ('eigenvalues', 0.048), ('task', 0.048), ('inequality', 0.046), ('gradient', 0.045), ('lt', 0.044), ('structure', 0.041), ('semide', 0.041), ('proposition', 0.04), ('examination', 0.039), ('micchelli', 0.039), ('regressions', 0.039), ('schools', 0.039), ('shared', 0.039), ('ridge', 0.039), ('art', 0.039), ('regularizers', 0.037), ('trace', 0.036), ('diag', 0.035), ('cw', 0.035), ('prescribed', 0.035), ('variance', 0.035), ('partial', 0.034), ('norm', 0.034), ('moreover', 0.033), ('persons', 0.033), ('computer', 0.032), ('sharing', 0.032), ('bayes', 0.032), ('decomposition', 0.032), ('hierarchical', 0.031), ('lemma', 0.031), ('microarray', 0.03), ('framework', 0.03), ('equality', 0.03), ('uk', 0.029), ('summarized', 0.028), ('favors', 0.027), ('gt', 0.027), ('cpu', 0.027), ('street', 0.027), ('facilitate', 0.027), ('minimizing', 0.027), ('functions', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli

Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1

2 0.14318359 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

3 0.10377467 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

Author: Francis R. Bach, Zaïd Harchaoui

Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.

4 0.10344908 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

5 0.099333137 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

Author: Bill Triggs, Jakob J. Verbeek

Abstract: Conditional Random Fields (CRFs) are an effective tool for a variety of different data segmentation and labeling tasks including visual scene interpretation, which seeks to partition images into their constituent semantic-level regions and assign appropriate class labels to each region. For accurate labeling it is important to capture the global context of the image as well as local information. We introduce a CRF based scene labeling model that incorporates both local features and features aggregated over the whole image or large sections of it. Secondly, traditional CRF learning requires fully labeled datasets which can be costly and troublesome to produce. We introduce a method for learning CRFs from datasets with many unlabeled nodes by marginalizing out the unknown labels so that the log-likelihood of the known ones can be maximized by gradient ascent. Loopy Belief Propagation is used to approximate the marginals needed for the gradient and log-likelihood calculations and the Bethe free-energy approximation to the log-likelihood is monitored to control the step size. Our experimental results show that effective models can be learned from fragmentary labelings and that incorporating top-down aggregate features significantly improves the segmentations. The resulting segmentations are compared to the state-of-the-art on three different image datasets. 1

6 0.092329107 134 nips-2007-Multi-Task Learning via Conic Programming

7 0.09218163 40 nips-2007-Bundle Methods for Machine Learning

8 0.088649258 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

9 0.079150252 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

10 0.076450989 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

11 0.070807241 199 nips-2007-The Price of Bandit Information for Online Optimization

12 0.068115287 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

13 0.06803216 63 nips-2007-Convex Relaxations of Latent Variable Training

14 0.067264259 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

15 0.066221394 166 nips-2007-Regularized Boost for Semi-Supervised Learning

16 0.065678641 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

17 0.064284891 21 nips-2007-Adaptive Online Gradient Descent

18 0.06130765 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

19 0.058555637 175 nips-2007-Semi-Supervised Multitask Learning

20 0.058534045 156 nips-2007-Predictive Matrix-Variate t Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.201), (1, 0.0), (2, -0.097), (3, 0.112), (4, 0.008), (5, -0.029), (6, -0.016), (7, -0.01), (8, -0.055), (9, 0.05), (10, 0.075), (11, -0.096), (12, -0.003), (13, 0.05), (14, -0.077), (15, -0.004), (16, -0.05), (17, -0.08), (18, 0.097), (19, 0.0), (20, -0.187), (21, 0.05), (22, 0.002), (23, -0.066), (24, 0.027), (25, 0.053), (26, -0.041), (27, -0.015), (28, 0.009), (29, -0.02), (30, -0.098), (31, -0.008), (32, -0.088), (33, -0.102), (34, 0.029), (35, 0.124), (36, 0.121), (37, 0.009), (38, 0.14), (39, -0.05), (40, 0.056), (41, 0.059), (42, -0.01), (43, 0.15), (44, 0.027), (45, 0.018), (46, 0.067), (47, 0.115), (48, -0.079), (49, -0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93316376 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli

Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1

2 0.5840894 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

3 0.5767467 40 nips-2007-Bundle Methods for Machine Learning

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach. 1

4 0.52860641 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1

5 0.52624428 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

Author: M. M. Mahmud, Sylvian Ray

Abstract: In transfer learning we aim to solve new problems using fewer examples using information gained from solving related problems. Transfer learning has been successful in practice, and extensive PAC analysis of these methods has been developed. However it is not yet clear how to define relatedness between tasks. This is considered as a major problem as it is conceptually troubling and it makes it unclear how much information to transfer and when and how to transfer it. In this paper we propose to measure the amount of information one task contains about another using conditional Kolmogorov complexity between the tasks. We show how existing theory neatly solves the problem of measuring relatedness and transferring the ‘right’ amount of information in sequential transfer learning in a Bayesian setting. The theory also suggests that, in a very formal and precise sense, no other reasonable transfer method can do much better than our Kolmogorov Complexity theoretic transfer method, and that sequential transfer is always justified. We also develop a practical approximation to the method and use it to transfer information between 8 arbitrarily chosen databases from the UCI ML repository. 1

6 0.51920104 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

7 0.51665395 134 nips-2007-Multi-Task Learning via Conic Programming

8 0.49090633 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

9 0.48215774 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

10 0.47349522 156 nips-2007-Predictive Matrix-Variate t Models

11 0.46770278 70 nips-2007-Discriminative K-means for Clustering

12 0.40978375 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

13 0.40636596 49 nips-2007-Colored Maximum Variance Unfolding

14 0.40236548 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

15 0.3998307 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

16 0.39338315 24 nips-2007-An Analysis of Inference with the Universum

17 0.39300564 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

18 0.3904005 63 nips-2007-Convex Relaxations of Latent Variable Training

19 0.38249502 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

20 0.3711566 112 nips-2007-Learning Monotonic Transformations for Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.034), (13, 0.026), (16, 0.033), (21, 0.049), (34, 0.014), (35, 0.402), (47, 0.064), (49, 0.029), (83, 0.123), (85, 0.054), (90, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87708062 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

Author: Richard Turner, Maneesh Sahani

Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1

2 0.87057012 160 nips-2007-Random Features for Large-Scale Kernel Machines

Author: Ali Rahimi, Benjamin Recht

Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1

3 0.8552075 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

Author: Anton Chechetka, Carlos Guestrin

Abstract: We present the first truly polynomial algorithm for PAC-learning the structure of bounded-treewidth junction trees – an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efficient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity. If a junction tree with sufficiently strong intraclique dependencies exists, we provide strong theoretical guarantees in terms of KL divergence of the result from the true distribution. We also present a lazy extension of our approach that leads to very significant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of variables with only polynomially many mutual information computations on fixed-size subsets of variables, if the underlying distribution can be approximated by a bounded-treewidth junction tree. 1

same-paper 4 0.84080726 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli

Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1

5 0.57563746 16 nips-2007-A learning framework for nearest neighbor search

Author: Lawrence Cayton, Sanjoy Dasgupta

Abstract: Can we leverage learning techniques to build a fast nearest-neighbor (ANN) retrieval data structure? We present a general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate. We explore the potential of this novel framework through two popular NN data structures: KD-trees and the rectilinear structures employed by locality sensitive hashing. We derive a generalization theory for these data structure classes and present simple learning algorithms for both. Experimental results reveal that learning often improves on the already strong performance of these data structures. 1

6 0.56365639 96 nips-2007-Heterogeneous Component Analysis

7 0.55056441 134 nips-2007-Multi-Task Learning via Conic Programming

8 0.54704452 175 nips-2007-Semi-Supervised Multitask Learning

9 0.54174089 156 nips-2007-Predictive Matrix-Variate t Models

10 0.52723783 49 nips-2007-Colored Maximum Variance Unfolding

11 0.51952761 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

12 0.51805884 63 nips-2007-Convex Relaxations of Latent Variable Training

13 0.51697397 135 nips-2007-Multi-task Gaussian Process Prediction

14 0.51544333 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

15 0.51453513 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

16 0.51295406 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

17 0.50994426 116 nips-2007-Learning the structure of manifolds using random projections

18 0.50823653 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

19 0.50101519 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

20 0.49981567 70 nips-2007-Discriminative K-means for Clustering