nips nips2008 nips2008-47 knowledge-graph by maker-knowledge-mining

47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation


Source: pdf

Author: Laurent Jacob, Jean-philippe Vert, Francis R. Bach

Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. [sent-7, score-0.36]

2 In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. [sent-8, score-0.189]

3 In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. [sent-9, score-0.691]

4 We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. [sent-10, score-0.685]

5 We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. [sent-11, score-0.332]

6 In particular, regularization by squared Euclidean norms or squared Hilbert norms has been thoroughly studied in various settings, leading to efficient practical algorithms based on linear algebra, and to very good theoretical understanding (see, e. [sent-13, score-0.356]

7 In recent years, regularization by non Hilbert norms, such as ℓp norms with p = 2, has also generated considerable interest for the inference of linear functions in supervised classification or regression. [sent-16, score-0.202]

8 Indeed, such norms can sometimes both make the problem statistically and numerically better-behaved, and impose various prior knowledge on the problem. [sent-17, score-0.187]

9 For example, the ℓ1 -norm (the sum of absolute values) imposes some of the components to be equal to zero and is widely used to estimate sparse functions [3], while various combinations of ℓp norms can be defined to impose various sparsity patterns. [sent-18, score-0.187]

10 That is, assuming a given prior knowledge, how can we design a norm that will enforce it? [sent-20, score-0.161]

11 In multi-task learning several related inference tasks are considered simultaneously, with the hope that by an appropriate sharing 1 of information across tasks, each one may benefit from the others. [sent-22, score-0.321]

12 When linear functions are estimated, each task is associated with a weight vector, and a common strategy to design multi-task learning algorithm is to translate some prior hypothesis about how the tasks are related to each other into constraints on the different weight vectors. [sent-23, score-0.59]

13 For example, such constraints are typically that the weight vectors of the different tasks belong (a) to a Euclidean ball centered at the origin [5], which implies no sharing of information between tasks apart from the size of the different vectors, i. [sent-24, score-0.746]

14 In this paper, we consider a different prior hypothesis that we believe could be more relevant in some applications: the hypothesis that the different tasks are in fact clustered into different groups, and that the weight vectors of tasks within a group are similar to each other. [sent-27, score-0.872]

15 A key difference with [5], where a similar hypothesis is studied, is that we don’t assume that the groups are known a priori, and in a sense our goal is both to identify the clusters and to use them for multi-task learning. [sent-28, score-0.247]

16 Besides an improved performance if the hypothesis turns out to be correct, we also expect this approach to be able to identify the cluster structure among the tasks as a by-product of the inference step, e. [sent-31, score-0.573]

17 In order to translate this hypothesis into a working algorithm, we follow the general strategy mentioned above which is to design a norm or a penalty over the set of weights which can be used as regularization in classical inference algorithms. [sent-34, score-0.411]

18 We construct such a penalty by first assuming that the partition of the tasks into clusters is known, similarly to [5]. [sent-35, score-0.543]

19 This optimization problem over the set of partitions being computationally challenging, we propose a convex relaxation of the problem which results in an efficient algorithm. [sent-37, score-0.229]

20 2 Multi-task learning with clustered tasks We consider m related inference tasks that attempt to learn linear functions over X = Rd from a training set of input/output pairs (xi , yi )i=1,. [sent-38, score-0.665]

21 Each training example (xi , yi ) is associated to a particular task t ∈ [1, m], and we denote by I(t) ⊂ [1, n] the set of indices of training examples associated to the task t. [sent-43, score-0.164]

22 In order to learn simultaneously the m tasks, we follow the now well-established approach which looks for a set of weight vectors W that minimizes the empirical risk regularized by a penalty functional, i. [sent-56, score-0.257]

23 For example, [5] suggests to penalize both the norms of the wi ’s and their variance, 2 i. [sent-59, score-0.3]

24 , to consider a function of the form: Ωvariance (W ) = w ¯ 2 + m i=1 β m wi − w ¯ 2 , (3) n ′ where w = ( i=1 wi ) /m is the mean weight vector. [sent-61, score-0.291]

25 This penalty enforces a clustering of the wi s ¯ towards their mean when β increases. [sent-62, score-0.392]

26 Alternatively, [7] propose to penalize the trace norm of W : Ωtrace (W ) = min(d,m) i=1 σi (W ) , (4) where σ1 (W ), . [sent-63, score-0.252]

27 , constrains the different wi ’s to live in a low-dimensional subspace. [sent-69, score-0.146]

28 Here we would like to define a penalty function Ω(W ) that encodes as prior knowledge that tasks are clustered into r < m groups. [sent-70, score-0.491]

29 , we have a partition of the set of tasks into r groups. [sent-73, score-0.268]

30 For a given cluster c ∈ [1, r], let us denote J (c) ⊂ [1, m] the set of tasks in c, mc = |J (c)| the number of tasks in the cluster c, and E the m × r binary matrix which describes the cluster assignment for the m tasks, i. [sent-75, score-1.232]

31 Let us further denote by wc = ¯ m ( i∈J (c) wi )/mc the average weight vector for the tasks in c, and recall that w = ( i=1 wi ) /m ¯ denotes the average weight vector over all tasks. [sent-78, score-0.809]

32 M can also be written I − L, where L is the normalized Laplacian of the graph G whose nodes are the tasks connected by an edge if and only if they are in the same cluster. [sent-80, score-0.268]

33 Then we can define three semi-norms of interest on W that quantify different orthogonal aspects: • A global penalty, which measures on average how large the weight vectors are: Ωmean (W ) = n w ¯ 2 = trW U W ⊤ . [sent-81, score-0.169]

34 • A measure of between-cluster variance, which quantifies how close to each other the different clusters are: Ωbetween (W ) = r c=1 mc wc − w ¯ ¯ 2 = trW (M − U )W ⊤ . [sent-82, score-0.368]

35 • A measure of within-cluster variance, which quantifies the compactness of the clusters: Ωwithin (W ) = r c=1 i∈J (c) wi − wc ¯ 2 = trW (I − M )W ⊤ . [sent-83, score-0.292]

36 We note that both Ωbetween (W ) and Ωwithin (W ) depend on the particular choice of clusters E, or equivalently of M . [sent-84, score-0.177]

37 We now propose to consider the following general penalty function: Ω(W ) = εM Ωmean (W ) + εB Ωbetween (W ) + εW Ωwithin (W ) , (5) where εM , εB and εW are non-negative parameters that can balance the importance of the components of the penalty. [sent-85, score-0.137]

38 Plugging this quadratic penalty into (2) leads to the general problem: minW ∈Rd×m ℓ(W ) + λtrW Σ(M )−1 W ⊤ , (6) Σ(M )−1 = εM U + εB (M − U ) + εW (I − M ) . [sent-86, score-0.168]

39 (7) where Here we use the notation Σ(M ) to insist on the fact that this quadratic penalty depends on the cluster structure through the matrix M . [sent-87, score-0.44]

40 Observing that the matrices U , M − U and I − M are orthogonal projections onto orthogonal supplementary subspaces, we easily get from (7): Σ(M ) = ε−1 U + ε−1 (M − U ) + ε−1 (I − M ) = ε−1 I + (ε−1 − ε−1 )U + (ε−1 − ε−1 )M . [sent-88, score-0.164]

41 • For εW = εB > εM , we recover the penalty of [5] without clusters: m Ω(W ) = trW (εM U + εB (I − U )) W ⊤ = εM n w 2 + εB i=1 wi − w 2 . [sent-90, score-0.298]

42 ¯ ¯ In that case, a global similarity between tasks is enforced, in addition to the general constraint on their mean. [sent-91, score-0.268]

43 The structure in clusters plays no role since the sum of the betweenand within-cluster variance is independent of the particular choice of clusters. [sent-92, score-0.208]

44 • For εW > εB = εM we recover the penalty of [5] with clusters: r Ω(W ) = trW (εM M + εW (I − M )) W ⊤ = εM mc wc ¯ 2 + εW εM c=1 i∈J (c) wi − wc ¯ 2 In order to enforce a cluster hypothesis on the tasks, we therefore see that a natural choice is to take εW > εB > εM in (5). [sent-93, score-0.979]

45 Of course, a major limitation at this point is that we assumed the cluster structure known a priori (through the matrix E, or equivalently M ). [sent-95, score-0.38]

46 In many cases of interest, we would like instead to learn the cluster structure itself from the data. [sent-96, score-0.24]

47 We propose to learn the cluster structure in our framework by optimizing our objective function (6) both in W and M , i. [sent-97, score-0.272]

48 , to consider the problem: minW ∈Rd×m ,M ∈Mr ℓ(W ) + λtrW Σ(M )−1 W ⊤ , (9) where Mr denotes the set of matrices M = E(E ⊤ E)−1 E ⊤ defined by a clustering of the m tasks into r clusters and Σ(M ) is defined in (8). [sent-99, score-0.572]

49 (10) m The objective function in (10) is jointly convex in W ∈ Rd×m and Σ ∈ S+ , the set of m×m positive semidefinite matrices, however the (finite) set Sr is not convex, making this problem intractable. [sent-101, score-0.123]

50 We are now going to propose a convex relaxation of (10) by optimizing over a convex set of positive semidefinite matrices that contains Sr . [sent-102, score-0.45]

51 3 Convex relaxation In order to formulate a convex relaxation of (10), we observe that in the penalty term (5) the cluster structure only contributes to the second and third terms Ωbetween (W ) and Ωwithin (W ), and that these penalties only depend on the centered version of W . [sent-103, score-0.712]

52 Plugging (11) in (7) and (9), we get the penalty trW Σ(M )−1 W ⊤ = εM trW ⊤ W U + (W Π)(εB M + εW (I − M ))(W Π)⊤ , (12) in which, again, only the second part needs to be optimized with respect to the clustering M . [sent-112, score-0.237]

53 A possible convex relaxation of the discrete set of matrices M is therefore {M : 0 M I, trM = r − 1}. [sent-122, score-0.295]

54 B (14) + (r − Incorporating the first part of the and γ = (m − r + with α = β= penalty (12) into the empirical risk term by defining ℓc (W ) = λℓ(W ) + εM trW ⊤ W U , we are now ready to state our relaxation of (10): (15) minW ∈Rd×m ,Σc ∈Sc ℓc (W ) + λtrW ΠΣ−1 (W Π)⊤ . [sent-124, score-0.243]

55 1 Reinterpretation in terms of norms We denote W 2 = minΣc ∈Sc trW Σ−1 W T the cluster norm (CN). [sent-127, score-0.477]

56 For any convex set Sc , we obc c tain a norm on W (that we apply here to its centered version). [sent-128, score-0.241]

57 By putting some different constraints on the set Sc , we obtain different norms on W , and in fact all previous multi-task formulations may be cast in this way, i. [sent-129, score-0.191]

58 , trace constraint for the trace norm, and simply a singleton for the Frobenius norm). [sent-133, score-0.198]

59 Thus, designing norms for multitask learning is equivalent to designing a set of positive matrices. [sent-134, score-0.217]

60 Note that we have selected a simple spectral convex set Sc in order to make the optimization simpler in Section 3. [sent-136, score-0.155]

61 Finally, when r = 1 (one cluster) and r = m (one cluster per task), we get back the formulation of [5]. [sent-138, score-0.237]

62 2 Reinterpretation as a convex relaxation of K-means In this section we show that the semi-norm W Π 2 that we have designed earlier, can be interpreted c as a convex relaxation of K-means on the tasks [9]. [sent-140, score-0.726]

63 Indeed, given W ∈ Rd×m , K-means aims to decompose it in the form W = µE ⊤ where µ ∈ Rd×r are cluster centers and E represents a partition. [sent-141, score-0.246]

64 Thus, a natural strategy F outlined by [9], is to alternate between optimizing µ, the partition E and the weight vectors W . [sent-143, score-0.152]

65 We now show that our convex norm is obtained when minimizing in closed form with respect to µ and relaxing. [sent-144, score-0.241]

66 , after optimization of the cluster centers) is equal to trΠW ⊤ W Π(ΠE(E ⊤ E)−1 E ⊤ Π/λ + I)−1 = trΠW ⊤ W Π(ΠM Π/λ + I)−1 . [sent-148, score-0.205]

67 (13), we see that our formulation is indeed a convex relaxation of K-means. [sent-150, score-0.293]

68 The optimal V is the matrix of the eigenvectors of W ΠW ⊤ , and we obtain the value of the objective function at the optimum: minΣ∈S trW ΠΣ−1 (W Π)⊤ = minλ∈Rm , α≤λi ≤β, λ1=γ 2 m σi i=1 λi , where σ and λ are the vectors containing the singular values of W Π and Σ respectively. [sent-158, score-0.148]

69 The only coupling in this formulation comes from the trace constraint. [sent-160, score-0.131]

70 Now this may not be 5 √ in the constraint set (α, β), so if σi < α ν then the minimum in λi ∈ [α, β] of (17) is reached √ √ for λi = α, and if σi > β ν it is reached for λi = β. [sent-167, score-0.13]

71 1 Experiments Artificial data We generated synthetic data consisting of two clusters of two tasks. [sent-183, score-0.138]

72 For each cluster, a center wc was generated in Rd−2 , so that the two clusters be orthogonal. [sent-185, score-0.319]

73 More ¯ 2 2 precisely, each wc had (d − 2)/2 random features randomly drawn from N (0, σr ), σr = 900, and ¯ (d − 2)/2 zero features. [sent-186, score-0.181]

74 Then, each tasks t was computed as wt + wc (t), where c(t) was the cluster ¯ of t. [sent-187, score-0.699]

75 wt had the same zero feature as its cluster center, and the other features were drawn from 2 2 2 N (0, σc ), σc = 16. [sent-188, score-0.25]

76 The last two features were non-zero for all the tasks and drawn from N (0, σc ). [sent-189, score-0.268]

77 In a first experiment, we compared our cluster norm . [sent-191, score-0.323]

78 2 with the single-task learning given by the c Frobenius norm, and with the trace norm, that corresponds to the assumption that the tasks live in a low-dimension space. [sent-192, score-0.402]

79 In a second setting, we compare CN to alternative methods that differ in the way they learn Σ: • The True metric approach, that simply plugs the actual clustering in E and optimizes W using this fixed metric. [sent-194, score-0.14]

80 • The k-means approach, that alternates between optimizing the tasks in W given the metric Σ and re-learning Σ by clustering the tasks wi [9]. [sent-196, score-0.819]

81 We also tried one run of CN followed by a run of True metric using the learned Σ reprojected in Sr by rounding, i. [sent-199, score-0.219]

82 6 Only the first method requires to know the true clustering a priori, all the other methods can be run without any knowledge of the clustering structure of the tasks. [sent-202, score-0.268]

83 The training points were equally separated between the two clusters and for each cluster, 5/6th of the points were used for the first task and 1/6th for the second, in order to simulate a natural setting were some tasks have fewer data. [sent-204, score-0.588]

84 We used the 2000 points of each task to build 3 training folds, and the remaining points were used for testing. [sent-205, score-0.182]

85 We used the mean RMSE across the tasks as a criterion, and a quadratic loss for ℓ(W ). [sent-206, score-0.299]

86 CN penalization on the other hand always gives better testing error than the trace norm penalization, with a stronger advantage when very few training points are available. [sent-209, score-0.367]

87 For 28 training points, no method recovers the correct clustering structure, as displayed on Figure 2, although CN performs slightly better than the k-means approach since the metric it learns is more diffuse. [sent-224, score-0.227]

88 For 50 training points, CN performs much better than the k-means approach, which completely fails to recover the clustering structure as illustrated by the Σ learned for 28 and 50 training points on Figure 2. [sent-225, score-0.321]

89 When more training points become available, the k-means approach perfectly recovers the clustering structure and outperforms the relaxed approach. [sent-227, score-0.272]

90 2 MHC-I binding data We also applied our method to the IEDB MHC-I peptide binding benchmark proposed in [10]. [sent-231, score-0.402]

91 This database contains binding affinities of various peptides, i. [sent-232, score-0.164]

92 This binding process is central in the immune system, and predicting it is crucial, for example to design vaccines. [sent-235, score-0.207]

93 We used an orthogonal coding of the amino acids to represent the peptides and balanced 7 Table 1: Prediction error for the 10 molecules with less than 200 training peptides in IEDB. [sent-238, score-0.361]

94 Besides, it is well known in the vaccine design community that some molecules can be grouped into empirically defined supertypes known to have similar binding behaviors. [sent-252, score-0.362]

95 More accurate results could arise from such a cross validation, in particular concerning the number of clusters (here we limited the choice to 2 or 10 clusters). [sent-256, score-0.138]

96 On the other hand, all the multitask approaches improve the accuracy, the cluster norm giving the best performance. [sent-259, score-0.386]

97 The learned Σ, however, did not recover the known supertypes, although it may contain some relevant information on the binding behavior of the molecules. [sent-260, score-0.214]

98 5 Conclusion We have presented a convex approach to clustered multi-task learning, based on the design of a dedicated norm. [sent-261, score-0.297]

99 A community resource benchmarking predictions of peptide binding to MHC-I molecules. [sent-359, score-0.238]

100 Efficient peptide-MHC-I binding prediction for alleles with few known binders. [sent-374, score-0.164]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('trw', 0.493), ('tasks', 0.268), ('cluster', 0.205), ('sc', 0.201), ('cn', 0.187), ('wc', 0.181), ('binding', 0.164), ('norms', 0.154), ('clusters', 0.138), ('penalty', 0.137), ('convex', 0.123), ('norm', 0.118), ('reprojected', 0.113), ('wi', 0.111), ('sr', 0.11), ('relaxation', 0.106), ('minw', 0.106), ('rd', 0.102), ('clustering', 0.1), ('trace', 0.099), ('molecules', 0.099), ('clustered', 0.086), ('iedb', 0.085), ('peptides', 0.085), ('trm', 0.085), ('rmse', 0.08), ('customers', 0.076), ('peptide', 0.074), ('priori', 0.069), ('weight', 0.069), ('beforehand', 0.068), ('matrices', 0.066), ('frobenius', 0.066), ('hypothesis', 0.065), ('reached', 0.065), ('singular', 0.065), ('evgeniou', 0.063), ('multitask', 0.063), ('rue', 0.06), ('penalization', 0.057), ('cninit', 0.056), ('reinterpretation', 0.056), ('supertypes', 0.056), ('semide', 0.056), ('sharing', 0.053), ('vectors', 0.051), ('recover', 0.05), ('points', 0.05), ('cbio', 0.049), ('fontainebleau', 0.049), ('honor', 0.049), ('inserm', 0.049), ('mines', 0.049), ('nities', 0.049), ('saint', 0.049), ('pooling', 0.049), ('mc', 0.049), ('orthogonal', 0.049), ('regularization', 0.048), ('min', 0.048), ('mr', 0.045), ('dedicated', 0.045), ('om', 0.045), ('wt', 0.045), ('dual', 0.044), ('enforces', 0.044), ('recovers', 0.044), ('groups', 0.044), ('training', 0.043), ('design', 0.043), ('curie', 0.042), ('canceling', 0.042), ('jacob', 0.042), ('centers', 0.041), ('metric', 0.04), ('france', 0.04), ('task', 0.039), ('equivalently', 0.039), ('constraints', 0.037), ('paristech', 0.037), ('emerged', 0.037), ('institut', 0.035), ('successive', 0.035), ('penalize', 0.035), ('live', 0.035), ('quanti', 0.035), ('variance', 0.035), ('structure', 0.035), ('impose', 0.033), ('run', 0.033), ('formulation', 0.032), ('matrix', 0.032), ('optimizing', 0.032), ('indeed', 0.032), ('spectral', 0.032), ('eigenvalues', 0.031), ('quadratic', 0.031), ('preferences', 0.03), ('denoting', 0.03), ('tr', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

Author: Laurent Jacob, Jean-philippe Vert, Francis R. Bach

Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. 1

2 0.14950949 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

Author: Tong Zhang

Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1

3 0.12469028 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

4 0.12417263 117 nips-2008-Learning Taxonomies by Dependence Maximization

Author: Matthew Blaschko, Arthur Gretton

Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1

5 0.12224044 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1

6 0.11998736 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

7 0.11981785 48 nips-2008-Clustering via LP-based Stabilities

8 0.099633709 214 nips-2008-Sparse Online Learning via Truncated Gradient

9 0.097294591 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

10 0.096954428 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

11 0.094721638 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

12 0.088140406 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising

13 0.086882301 193 nips-2008-Regularized Co-Clustering with Dual Supervision

14 0.086062238 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph

15 0.082386382 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

16 0.080961816 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

17 0.080575563 104 nips-2008-Improved Moves for Truncated Convex Models

18 0.077930547 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

19 0.077807903 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

20 0.075702809 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.242), (1, -0.065), (2, -0.114), (3, 0.083), (4, 0.027), (5, -0.032), (6, -0.046), (7, -0.102), (8, -0.044), (9, -0.071), (10, 0.015), (11, 0.004), (12, -0.093), (13, 0.043), (14, 0.009), (15, 0.122), (16, -0.158), (17, 0.037), (18, 0.025), (19, -0.009), (20, -0.043), (21, 0.001), (22, -0.096), (23, 0.128), (24, 0.007), (25, -0.041), (26, 0.052), (27, 0.058), (28, -0.155), (29, 0.07), (30, -0.09), (31, -0.036), (32, 0.03), (33, 0.009), (34, 0.02), (35, 0.029), (36, -0.088), (37, -0.021), (38, 0.004), (39, 0.07), (40, 0.101), (41, -0.022), (42, 0.028), (43, -0.041), (44, -0.107), (45, 0.072), (46, 0.038), (47, -0.002), (48, 0.027), (49, -0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9503473 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

Author: Laurent Jacob, Jean-philippe Vert, Francis R. Bach

Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. 1

2 0.7776379 48 nips-2008-Clustering via LP-based Stabilities

Author: Nikos Komodakis, Nikos Paragios, Georgios Tziritas

Abstract: A novel center-based clustering algorithm is proposed in this paper. We first formulate clustering as an NP-hard linear integer program and we then use linear programming and the duality theory to derive the solution of this optimization problem. This leads to an efficient and very general algorithm, which works in the dual domain, and can cluster data based on an arbitrary set of distances. Despite its generality, it is independent of initialization (unlike EM-like methods such as K-means), has guaranteed convergence, can automatically determine the number of clusters, and can also provide online optimality bounds about the quality of the estimated clustering solutions. To deal with the most critical issue in a centerbased clustering algorithm (selection of cluster centers), we also introduce the notion of stability of a cluster center, which is a well defined LP-based quantity that plays a key role to our algorithm’s success. Furthermore, we also introduce, what we call, the margins (another key ingredient in our algorithm), which can be roughly thought of as dual counterparts to stabilities and allow us to obtain computationally efficient approximations to the latter. Promising experimental results demonstrate the potentials of our method.

3 0.66432941 117 nips-2008-Learning Taxonomies by Dependence Maximization

Author: Matthew Blaschko, Arthur Gretton

Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1

4 0.65975827 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph

Author: Deli Zhao, Xiaoou Tang

Abstract: Detecting underlying clusters from large-scale data plays a central role in machine learning research. In this paper, we tackle the problem of clustering complex data of multiple distributions and multiple scales. To this end, we develop an algorithm named Zeta l-links (Zell) which consists of two parts: Zeta merging with a similarity graph and an initial set of small clusters derived from local l-links of samples. More specifically, we propose to structurize a cluster using cycles in the associated subgraph. A new mathematical tool, Zeta function of a graph, is introduced for the integration of all cycles, leading to a structural descriptor of a cluster in determinantal form. The popularity character of a cluster is conceptualized as the global fusion of variations of such a structural descriptor by means of the leave-one-out strategy in the cluster. Zeta merging proceeds, in the hierarchical agglomerative fashion, according to the maximum incremental popularity among all pairwise clusters. Experiments on toy data clustering, imagery pattern clustering, and image segmentation show the competitive performance of Zell. The 98.1% accuracy, in the sense of the normalized mutual information (NMI), is obtained on the FRGC face data of 16028 samples and 466 facial clusters. 1

5 0.63999164 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering

Author: Shai Ben-David, Margareta Ackerman

Abstract: Aiming towards the development of a general clustering theory, we discuss abstract axiomatization for clustering. In this respect, we follow up on the work of Kleinberg, ([1]) that showed an impossibility result for such axiomatization. We argue that an impossibility result is not an inherent feature of clustering, but rather, to a large extent, it is an artifact of the specific formalism used in [1]. As opposed to previous work focusing on clustering functions, we propose to address clustering quality measures as the object to be axiomatized. We show that principles like those formulated in Kleinberg’s axioms can be readily expressed in the latter framework without leading to inconsistency. A clustering-quality measure (CQM) is a function that, given a data set and its partition into clusters, returns a non-negative real number representing how strong or conclusive the clustering is. We analyze what clustering-quality measures should look like and introduce a set of requirements (axioms) for such measures. Our axioms capture the principles expressed by Kleinberg’s axioms while retaining consistency. We propose several natural clustering quality measures, all satisfying the proposed axioms. In addition, we analyze the computational complexity of evaluating the quality of a given clustering and show that, for the proposed CQMs, it can be computed in polynomial time. 1

6 0.61491382 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

7 0.56591642 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

8 0.51946396 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

9 0.51089025 29 nips-2008-Automatic online tuning for fast Gaussian summation

10 0.51057565 202 nips-2008-Robust Regression and Lasso

11 0.49751276 193 nips-2008-Regularized Co-Clustering with Dual Supervision

12 0.47894439 104 nips-2008-Improved Moves for Truncated Convex Models

13 0.47392684 85 nips-2008-Fast Rates for Regularized Objectives

14 0.47025931 218 nips-2008-Spectral Clustering with Perturbed Data

15 0.46828756 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

16 0.46149671 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

17 0.46097335 214 nips-2008-Sparse Online Learning via Truncated Gradient

18 0.45714602 194 nips-2008-Regularized Learning with Networks of Features

19 0.45424062 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

20 0.42465851 196 nips-2008-Relative Margin Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.074), (7, 0.065), (12, 0.046), (15, 0.302), (16, 0.012), (28, 0.167), (57, 0.05), (59, 0.012), (63, 0.028), (71, 0.016), (77, 0.046), (78, 0.028), (83, 0.088)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94174373 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models

Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan

Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8

2 0.82654244 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity

Author: Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani

Abstract: We consider the problem of extracting smooth, low-dimensional neural trajectories that summarize the activity recorded simultaneously from tens to hundreds of neurons on individual experimental trials. Current methods for extracting neural trajectories involve a two-stage process: the data are first “denoised” by smoothing over time, then a static dimensionality reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way, and account for spiking variability that may vary both across neurons and across time. We then present a novel method for extracting neural trajectories, Gaussian-process factor analysis (GPFA), which unifies the smoothing and dimensionality reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that GPFA provided a better characterization of the population activity than the two-stage methods. 1

same-paper 3 0.7837494 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

Author: Laurent Jacob, Jean-philippe Vert, Francis R. Bach

Abstract: In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the IEDB MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. 1

4 0.74981374 224 nips-2008-Structured ranking learning using cumulative distribution networks

Author: Jim C. Huang, Brendan J. Frey

Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1

5 0.61901581 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We propose new families of models and algorithms for high-dimensional nonparametric learning with joint sparsity constraints. Our approach is based on a regularization method that enforces common sparsity patterns across different function components in a nonparametric additive model. The algorithms employ a coordinate descent approach that is based on a functional soft-thresholding operator. The framework yields several new models, including multi-task sparse additive models, multi-response sparse additive models, and sparse additive multi-category logistic regression. The methods are illustrated with experiments on synthetic data and gene microarray data. 1

6 0.61483383 231 nips-2008-Temporal Dynamics of Cognitive Control

7 0.61455238 194 nips-2008-Regularized Learning with Networks of Features

8 0.60985816 202 nips-2008-Robust Regression and Lasso

9 0.60760558 75 nips-2008-Estimating vector fields using sparse basis field expansions

10 0.60700738 95 nips-2008-Grouping Contours Via a Related Image

11 0.60437322 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

12 0.60427719 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

13 0.60089839 66 nips-2008-Dynamic visual attention: searching for coding length increments

14 0.59865797 10 nips-2008-A rational model of preference learning and choice prediction by children

15 0.59827173 226 nips-2008-Supervised Dictionary Learning

16 0.59547848 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding

17 0.59468722 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

18 0.59443307 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields

19 0.59441298 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

20 0.59429032 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning