nips nips2010 nips2010-177 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Novi Quadrianto, James Petterson, Tibério S. Caetano, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. Our proposed approach has a direct application for data integration with different label spaces, such as integrating Yahoo! and DMOZ web directories. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Research, Santa Clara, CA, USA 3 Purdue University, West Lafayette, IN, USA Abstract We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. [sent-5, score-0.682]
2 This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. [sent-6, score-0.518]
3 Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. [sent-7, score-0.131]
4 Our proposed approach has a direct application for data integration with different label spaces, such as integrating Yahoo! [sent-8, score-0.219]
5 1 Introduction In machine learning it is widely known that if several tasks are related, then learning them simultaneously can improve performance [1–4]. [sent-10, score-0.145]
6 If one views learning as the task of inferring a function f from the input space X to the output space Y, then multitask learning is the problem of inferring several functions fi : Xi → Yi simultaneously. [sent-12, score-0.283]
7 Traditionally, one either assumes that the set of labels Yi for all the tasks are the same (that is, Yi = Y for all i), or that we have access to an oracle mapping function gi,j : Yi → Yj . [sent-13, score-0.219]
8 Our motivating example is the problem of learning to automatically categorize objects on the web into an ontology or directory. [sent-15, score-0.294]
9 It is well established that many web-related objects such as web directories and RSS directories admit a (hierarchical) categorization, and web directories aim to do this in a semi-automated fashion. [sent-16, score-1.23]
10 directory1 , to take into account other web directories such as DMOZ2 . [sent-18, score-0.465]
11 Although the tasks are clearly related, their label sets are not identical. [sent-19, score-0.3]
12 Furthermore, different editors may have made different decisions about the ontology depth and structure, leading to incompatibilities. [sent-21, score-0.127]
13 To make matters worse, these ontologies evolve with time and certain topic labels may die naturally due to lack of interest or expertise while other new topic labels may be added to the directory. [sent-22, score-0.565]
14 Given the large label space, it is unrealistic to expect that a label mapping function is readily available. [sent-23, score-0.371]
15 However, the two tasks are clearly related and learning them simultaneously is likely to improve performance. [sent-24, score-0.145]
16 This paper presents a method to learn classifiers from a collection of related tasks or data sets, in which each task has its own label dictionary, without constructing an explicit label mapping among them. [sent-25, score-0.545]
17 We formulate the problem as that of maximizing mutual information among the labels sets. [sent-26, score-0.166]
18 We then show that this maximization problem yields an objective function which can be written as a difference of concave functions. [sent-27, score-0.123]
19 By exploiting convex duality [5], we can solve the resulting optimization problem efficiently in the dual space using existing DC programming algorithms [6]. [sent-28, score-0.173]
20 org/ 1 Related Work As described earlier, our work is closely related to the research efforts on multitask learning, where the problem of simultaneously learning multiple related tasks is addressed. [sent-33, score-0.369]
21 Several papers have empirically and theoretically highlighted the benefits of multitask learning over singletask learning when the tasks are related. [sent-34, score-0.34]
22 The works of [2, 7, 8] consider the setting when the tasks to be learned jointly share a common subset of features. [sent-36, score-0.159]
23 There is also work on data integration via multitask learning where each data source has the same binary label space, whereas the attributes of the inputs can admit different orderings as well as be linearly transformed [11]. [sent-41, score-0.482]
24 We briefly develop background on the maximum entropy estimation problem and its dual in Section 2. [sent-43, score-0.275]
25 We introduce in Section 3 the novel multitask formulation in terms of a mutual information maximization criterion. [sent-44, score-0.318]
26 Section 4 presents the algorithm to solve the optimization problem posed by the multitask problem. [sent-45, score-0.257]
27 We then present the experimental results, including applications on news articles and web directories data integration, in Section 5. [sent-46, score-0.694]
28 2 Maximum Entropy Duality for Conditional Distributions Here we briefly summarize the well known duality relation between approximate conditional maximum entropy estimation and maximum a posteriori estimation (MAP) [5, 12]. [sent-48, score-0.303]
29 Also note that by enforcing the moment matching constraint exactly, that is, setting = 0, we recover the well-known duality between maximum (Shannon) entropy and maximum likelihood (ML) estimation. [sent-64, score-0.303]
30 If we are interested to solve each of the categorization tasks independently, a maximum entropy estimator described in Section 2 can be readily employed [13]. [sent-84, score-0.557]
31 Here we would like to learn the 2 two tasks simultaneously in order to improve classification accuracy. [sent-85, score-0.17]
32 Assuming that the labels are different yet correlated we should assume that the joint distribution p(y, y ) displays high mutual information between y and y . [sent-86, score-0.219]
33 Recall that the mutual information between random variables y and y is defined as I(y, y ) = H(y) + H(y ) − H(y, y ), and that this quantity is high when the two variables are mutually dependent. [sent-87, score-0.123]
34 and DMOZ web directories, we would expect there is a high mutual dependency between section heading ‘Computer & Internet’ at Yahoo! [sent-89, score-0.362]
35 directory and ‘Computers’ at DMOZ directory although they are named somewhat slightly different. [sent-90, score-0.207]
36 Since the marginal distributions over the labels, p(y) and p(y ) are fixed, maximizing mutual information can then be viewed as minimizing the joint entropy H(y, y ) = − p(y, y ) log p(y, y ). [sent-91, score-0.341]
37 (3) y,y This reasoning leads us to adding the joint entropy as an additional term for the objective function of the multitask problem. [sent-92, score-0.508]
38 If we define µ= 1 m m φ(xi , yi ) and µ = i=1 1 m m φ(xi , yi ), (4) i=1 then we have the following objective function m m maximize p(y|x) s. [sent-93, score-0.113]
39 We can recover the single task maximum entropy estimator by removing the joint entropy term (by setting λ = 0), since the optimization problem (the objective functions as well as the constraints) in (5) will be decoupled in terms of p(y|x) and p(y |x ). [sent-97, score-0.679]
40 There are two main challenges in solving (5): • The joint entropy term H(y, y ) is concave, hence the above objective of the optimization problem is not concave in general (it is the difference of two concave functions). [sent-98, score-0.456]
41 We therefore propose to solve this non-concave problem using DC programming [6], in particular the concave convex procedure (CCCP) [14, 15]. [sent-99, score-0.156]
42 • The joint distribution between labels p(y, y ) is unknown. [sent-100, score-0.125]
43 We will estimate this quantity (therefore the joint entropy quantity) from the observations x and x . [sent-101, score-0.312]
44 4 Optimization The concave convex procedure (CCCP) works as follow: for a given function f (x) = g(x) − h(x), where g is concave and −h is convex, a lower bound can be found by f (x) ≥ g(x) − h(x0 ) − ∂h(x0 ), x − x0 . [sent-109, score-0.209]
45 3 (6) This lower bound is concave and can be maximized effectively over a convex domain. [sent-110, score-0.123]
46 Therefore, one potential approach to solve the optimization problem in (5) is to use successive linear lower bounds on H(y, y ) and to solve the resulting decoupled problems in p(y|x) and p(y |x ) separately. [sent-113, score-0.143]
47 Therefore, taking derivatives of the joint entropy with respect to p(y|xi ) and evaluating at parameters at iteration t − 1, denoted as θt−1 and θt−1 , yields gy (xi ) := −∂p(y|xi ) H(y, y |X) m 1 1 1 + log p(y|xj , θt−1 )p(y |xj , θt−1 ) p(y |xi , θt−1 ). [sent-116, score-0.357]
48 = m m j=1 (8) (9) y Define similarly gy (xi ), gy (xi ), and gy (xi ) for the derivative with respect to p(y|xi ), p(y |xi ) and p(y |xi ), respectively. [sent-117, score-0.33]
49 This leads, by optimizing the lower bound in (6), to the following decoupled optimization problems in p(y|xi ) and an analogous problem in p(y |xi ): m m −H(y|xi ) + λ min p(y|x) y i=1 subject to −H(y|xi ) + λ gy (xi )p(y|xi ) + gy (xi )p(y|xi ) (10a) y i=1 Ey∼p(y|X) [φ(X, y)] − µ ≤ . [sent-118, score-0.297]
50 (10b) The above objective function is still in the form of maximum entropy estimation, with the linearization of the joint entropy quantities acting like additional evidence terms. [sent-119, score-0.538]
51 Furthermore, we also impose an additional maximum entropy requirement on the ‘off-set’ observations p(y|xi ), as after all we also want the ‘simplicity’ requirement of the distribution p on the input xi . [sent-120, score-0.403]
52 While we succeed in reducing the non-concave objective function in (5) to a decoupled concave objective function in (10), it might be desirable to solve the problem in the dual space due to difficulty in handling the constraint in (10b). [sent-122, score-0.322]
53 Initialization For each iteration of CCCP, the linearization part of the joint entropy function requires the value of θ and θ at the previous iteration (refer to (9)). [sent-128, score-0.278]
54 Algorithms We couldn’t find in the literature of multitask learning methods addressing the same problem as the one we study: learn multiple tasks when there is no correspondence between the output spaces. [sent-149, score-0.365]
55 Therefore we compared the performance of our multitask method against the baseline given by the maximum entropy estimator applied to each of the tasks independently. [sent-150, score-0.599]
56 Note that we focus on the setting in which data sources have disjoint sets of covariate observations (vide Section 3) and thus a simple strategy of multilabel prediction with union of label sets corresponds to our baseline. [sent-151, score-0.283]
57 The weight on the joint entropy term was set to be equal to 1. [sent-155, score-0.247]
58 Pairwise Label Correlation Section 3 describes the multitask objective function for the case of the 2-task problem. [sent-156, score-0.261]
59 As more computationally efficient way, we can consider the joint entropy on the pairwise distribution instead. [sent-158, score-0.247]
60 We find that, on average, jointly learning the multiple related tasks always improves the classification 3 http://yann. [sent-161, score-0.159]
61 STL: single task learning; MTL: multi task learning and Upper Bound: multi class learning. [sent-165, score-0.224]
62 When assessing the performance on each of the tasks, we notice that the advantage of learning jointly is particularly significant for those tasks with smaller number of observations. [sent-331, score-0.159]
63 We use the Reuters1-v2 news article dataset [18] which has been pre-processed4 . [sent-334, score-0.149]
64 In the pre-processing stage, the label hierarchy is reorganized by mapping the data set to the second level of topic hierarchy. [sent-335, score-0.437]
65 For this experiment, we use 12500 news articles to form one set of data and another 12500 news article to form the second set of data. [sent-341, score-0.378]
66 In the first set, we group the news articles having the label {1, 2}, {3, 4}, {5, 6}, {7, 8} and {9, 10} and re-label it as {1, 2, 3, 4, 5}. [sent-342, score-0.386]
67 For the second set of data, it also has 5 labels but this time the labels are 4 http://www. [sent-343, score-0.144]
68 STL: single task learning accuracy; MTL: multi task learning accuracy; % Imp. [sent-350, score-0.171]
69 Interestingly, DMOZ has a similar topic but was called ‘Computers’ and it achieves accuracy of 75. [sent-358, score-0.239]
70 STL: single task learning accuracy; MTL: multi task learning accuracy; % Imp. [sent-405, score-0.171]
71 The improvement of multitask to single task on each topic is negligible for DMOZ web directories. [sent-407, score-0.72]
72 Arguably, this can be partly explained as DMOZ has higher average topic categorization accuracy than Yahoo! [sent-408, score-0.414]
73 We split equally the news articles on each set to form training and test sets. [sent-456, score-0.229]
74 We run a maximum entropy estimator independently, p(y|x, θ) and p(y |x , θ ) , on the two sets achieving accuracy of 92. [sent-457, score-0.343]
75 We then learn the two sets of the news articles jointly and in the first test set, we achieve accuracy of 93. [sent-460, score-0.381]
76 This experiment further emphasizes that it is possible to learn several related tasks simultaneously even though they have different label sets and it is beneficial to do so. [sent-464, score-0.381]
77 ’s topic tree and sample web links listed in the directory. [sent-468, score-0.386]
78 Similarly we also consider the top level of the DMOZ topic tree and retrieve sampled web links. [sent-469, score-0.422]
79 We consider the content of the first page of each web link as our input data. [sent-470, score-0.204]
80 It is possible that the first page that is being linked from the web directory contain mostly images (for the purpose of attracting visitors), thus we only consider those webpages that have enough texts to be a valid input. [sent-471, score-0.423]
81 However, we find that it is quite damaging to do so because as we crawl deeper the topic of the texts are rapidly changing. [sent-475, score-0.297]
82 7 From the experimental results on web directories integration, we observe the following: • Similarly to the experiments on MNIST digits and Reuters1-v2 news articles, multitask learning always helps on average, i. [sent-481, score-0.874]
83 and DMOZ web directories; • The improvement of multitask to single task on each topic is more prominent for Yahoo! [sent-484, score-0.695]
84 web directories and is negligible for DMOZ web directories (2. [sent-485, score-0.955]
85 has lower average topic categorization accuracy than DMOZ (c. [sent-489, score-0.362]
86 Interestingly, DMOZ has a similar topic but was called ‘Computers’ and it achieves accuracy of 75. [sent-501, score-0.239]
87 The improvement might be partly because our proposed method is able to discover the implicit label correlations despite the two topics being named differently; • Regarding the worst classified categories, we have ‘News & Media’ for Yahoo! [sent-503, score-0.302]
88 As well, this is quite intuitive as the world of health contains mostly specific jargon and the world of world has much language-specific webpage content. [sent-508, score-0.145]
89 6 Discussion and Conclusion We presented a method to learn classifiers from a collection of related tasks or data sets, in which each task has its own label set. [sent-509, score-0.357]
90 Our method works without the need of an explicit mapping between the label spaces of the different tasks. [sent-510, score-0.188]
91 We formulate the problem as one of maximizing the mutual information among the label sets. [sent-511, score-0.251]
92 Our experiments on binary n-task (n ∈ {3, 5, 7, 10}) and multiclass 2-task problems revealed that, on average, jointly learning the multiple related tasks, albeit with different label sets, always improves the classification accuracy. [sent-512, score-0.231]
93 5% better prior to the application of our method, this shows the method was able to transfer classification accuracy from the DMOZ task to the Yahoo! [sent-519, score-0.116]
94 Furthermore, the experiments seem to suggest that our proposed method is able to discover implicit label correlations despite the lack of label correspondences. [sent-521, score-0.34]
95 Although the experiments on web directories integration is encouraging, we have clearly only touched the surface of possibilities to be explored. [sent-522, score-0.527]
96 In the extreme case, we might consider the labels as corresponding to a directed acyclic graph (DAG) and encode the feature map associated with the label hierarchy accordingly. [sent-524, score-0.26]
97 One instance as considered in [19] is to use a feature map φ(y) ∈ Rk for k nodes in the DAG (excluding the root node) and associate with every label y the vector describing the path from the root node to y, ignoring the root node itself. [sent-525, score-0.157]
98 Furthermore, the application of data integration which admit a hierarchical categorization goes beyond web related objects. [sent-526, score-0.428]
99 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-545, score-0.116]
100 RCV1: A new benchmark collection for text categorization research. [sent-651, score-0.123]
wordName wordTfidf (topN-words)
[('dmoz', 0.499), ('yahoo', 0.367), ('directories', 0.261), ('multitask', 0.224), ('web', 0.204), ('entropy', 0.194), ('topic', 0.182), ('label', 0.157), ('news', 0.149), ('categorization', 0.123), ('tasks', 0.116), ('gy', 0.11), ('mutual', 0.094), ('ey', 0.094), ('xi', 0.092), ('ontology', 0.09), ('stl', 0.09), ('concave', 0.086), ('webpages', 0.085), ('mtl', 0.085), ('directory', 0.083), ('articles', 0.08), ('digit', 0.08), ('decoupled', 0.077), ('labels', 0.072), ('integration', 0.062), ('cccp', 0.06), ('task', 0.059), ('health', 0.058), ('ontologies', 0.057), ('accuracy', 0.057), ('multi', 0.053), ('joint', 0.053), ('partly', 0.052), ('internet', 0.052), ('dual', 0.052), ('duality', 0.051), ('texts', 0.051), ('computers', 0.05), ('recreation', 0.047), ('mnist', 0.047), ('regional', 0.045), ('jointly', 0.043), ('xm', 0.042), ('named', 0.041), ('admit', 0.039), ('heading', 0.038), ('arts', 0.038), ('yi', 0.038), ('classi', 0.037), ('objective', 0.037), ('editors', 0.037), ('convex', 0.037), ('observations', 0.036), ('estimator', 0.036), ('level', 0.036), ('sources', 0.036), ('yc', 0.036), ('dag', 0.036), ('crawl', 0.036), ('theodoros', 0.036), ('digits', 0.036), ('banach', 0.034), ('solve', 0.033), ('media', 0.032), ('correspondences', 0.032), ('economy', 0.032), ('government', 0.032), ('mapping', 0.031), ('massimiliano', 0.031), ('shannon', 0.031), ('linearization', 0.031), ('relatedness', 0.031), ('business', 0.031), ('hierarchy', 0.031), ('multiclass', 0.031), ('shared', 0.03), ('lewis', 0.03), ('lugosi', 0.03), ('quantity', 0.029), ('world', 0.029), ('simultaneously', 0.029), ('lastly', 0.029), ('dictionary', 0.029), ('maximum', 0.029), ('simon', 0.028), ('deeper', 0.028), ('sets', 0.027), ('experiment', 0.027), ('verlag', 0.027), ('australian', 0.027), ('readily', 0.026), ('requirement', 0.026), ('implicit', 0.026), ('evgeniou', 0.026), ('improvement', 0.026), ('dependency', 0.026), ('editor', 0.025), ('learn', 0.025), ('negligible', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999875 177 nips-2010-Multitask Learning without Label Correspondences
Author: Novi Quadrianto, James Petterson, Tibério S. Caetano, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. Our proposed approach has a direct application for data integration with different label spaces, such as integrating Yahoo! and DMOZ web directories. 1
2 0.1569379 146 nips-2010-Learning Multiple Tasks using Manifold Regularization
Author: Arvind Agarwal, Samuel Gerber, Hal Daume
Abstract: We present a novel method for multitask learning (MTL) based on manifold regularization: assume that all task parameters lie on a manifold. This is the generalization of a common assumption made in the existing literature: task parameters share a common linear subspace. One proposed method uses the projection distance from the manifold to regularize the task parameters. The manifold structure and the task parameters are learned using an alternating optimization framework. When the manifold structure is fixed, our method decomposes across tasks which can be learnt independently. An approximation of the manifold regularization scheme is presented that preserves the convexity of the single task learning problem, and makes the proposed MTL framework efficient and easy to implement. We show the efficacy of our method on several datasets. 1
3 0.13341242 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
4 0.12430516 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
Author: Yi Zhang, Jeff G. Schneider
Abstract: In this paper, we propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks. Learning multiple (parametric) models can be viewed as estimating a matrix of parameters, where rows and columns of the matrix correspond to tasks and features, respectively. Following the matrix-variate normal density, we design a penalty that decomposes the full covariance of matrix elements into the Kronecker product of row covariance and column covariance, which characterizes both task relatedness and feature representation. Several recently proposed methods are variants of the special cases of this formulation. To address the overfitting issue and select meaningful task and feature structures, we include sparse covariance selection into our matrix-normal regularization via ℓ1 penalties on task and feature inverse covariances. We empirically study the proposed method and compare with related models in two real-world problems: detecting landmines in multiple fields and recognizing faces between different subjects. Experimental results show that the proposed framework provides an effective and flexible way to model various different structures of multiple tasks.
5 0.10857416 286 nips-2010-Word Features for Latent Dirichlet Allocation
Author: James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, Alex J. Smola
Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
6 0.10794546 151 nips-2010-Learning from Candidate Labeling Sets
7 0.1067697 138 nips-2010-Large Margin Multi-Task Metric Learning
8 0.10576531 152 nips-2010-Learning from Logged Implicit Exploration Data
9 0.10290235 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
10 0.094464764 235 nips-2010-Self-Paced Learning for Latent Variable Models
11 0.093678474 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
12 0.093566962 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
13 0.084272891 150 nips-2010-Learning concept graphs from text with stick-breaking priors
14 0.082899012 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
15 0.075677298 118 nips-2010-Implicit Differentiation by Perturbation
16 0.075241834 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
17 0.074053742 27 nips-2010-Agnostic Active Learning Without Constraints
18 0.072828792 194 nips-2010-Online Learning for Latent Dirichlet Allocation
19 0.072198376 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
20 0.070366107 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
topicId topicWeight
[(0, 0.203), (1, 0.061), (2, 0.054), (3, -0.095), (4, -0.073), (5, 0.039), (6, 0.022), (7, -0.035), (8, -0.04), (9, -0.077), (10, 0.025), (11, 0.06), (12, 0.079), (13, 0.052), (14, 0.034), (15, -0.003), (16, -0.039), (17, -0.087), (18, -0.026), (19, -0.063), (20, -0.135), (21, -0.186), (22, 0.065), (23, 0.062), (24, -0.029), (25, -0.035), (26, -0.003), (27, -0.015), (28, 0.089), (29, 0.083), (30, 0.071), (31, 0.007), (32, 0.01), (33, 0.037), (34, -0.018), (35, -0.002), (36, -0.052), (37, 0.072), (38, -0.085), (39, 0.075), (40, -0.027), (41, -0.117), (42, 0.035), (43, 0.058), (44, 0.103), (45, 0.07), (46, -0.005), (47, 0.017), (48, 0.053), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.9312973 177 nips-2010-Multitask Learning without Label Correspondences
Author: Novi Quadrianto, James Petterson, Tibério S. Caetano, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. Our proposed approach has a direct application for data integration with different label spaces, such as integrating Yahoo! and DMOZ web directories. 1
2 0.64262027 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
Author: Yi Zhang, Jeff G. Schneider
Abstract: In this paper, we propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks. Learning multiple (parametric) models can be viewed as estimating a matrix of parameters, where rows and columns of the matrix correspond to tasks and features, respectively. Following the matrix-variate normal density, we design a penalty that decomposes the full covariance of matrix elements into the Kronecker product of row covariance and column covariance, which characterizes both task relatedness and feature representation. Several recently proposed methods are variants of the special cases of this formulation. To address the overfitting issue and select meaningful task and feature structures, we include sparse covariance selection into our matrix-normal regularization via ℓ1 penalties on task and feature inverse covariances. We empirically study the proposed method and compare with related models in two real-world problems: detecting landmines in multiple fields and recognizing faces between different subjects. Experimental results show that the proposed framework provides an effective and flexible way to model various different structures of multiple tasks.
3 0.63023806 138 nips-2010-Large Margin Multi-Task Metric Learning
Author: Shibin Parameswaran, Kilian Q. Weinberger
Abstract: Multi-task learning (MTL) improves the prediction performance on multiple, different but related, learning problems through shared parameters or representations. One of the most prominent multi-task learning algorithms is an extension to support vector machines (svm) by Evgeniou et al. [15]. Although very elegant, multi-task svm is inherently restricted by the fact that support vector machines require each class to be addressed explicitly with its own weight vector which, in a multi-task setting, requires the different learning tasks to share the same set of classes. This paper proposes an alternative formulation for multi-task learning by extending the recently published large margin nearest neighbor (lmnn) algorithm to the MTL paradigm. Instead of relying on separating hyperplanes, its decision function is based on the nearest neighbor rule which inherently extends to many classes and becomes a natural fit for multi-task learning. We evaluate the resulting multi-task lmnn on real-world insurance data and speech classification problems and show that it consistently outperforms single-task kNN under several metrics and state-of-the-art MTL classifiers. 1
4 0.6243583 146 nips-2010-Learning Multiple Tasks using Manifold Regularization
Author: Arvind Agarwal, Samuel Gerber, Hal Daume
Abstract: We present a novel method for multitask learning (MTL) based on manifold regularization: assume that all task parameters lie on a manifold. This is the generalization of a common assumption made in the existing literature: task parameters share a common linear subspace. One proposed method uses the projection distance from the manifold to regularize the task parameters. The manifold structure and the task parameters are learned using an alternating optimization framework. When the manifold structure is fixed, our method decomposes across tasks which can be learnt independently. An approximation of the manifold regularization scheme is presented that preserves the convexity of the single task learning problem, and makes the proposed MTL framework efficient and easy to implement. We show the efficacy of our method on several datasets. 1
5 0.59761107 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
Author: Umar Syed, Ben Taskar
Abstract: We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines. 1
6 0.54066539 151 nips-2010-Learning from Candidate Labeling Sets
7 0.51770151 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
8 0.49798927 267 nips-2010-The Multidimensional Wisdom of Crowds
9 0.47478175 228 nips-2010-Reverse Multi-Label Learning
10 0.46074748 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
11 0.4545055 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
12 0.45061392 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
13 0.45029059 142 nips-2010-Learning Bounds for Importance Weighting
14 0.43918714 158 nips-2010-Learning via Gaussian Herding
15 0.43709832 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
16 0.43639037 114 nips-2010-Humans Learn Using Manifolds, Reluctantly
17 0.43069702 286 nips-2010-Word Features for Latent Dirichlet Allocation
18 0.42893609 99 nips-2010-Gated Softmax Classification
19 0.42688298 150 nips-2010-Learning concept graphs from text with stick-breaking priors
20 0.42547479 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
topicId topicWeight
[(13, 0.072), (17, 0.031), (27, 0.085), (30, 0.064), (35, 0.026), (45, 0.22), (47, 0.195), (50, 0.06), (52, 0.034), (60, 0.027), (77, 0.035), (78, 0.036), (90, 0.055)]
simIndex simValue paperId paperTitle
Author: Ken Takiyama, Masato Okada
Abstract: 019 020 We propose an algorithm for simultaneously estimating state transitions among neural states, the number of neural states, and nonstationary firing rates using a switching state space model (SSSM). This algorithm enables us to detect state transitions on the basis of not only the discontinuous changes of mean firing rates but also discontinuous changes in temporal profiles of firing rates, e.g., temporal correlation. We construct a variational Bayes algorithm for a non-Gaussian SSSM whose non-Gaussian property is caused by binary spike events. Synthetic data analysis reveals that our algorithm has the high performance for estimating state transitions, the number of neural states, and nonstationary firing rates compared to previous methods. We also analyze neural data that were recorded from the medial temporal area. The statistically detected neural states probably coincide with transient and sustained states that have been detected heuristically. Estimated parameters suggest that our algorithm detects the state transition on the basis of discontinuous changes in the temporal correlation of firing rates, which transitions previous methods cannot detect. This result suggests that our algorithm is advantageous in real-data analysis. 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053
same-paper 2 0.84493697 177 nips-2010-Multitask Learning without Label Correspondences
Author: Novi Quadrianto, James Petterson, Tibério S. Caetano, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. Our proposed approach has a direct application for data integration with different label spaces, such as integrating Yahoo! and DMOZ web directories. 1
3 0.79073471 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
4 0.79015779 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
5 0.78973073 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
6 0.78758568 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
8 0.78571987 63 nips-2010-Distributed Dual Averaging In Networks
9 0.78557038 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
10 0.78554028 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
11 0.78493333 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
12 0.78427488 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
13 0.78397357 155 nips-2010-Learning the context of a category
14 0.78385419 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
15 0.78369719 265 nips-2010-The LASSO risk: asymptotic results and real world examples
16 0.78338897 158 nips-2010-Learning via Gaussian Herding
17 0.78310233 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
18 0.78241301 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
19 0.78210998 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
20 0.7816177 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average