nips nips2012 nips2012-222 knowledge-graph by maker-knowledge-mining

222 nips-2012-Multi-Task Averaging


Source: pdf

Author: Sergey Feldman, Maya Gupta, Bela Frigyik

Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Frigyik Department of Electrical Engineering University of Washington Seattle, WA 98103 Abstract We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. [sent-3, score-0.079]

2 Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. [sent-6, score-0.072]

3 1 Introduction The motivating hypothesis behind multi-task learning (MTL) algorithms is that leveraging data from related tasks can yield superior performance over learning from each task independently. [sent-7, score-0.088]

4 Early evidence for this hypothesis is Stein’s work on the estimation of the means of T distributions (tasks) [1]. [sent-8, score-0.077]

5 Stein showed that it is better (in a summed squared error sense) to estimate each of the means of T Gaussian random variables using data sampled from all of them, even if they are independent and have different means. [sent-9, score-0.1]

6 That is, it is beneficial to consider samples from seemingly unrelated distributions in the estimation of the tth mean. [sent-10, score-0.142]

7 Estimating means is perhaps the most common of all estimation tasks, and often multiple means need to be estimated. [sent-12, score-0.132]

8 In this paper we consider a multi-task regularization approach to the problem of estimating multiple means that we call multi-task averaging (MTA). [sent-13, score-0.14]

9 In particular, we state the optimal amount of regularization to be used, and show that this optimal amount can be effectively estimated. [sent-17, score-0.079]

10 Simulations in Section 5 verify the advantage of MTA over standard sample means and James-Stein estimation if the true means are close compared to the sample variance. [sent-18, score-0.176]

11 1, two experiments estimating expected sales show that MTA can reduce real errors by over 30% compared to the sample mean. [sent-20, score-0.099]

12 MTA can be used anywhere multiple averages are needed; we demonstrate this by applying it fruitfully to the averaging step of kernel density estimation in Section 6. [sent-21, score-0.167]

13 2 Multi-Task Averaging Consider the T -task problem of estimating the means of T random variables that have finite mean and variance. [sent-23, score-0.105]

14 Dtt = T r=1 Atr In addition, assume that the T × T matrix A describes the relatedness or similarity of any pair of the T tasks, with Att = 0 for all t without loss of generality (because the diagonal self-similarity terms are canceled in the objective below). [sent-32, score-0.096]

15 The regularization parameter γ balances the empirical risk and the multi-task regularizer. [sent-35, score-0.079]

16 Note that if γ = 0, then (1) decomposes to T ¯ separate minimization problems, producing the sample averages Yt . [sent-36, score-0.079]

17 The normalization of each 2 error term in (1) by its task-specific variance σt (which may be estimated) scales the T empirical loss terms relative to the variance of their distribution; this ensures that high-variance tasks do not disproportionately dominate the loss term. [sent-37, score-0.136]

18 If L is chosen to be any Bregman loss, then setting γ = 0 will produce the T sample averages [3]. [sent-39, score-0.079]

19 The task similarity matrix A can be specified as side information (e. [sent-41, score-0.108]

20 In Section 4 we derive two optimal choices of A for the T = 2 case: the A that minimizes expected squared error, and a minimax A. [sent-44, score-0.188]

21 The most closely related work is Stein estimation, an empirical Bayes strategy for estimating multiple means simultaneously [7, 8, 2, 9]. [sent-50, score-0.089]

22 James and Stein [7] showed that the maximum likelihood estimate of the tth mean µt can be dominated by a shrinkage estimate given Gaussian assumptions. [sent-51, score-0.145]

23 We hypothesize that this is due to the high variance of the estimate of the maximum eigenvalue of Σ. [sent-56, score-0.071]

24 MTA can be interpreted as estimating means of T Gaussians with an intrinsic Gaussian Markov random field prior [11]. [sent-57, score-0.089]

25 A key issue for MTA and many other multi-task learning methods is how to estimate the similarity (or task relatedness) between tasks and/or samples if it is not provided. [sent-59, score-0.172]

26 A common approach is to estimate the similarity matrix jointly with the task parameters [12, 13, 5, 14, 15]. [sent-60, score-0.132]

27 For example, Zhang and Yeung [15] assumed that there exists a covariance matrix for the task relatedness, and proposed a convex optimization approach to estimate the task covariance matrix and the task parameters in a joint, alternating way. [sent-61, score-0.237]

28 However, the simplicity of MTA enables us to specify the optimal task similarity matrix for T = 2 (see Sec. [sent-63, score-0.139]

29 It is straightforward to show that (1) has closed-form solution: Y∗ = I + γ ΣL T −1 ¯ Y, (3) N t 1 ¯ ¯ where Y is the vector of sample averages with tth entry Yt = Nt i=1 Yti , L is the graph Laplacian of A, and Σ is defined as before. [sent-66, score-0.16]

30 1 Convexity of MTA Solution From inspection of (3), it is clear that each of the elements of Y ∗ is a linear combination of the ¯ sample averages Y . [sent-71, score-0.079]

31 However, a stronger statement can be made: σ2 Theorem: If γ ≥ 0, 0 ≤ Ars < ∞ for all r, s and 0 < Ntt < ∞ for all t, then the MTA estimates ¯ {Yt∗ } given in (3) are a convex combination of the task sample averages {Yt }. [sent-72, score-0.17]

32 2 Optimal A for the Two Task Case In this section we analyze the T = 2 task case, with N1 and N2 samples for tasks 1 and 2 respectively. [sent-81, score-0.112]

33 Suppose {Y1i } are iid (independently and identically distributed) with finite mean µ1 and 2 2 finite variance σ1 , and {Y2i } are iid with finite mean µ2 = µ1 + ∆ and finite variance σ2 . [sent-82, score-0.15]

34 Note that as a approaches 0 from above, the RHS of (6) approaches infinity, which means that a small amount of regularization can be helpful even when the difference between the task means ∆ is large. [sent-87, score-0.202]

35 Summarizing, if the two task means are close relative to each task’s sample variance, MTA will help. [sent-88, score-0.131]

36 The risk is the sum of the mean squared errors: MSE[Y1∗ ]+MSE[Y2∗ ], which is a convex, continuous, and differentiable function of a, and therefore the first derivative can be used to specify the optimal value a∗ , when all the other variables are fixed. [sent-89, score-0.137]

37 In the limit case, when the difference in the task means ∆ goes to zero 2 (while σt stay constant), the optimal task-relatedness a∗ goes to infinity, and the weights in (4) on ¯ ¯ Y1 and Y2 become 1/2 each. [sent-95, score-0.151]

38 The first method is designed to minimize the approximate risk using a constant similarity matrix. [sent-98, score-0.117]

39 1 Constant MTA ¯¯ ˆ ¯ Recalling that E[Y Y T ] = µµT + Σ, the risk of estimator Y = W Y of unknown parameter vector µ for the squared loss is the sum of the mean squared errors: ¯ ¯ ¯ R(µ, W Y ) = E[(W Y − µ)T (W Y − µ)] = tr(W ΣW T ) + µT (I − W )T (I − W )µ. [sent-103, score-0.181]

40 2 to arbitrary T is to try to find a symmetric, ¯ non-negative matrix A such that the (convex, differentiable) risk R(µ, W Y ) is minimized for W = −1 γ I + T ΣL (recall L is the graph Laplacian of A). [sent-105, score-0.072]

41 The problem with this approach is two-fold: (i) the solution is not analytically tractable for T > 2 and (ii) an arbitrary A has T (T − 1) degrees of freedom, which is considerably more than the number of means we are trying to estimate in 4 the first place. [sent-106, score-0.093]

42 To avoid these problems, we generalize the two-task results by constraining A to be a scaled constant matrix A = a11T , and find the optimal a∗ that minimizes the risk in (8). [sent-107, score-0.129]

43 we set γ to 1, and for analytic tractability we assume that all the tasks have the same variance, estimating Σ as tr(Σ) I. [sent-112, score-0.068]

44 Then it remains to solve: T a∗ = arg min R µ, I + a 1 tr(Σ) L(a11T ) T T which has the solution a∗ = 2 T r=1 1 T (T −1) T s=1 (µr − µs )2 −1 ¯ Y , , which reduces to the optimal two task MTA solution (7) when T = 2. [sent-113, score-0.068]

45 So, to estimate a∗ we 2 use the sample means {¯r }: a∗ = y ˆ . [sent-115, score-0.101]

46 Using this estimated optimal constant T T 1 (¯ −¯ )2 y y T (T −1) r=1 s=1 r s ˆ similarity and an estimated covariance matrix Σ produces what we refer to as the constant MTA estimate −1 1ˆ ¯ Y ∗ = I + ΣL(ˆ∗ 11T ) a Y. [sent-116, score-0.184]

47 (9) T Note that we made the assumption that the entries of Σ were the same in order to be able to derive ˆ the constant similarity a∗ , but we do not need nor suggest that assumption on the Σ used with a∗ in ˆ (9). [sent-117, score-0.063]

48 4 Minimax MTA Bock’s James-Stein estimator is minimax in that it minimizes the worst-case loss, not necessarily the expected loss [10]. [sent-119, score-0.222]

49 In this section, we derive a minimax version of MTA, that prescribes less regularization than the constant MTA. [sent-121, score-0.189]

50 Formally, an estimator Y M of µ is called minimax if it minimizes the maximum risk: ˆ inf sup R(µ, Y ) = sup R(µ, Y M ). [sent-122, score-0.205]

51 ˆ Y µ µ First, we will specify minimax MTA for the T = 2 case. [sent-123, score-0.154]

52 To find a minimax estimator Y M it is sufficient to show that (i) Y M is a Bayes estimator w. [sent-124, score-0.241]

53 the least favorable prior (LFP) and (ii) it has constant risk [10]. [sent-127, score-0.081]

54 To find a LFP, we first need to specify a constraint set for µt ; we use an interval: µt ∈ [bl , bu ], for all t, where bl ∈ R and bu ∈ R. [sent-128, score-0.11]

55 With this constraint set the minimax estimator is: −1 2 ¯ YM = I + ΣL(11T ) Y, (10) T (bu − bl )2 which reduces to (7) when T = 2. [sent-129, score-0.216]

56 This minimax analysis is only valid for the case when T = 2, but we found that good practical results for larger T using (10) with the data-dependent interval ˆl = mint yt and ˆu = maxt yt . [sent-130, score-0.299]

57 Simulation parameters are given in the table in Figure 1, and were set so that the variances of the distribution of the true means were the same in 2 both types of simulations. [sent-132, score-0.084]

58 We compared constant MTA and minimax MTA to single-task sample averages and to the JamesStein estimator given in (2). [sent-134, score-0.295]

59 We also compared to a randomized 5-fold 50/50 cross-validated (CV) version of constant MTA, and minimax MTA, and the James-Stein estimator (which is simply a con¯ vex regularization towards the average of the sample means: λ¯t +(1−λ)y . [sent-135, score-0.263]

60 , 100} 2 yti ∼ N (µt , σt ) Uniform Simulations 2 2 µt ∼ U (− 3σµ , 3σµ ) 2 σt ∼ U (0. [sent-143, score-0.106]

61 , 100} 2 yti ∼ U [µt − 3σt , µt + T=2 T=2 10 % change vs. [sent-148, score-0.123]

62 5 Single−Task James−Stein MTA, constant MTA, minimax James−Stein (CV) MTA, constant (CV) MTA, minimax (CV) 1 1. [sent-151, score-0.328]

63 5 Single−Task James−Stein MTA, constant MTA, minimax James−Stein (CV) MTA, constant (CV) MTA, minimax (CV) 1 1. [sent-154, score-0.328]

64 5 σ2 (variance of the means) µ 3 Figure 1: Average (over 10000 random draws) percent change in risk vs. [sent-172, score-0.086]

65 MTA or λ for James-Stein that resulted in the lowest average left-out risk compared to the sample mean estimated with all Nt samples. [sent-175, score-0.111]

66 Even when cross-validating, an advantage of using the proposed constant MTA or minimax MTA is that these estimators provide ∗ a data-adaptive scale for γ, where γ = 1 sets the regularization parameter to be a or T (bu1 l )2 , T −b respectively. [sent-181, score-0.219]

67 For T = 5, constant MTA dominates in the Gaussian case, but in the uniform case does worse than single-task when the means are far apart. [sent-185, score-0.082]

68 Note that for all T > 2 minimax MTA almost always outperforms James-Stein and always outperforms singletask, which is to be expected as it was designed conservatively. [sent-186, score-0.137]

69 Since both constant MTA and minimax MTA use a similarity matrix of all ones scaled by a constant, crossvalidating over a set of possible γ may result in nearly identical performance, and this can be seen in the Figure (i. [sent-189, score-0.218]

70 To conclude, when the tasks are close to each other compared to their variances, constant MTA is the best estimator to use by a wide margin. [sent-192, score-0.113]

71 When the tasks are farther apart, minimax MTA will provide a win over both James-Stein and maximum likelihood. [sent-193, score-0.171]

72 The first application parallels the simulations, estimating expected values of sales of related products. [sent-195, score-0.077]

73 The second application uses MTA for multi-task kernel density estimation, highlighting the applicability of MTA to any algorithm that uses sample averages. [sent-196, score-0.084]

74 1 Application: Estimating Product Sales We consider two multi-task problems using sales data over a certain time period supplied by Artifact Puzzles, a company that sells jigsaw puzzles online. [sent-198, score-0.096]

75 The first problem estimates the impact of a particular puzzle on repeat business: “Estimate how much a random customer will spend on an order on average, if on their last order they purchased the tth puzzle, for each of T = 77 puzzles. [sent-200, score-0.193]

76 ” The samples were the amounts different customers had spent on orders after buying each of the t puzzles, and ranged from 480 down to 0 for customers that had not re-ordered. [sent-201, score-0.134]

77 The number of samples for each puzzle ranged from Nt = 8 to Nt = 348. [sent-202, score-0.098]

78 The second problem estimates the expected order size of a particular customer: “Estimate how much the tth customer will spend on a order on average, for each of the T = 477 customers that ordered at least twice during the data timeframe. [sent-203, score-0.183]

79 The number of samples for each customer ranged from Nt = 2 to Nt = 17. [sent-206, score-0.089]

80 As a metric to compare the estimates, we treat each task’s sample average computed from all of the samples as the ground truth, and compare to estimates computed from a uniformly randomly chosen 50% of the samples. [sent-208, score-0.068]

81 Table 2: Percent change in average risk (for puzzle and buyer data, lower is better), and mean reciprocal rank (for terrorist data, higher is better). [sent-211, score-0.198]

82 Recall that for standard single-task kernel density estimation (KDE) [18], a set of random samples xi ∈ Rd , i ∈ {1, . [sent-238, score-0.108]

83 , N } are assumed to be iid from an unknown distribution pX , and the problem is to estimate the density for a query sample, z ∈ Rd . [sent-241, score-0.087]

84 Given a kernel function K(xi , xj ), the un-normalized single-task KDE estimate is N 1 p(z) = N i=1 K(xi , z), which is just a sample average. [sent-242, score-0.07]

85 ˆ When multiple kernel densities {pt (z)}T are estimated for the same domain, we replace the mult=1 tiple sample averages with MTA estimates, which we refer to as multi-task kernel density estimation (MT-KDE). [sent-243, score-0.206]

86 We compared KDE and MT-KDE on a problem of estimating the probability of terrorist events in Jerusalem using the Naval Research Laboratory’s Adversarial Modeling and Exploitation Database (NRL AMX-DB). [sent-244, score-0.117]

87 The NRL AMX-DB combined multiple open primary sources2 to create a rich representation of the geospatial features of urban Jerusalem and the surrounding region, and accurately geocoded locations of terrorist attacks. [sent-245, score-0.087]

88 In related work, [19] also used a Gaussian kernel density estimate to assess risk from past terrorism events. [sent-247, score-0.211]

89 The goal in this application is to estimate a risk density for 40,000 geographical locations (samples) in a 20km × 20km area of interest in Jerusalem. [sent-248, score-0.156]

90 Locations of past events are known for 17 suicide bombings. [sent-251, score-0.064]

91 All the events are attributed to one of seven terrorist groups. [sent-252, score-0.111]

92 The density estimates for these seven groups are expected to be related, and are treated as T = 7 tasks. [sent-253, score-0.088]

93 In addition to constant A and minimax A, we also obtained a side-information A from terrorism expert Mohammed M. [sent-255, score-0.259]

94 Hafez of the Naval Postgraduate School; he assessed the similarity between the seven groups during the Second Intifada (the time period of the data), providing similarities between 0 and 1. [sent-256, score-0.064]

95 After computing the KDE and MT-KDE density estimates using all but one of the training examples {xti } for each task, we sort the resulting 40,000 estimated probabilities for each of the seven tasks, and extract the rank of the left-out known event. [sent-258, score-0.107]

96 7 Summary Though perhaps unintuitive, we showed that both in theory and in practice, estimating multiple unrelated means using an MTL approach can improve the overall risk, even more so than James-Stein estimation. [sent-262, score-0.104]

97 Averaging is common, and MTA has potentially broad applicability as a subcomponent in many algorithms, such as k-means clustering, kernel density estimation, or non-local means denoising. [sent-263, score-0.117]

98 Stein, “Inadmissibility of the usual estimator for the mean of a multivariate distribution,” Proc. [sent-267, score-0.068]

99 Yeung, “A convex formulation for learning task relationships,” in Proc. [sent-352, score-0.069]

100 Hoyle, “Spatial forecast methods for terrorist events in urban environments,” Lecture Notes in Computer Science, vol. [sent-372, score-0.103]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mta', 0.906), ('minimax', 0.137), ('stein', 0.116), ('yti', 0.106), ('nt', 0.098), ('cv', 0.086), ('yt', 0.081), ('tth', 0.081), ('terrorism', 0.071), ('mse', 0.061), ('averages', 0.057), ('means', 0.055), ('risk', 0.054), ('task', 0.054), ('terrorist', 0.054), ('kde', 0.054), ('puzzles', 0.053), ('estimator', 0.052), ('sales', 0.043), ('puzzle', 0.041), ('nrl', 0.04), ('ntt', 0.04), ('density', 0.038), ('simulations', 0.037), ('similarity', 0.036), ('suicide', 0.035), ('bock', 0.035), ('tasks', 0.034), ('variance', 0.034), ('estimating', 0.034), ('james', 0.033), ('mrr', 0.033), ('ranged', 0.033), ('bu', 0.033), ('customer', 0.032), ('customers', 0.031), ('mtl', 0.031), ('estimators', 0.03), ('events', 0.029), ('variances', 0.029), ('seven', 0.028), ('constant', 0.027), ('bl', 0.027), ('jerusalem', 0.027), ('ars', 0.027), ('geographical', 0.027), ('gershgorin', 0.027), ('hafez', 0.027), ('mohammed', 0.027), ('paradox', 0.027), ('pontil', 0.026), ('averaging', 0.026), ('relatedness', 0.025), ('regularization', 0.025), ('iid', 0.025), ('estimate', 0.024), ('expert', 0.024), ('samples', 0.024), ('kernel', 0.024), ('violent', 0.024), ('casella', 0.024), ('lfp', 0.024), ('laplacian', 0.023), ('estimation', 0.022), ('sample', 0.022), ('estimates', 0.022), ('squared', 0.021), ('urban', 0.02), ('israel', 0.02), ('estimated', 0.019), ('yeung', 0.019), ('matrix', 0.018), ('change', 0.017), ('spend', 0.017), ('specify', 0.017), ('loss', 0.017), ('minimizes', 0.016), ('naval', 0.016), ('micchelli', 0.016), ('mean', 0.016), ('daily', 0.016), ('reciprocal', 0.016), ('bregman', 0.016), ('unrelated', 0.015), ('nity', 0.015), ('percent', 0.015), ('convex', 0.015), ('tt', 0.015), ('amounts', 0.015), ('differentiable', 0.015), ('optimal', 0.014), ('goes', 0.014), ('circle', 0.014), ('argyriou', 0.014), ('trying', 0.014), ('gaussian', 0.013), ('amount', 0.013), ('locations', 0.013), ('database', 0.013), ('eigenvalue', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 222 nips-2012-Multi-Task Averaging

Author: Sergey Feldman, Maya Gupta, Bela Frigyik

Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1

2 0.081828415 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

3 0.077066906 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

4 0.070228435 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

5 0.067373663 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

6 0.06130999 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

7 0.057690896 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

8 0.055995669 187 nips-2012-Learning curves for multi-task Gaussian process regression

9 0.055417135 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

10 0.055403277 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

11 0.045359921 293 nips-2012-Relax and Randomize : From Value to Algorithms

12 0.045205522 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

13 0.043456428 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

14 0.043361541 16 nips-2012-A Polynomial-time Form of Robust Regression

15 0.040995643 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

16 0.039210264 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

17 0.038378365 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

18 0.033927754 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

19 0.033605754 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

20 0.032517489 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.098), (1, 0.0), (2, 0.053), (3, 0.033), (4, 0.034), (5, -0.022), (6, -0.011), (7, 0.036), (8, -0.026), (9, -0.033), (10, 0.034), (11, -0.035), (12, -0.028), (13, -0.05), (14, 0.014), (15, -0.037), (16, 0.064), (17, 0.015), (18, -0.018), (19, -0.028), (20, 0.031), (21, 0.01), (22, 0.096), (23, 0.04), (24, 0.01), (25, -0.066), (26, 0.078), (27, -0.009), (28, 0.012), (29, -0.043), (30, 0.048), (31, 0.002), (32, -0.045), (33, 0.029), (34, 0.04), (35, 0.001), (36, 0.041), (37, 0.025), (38, -0.024), (39, -0.078), (40, 0.007), (41, -0.012), (42, -0.044), (43, 0.075), (44, 0.003), (45, 0.018), (46, 0.025), (47, 0.034), (48, -0.028), (49, -0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88582689 222 nips-2012-Multi-Task Averaging

Author: Sergey Feldman, Maya Gupta, Bela Frigyik

Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1

2 0.59311545 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

3 0.57007998 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

4 0.56157035 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.

5 0.54050976 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1

6 0.52359837 95 nips-2012-Density-Difference Estimation

7 0.50789595 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

8 0.48859543 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

9 0.48388168 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

10 0.44914746 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

11 0.4446319 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

12 0.43347859 32 nips-2012-Active Comparison of Prediction Models

13 0.43042028 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

14 0.42745635 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

15 0.41766629 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

16 0.41280046 225 nips-2012-Multi-task Vector Field Learning

17 0.41220355 338 nips-2012-The Perturbed Variation

18 0.40965089 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

19 0.40895 254 nips-2012-On the Sample Complexity of Robust PCA

20 0.40722302 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (11, 0.016), (16, 0.256), (21, 0.036), (38, 0.114), (39, 0.012), (42, 0.038), (53, 0.013), (54, 0.017), (55, 0.046), (74, 0.043), (76, 0.159), (80, 0.065), (92, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.75839096 329 nips-2012-Super-Bit Locality-Sensitive Hashing

Author: Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian

Abstract: Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, ⇡/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments. 1

same-paper 2 0.75250763 222 nips-2012-Multi-Task Averaging

Author: Sergey Feldman, Maya Gupta, Bela Frigyik

Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1

3 0.74969071 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

Author: Jeremy Weiss, Sriraam Natarajan, David Page

Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.

4 0.71664369 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi

Abstract: We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1

5 0.6437726 215 nips-2012-Minimizing Uncertainty in Pipelines

Author: Nilesh Dalvi, Aditya Parameswaran, Vibhor Rastogi

Abstract: In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a confidence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efficient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable. 1

6 0.64260155 188 nips-2012-Learning from Distributions via Support Measure Machines

7 0.6413967 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

8 0.64072436 298 nips-2012-Scalable Inference of Overlapping Communities

9 0.63953209 294 nips-2012-Repulsive Mixtures

10 0.63938636 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

11 0.63935751 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

12 0.63925195 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

13 0.63888693 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

14 0.63871902 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

15 0.63867277 95 nips-2012-Density-Difference Estimation

16 0.63778108 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

17 0.63766068 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

18 0.63722825 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

19 0.63644654 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

20 0.63630182 75 nips-2012-Collaborative Ranking With 17 Parameters