nips nips2013 nips2013-244 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima
Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. [sent-13, score-0.325]
2 Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. [sent-14, score-0.203]
3 Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. [sent-15, score-0.28]
4 We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. [sent-16, score-0.475]
5 1 Introduction Multi-task learning (MTL) has been studied for learning multiple related tasks simultaneously. [sent-18, score-0.049]
6 A key assumption behind MTL is that there exists a common shared representation across the tasks. [sent-19, score-0.083]
7 Many MTL algorithms attempt to find such a common representation and at the same time to learn multiple tasks under that shared representation. [sent-20, score-0.132]
8 For example, we can enforce all the tasks to share a common feature subspace or a common set of variables by using an algorithm introduced in [1, 2] that alternately optimizes the shared representation and the task-wise solutions. [sent-21, score-0.281]
9 Although the standard MTL formulation can handle only a finite number of tasks, it is sometimes more natural to consider infinitely many tasks parameterized by a continuous parameter, e. [sent-22, score-0.183]
10 , in learning under non-stationarity [3] where learning problems change over continuous time, costsensitive learning [4] where loss functions are asymmetric with continuous cost balance, and quantile regression [5] where the quantile is a continuous variable between zero and one. [sent-24, score-1.138]
11 In order to handle these infinitely many parametrized tasks, we propose in this paper an extended formulation of MTL called parametric-task learning (PTL). [sent-25, score-0.073]
12 The key contribution of this paper is to show that, for a certain class of PTL problems, the optimal common representation shared across infinitely many parameterized tasks can be obtainable. [sent-26, score-0.178]
13 Specif` ically, we develop an alternating minimization algorithm a la [1, 2] for finding the entire continuum of solutions and the common feature subspace (or the common set of variables) among infinitely many parameterized tasks. [sent-27, score-0.299]
14 Our algorithm exploits the fact that, for those classes of PTL problems, the path of task-wise solutions is piecewise-linear in the task parameter. [sent-28, score-0.131]
15 We use the parametric programming technique [6, 7, 8, 9] for computing those piecewise linear solutions. [sent-29, score-0.162]
16 Let {(xi , yi )}i∈Nn be the set of n training instances, where xi ∈ X ⊆ Rd is the input and yi ∈ Y is the output. [sent-36, score-0.52]
17 We define wi (t) ∈ [0, 1], t ∈ NT as the weight of the ith instance for the tth task, where T is the number ⊤ of tasks. [sent-37, score-0.277]
18 It was shown [1] that the problem (1) is equivalent to ∑ ∑ γ ˜⊤ ˜ min wi (t)ℓt (r(yi , βt xi )) + ||B||2 , tr ˜ T {βt }t∈NT t∈NT i∈NN where B is the d × T matrix whose tth column is given by the vector βt , and ||B||tr := tr((BB ⊤ )1/2 ) is the trace norm of B. [sent-47, score-0.385]
19 As shown in [10], the trace norm is the convex upper envelope of the rank of B, and (1) can be interpreted as the problem of finding a common feature subspace across T tasks. [sent-48, score-0.143]
20 This problem is often referred to as multi-task feature learning. [sent-49, score-0.054]
21 If the matrix D is restricted to be diagonal, the formulation (1) is reduced to multi-task variable selection [11, 12]. [sent-50, score-0.053]
22 This algorithm alternately optimizes the task-wise solutions {βt }t∈NT and the com˜t can be independently mon representation matrix D. [sent-52, score-0.094]
23 3 Parametric-Task Learning (PTL) We consider the case where we have infinitely many tasks parametrized by a single continuous parameter. [sent-58, score-0.104]
24 Let θ ∈ [θL , θU ] be a continuous task parameter. [sent-59, score-0.066]
25 Instead of the set of weights wi (t), t ∈ NT , we consider a weight function wi : [θL , θU ] → [0, 1] for each instance i ∈ Nn . [sent-60, score-0.348]
26 In PTL, we ˜ learn a parameter vector βθ ∈ Rd+1 as a continuous function of the task parameter θ: ∫ θU ∑ ∫ θU ⊤ ˜⊤ xi )) dθ + γ min wi (θ) ℓθ (r(yi , βθ ˜ βθ D−1 βθ dθ, (2) ˜ {βθ }θ∈[θL ,θU ] d D∈S++ ,tr(D)≤1 θL θL i∈Nn where, note that, the loss function ℓθ possibly depends on θ. [sent-61, score-0.364]
27 It takes 1 only if the ith instance is used in the tth task. [sent-64, score-0.08]
28 We slightly generalize the setup so that each instance can be used in multiple tasks with different weights. [sent-65, score-0.049]
29 1 2 Algorithm 1 A LTERNATING M INIMIZATION A LGORITHM FOR MTL [1] 1: Input: Data {(xi , yi )}i∈Nn and weights {wi (t)}i∈Nn ,t∈NT ; 2: Initialize: D ← Id /d (Id is d × d identity matrix) 3: while convergence condition is not true do 4: Step 1: For t = 1, . [sent-66, score-0.203]
30 , T do ∑ γ ˜ ˜ ˜ wi (t)ℓt (r(yi , β ⊤ xi )) + β ⊤ D−1 β βt ← arg min ˜ T β i∈Nn 5: Step 2: D ← ∑ C 1/2 ⊤ βt D−1 βt , = arg min 1/2 d ,tr(D)≤1 tr(C) D∈S++ t∈N T where C := BB ⊤ whose (j, k)th element is defined as Cj,k := 6: end while ˜ 7: Output: {βt }t∈NT and D; ∑ t∈NT βtj βtk . [sent-69, score-0.349]
31 However, at first glance, the PTL optimization problem (2) seems computationally intractable since we need to find infinitely many task-wise solutions as well as the common feature subspace (or the common set of variables if D is restricted to be diagonal) shared by infinitely many tasks. [sent-71, score-0.246]
32 Our key finding is that, for a certain class of PTL problems, when D is fixed, the optimal path of the ˜ task-wise solutions βθ is shown to be piecewise-linear in θ. [sent-72, score-0.1]
33 By exploiting this piecewise-linearity, we can efficiently handle infinitely many parameterized tasks, and the optimal solutions of those class of PTL problems can be exactly computed. [sent-73, score-0.138]
34 ˜ In the following theorem, we prove that the task-wise solutions βθ is piecewise-linear in θ if the weight functions and the loss function satisfy certain conditions. [sent-74, score-0.149]
35 In the proof in Appendix A, we show that, if the weight functions and the loss function satisfy the conditions (a) or (b), the problem (3) is reformulated as a parametric quadratic program (parametric QP), where the parameter θ only appears in the linear term of the objective function. [sent-76, score-0.198]
36 As shown, for example, in [9], the optimal solution path of this class of parametric QP has a piecewise-linear form. [sent-77, score-0.149]
37 ˜ If βθ is piecewise-linear in θ, we can exactly compute the entire solution path by using parametric programming. [sent-78, score-0.149]
38 We start from the solution at θ = θL , and follow the path of the optimal solutions while θ is continuously increased. [sent-80, score-0.119]
39 Note that, by exploiting the piecewise linearity of βθ , we can compute the integral at Step 2 (Eq. [sent-83, score-0.075]
40 , λd ) where λj = ∑ for all j ∈ Nd , θU 2 βθ,j ′ dθ j ′ ∈Nd θL which can also be computed efficiently by exploiting the piecewise linearity of βθ . [sent-88, score-0.075]
41 Binary Classification Under Non-Stationarity Suppose that we observe n training instances sequentially, and denote them as {(xi , yi , τi )}i∈Nn , where xi ∈ Rd , yi ∈ {−1, 1}, and τi is the time when the ith instance is observed. [sent-90, score-0.561]
42 Under non-stationarity, if we are requested to learn a classifier to predict the output for a test input x observed at time τ , the training instances observed around time τ should have more influence on the classifier than others. [sent-95, score-0.118]
43 Let wi (τ ) denote the weight of the ith instance when training a classifier for a test point at time τ . [sent-96, score-0.282]
44 We can for example use the following triangular weight function (see Figure1): 1 + s−1 (τi − τ ) if τ − s ≤ τi < τ, wi (τ ) = (6) 1 − s−1 (τi − τ ) if τ ≤ τi < τ + s, 0 otherwise, where s > 0 determines the width of the triangular time windows. [sent-97, score-0.269]
45 The problem of training a classifier for time τ is then formulated as ∑ ˜ ˜ min wi (τ ) max(0, 1 − yi β ⊤ xi ) + γ||β||2 , 2 ˜ β i∈Nn where we used the hinge loss. [sent-98, score-0.532]
46 3 In regularization path-following, one computes the optimal solution path w. [sent-99, score-0.092]
47 the regularization parameter, whereas we compute the optimal solution path w. [sent-102, score-0.092]
48 4 Figure 1: Examples of weight functions {wi (τ )}i∈Nn in non-stationary time-series learning. [sent-106, score-0.083]
49 Given a training instances (xi , yi ) at time τi for i = 1, . [sent-107, score-0.254]
50 , n under non-stationary condition, it is reasonable to use the weights {wi (τ )}i∈Nn as shown here when we learn a classifier to predict the output of a test input at time τ . [sent-110, score-0.048]
51 If we have the belief that a set of classifiers for different time should have some common structure, we can apply our PTL approach to this problem. [sent-111, score-0.04]
52 If we consider a time interval τ ∈ [τL , τU ], the parametric-task feature learning problem is formulated as ∫ τU ∑ ∫ τU ⊤ ˜⊤ ˜ min wi (τ ) max(0, 1 − yi βτ xi ) dτ + γ βτ D−1 βτ dτ. [sent-112, score-0.553]
53 When the costs of false positives and false negatives are unequal, or when the numbers of positive and negative training instances are highly imbalanced, it is effective to use the cost-sensitive learning approach [16]. [sent-115, score-0.168]
54 Suppose that we are given a set of training instances {(xi , yi )}i∈Nn with xi ∈ Rd and yi ∈ {−1, 1}. [sent-116, score-0.532]
55 When the exact false positive and false negative costs in the test scenario are unknown [4], it is often desirable to train several cost-sensitive SVMs with different values of θ. [sent-118, score-0.121]
56 If we consider an interval θ ∈ [θL , θU ], 0 < θL < θU < 1, the parametric-task feature learning problem is formulated as ∫ θU ∑ ∫ θU ⊤ ˜⊤ xi ) dθ + γ min wi (θ) max(0, 1 − yi βθ ˜ βθ D−1 βθ dθ. [sent-120, score-0.553]
57 Figure 2 shows an example of joint cost-sensitive learning applied to a toy 2D binary classification problem. [sent-122, score-0.078]
58 Jointly estimating multiple conditional quantile functions is often useful for exploring the stochastic relationship between X and Y (see Section 5 for an example of joint quantile regression problems). [sent-124, score-1.091]
59 Linear quantile regression along with L2 regularization [20] at order τ ∈ (0, 1) is formulated as { ∑ (1 − τ )|r| if r ≤ 0, ˜ ˜ min ρτ (yi − β ⊤ xi ) + γ||β||2 , ρτ (r) := 2 τ |r| if r > 0. [sent-125, score-0.693]
60 (a) Left plot is the results obtained by independently training each cost-sensitive SVMs. [sent-133, score-0.057]
61 (b) Right plot is the results obtained by jointly training infinitely many cost-sensitive SVMs for all the continuum of θ ∈ [0. [sent-134, score-0.083]
62 95] using the methodology we present in this paper (both are trained with the same regularization parameter γ). [sent-136, score-0.055]
63 We assume that our data generating mechanism produces the training set {(xi , yi , τi )}i∈Nn with n = 100 as follows. [sent-143, score-0.217]
64 , (n − 1) 2π }, the output yi is first determined as yi = 1 if i n n n is odd, while yi = −1 if i is even. [sent-147, score-0.552]
65 Then, xi ∈ Rd is generated as xi1 ∼ N (yi cos τi , 12 ), xi2 ∼ N (yi sin τi , 12 ), xij ∼ N (0, 12 ), ∀j ∈ {3, . [sent-148, score-0.094]
66 Namely, only the first two dimensions of x differ in two classes, and the remaining d − 2 dimensions are considered as noise. [sent-152, score-0.066]
67 In addition, according to the value of τi , the means of the class-wise distributions in the first two dimensions gradually change. [sent-153, score-0.064]
68 Here, we applied our PT feature learning approach with triangular time windows in (6) with s = 0. [sent-157, score-0.09]
69 Figure 4 shows the mis-classification rate of PT feature learning (PTFL) and ordinary independent learning (IND) on a similarly generated test sample with size 1000. [sent-159, score-0.103]
70 When the input dimension d = 2, there is no advantage for learning common features since these two input dimensions are important for classification. [sent-160, score-0.123]
71 On the other hand, as d increases, PT feature learning becomes more and more advantageous. [sent-161, score-0.054]
72 6 Figure 3: The first 2 input dimensions of artificial example at τ = 0, 0. [sent-163, score-0.058]
73 The class-wise distributions in these two dimensions gradually change with τ ∈ [0, 2π]. [sent-166, score-0.092]
74 The red symbols indicate the results of our PT feature learning (PTFL) whereas the blue symbols indicate ordinary independent learning (IND). [sent-185, score-0.126]
75 Joint Cost-Sensitive SVM Learning on Benchmark Datasets Here, we report the experimental results on joint cost-sensitive SVM learning discussed in Section 4. [sent-189, score-0.058]
76 In PTFL and PTVS, we learned common feature subspaces and common sets of variables shared across the continuum of cost-sensitive SVM for θ ∈ [0. [sent-191, score-0.227]
77 Joint Quantile Regression Finally, we applied PT feature learning to joint quantile regression problems. [sent-210, score-0.607]
78 Given a training set {(xi , yi )}i∈Nn , we first estimated conditional mean function E[Y |X = ˆ ˆ x] by least-square regression, and computed the residual ri := yi − E[Y |X = xi ], where E is the estimated conditional mean function. [sent-212, score-0.626]
79 Then, we applied PT feature learning to {(xi , ri )}i∈Nn , and ˆ ˆ −1 ˆ estimated the conditional τ th quantile function as FY |X=x (τ ) := E[Y |X = xi ] + fres (x|τ ), where ˆ fres (·|τ ) is the estimated τ th quantile regression fitted to the residuals. [sent-213, score-1.29]
80 When multiple quantile regressions with different τ s are independently learned, we often encounter a notorious problem known as quantile crossing (see Section 2. [sent-214, score-0.983]
81 For example, in Figure 5(a), some of the estimated conditional quantile functions cross each other (which never happens in the true conditional quantile functions). [sent-216, score-1.039]
82 In the simplest case, if we assume that the data is homoscedastic (i. [sent-218, score-0.042]
83 , the conditional distribution P (Y |x) does not depend on x except its location), 7 Table 1: Average (and standard deviation) of test errors obtained by joint cost-sensitive SVMs on benchmark datasets. [sent-220, score-0.133]
84 n is the sample size, d is the input dimension, Ind indicates the results when each cost-sensitive SVM was trained independently, while PTFL and PTVS indicate the results from PT feature learning and PT feature selection, respectively. [sent-221, score-0.154]
85 38) quantile regressions at different τ s can be obtained by just vertically shifting other quantile regression function (see Figure 5(f)). [sent-283, score-1.005]
86 Our PT feature learning approach, when applied to the joint quantile regression problem, allows us to interpolate these two extreme cases. [sent-284, score-0.607]
87 Figure 5 shows a joint QR example on the bone mineral density (BMD) data [21]. [sent-285, score-0.058]
88 When (a) γ → 0, our approach is identical with independently estimating each quantile regression, while it coincides with homoscedastic case when (f) γ → ∞. [sent-287, score-0.515]
89 95 conditional quantile functions 3 2 1 0 -1 -2 -2 -1 0 -1 2 (Standardized) Relative BMD Change (Standardized) Relative BMD Change 3 -1. [sent-310, score-0.538]
90 95 conditional quantile functions -2 2 (Standardized) Age (a) γ → 0 4 0. [sent-318, score-0.538]
91 95 conditional quantile functions 3 -2 -2 -2 (Standardized) Relative BMD Change 4 0. [sent-324, score-0.538]
92 95 conditional quantile functions 3 (Standardized) Relative BMD Change 0. [sent-330, score-0.538]
93 95 conditional quantile functions 3 (Standardized) Relative BMD Change (Standardized) Relative BMD Change 4 0. [sent-336, score-0.538]
94 95 conditional quantile functions 3 2 1 0 -1 -2 -2 -1. [sent-342, score-0.538]
95 5 2 (f) γ → ∞ Figure 5: Joint quantile regression examples on BMD data [21] for six different γs. [sent-350, score-0.495]
96 6 Conclusions In this paper, we introduced parametric-task learning (PTL) approach that can systematically handle infinitely many tasks parameterized by a continuous parameter. [sent-351, score-0.177]
97 We illustrated the usefulness of this approach by providing three examples that can be naturally formulated as PTL. [sent-352, score-0.041]
98 An algorithm for the solution of the parametric quadratic programming problem. [sent-405, score-0.132]
99 Joint covariate selection and joint sbspace selection for multiple classification problems. [sent-427, score-0.112]
100 The entire regularization path for the support vector machine. [sent-447, score-0.092]
wordName wordTfidf (topN-words)
[('ptl', 0.548), ('quantile', 0.449), ('nn', 0.266), ('mtl', 0.186), ('standardized', 0.184), ('yi', 0.184), ('bmd', 0.169), ('ptfl', 0.169), ('wi', 0.151), ('ind', 0.137), ('nt', 0.113), ('ptvs', 0.105), ('pt', 0.103), ('nitely', 0.095), ('xi', 0.094), ('parametric', 0.091), ('nagoya', 0.074), ('age', 0.068), ('regressions', 0.061), ('svms', 0.06), ('joint', 0.058), ('path', 0.058), ('feature', 0.054), ('conditional', 0.052), ('japan', 0.051), ('tth', 0.051), ('svm', 0.051), ('continuum', 0.05), ('tasks', 0.049), ('false', 0.049), ('parameterized', 0.046), ('weight', 0.046), ('dh', 0.046), ('regression', 0.046), ('fy', 0.044), ('classi', 0.044), ('shared', 0.043), ('rd', 0.042), ('fres', 0.042), ('homoscedastic', 0.042), ('inimization', 0.042), ('lternating', 0.042), ('solutions', 0.042), ('programming', 0.041), ('tokyo', 0.041), ('formulated', 0.041), ('common', 0.04), ('ch', 0.04), ('tr', 0.038), ('lgorithm', 0.037), ('ah', 0.037), ('bh', 0.037), ('mext', 0.037), ('instances', 0.037), ('functions', 0.037), ('triangular', 0.036), ('continuous', 0.035), ('id', 0.034), ('regularization', 0.034), ('bb', 0.033), ('dimensions', 0.033), ('training', 0.033), ('kakenhi', 0.032), ('takeuchi', 0.032), ('task', 0.031), ('gradually', 0.031), ('th', 0.031), ('piecewise', 0.03), ('min', 0.029), ('ith', 0.029), ('change', 0.028), ('alternately', 0.028), ('residual', 0.027), ('subspace', 0.027), ('er', 0.027), ('argyriou', 0.027), ('qp', 0.027), ('handle', 0.027), ('selection', 0.027), ('ordinary', 0.026), ('formulation', 0.026), ('breast', 0.026), ('input', 0.025), ('loss', 0.024), ('independently', 0.024), ('symbols', 0.023), ('exploiting', 0.023), ('test', 0.023), ('cancer', 0.023), ('arg', 0.023), ('trace', 0.022), ('linearity', 0.022), ('trained', 0.021), ('relative', 0.021), ('systematically', 0.02), ('af', 0.02), ('toy', 0.02), ('parametrized', 0.02), ('condition', 0.019), ('continuously', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 244 nips-2013-Parametric Task Learning
Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima
Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.
2 0.13558498 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch
Abstract: In this paper we introduce a novel method that can efficiently estimate a family of hierarchical dense sets in high-dimensional distributions. Our method can be regarded as a natural extension of the one-class SVM (OCSVM) algorithm that finds multiple parallel separating hyperplanes in a reproducing kernel Hilbert space. We call our method q-OCSVM, as it can be used to estimate q quantiles of a highdimensional distribution. For this purpose, we introduce a new global convex optimization program that finds all estimated sets at once and show that it can be solved efficiently. We prove the correctness of our method and present empirical results that demonstrate its superiority over existing methods. 1
3 0.081423931 211 nips-2013-Non-Linear Domain Adaptation with Boosting
Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua
Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1
4 0.070973143 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
5 0.066063337 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
6 0.06185611 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
7 0.057228137 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
8 0.055639457 318 nips-2013-Structured Learning via Logistic Regression
9 0.054314476 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
10 0.054120891 158 nips-2013-Learning Multiple Models via Regularized Weighting
11 0.051799465 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
12 0.049113896 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
13 0.048663806 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
14 0.048459835 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
15 0.046481702 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
16 0.046439286 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
17 0.046338011 91 nips-2013-Dirty Statistical Models
18 0.045538485 65 nips-2013-Compressive Feature Learning
19 0.045481104 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
20 0.045002636 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
topicId topicWeight
[(0, 0.142), (1, 0.04), (2, 0.02), (3, -0.017), (4, 0.034), (5, 0.012), (6, -0.02), (7, 0.034), (8, -0.031), (9, 0.022), (10, 0.008), (11, -0.017), (12, -0.029), (13, -0.022), (14, 0.033), (15, 0.015), (16, 0.004), (17, 0.035), (18, -0.011), (19, -0.014), (20, -0.053), (21, 0.048), (22, 0.048), (23, 0.066), (24, -0.001), (25, -0.017), (26, 0.002), (27, -0.063), (28, 0.021), (29, -0.068), (30, -0.053), (31, 0.035), (32, -0.012), (33, 0.008), (34, 0.052), (35, 0.046), (36, -0.04), (37, -0.151), (38, -0.055), (39, 0.109), (40, 0.01), (41, 0.053), (42, -0.089), (43, 0.048), (44, -0.147), (45, -0.085), (46, 0.0), (47, -0.004), (48, 0.021), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.9088071 244 nips-2013-Parametric Task Learning
Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima
Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.
2 0.80270451 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
Author: Yu Zhang
Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1
3 0.65598541 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang
Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1
4 0.63817501 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch
Abstract: In this paper we introduce a novel method that can efficiently estimate a family of hierarchical dense sets in high-dimensional distributions. Our method can be regarded as a natural extension of the one-class SVM (OCSVM) algorithm that finds multiple parallel separating hyperplanes in a reproducing kernel Hilbert space. We call our method q-OCSVM, as it can be used to estimate q quantiles of a highdimensional distribution. For this purpose, we introduce a new global convex optimization program that finds all estimated sets at once and show that it can be solved efficiently. We prove the correctness of our method and present empirical results that demonstrate its superiority over existing methods. 1
5 0.5737896 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
6 0.56644118 211 nips-2013-Non-Linear Domain Adaptation with Boosting
7 0.56161547 76 nips-2013-Correlated random features for fast semi-supervised learning
8 0.53838903 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
9 0.52805275 158 nips-2013-Learning Multiple Models via Regularized Weighting
10 0.52476555 202 nips-2013-Multiclass Total Variation Clustering
11 0.51667649 75 nips-2013-Convex Two-Layer Modeling
12 0.50711948 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
13 0.50062418 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
14 0.49778295 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
15 0.48248136 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics
16 0.4813579 171 nips-2013-Learning with Noisy Labels
17 0.4779855 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
18 0.47469628 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
19 0.47176695 65 nips-2013-Compressive Feature Learning
20 0.46177658 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
topicId topicWeight
[(2, 0.014), (16, 0.023), (33, 0.158), (34, 0.099), (41, 0.043), (49, 0.027), (56, 0.081), (70, 0.036), (72, 0.292), (85, 0.024), (89, 0.034), (93, 0.061), (95, 0.011)]
simIndex simValue paperId paperTitle
1 0.81296819 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
Author: José Miguel Hernández-Lobato, James R. Lloyd, Daniel Hernández-Lobato
Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other timevarying copula methods. 1
2 0.80691403 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
same-paper 3 0.74737984 244 nips-2013-Parametric Task Learning
Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima
Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.
4 0.73108643 167 nips-2013-Learning the Local Statistics of Optical Flow
Author: Dan Rosenbaum, Daniel Zoran, Yair Weiss
Abstract: Motivated by recent progress in natural image statistics, we use newly available datasets with ground truth optical flow to learn the local statistics of optical flow and compare the learned models to prior models assumed by computer vision researchers. We find that a Gaussian mixture model (GMM) with 64 components provides a significantly better model for local flow statistics when compared to commonly used models. We investigate the source of the GMM’s success and show it is related to an explicit representation of flow boundaries. We also learn a model that jointly models the local intensity pattern and the local optical flow. In accordance with the assumptions often made in computer vision, the model learns that flow boundaries are more likely at intensity boundaries. However, when evaluated on a large dataset, this dependency is very weak and the benefit of conditioning flow estimation on the local intensity pattern is marginal. 1
5 0.68764299 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
Author: David Carlson, Vinayak Rao, Joshua T. Vogelstein, Lawrence Carin
Abstract: With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparametric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-theart. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we find several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) detecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using multiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain. 1
6 0.66253018 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
7 0.60122335 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
8 0.60046148 201 nips-2013-Multi-Task Bayesian Optimization
9 0.60010564 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
10 0.59955668 251 nips-2013-Predicting Parameters in Deep Learning
11 0.59765583 301 nips-2013-Sparse Additive Text Models with Low Rank Background
12 0.59649777 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
13 0.59562188 331 nips-2013-Top-Down Regularization of Deep Belief Networks
14 0.59550601 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
15 0.59514081 30 nips-2013-Adaptive dropout for training deep neural networks
16 0.5950743 99 nips-2013-Dropout Training as Adaptive Regularization
17 0.59493983 173 nips-2013-Least Informative Dimensions
18 0.59478867 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
19 0.59450984 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
20 0.59406817 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles