nips nips2013 nips2013-211 knowledge-graph by maker-knowledge-mining

211 nips-2013-Non-Linear Domain Adaptation with Boosting

Source: pdf

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-speciﬁc decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for speciﬁc a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a signiﬁcant improvement over the state of the art. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. [sent-6, score-0.334]

2 In this paper we present a multitask learning algorithm for domain adaptation based on boosting. [sent-7, score-0.45]

3 Unlike previous approaches that learn task-speciﬁc decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. [sent-8, score-0.773]

4 This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. [sent-10, score-0.684]

5 A possible solution is to treat each acquisition as a separate, but related classiﬁcation problem, and exploit their possible relationship to learn from the supervised data available across all of them. [sent-18, score-0.277]

6 1(a,b) the task is mitochondria segmentation in both acquisitions. [sent-21, score-0.44]

7 Techniques in domain adaptation [1] and more generally multi-task learning [2, 3] seek to leverage data from a set of different yet related tasks or domains to help learn a classiﬁer in a seemingly new task. [sent-23, score-0.64]

8 In domain adaptation, it is typically assumed that there is a fairly large amount of labeled data in one domain, commonly referred to as the source domain, and that a limited amount of supervision is available in the other, often called the target domain. [sent-24, score-0.607]

9 Our goal is to exploit the labeled data in the source domain to learn an accurate classiﬁer in the target domain despite having only a few labeled samples in the latter. [sent-25, score-1.059]

10 1 Mitochondria Segmentation (3D stacks) (a) Striatum Path Classiﬁcation (2D images to 3D stacks) (b) Hippocampus (c) Aerial road images (d) Neural Axons (OPF) Figure 1: (a,b) Slice cuts from two 3D Electron Microscopy acquisitions from different parts of the brain of a rat. [sent-27, score-0.387]

11 (c,d) 2D aerial road images and 3D neural axons from Olfactory Projection Fibers (OPF). [sent-28, score-0.391]

12 The data acquisition problem is unique to many multi-task learning problems, however, in that each task is in fact the same, but what has changed is that the features across different acquisitions have undergone some unknown transformation. [sent-30, score-0.503]

13 That is to say that each task can be well described by a single decision boundary in some common feature space that preserves the task-relevant features and discards the domain speciﬁc ones corresponding to unwanted acquisition artifacts. [sent-31, score-0.893]

14 This contrasts the more general multi-task setting where each task is comprised of both a common and task-speciﬁc boundary, even when mapped to a common feature space, as illustrated in Fig. [sent-32, score-0.28]

15 A method that can jointly optimize over the common decision boundary and shared feature space is therefore desired. [sent-34, score-0.528]

16 Linear latent variable methods such as those based on Canonical Correlation Analysis (CCA) [4, 5] can be applied to learn a shared feature space across the different acquisitions. [sent-35, score-0.381]

17 In this paper we propose a solution to the data acquisition problem and devise a method that can jointly solve for the non-linear decision boundary and transformations across tasks. [sent-39, score-0.435]

18 We assume that only the mappings are taskdependent and that in the shared space the problem is linearly separable and the decision boundary is common to all tasks. [sent-42, score-0.503]

19 We use the boosting-trick [8, 9, 10] to simultaneously learn the non-linear task-speciﬁc mappings as well as the decision boundary, with no need for speciﬁc a-priori knowledge of their global analytical form. [sent-43, score-0.262]

20 This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. [sent-44, score-0.684]

21 We ﬁrst consider the classiﬁcation of curvilinear structures in 3D image stacks of Olfactory Projection Fibers (OPF) [11] using labeled 2D aerial road images. [sent-47, score-0.804]

22 We then perform mitochondria segmentation in large 3D Electron Microscopy (EM) stacks of neural rat tissue, demonstrating the ability of our algorithm to leverage labeled data from different data acquisitions on this challenging task. [sent-48, score-0.954]

23 On both datasets our approach obtains a signiﬁcant improvement over using labeled data from either domain alone and outperforms recent multi-task learning baseline methods. [sent-49, score-0.372]

24 2 Related Work Initial ideas to multi-task learning exploited supervised data from related tasks to deﬁne a form of regularization in the target problem [2, 12]. [sent-50, score-0.265]

25 MTL assumes a single, pre-deﬁned transformation φ(x) : X → Z and learns shared and task-speciﬁc linear boundaries in Z, namely β o , β 1 and β 2 ∈ Z. [sent-52, score-0.273]

26 In contrast, our DA approach learns a single linear boundary β in a common feature space Z, and task-speciﬁc mappings φ1 (x), φ2 (x) : X → Z. [sent-53, score-0.328]

27 as auxiliary problems [13], are used to learn a latent representation and ﬁnd discriminative features shared across tasks. [sent-55, score-0.34]

28 This representation is then transferred to the target task to help regularize the solution and learn from fewer labeled examples. [sent-56, score-0.399]

29 More recent multi-task learning methods jointly optimize over both the shared and task-speciﬁc components of each task [3, 14, 10, 15]. [sent-62, score-0.22]

30 In particular, for each task their approach computes a linear decision boundary deﬁned as a linear combination between a shared hyperplane, shared across tasks, and a task-speciﬁc one in either the original or a kernelized feature space. [sent-64, score-0.759]

31 For each task they optimize for a shared and task-speciﬁc decision boundary similar to [3], except nonlinearities are modeled using a boosted feature space. [sent-67, score-0.678]

32 As with other methods, however, additional parameters are required to control the degree of sharing between tasks that can be difﬁcult to set, especially when one or more tasks have only a few labeled samples. [sent-68, score-0.338]

33 For many problems, such as those common to domain adaptation [1], the decision problem is in fact the same across tasks, however, the features of each task have undergone some unknown transformation. [sent-69, score-0.857]

34 Feature-based approaches seek to uncover this transformation by learning a mapping between the features across tasks [18, 19, 7]. [sent-70, score-0.237]

35 A cross-domain Mahalanobis distance metric was introduced in [18] that leverages across-task correspondences to learn a transformation from the source to target domain. [sent-71, score-0.36]

36 Shared latent variable models have also been proposed to learn a shared representation across multiple feature sources or tasks [4, 19, 6, 7, 21]. [sent-73, score-0.484]

37 In this paper, we exploit the boosting-trick [10] to handle non-linearities and learn a shared representation across tasks, overcoming these limitations. [sent-75, score-0.262]

38 This results in a more parameter-free, scalable domain adaptation approach that can leverage learning on new tasks where labeled data is scarce. [sent-76, score-0.702]

39 3 Our Approach We consider the problem of learning a binary decision function from supervised data collected across multiple tasks or domains. [sent-77, score-0.34]

40 In our setting, each task is an instance of the same underlying decision problem, however, its features are assumed to have undergone some unknown non-linear transformation. [sent-78, score-0.356]

41 , T tasks, where i=1 i t xt ∈ RD represents a feature vector for sample i in task t and yi ∈ {−1, 1} its label. [sent-82, score-0.33]

42 For each task, i we seek to learn a non-linear transformation, φt (xt ), that maps xt to a common, task-independent feature space, Z, accounting for any unwanted feature shift. [sent-83, score-0.385]

43 In what follows, we set H to be the set of regression trees or stumps [8] that in combination with τ t can be used to model highly complex, non-linear transformations. [sent-101, score-0.325]

44 Assuming that the problem is linearly separable in Z the predictive function ft (·) : RD → R for each task can then be written as M ft (x) = β φt (xt ) = βj hj (xt − τjt ) (2) j=1 where β ∈ RM is a linear decision boundary in Z that is common to all tasks. [sent-102, score-0.74]

45 This contrasts previous approaches to multi-task learning such as [3, 10] that learn a separate decision boundary per task and, as we show later, is better suited for problems in domain adaptation. [sent-103, score-0.748]

46 We learn the functions ft (·) by minimizing the exponential loss on the training data across each task T β ∗ , Γ∗ = min β,Γ L(β, Γt ; X t ), (3) t=1 where Nt t Nt t exp − L(β, Γ ; X ) = t yi ft (xt ) i i=1 M t exp − yi = i=1 βj hj (xt − τjt ) , i (4) j=1 t t and Γ = [Γ1 , . [sent-104, score-0.736]

47 Luckily, this is a problem for which boosting is particularly well suited [8], as it has been demonstrated to be an effective method for constructing a highly accurate classiﬁer from a possibly large collection of weak prediction functions. [sent-113, score-0.282]

48 We propose to use gradient boosting [8, 9] to solve for ft (·). [sent-116, score-0.311]

49 (3), ˜ ˜ ˜ the goal at each boosting iteration is to ﬁnd the weak learner h ∈ H and the set of {τ 1 , . [sent-120, score-0.241]

50 , τ T } that minimize T  2 t t ˜ ˜ wik h(xt − τ t ) − rik  ,  t=1  Nt (5) i=1 t t t t t where wik and rik can be computed by differentiating the loss of Eq. [sent-123, score-0.54]

51 (4), obtaining wik = e−yi ft (xi ) t t ˜ and {τ 1 , . [sent-124, score-0.264]

52 Once h 4 Algorithm 1 Non-Linear Domain Adaptation with Boosting t t Input: Training samples and labels for T tasks X t = {(xt , yi )}N i i=1 Number of iterations K, shrinkage factor 0 < γ ≤ 1 1: Set ft (·) = 0 ∀ t = 1, . [sent-128, score-0.339]

53 , T 2: for k = 1 to K do 3: t t t t t Let wik = e−yi ft (xi ) and rik = yi Nt T 4: Find ˜ ˜ ˜ h(·), τ 1 , . [sent-131, score-0.463]

54 , τ T = t t wik h(xt − τ t ) − rik i argmin h∈H,τ 1 ,. [sent-134, score-0.27]

55 ,τ T t=1 i=1 T 5: Nt t ˜ i ˜ exp − yi ft (xt ) + α h(xt − τ t ) i Find α through line search: α = argmin ˜ ˜ α 2 t=1 i=1 6: ˜ Set β = γ α ˜ 7: ˜˜ ˜ Update ft (·) = ft (·) + β h( · − τ t ) ∀ t = 1, . [sent-137, score-0.475]

56 Shrinkage may be applied to help regularize the solution, particularly when using powerful weak learners such as regression trees [8]. [sent-145, score-0.241]

57 In the next section we show that regression trees and boosted stumps can be used efﬁciently to minimize Eq. [sent-154, score-0.436]

58 1 Weak Learners Regression trees have proven very effective when used as weak learners with gradient boosting [23]. [sent-157, score-0.376]

59 Decision stumps represent a special case of single-level regression trees. [sent-159, score-0.241]

60 In cases where feature dimensionality D is very large, decision stumps may be preferred over regression trees to reduce training time. [sent-161, score-0.625]

61 ,τ T }  Nt Nt t t 1{xt [n]−τ t } wik η1 − rik i  i=1 2  t t ¯{xt [n]−τ t } wik η2 − rik 2  , (6) 1 i + i=1 where x[n] ∈ R denotes the value of the nth dimension of x, 1{·} is the indicator function, and ¯{·} = 1 − 1{·} . [sent-169, score-0.54]

62 classic regression trees is that, besides learning the values of 1 η1 , η2 and n, our approach requires the tree to also learn a threshold τ t ∈ R per task. [sent-173, score-0.223]

63 5 Decision Stumps: Decision stumps consist of a single split and return values η1 , η2 = ±1. [sent-177, score-0.241]

64 If also t rik = ±1, which is true when boosting with the exponential loss, then it can be demonstrated that minimizing Eq (6) can be separated into T independent minimization problems for all D attributes for each n. [sent-178, score-0.317]

65 This makes decision stumps feasible for large-scale applications with very high dimensional feature spaces. [sent-181, score-0.423]

66 4 Evaluation We evaluated our approach on two challenging domain adaptation problems for which annotation is very time-consuming, representative of the problems we seek to address. [sent-184, score-0.443]

67 We consider the detection of 3D curvilinear structures in 3D image stacks of Olfactory Projection Fibers (OPF) using 2D aerial road images (see Fig. [sent-188, score-0.739]

68 For this problem, the task is to predict whether a tubular path between two image locations belongs to a curvilinear structure. [sent-190, score-0.301]

69 We used a publiclyavailable dataset [11] of 2D aerial images of road networks as the source domain and 3D stacks of Olfactory Projection Fibers (OPF) from the DIADEM challenge as the target domain. [sent-191, score-1.033]

70 The source domain consists of six fully-labeled 2D aerial road images and the target domain contains eight fully-labeled 3D stacks. [sent-192, score-1.028]

71 We aim at using large amounts of labeled data from 2D road images to leverage learning in the 3D stacks. [sent-193, score-0.383]

72 The goal of this task is to segment mitochondria from large 3D Electron Microscopy (EM) stacks of 5 nm voxel size, acquired from the brain of a rat. [sent-197, score-0.633]

73 As in the path classiﬁcation problem, 3D annotations are time-consuming and exploiting already-annotated stacks is essential to speed up analysis. [sent-198, score-0.315]

74 The source domain is a fully-labeled EM stack from the Striatum region of 853x506x496 voxels with 39 labeled mitochondria. [sent-199, score-0.524]

75 The target domain consists of two stacks acquired from the Hippocampus, one a training volume of size 1024x653x165 voxels and the other a test volume that is 1024x883x165 voxels, with 10 and 42 labeled mitochondria in each respectively. [sent-200, score-1.195]

76 However, differences in appearance and geometry of the structures may potentially adversely affect classiﬁer accuracy when 2D-trained ones are applied to 3D stacks, which motivates domain adaptation. [sent-207, score-0.287]

77 We use half of the target domain for training and half for testing. [sent-208, score-0.441]

78 This results in balanced sets of 30k samples for training in the source domain, and 20k for training and 20k for testing in the target domain. [sent-210, score-0.411]

79 To simulate the lack of training data, we randomly pick an equal number of positive and negative samples for training from the target domain. [sent-211, score-0.327]

80 The HGD codewords are extracted from the road images and used for both domains to generate consistent feature vectors. [sent-212, score-0.31]

81 We employ gradient boosted trees, which in our experiments outperformed boosted stumps and kernel SVMs. [sent-213, score-0.503]

82 Pooling TD only Full TD Test error 8% 6% 4% 2% 20 30 40 70 100 150 250 Number of training samples in TD 500 1000 Figure 3: Path Classiﬁcation: Median, lower and upper quartiles of the test error as the number of training samples is varied. [sent-215, score-0.252]

83 Our approach nears Full TD performance with as few as 70 training samples in the target domain and signiﬁcantly outperforms the baseline methods. [sent-216, score-0.484]

84 For mitochondria segmentation we use the boosting-based method of [27], which is optimized for 3D stacks and whose source code is publicly available. [sent-222, score-0.69]

85 Similar to [27], we group voxels into supervoxels to reduce training and testing time, which yields 15k positive and 275k negative supervoxel samples in the source domain. [sent-224, score-0.323]

86 In the target domain it renders 12k negative training samples. [sent-225, score-0.441]

87 To simulate a real scenario, we create 10 different transfer learning problems using the samples from one mitochondria at a time as positives, which translates into approximately 300 positive training supervoxels each. [sent-226, score-0.439]

88 We compare to linear Canonical Correlation Analysis (CCA) and Kernel CCA (KCCA) [4] for learning a shared latent space on the path classiﬁcation dataset, and use a Radial Basis kernel function for KCCA, which is a commonly used kernel. [sent-232, score-0.33]

89 The data size and dimensionality of the mitochondria dataset is prohibitive for these methods, and instead we compare to Mean-Variance Normalization (MVN) and Histogram Matching (HM) that are common normalizations one might apply to compensate for acquisition artifacts. [sent-234, score-0.42]

90 The next best competitor is the multi-task method of [10], although it exhibits a much higher variance than our approach and performs poorly when only provided a few labeled target examples. [sent-241, score-0.25]

91 In contrast, our method yields a higher performance without the need for such priors and is able to faithfully leverage the source domain data to learn from relatively few examples in the target domain, outperforming the baseline methods. [sent-252, score-0.569]

92 Our approach is very close to Full TD in performance when using as few as 70 training samples, even though the Full TD classiﬁer was trained with 20k samples from the target domain. [sent-255, score-0.244]

93 [10] suggests that modeling the domain shift using shared and task-speciﬁc boundaries, as is commonly done in multi-task learning methods, is not a good model for domain adaptation problems such as the ones shown in Fig. [sent-260, score-0.791]

94 This gets accentuated by the parameter tuning required by [10], done through crossvalidation, that performs poorly when only afforded a few labeled samples in the target domain and yields a longer training time. [sent-262, score-0.692]

95 The method of [10] took 25 minutes to train, while our approach only took between 2 and 15 minutes, depending on the amount of labeled target data. [sent-263, score-0.334]

96 5 Conclusion In this paper we presented an approach for performing non-linear domain adaptation with boosting. [sent-272, score-0.41]

97 Our method learns a task-independent decision boundary in a common feature space, obtained via a non-linear mapping of the features in each task. [sent-273, score-0.454]

98 This contrasts recent approaches that learn taskspeciﬁc boundaries and is better suited for problems in domain adaptation where each task is of the same decision problem, but whose features have undergone an unknown transformation. [sent-274, score-0.973]

99 We evaluated our approach on two challenging bio-medical datasets where it achieved a signiﬁcant gain over using labeled data from either domain alone and outperformed recent multi-task learning methods. [sent-276, score-0.405]

100 : A literature survey on domain adaptation of statistical classiﬁers. [sent-278, score-0.41]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mitochondria', 0.268), ('stacks', 0.245), ('domain', 0.24), ('stumps', 0.206), ('td', 0.174), ('boosting', 0.17), ('adaptation', 0.17), ('aerial', 0.152), ('rik', 0.147), ('decision', 0.142), ('shared', 0.141), ('ft', 0.141), ('labeled', 0.132), ('boundary', 0.13), ('road', 0.127), ('acquisitions', 0.126), ('opf', 0.126), ('xt', 0.124), ('wik', 0.123), ('target', 0.118), ('acquisition', 0.112), ('boosted', 0.111), ('tasks', 0.103), ('mvn', 0.103), ('curvilinear', 0.101), ('fibers', 0.101), ('undergone', 0.101), ('baselines', 0.097), ('segmentation', 0.093), ('fua', 0.089), ('cca', 0.088), ('chapelle', 0.086), ('hm', 0.086), ('trees', 0.084), ('source', 0.084), ('training', 0.083), ('task', 0.079), ('classi', 0.078), ('pooling', 0.077), ('accentuated', 0.076), ('striatum', 0.076), ('ht', 0.076), ('kernel', 0.075), ('feature', 0.075), ('olfactory', 0.073), ('nt', 0.073), ('weak', 0.071), ('learn', 0.07), ('path', 0.07), ('jt', 0.068), ('voxels', 0.068), ('images', 0.067), ('hj', 0.067), ('er', 0.059), ('jaccard', 0.058), ('leverage', 0.057), ('electron', 0.055), ('hippocampus', 0.053), ('darrell', 0.053), ('microscopy', 0.053), ('yi', 0.052), ('learners', 0.051), ('across', 0.051), ('hgd', 0.051), ('tubular', 0.051), ('mappings', 0.05), ('boundaries', 0.05), ('transformation', 0.049), ('structures', 0.047), ('contrasts', 0.046), ('histogram', 0.045), ('axons', 0.045), ('supervoxels', 0.045), ('mtl', 0.045), ('latent', 0.044), ('supervised', 0.044), ('samples', 0.043), ('took', 0.042), ('sd', 0.042), ('codewords', 0.041), ('caruana', 0.041), ('saenko', 0.041), ('unwanted', 0.041), ('acquired', 0.041), ('jmlr', 0.041), ('suited', 0.041), ('multitask', 0.04), ('common', 0.04), ('leverages', 0.039), ('normalizes', 0.039), ('kcca', 0.037), ('regression', 0.035), ('becker', 0.035), ('grossly', 0.035), ('split', 0.035), ('features', 0.034), ('tree', 0.034), ('challenging', 0.033), ('supervision', 0.033), ('learns', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

2 0.19600844 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Author: Boqing Gong, Kristen Grauman, Fei Sha

Abstract: In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. However, image data is difﬁcult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.) We propose an approach to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identiﬁed domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric formulation and efﬁcient optimization procedure that can successfully discover domains among both training and test data. We extensively evaluate our approach on object recognition and human activity recognition tasks. 1

3 0.13978966 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between ofﬂine and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classiﬁer response space. We propose an efﬁcient algorithmic implementation which makes it tractable in practice, and demonstrate its efﬁciency experimentally on several compute-vision data-sets, on which it outperforms both online and ofﬂine methods in a memory constrained setting. 1

4 0.13078447 89 nips-2013-Dimension-Free Exponentiated Gradient

Author: Francesco Orabona

Abstract: I present a new online learning algorithm that extends the exponentiated gradient framework to inﬁnite dimensional spaces. My analysis shows that the algorithm is implicitly able to estimate the L2 norm of the unknown competitor, U , achieving √ a regret bound of the order of O(U log(U T + 1)) T ), instead of the standard √ O((U 2 + 1) T ), achievable without knowing U . For this analysis, I introduce novel tools for algorithms with time-varying regularizers, through the use of local smoothness. Through a lower bound, I also show that the algorithm is optimal up to log(U T ) term for linear and Lipschitz losses. 1

5 0.12760559 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classiﬁer of weak classiﬁers through directly minimizing empirical classiﬁcation error over labeled training examples; once the training classiﬁcation error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classiﬁers to maximize any targeted arbitrarily deﬁned margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

6 0.12118938 75 nips-2013-Convex Two-Layer Modeling

7 0.11857431 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

8 0.11705869 318 nips-2013-Structured Learning via Logistic Regression

9 0.11447948 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

10 0.096214287 149 nips-2013-Latent Structured Active Learning

11 0.092488445 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

12 0.091628477 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

13 0.089298695 201 nips-2013-Multi-Task Bayesian Optimization

14 0.086409517 5 nips-2013-A Deep Architecture for Matching Short Texts

15 0.084420912 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

16 0.083554119 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

17 0.081423931 244 nips-2013-Parametric Task Learning

18 0.078050055 269 nips-2013-Regression-tree Tuning in a Streaming Setting

19 0.077562377 183 nips-2013-Mapping paradigm ontologies to and from the brain

20 0.076793492 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.234), (1, 0.035), (2, -0.046), (3, -0.108), (4, 0.106), (5, -0.053), (6, -0.032), (7, 0.066), (8, -0.037), (9, 0.005), (10, -0.124), (11, -0.114), (12, -0.028), (13, -0.052), (14, -0.01), (15, 0.016), (16, -0.018), (17, 0.066), (18, 0.036), (19, -0.04), (20, -0.024), (21, 0.119), (22, 0.095), (23, 0.02), (24, 0.007), (25, 0.0), (26, 0.139), (27, -0.111), (28, 0.002), (29, -0.087), (30, 0.017), (31, 0.053), (32, -0.065), (33, 0.065), (34, 0.115), (35, 0.008), (36, 0.018), (37, -0.087), (38, 0.109), (39, 0.115), (40, -0.136), (41, -0.016), (42, -0.059), (43, 0.084), (44, 0.028), (45, -0.096), (46, -0.091), (47, 0.125), (48, 0.001), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95847887 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

2 0.78280127 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Author: Boqing Gong, Kristen Grauman, Fei Sha

3 0.74432099 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

4 0.65621251 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

5 0.62024224 244 nips-2013-Parametric Task Learning

Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima

Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle inﬁnitely many tasks parameterized by a continuous parameter. Our key ﬁnding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.

6 0.6135087 318 nips-2013-Structured Learning via Logistic Regression

7 0.61316431 337 nips-2013-Transportability from Multiple Environments with Limited Experiments

8 0.60437024 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

9 0.58372456 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

10 0.57875293 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

11 0.56300056 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

12 0.54160279 76 nips-2013-Correlated random features for fast semi-supervised learning

13 0.52295083 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

14 0.51883715 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

15 0.51749772 335 nips-2013-Transfer Learning in a Transductive Setting

16 0.50068063 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

17 0.49249363 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

18 0.48614964 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

19 0.48484766 226 nips-2013-One-shot learning by inverting a compositional causal process

20 0.48097512 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.024), (33, 0.107), (34, 0.082), (49, 0.026), (56, 0.092), (70, 0.031), (85, 0.028), (89, 0.018), (93, 0.508), (95, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94381744 339 nips-2013-Understanding Dropout

Author: Pierre Baldi, Peter J. Sadowski

Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1

same-paper 2 0.89318895 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

3 0.88724542 65 nips-2013-Compressive Feature Learning

Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie

Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method ﬁnds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efﬁcient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1

4 0.85472 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon

Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efﬁciency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1

5 0.85087019 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

Author: Andriy Mnih, Koray Kavukcuoglu

Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and ﬁnd that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1

6 0.71501905 215 nips-2013-On Decomposing the Proximal Map

7 0.67198712 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

8 0.62779051 99 nips-2013-Dropout Training as Adaptive Regularization

9 0.61497372 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

10 0.61095625 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

11 0.59323239 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality

12 0.58601266 30 nips-2013-Adaptive dropout for training deep neural networks

13 0.57888424 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

14 0.5652746 251 nips-2013-Predicting Parameters in Deep Learning

15 0.55774409 5 nips-2013-A Deep Architecture for Matching Short Texts

16 0.55259573 69 nips-2013-Context-sensitive active sensing in humans

17 0.53459597 64 nips-2013-Compete to Compute

18 0.53317231 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion

19 0.52701622 183 nips-2013-Mapping paradigm ontologies to and from the brain

20 0.52620095 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization