nips nips2010 nips2010-47 knowledge-graph by maker-knowledge-mining

47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation


Source: pdf

Author: Abhishek Kumar, Avishek Saha, Hal Daume

Abstract: This paper presents a co-regularization based approach to semi-supervised domain adaptation. Our proposed approach (EA++) builds on the notion of augmented space (introduced in E ASYA DAPT (EA) [1]) and harnesses unlabeled data in target domain to further assist the transfer of information from source to target. This semi-supervised approach to domain adaptation is extremely simple to implement and can be applied as a pre-processing step to any supervised learner. Our theoretical analysis (in terms of Rademacher complexity) of EA and EA++ show that the hypothesis class of EA++ has lower complexity (compared to EA) and hence results in tighter generalization bounds. Experimental results on sentiment analysis tasks reinforce our theoretical findings and demonstrate the efficacy of the proposed method when compared to EA as well as few other representative baseline approaches.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our proposed approach (EA++) builds on the notion of augmented space (introduced in E ASYA DAPT (EA) [1]) and harnesses unlabeled data in target domain to further assist the transfer of information from source to target. [sent-8, score-0.793]

2 This semi-supervised approach to domain adaptation is extremely simple to implement and can be applied as a pre-processing step to any supervised learner. [sent-9, score-0.266]

3 1 Introduction A domain adaptation approach for NLP tasks, termed E ASYA DAPT (EA), augments the source domain feature space using features from labeled data in target domain [1]. [sent-12, score-0.97]

4 However, EA requires labeled data in both source and target, and hence applies to fully supervised domain adaptation settings only. [sent-14, score-0.519]

5 There exists prior work on supervised domain adaptation (and multi-task learning) that can be related to E ASYA DAPT. [sent-16, score-0.266]

6 Domain Adaptation Machine (DAM) [10] is a semi-supervised extension of SVMs for domain adaptation and presents extensive empirical results. [sent-25, score-0.256]

7 We define supervised domain adaptation as having labeled data in both source and target and unsupervised domain adaptation as having labeled data in only source. [sent-28, score-1.098]

8 In semi-supervised domain adaptation, we also have access to both labeled and unlabeled data in target. [sent-29, score-0.344]

9 Our generalization bounds also apply to the approach proposed in [3] for domain adaptation setting, where we are only concerned with the error on target domain. [sent-35, score-0.571]

10 Their paper investigates the necessity to combine supervised and unsupervised domain adaptation (which the authors refer to as labeled and unlabeled adaptation frameworks, respectively) and analyzes the combination using mistake bounds (which is limited to perceptron-based online scenarios). [sent-37, score-0.651]

11 Let Ds (x, y) be the source distribution and Dt (x, y) be the target distribution. [sent-43, score-0.457]

12 We have a set of source labeled examples Ls (∼ Ds (x, y)) and a set of target labeled examples Lt (∼ Dt (x, y)), where |Ls | = ls ≫ |Lt | = lt . [sent-44, score-0.726]

13 We also have target unlabeled data denoted by Ut (∼ Dt (x)), where |Ut | = ut . [sent-45, score-0.495]

14 Our goal is to learn a hypothesis h : X → Y having low expected error with respect to the target domain. [sent-46, score-0.439]

15 Source and target empirical errors for hypothesis h are denoted by ˆs (h, fs ) and ǫt (h, ft ) respectively, where fs and ft are the true source and target labeling functions. [sent-49, score-1.084]

16 EA operates in an augmented space denoted by X ⊂ R3d (for a single pair of (k+1)d source and target domain). [sent-54, score-0.521]

17 The augmented feature maps ˘ Φs , Φt : X → X for source and target domains are defined as Φs (x) = x, x, 0 and Φt (x) = x, 0, x where x and 0 are vectors in Rd , and 0 denotes a zero vector of dimension d. [sent-56, score-0.557]

18 The first d-dimensional segment corresponds to commonality between source and target, the second d-dimensional segment corresponds to the source domain while the last segment corresponds to the target domain. [sent-57, score-0.759]

19 Source and target domain examples are transformed using these feature maps and the augmented features so constructed are passed onto the underlying supervised classifier. [sent-58, score-0.48]

20 Let us denote h = gc , gs , gt , where each of gc , ˘ gs , gt is of dimension d, and represent the common, source-specific and target-specific components of h, respectively. [sent-63, score-0.606]

21 t ˘ is applied on this During prediction on target data, the incoming target sample x is transformed to obtain Φ (x) and h transformed sample. [sent-64, score-0.552]

22 Briefly, it can be thought to be simultaneously training two hypotheses: hs = (gc + gs ) for source domain and ht = (gc + gt ) for target domain. [sent-67, score-1.083]

23 The commonality between the domains is represented by gc whereas gs and gt capture the idiosyncrasies of the source and target domain, respectively. [sent-68, score-0.812]

24 One drawback of E ASYA DAPT is its inability to leverage unlabeled target data which is usually available in large quantities in most practical scenarios. [sent-70, score-0.443]

25 Thereafter, unlabeled data is utilized to co-regularize these learned hypotheses by making them agree on unlabeled samples. [sent-74, score-0.447]

26 In domain adaptation, the source and target data come from two different distributions. [sent-75, score-0.562]

27 However, if the source and target domains are reasonably close, we can employ a similar form of regularization using unlabeled data. [sent-76, score-0.66]

28 A prior co-regularization based idea to harness unlabeled data in domain adaptation tasks demonstrated improved empirical results [10]. [sent-77, score-0.423]

29 2 EA++: E ASYA DAPT with unlabeled data In our proposed semi-supervised approach, the source and target hypotheses are made to agree on unlabeled data. [sent-80, score-0.904]

30 The hypothesis h contains common, source-specific and target-specific sub-hypotheses and is expressed as ˘ h = gc , gs , gt . [sent-83, score-0.418]

31 2), this is equivalent to learning a source specific hypothesis hs = (gc + gs ) and a target specific hypothesis ht = (gc + gt ). [sent-86, score-1.193]

32 In EA++, we want the source hypothesis hs and the target hypothesis ht to agree on the unlabeled data. [sent-87, score-1.239]

33 For an unlabeled target sample xi ∈ Ut ⊂ Rd , the goal of EA++ is to make the predictions of hs and ht on xi , agree with each other. [sent-88, score-0.828]

34 Formally, it aims to achieve the following condition: hs · xi ≈ ht · xi ⇐⇒ (gc + gs ) · xi ≈ (gc + gt ) · xi ⇐⇒ (gs − gt ) · xi ≈ 0 ⇐⇒ gc , gs , gt · 0, xi , −xi ≈ 0. [sent-89, score-0.877]

35 Every unlabeled target sample is transformed using the map Φu (. [sent-92, score-0.443]

36 The augmented feature space that results from the application of three feature maps, namely, Φs (·), Φt (·) and Φu (·) on source labeled samples, target labeled samples and target unlabeled samples is summarized in Figure 1(a). [sent-94, score-1.18]

37 1) which relates empirical and expected error for the general case and hence applies to both the source and target domains. [sent-117, score-0.53]

38 2 which relates the expected target error to the expected source error. [sent-119, score-0.531]

39 2 so as to relate the expected target error to empirical errors in source and target (which is the main goal of the generalization bounds presented in this paper). [sent-123, score-0.89]

40 The augmented hypothesis h is trained using data from both domains, and the three sub-hypotheses (gc + gs + gt ) of d-dimension are treated in a different manner for source and target data. [sent-128, score-0.787]

41 2, EA can be thought to be simultaneously training two hypotheses hs = (gc + gs ) and ht = (gc + gt ) for source and target domains, respectively. [sent-131, score-1.061]

42 gc gives gc = (hs + ht )/3 at the minimum, and the regularizer reduces to 1 2 2 2 3 (||hs || + ||ht || + ||hs − ht || ). [sent-137, score-0.663]

43 Thus, EA can be thought to be minimizing the sum of empirical source error on hs , empirical target error on ht and this regularizer. [sent-138, score-0.921]

44 1) αˆs (h1 ) + (1 − α)ˆt (h2 ) + λ1 ||h1 ||2 + λ2 ||h2 ||2 + λ||h1 − h2 ||2 , and (hs , ht ) = arg min QEA ǫ ǫ h1 ,h2 The EA algorithm minimizes this cost function over h1 and h2 jointly to obtain hs and ht . [sent-140, score-0.542]

45 The EA++ algorithm uses target unlabeled data, and encourages hs and ht to agree on unlabeled samples (Eq. [sent-141, score-1.031]

46 2) Both EA and EA++ give equal weights to source and target empirical errors, so α turns out to be 0. [sent-146, score-0.482]

47 One possible way [13] of defining the hypotheses classes is to substitute trivial hypotheses h1 = h2 = 0 in both the cost functions which makes all regularizers and co-regularizers equal to zero and thus bounds the cost functions QEA and Q++ . [sent-157, score-0.25]

48 Without loss of generality, we also assume that final ˆ ˆ source and target hypotheses can only reduce the cost function as compared to the zero hypotheses. [sent-159, score-0.574]

49 Hence, the final hypothesis pair (hs , ht ) that minimizes the cost functions is contained in the following paired hypothesis classes for EA and EA++, H := {(h1 , h2 ) : λ1 ||h1 ||2 + λ2 ||h2 ||2 + λ||h1 − h2 ||2 ≤ 1} (4. [sent-160, score-0.447]

50 3) H := {(h , h ) : λ ||h ||2 + λ ||h ||2 + λ||h − h ||2 + λ (h (x ) − h (x ))2 ≤ 1} ++ 1 2 1 1 2 2 1 2 u 1 i∈Ut 4 i 2 i The source hypothesis class for EA is the set of all h1 such that the pair (h1 , h2 ) is in H. [sent-161, score-0.321]

51 Similarly, the target hypothesis class for EA is the set of all h2 such that the pair (h1 , h2 ) is in H. [sent-162, score-0.416]

52 Consequently, the source and target hypothesis classes for EA can be defined as: s JEA := {h1 : X → R, (h1 , h2 ) ∈ H} t JEA := {h2 : X → R, (h1 , h2 ) ∈ H} and (4. [sent-163, score-0.602]

53 4) Similarly, the source and target hypothesis classes for EA++ are defined as: s J++ := {h1 : X → R, (h1 , h2 ) ∈ H++ } and t J++ := {h2 : X → R, (h1 , h2 ) ∈ H++ } (4. [sent-164, score-0.602]

54 Let us define the kernel matrix and partition it corresponding to source labeled, target labeled and target unlabeled data as shown below: K= As×s ′ Ct×s ′ Du×s Cs×t Bt×t ′ Eu×t Ds×u Et×u Fu×u , (4. [sent-166, score-0.972]

55 6) where ‘s’, ‘t’ and ‘u’ indicate terms corresponding to source labeled, target labeled and target unlabeled, respectively. [sent-167, score-0.805]

56 2 Relate empirical and expected error (for both source and target) Having defined the hypothesis classes, we now proceed to obtain generalization bounds for EA and EA++. [sent-169, score-0.411]

57 m s t If we can bound the complexity of hypothesis classes JEA and JEA , we will have a uniform convergence bound on the difference of expected and empirical errors (|ǫt (h) − ˆt (h)| and |ǫs (h) − ǫs (h)|) using Theorem 4. [sent-184, score-0.298]

58 However, in ǫ ˆ domain adaptation setting, we are also interested in the bounds that relate expected target error to total empirical error on source and target samples. [sent-186, score-1.102]

59 3 Relate source expected error and target expected error The following theorem provides a bound on the difference of expected target error and expected source error. [sent-189, score-1.147]

60 The bound is in terms of ηs := ǫs (fs , ft ), νs := ǫs (h∗ , ft ) and νt := ǫt (h∗ , ft ), where fs and ft are the source and target t t labeling functions, and h∗ is the optimal target hypothesis in target hypothesis class. [sent-190, score-1.507]

61 On the contrary, if the domains are far apart, these terms will be big and the use of extra source samples may not help in learning a better target hypothesis. [sent-194, score-0.529]

62 For any two source and target hypotheses hs , ht (which belong to different hypotheses classes), we have 1 ǫt (ht , ft ) − ǫs (hs , fs ) ≤M ||ht − hs ||Es k(x, x) + dHt ∆Ht (Ds , Dt ) + ηs + νs + νt . [sent-200, score-1.259]

63 2 where Ht is the target hypothesis class, and k(·, ·) is the reproducing kernel for the RKHS. [sent-201, score-0.391]

64 4 Relate target expected error with source and target empirical errors EA and EA++ learn source and target hypotheses jointly. [sent-206, score-1.365]

65 In this section, we aim to bound the target expected error in terms of source and target empirical errors. [sent-208, score-0.834]

66 2, with probability at least 1 − δ we have ǫt (ht , ft ) ≤ « „ p 1 1 1 1 1 ˆ ˆ √ +√ (2 + 3 ln(2/δ)/2) (ˆs (hs , fs ) + ǫt (ht , ft )) + (2M Rm (Hs ) + 2M Rm (Ht )) + ǫ ˆ 2 2 2 ls lt hp i 1 1 1 + M ||ht − hs ||Es k(x, x) + dHt ∆Ht (Ds , Dt ) + (ηs + νs + νt ) 2 4 2 for any hs and ht . [sent-214, score-0.809]

67 Hs and Ht are the source hypothesis class and the target hypothesis class, respectively. [sent-215, score-0.712]

68 This bound provides better a understanding of how the target expected error is governed by both source and target empirical errors, and hypotheses class complexities. [sent-221, score-0.942]

69 This behavior is expected since both EA and EA++ learn source and target hypotheses jointly. [sent-222, score-0.566]

70 3 depends on ||hs − ht ||, which apparently might give an impression that the best possible thing to do is to make source and target hypotheses equal. [sent-224, score-0.71]

71 However, due to joint learning of source and target hypotheses (by optimizing the cost function of Eq. [sent-225, score-0.557]

72 1), making the source and target hypotheses close will increase the source empirical error, thus loosening the bound of Theorem 4. [sent-227, score-0.774]

73 Noticing 1 that ||hs − ht ||2 ≤ λ for both EA and EA++, the bound can be made independent of ||hs − ht || although with a sacrifice on the tightness. [sent-229, score-0.368]

74 1 can also be used to bound the target generalization error of EA and EA++ in terms of only target empirical error. [sent-231, score-0.649]

75 However, if the number of labeled target samples is extremely low, this bound can be loose due to inverse dependency on number of target samples. [sent-232, score-0.688]

76 3 bounds the target expected error using the averages of empirical errors, Rademacher complexities, and sample dependent terms. [sent-234, score-0.369]

77 If the domains are reasonably close and the number of labeled source samples is much higher than target samples, this can provide a tighter bound compared to Theorem 4. [sent-235, score-0.629]

78 Finally, we need the Rademacher complexities of source and target hypothesis classes (for both EA and EA++) to be able to use Theorem 4. [sent-237, score-0.602]

79 5 Bound the Complexity of EA and EA++ Hypothesis Classes The following theorems bound the Rademacher complexity of the target hypothesis classes for EA and EA++. [sent-240, score-0.476]

80 The complexity of target class decreases with an increase in the values of hyperparameters. [sent-252, score-0.328]

81 It decreases more rapidly with change in λ2 compared to λ and λ1 , which is also expected since λ2 is the hyperparameter directly influencing the target hypothesis. [sent-253, score-0.302]

82 3, we need the Rademacher complexity of the source hypothesis class. [sent-258, score-0.323]

83 3) in h1 and h2 up to scalar parameters, the complex6 ity of source hypothesis class can be similarly bounded by λ1 + 4. [sent-261, score-0.321]

84 2 ”−1 “ 1 1 1 λ +λ s 1 2CEA √ 4 2 ls s ˆ ≤ Rm (JEA ) ≤ lt λu s where (CEA )2 = tr(A), and A is the kernel block sub-matrix corresponding to source samples. [sent-263, score-0.306]

85 So, the unlabeled data results in a reduction of complexity over the labeled data case (Theorem 4. [sent-272, score-0.266]

86 Since Ei is the vector representing the inner product of i’th target sample with all unlabeled samples, this means that the reduction in complexity is proportional to the similarity between target unlabeled samples and target labeled samples. [sent-275, score-1.297]

87 1 gives a bound on the target generalization error in terms of target empirical error. [sent-277, score-0.649]

88 3, we need the Rademacher complexity of source hypothesis class too. [sent-279, score-0.348]

89 The trace term can again be interpreted as before, which implies that the reduction in source class complexity is proportional to the similarity between source labeled samples and target unlabeled samples. [sent-284, score-0.965]

90 We quantify the domain divergences in terms of the A-distance [16] which is computed [17] from finite samples of source and target domain using the proxy A-distance [16]. [sent-288, score-0.735]

91 In all cases, the amount of unlabeled target data is equal to the total amount of labeled source and target data. [sent-294, score-1.008]

92 All these approaches were tested on the entire amount of available target test data. [sent-296, score-0.294]

93 For S OURCE O NLY, TARGET O NLY-F ULL and TARGET O NLY, it is just the corresponding number of labeled source or target samples, respectively. [sent-307, score-0.529]

94 For A LL and EA, it is the summation of labeled source and target samples. [sent-308, score-0.529]

95 For EA++, the x-value plotted denotes the amount of unlabeled target data used (in addition to an equal amount of source+target labeled data, as in A LL or EA). [sent-309, score-0.551]

96 We plot this number for EA++, just to compare its improvement over EA when using an additional (and equal) amount of unlabeled target data. [sent-310, score-0.461]

97 Hence, this domain pair is more amenable for domain adaptation as is demonstrated by the fact that the other approaches (S OURCE O NLY, TARGET O NLY, A LL) perform better or atleast as good as TARGET O NLY-F ULL. [sent-320, score-0.336]

98 With this extension, EA++ applies to both fully supervised and semi-supervised domain adaptation settings. [sent-330, score-0.266]

99 The difference being, while in SSL one would try to make the two views (on unlabeled data) agree, in domain adaptation the aim is to make the two hypotheses in source and target agree. [sent-333, score-0.938]

100 If we have k sources and a single target domain then we can introduce a co-regularizer for each source-target pair. [sent-337, score-0.381]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ea', 0.735), ('target', 0.276), ('hs', 0.185), ('source', 0.181), ('ht', 0.17), ('unlabeled', 0.167), ('asya', 0.161), ('gc', 0.152), ('nly', 0.141), ('dapt', 0.131), ('adaptation', 0.126), ('hypothesis', 0.115), ('domain', 0.105), ('jea', 0.091), ('hypotheses', 0.083), ('gs', 0.083), ('ource', 0.081), ('labeled', 0.072), ('gt', 0.068), ('rademacher', 0.068), ('augmented', 0.064), ('ls', 0.064), ('lt', 0.061), ('ut', 0.052), ('ull', 0.05), ('ft', 0.048), ('fs', 0.048), ('apparel', 0.04), ('avishek', 0.04), ('dvd', 0.04), ('qea', 0.04), ('ds', 0.039), ('hal', 0.038), ('samples', 0.036), ('domains', 0.036), ('sentiment', 0.035), ('theorem', 0.035), ('supervised', 0.035), ('daum', 0.034), ('rm', 0.033), ('proxy', 0.032), ('ll', 0.032), ('books', 0.03), ('fernando', 0.03), ('kitchen', 0.03), ('agree', 0.03), ('abhishek', 0.03), ('classes', 0.03), ('classi', 0.03), ('bound', 0.028), ('june', 0.027), ('complexity', 0.027), ('blitzer', 0.026), ('expected', 0.026), ('empirical', 0.025), ('class', 0.025), ('kf', 0.024), ('koby', 0.024), ('acl', 0.023), ('dt', 0.023), ('er', 0.023), ('relate', 0.023), ('error', 0.022), ('tr', 0.022), ('generalization', 0.022), ('avrim', 0.022), ('bounds', 0.02), ('danlp', 0.02), ('dht', 0.02), ('srconly', 0.02), ('ssl', 0.02), ('tgtonly', 0.02), ('wenyuan', 0.02), ('yong', 0.02), ('conjunction', 0.02), ('errors', 0.019), ('vancouver', 0.019), ('regularizer', 0.019), ('amount', 0.018), ('copies', 0.018), ('finland', 0.018), ('helsinki', 0.018), ('cea', 0.018), ('dai', 0.018), ('republic', 0.018), ('please', 0.017), ('loss', 0.017), ('cost', 0.017), ('base', 0.017), ('dh', 0.017), ('cs', 0.017), ('july', 0.016), ('commonality', 0.016), ('czech', 0.016), ('dredze', 0.016), ('frustratingly', 0.016), ('saha', 0.016), ('thought', 0.015), ('nlp', 0.015), ('prague', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation

Author: Abhishek Kumar, Avishek Saha, Hal Daume

Abstract: This paper presents a co-regularization based approach to semi-supervised domain adaptation. Our proposed approach (EA++) builds on the notion of augmented space (introduced in E ASYA DAPT (EA) [1]) and harnesses unlabeled data in target domain to further assist the transfer of information from source to target. This semi-supervised approach to domain adaptation is extremely simple to implement and can be applied as a pre-processing step to any supervised learner. Our theoretical analysis (in terms of Rademacher complexity) of EA and EA++ show that the hypothesis class of EA++ has lower complexity (compared to EA) and hence results in tighter generalization bounds. Experimental results on sentiment analysis tasks reinforce our theoretical findings and demonstrate the efficacy of the proposed method when compared to EA as well as few other representative baseline approaches.

2 0.2700173 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions

Author: Peter Stobbe, Andreas Krause

Abstract: Many combinatorial problems arising in machine learning can be reduced to the problem of minimizing a submodular function. Submodular functions are a natural discrete analog of convex functions, and can be minimized in strongly polynomial time. Unfortunately, state-of-the-art algorithms for general submodular minimization are intractable for larger problems. In this paper, we introduce a novel subclass of submodular minimization problems that we call decomposable. Decomposable submodular functions are those that can be represented as sums of concave functions applied to modular functions. We develop an algorithm, SLG, that can efÄ?Ĺš ciently minimize decomposable submodular functions with tens of thousands of variables. Our algorithm exploits recent results in smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint classiÄ?Ĺš cation-and-segmentation task, and show that it outperforms the state-of-the-art general purpose submodular minimization algorithms by several orders of magnitude. 1

3 0.19348519 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

Author: Alessandro Bergamo, Lorenzo Torresani

Abstract: Most current image categorization methods require large collections of manually annotated training examples to learn accurate visual recognition models. The time-consuming human labeling effort effectively limits these approaches to recognition problems involving a small number of different object classes. In order to address this shortcoming, in recent years several authors have proposed to learn object classifiers from weakly-labeled Internet images, such as photos retrieved by keyword-based image search engines. While this strategy eliminates the need for human supervision, the recognition accuracies of these methods are considerably lower than those obtained with fully-supervised approaches, because of the noisy nature of the labels associated to Web data. In this paper we investigate and compare methods that learn image classifiers by combining very few manually annotated examples (e.g., 1-10 images per class) and a large number of weakly-labeled Web photos retrieved using keyword-based image search. We cast this as a domain adaptation problem: given a few stronglylabeled examples in a target domain (the manually annotated examples) and many source domain examples (the weakly-labeled Web photos), learn classifiers yielding small generalization error on the target domain. Our experiments demonstrate that, for the same number of strongly-labeled examples, our domain adaptation approach produces significant recognition rate improvements over the best published results (e.g., 65% better when using 5 labeled training examples per class) and that our classifiers are one order of magnitude faster to learn and to evaluate than the best competing method, despite our use of large weakly-labeled data sets.

4 0.10066566 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

Abstract: The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is O( 1 ), where the order of 1/ǫ is independent of the parameter in Tsybakov noise, ǫ contrasting to previous polynomial bounds where the order of 1/ǫ is related to the parameter in Tsybakov noise. 1

5 0.085533179 167 nips-2010-Mixture of time-warped trajectory models for movement decoding

Author: Elaine Corbett, Eric Perreault, Konrad Koerding

Abstract: Applications of Brain-Machine-Interfaces typically estimate user intent based on biological signals that are under voluntary control. For example, we might want to estimate how a patient with a paralyzed arm wants to move based on residual muscle activity. To solve such problems it is necessary to integrate obtained information over time. To do so, state of the art approaches typically use a probabilistic model of how the state, e.g. position and velocity of the arm, evolves over time – a so-called trajectory model. We wanted to further develop this approach using two intuitive insights: (1) At any given point of time there may be a small set of likely movement targets, potentially identified by the location of objects in the workspace or by gaze information from the user. (2) The user may want to produce movements at varying speeds. We thus use a generative model with a trajectory model incorporating these insights. Approximate inference on that generative model is implemented using a mixture of extended Kalman filters. We find that the resulting algorithm allows us to decode arm movements dramatically better than when we use a trajectory model with linear dynamics. 1 In trod u cti on When patients have lost a limb or the ability to communicate with the outside world, brain machine interfaces (BMIs) are often used to enable robotic prostheses or restore communication. To achieve this, the user's intended state of the device must be decoded from biological signals. In the context of Bayesian statistics, two aspects are important for the design of an estimator of a temporally evolving state: the observation model, which describes how measured variables relate to the system’s state and the trajectory model which describes how the state changes over time in a probabilistic manner. Following this logic many recent BMI applications have relied on Bayesian estimation for a wide range of problems including the decoding of intended human [1] and animal [2] movements. In the context of BMIs, Bayesian approaches offer a principled way of formalizing the uncertainty about signals and thus often result in improvements over other signal processing techniques [1]-[3]. Most work on state estimation in dynamical systems has assumed linear dynamics and Gaussian noise. Under these circumstances, efficient algorithms result from belief propagation. The most frequent application uses the Kalman filter (KF), which recursively combines noisy state observations with the probabilistic evolution of state defined by the trajectory model to estimate the marginal distribution over states [4]. Such approaches have been used widely for applications including upper [1] and lower [5] extremity prosthetic 1 devices, functional electric stimulation [6] and human computer interactions [7]. As these algorithms are so commonly used, it seems promising to develop extensions to nonlinear trajectory models that may better describe the probabilistic distribution of movements in everyday life. One salient departure from the standard assumptions is that people tend to produce both slow and fast movements, depending on the situation. Models with linear dynamics only allow such deviation through the noise term, which makes these models poor at describing the natural variation of movement speeds during real world tasks. Explicitly incorporating movement speed into the trajectory model should lead to better movement estimates. Knowledge of the target position should also strongly affect trajectory models. After all , we tend to accelerate our arm early during movement and slow down later on. Target information can be linearly incorporated into the trajectory model, and this has greatly improved predictions [8]-[12]. Alternatively, if there are a small number of potential targets then a mixture of trajectory models approach [13] can be used. Here we are interested in the case where available data provide a prior over potential t argets but where movement targets may be anywhere. We want to incorporate target uncertainty and allow generalization to novel targets. Prior information about potential targets could come from a number of sources but would generally be noisy. For example, activity in the dorsal premotor cortex provides information about intended target location prior to movement and may be used where such recordings are available [14]. Target information may also be found noninvasively by tracking eye movements. However, such data will generally provide non-zero priors for a number of possible target locations as the subject saccades over the scene. While subjects almost always look at a target before reaching for it [15], there may be a delay of up to a second between looking at the target and the reach – a time interval over which up to 3 saccades are typically made. Each of these fixations could be the target. Hence, a probabilistic distribution of targets is appropriate when using either neural recordings or eye tracking to estimate potential reach targets Here we present an algorithm that uses a mixture of extended Kalman Filters (EKFs) to combine our insights related to the variation of movement speed and the availability of probabilistic target knowledge. Each of the mixture component s allows the speed of the movement to vary continuously over time. We tested how well we could use EMGs and eye movements to decode hand position of humans performing a three -dimensional large workspace reaching task. We find that using a trajectory model that allows for probabilistic target information and variation of speed leads to dramatic improvements in decoding quality. 2 Gen e ral Decod i n g S etti n g We wanted to test how well different decoding algorithms can decode human movement, over a wide range of dynamics. While many recent studies have looked at more restrictive, two-dimensional movements, a system to restore arm function should produce a wide range of 3D trajectories. We recorded arm kinematics and EMGs of healthy subjects during unconstrained 3D reaches to targets over a large workspace. Two healthy subjects were asked to reach at slow, normal and fast speeds, as they would in everyday life. Subjects were seated as they reached towards 16 LEDs in blocks of 150s, which were located on two planes positioned such that all targets were just reachable (Fig 1A). The target LED was lit for one second prior to an auditory go cue, at which time the subject would reach to the target at the appropriate speed. Slow, normal and fast reaches were allotted 3 s, 1.5s and 1s respectively; however, subjects determined the speed. An approximate total of 450 reaches were performed per subject. The subjects provided informed consent, and the protocol was approved by the Northwestern University Institutional Review Board. EMG signals were measured from the pectoralis major, and the three deltoid muscles of the shoulder. This represents a small subset of the muscles involved in reaching, and approximates those muscles retaining some voluntary control following mid-level cervical spinal cord injuries. 2 The EMG signals were band-pass filtered between 10 and 1,000 Hz, and subsequently anti aliased filtered. Hand, wrist, shoulder and head positions were tracked using an Optotrak motion analysis system. We simultaneously recorded eye movements with an ASL EYETRAC-6 head mounted eye tracker. Approximately 25% of the reaches were assigned to the test set, and the rest were used for training. Reaches for which either the motion capture data was incomplete, or there was visible motion artifact on the EMG were removed. As the state we used hand positions and joint angles (3 shoulder, 2 elbow, position, velocity and acceleration, 24 dimensions). Joint angles were calculated from the shoulder and wrist marker data using digitized bony landmarks which defined a coordinate system for the upper limb as detailed by Wu et al. [16]. As the motion data were sampled at 60Hz, the mean absolute value o f the EMG in the corresponding 16.7ms windows was used as an observation of the state at each time-step. Algorithm accuracy was quantified by normalizing the root -mean-squared error by the straight line distance between the first and final position of the endpoint for each reach. We compared the algorithms statistically using repeated measures ANOVAs with Tukey post -hoc tests, treating reach and subject as random effects. In the rest of the paper we will ask how well these reaching movements can be decoded from EMG and eye-tracking data. Figure 1: A Experimental setup and B sample kinematics and processed EMGs for one reach 3 Kal man Fi l ters w i th Target i n f ormati on All models that we consider in this paper assume linear observations with Gaussian noise: (1) where x is the state, y is the observation and v is the measurement noise with p(v) ~ N(0,R), and R is the observation covariance matrix. The model fitted the measured EMGs with an average r2 of 0.55. This highlights the need to integrate information over time. The standard approach also assumes linear dynamics and Gaussian process noise: (2) where, x t represents the hand and joint angle positions, w is the process noise with p(w) ~ N(0,Q), and Q is the state covariance matrix. The Kalman filter does optimal inference for this generative model. This model can effectively capture the dynamics of stereotypical reaches to a single target by appropriately tuning its parameters. However, when used to describe reaches to multiple targets, the model cannot describe target dependent aspects of reaching but boils down to a random drift model. Fast velocities are underestimated as they are unlikely under the trajectory model and there is excessive drift close to the target (Fig. 2A). 3 In many decoding applications we may know the subject’s target. A range of recent studies have addressed the issue of incorporating this information into the trajectory model [8, 13], and we might assume the effect of the target on the dynamics to be linear. This naturally suggests adding the target to the state space, which works well in practice [9, 12]. By appending the target to the state vector (KFT), the simple linear format of the KF may be retained: (3) where xTt is the vector of target positions, with dimensionality less than or equal to that of xt. This trajectory model thus allows describing both the rapid acceleration that characterizes the beginning of a reach and the stabilization towards its end. We compared the accuracy of the KF and the KFT to the Single Target Model (STM), a KF trained only on reaches to the target being tested (Fig. 2). The STM represents the best possible prediction that could be obtained with a Kalman filter. Assuming the target is perfectly known, we implemented the KFT by correctly initializing the target state xT at the beginning of the reach. We will relax this assumption below. The initial hand and joint angle positions were also assumed to be known. Figure 2: A Sample reach and predictions and B average accuracies with standard errors for KFT, KF and MTM. Consistent with the recent literature, both methods that incorporated target information produced higher prediction accuracy than the standard KF (both p<0.0001). Interestingly, there was no significant difference between the KFT and the STM (p=0.9). It seems that when we have knowledge of the target, we do not lose much by training a single model over the whole workspace rather than modeling the targets individually. This is encouraging, as we desire a BMI system that can generalize to any target within the workspace, not just specifically to those that are available in the training data. Clearly, adding the target to the state space allows the dynamics of typical movements to be modeled effectively, resulting in dramatic increases in decoding performance. 4 Ti me Warp i n g 4.1 I m p l e m e n t i n g a t i m e - w a r p e d t r a j e c t o r y mo d e l While the KFT above can capture the general reach trajectory profile, it does not allow for natural variability in the speed of movements. Depending on our task objectives, which would not directly be observed by a BMI, we might lazily reach toward a target or move a t maximal speed. We aim to change the trajectory model to explicitly incorporate a warping factor by which the average movement speed is scaled, allowing for such variability. As the movement speed will be positive in all practical cases, we model the logarithm of this factor, 4 and append it to the state vector: (4) We create a time-warped trajectory model by noting that if the average rate of a trajectory is to be scaled by a factor S, the position at time t will equal that of the original trajectory at time St. Differentiating, the velocity will be multiplied by S, and the acceleration by S 2. For simplicity, the trajectory noise is assumed to be additive and Gaussian, and the model is assumed to be stationary: (5) where Ip is the p-dimensional identity matrix and is a p p matrix of zeros. Only the terms used to predict the acceleration states need to be estimated to build the state transition matrix, and they are scaled as a nonlinear function of xs. After adding the variable movement speed to the state space the system is no longer linear. Therefore we need a different solution strategy. Instead of the typical KFT we use the Extended Kalman Filter (EKFT) to implement a nonlinear trajectory model by linearizing the dynamics around the best estimate at each time-step [17]. With this approach we add only small computational overhead to the KFT recursions. 4.2 Tr a i n i n g t h e t i m e w a r p i n g mo d e l The filter parameters were trained using a variant of the Expectation Maximization (EM) algorithm [18]. For extended Kalman filter learning the initialization for the variables may matter. S was initialized with the ground truth average reach speeds for each movement relative to the average speed across all movements. The state transition parameters were estimated using nonlinear least squares regression, while C, Q and R were estimated linearly for the new system, using the maximum likelihood solution [18] (M-step). For the E-step we used a standard extended Kalman smoother. We thus found the expected values for t he states given the current filter parameters. For this computation, and later when testing the algorithm, xs was initialized to its average value across all reaches while the remaining states were initialized to their true values. The smoothed estimate fo r xs was then used, along with the true values for the other states, to re-estimate the filter parameters in the M-step as before. We alternated between the E and M steps until the log likelihood converged (which it did in all cases). Following the training procedure, the diagonal of the state covariance matrix Q corresponding to xs was set to the variance of the smoothed xs over all reaches, according to how much this state should be allowed to change during prediction. This allowed the estimate of xs to develop over the course of the reach due to the evidence provided by the observations, better capturing the dynamics of reaches at different speeds. 4.3 P e r f o r ma n c e o f t h e t i m e - w a r p e d E K F T Incorporating time warping explicitly into the trajectory model pro duced a noticeable increase in decoding performance over the KFT. As the speed state xs is estimated throughout the course of the reach, based on the evidence provided by the observations, the trajectory model has the flexibility to follow the dynamics of the reach more accurately (Fig. 3). While at the normal self-selected speed the difference between the algorithms is small, for the slow and fast speeds, where the dynamics deviate from average, there i s a clear advantage to the time warping model. 5 Figure 3: Hand positions and predictions of the KFT and EKFT for sample reaches at A slow, B normal and C fast speeds. Note the different time scales between reaches. The models were first trained using data from all speeds (Fig. 4A). The EKFT was 1.8% more accurate on average (p<0.01), and the effect was significant at the slow (1.9%, p<0.05) and the fast (2.8%, p<0.01), but not at the normal (p=0.3) speed. We also trained the models from data using only reaches at the self-selected normal speed, as we wanted to see if there was enough variation to effectively train the EKFT (Fig. 4B). Interestingly, the performance of the EKFT was reduced by only 0.6%, and the KFT by 1.1%. The difference in performance between the EKFT and KFT was even more pronounced on aver age (2.3%, p<0.001), and for the slow and fast speeds (3.6 and 4.1%, both p< 0.0001). At the normal speed, the algorithms again were not statistically different (p=0.6). This result demonstrates that the EKFT is a practical option for a real BMI system, as it is not necessary to greatly vary the speeds while collecting training data for the model to be effective over a wide range of intended speeds. Explicitly incorporating speed information into the trajectory model helps decoding, by modeling the natural variation in volitional speed. Figure 4: Mean and standard error of EKFT and KFT accuracy at the different subjectselected speeds. Models were trained on reaches at A all speeds and B just normal speed reaches. Asterisks indicate statistically significant differences between the algorithms. 5 Mi xtu res of Target s So far, we have assumed that the targets of our reaches are perfectly known. In a real-world system, there will be uncertainty about the intended target of the reach. However, in typical applications there are a small number of possible objectives. Here we address this situation. Drawing on the recent literature, we use a mixture model to consider each of the possible targets [11, 13]. We condition the posterior probability for the state on the N possible targets, T: (6) 6 Using Bayes' Rule, this equation becomes: (7) As we are dealing with a mixture model, we perform the Kalman filter recursion for each possible target, xT, and our solution is a weighted sum of the outputs. The weights are proportional to the prior for that target, , and the likelihood of the model given that target . is independent of the target and does not need to be calculated. We tested mixtures of both algorithms, the mKFT and mEKFT, with real uncert ain priors obtained from eye-tracking in the one-second period preceding movement. As the targets were situated on two planes, the three-dimensional location of the eye gaze was found by projecting its direction onto those planes. The first, middle and last eye samples were selected, and all other samples were assigned to a group according to which of the three was closest. The mean and variance of these three groups were used to initialize three Kalman filters in the mixture model. The priors of the three groups were assigned proportional to the number of samples in them. If the subject looks at multiple positions prior to reaching, this method ensures with a high probability that the correct target was accounted for in one of the filters in the mixture. We also compared the MTM approach of Yu et al. [13], where a different KF model was generated for each target, and a mixture is performed over these models. This approach explicitly captures the dynamics of stereotypical reaches to specific targets. Given perfect target information, it would reduce to the STM described above. Priors for the MTM were found by assigning each valid eye sample to its closest two targets, and weighting the models proportional to the number of samples assigned to the corresponding target, divided by its distance from the mean of those samples. We tried other ways of assigning priors and the one presented gave the best results. We calculated the reduction in decoding quality when instead of perfect priors we provide eye-movement based noisy priors (Fig. 5). The accuracies of the mEKFT, the mKFT and the MTM were only degraded by 0.8, 1.9 and 2.1% respectively, compared to the perfect prior situation. The mEKFT was still close to 10% better than the KF. The mixture model framework is effective in accounting for uncertain priors. Figure 5: Mean and standard errors of accuracy for algorithms with perfect priors, and uncertain priors with full and partial training set. The asterisk indicates a statistically significant effects between the two training types, where real priors are used. Here, only reaches at normal speed were used to train the models, as this is a more realistic training set for a BMI application. This accounts for the degraded performance of the MTM with perfect priors relative to the STM from above (Fig. 2). With even more stereotyped training data for each target, the MTM doesn't generalize as well to new speeds. 7 We also wanted to know if the algorithms could generalize to new targets. In a real application, the available training data will generally not span the entire useable worksp ace. We compared the algorithms where reaches to all targets except the one being tested had been used to train the models. The performance of the MTM was significantly de graded unsurprisingly, as it was designed for reaches to a set of known targets. Performance of the mKFT and mEKFT degraded by about 1%, but not significantly (both p>0.7), demonstrating that the continuous approach to target information is preferable when the target could be anywhere in space, not just at locations for which training data is available. 6 Di scu ssi on and concl u si on s The goal of this work was to design a trajectory model that would improve decoding for BMIs with an application to reaching. We incorporated two features that prominently influence the dynamics of natural reach: the movement speed and the target location. Our approach is appropriate where uncertain target information is available. The model generalizes well to new regions of the workspace for which there is no training data, and across a broad range of reaching dynamics to widely spaced targets in three dimensions. The advantages over linear models in decoding precision we report here could be equally obtained using mixtures over many targets and speeds. While mixture models [11, 13] could allow for slow versus fast movements and any number of potential targets, this strategy will generally require many mixture components. Such an approach would require a lot more training data, as we have shown that it does not generalize well. It would also be run-time intensive which is problematic for prosthetic devices that rely on low power controllers. In contrast, the algorithm introduced here only takes a small amount of additional run-time in comparison to the standard KF approach. The EKF is only marginally slower than the standard KF and the algorithm will not generally need to consider more than 3 mixture components assuming the subject fixates the target within the second pre ceding the reach. In this paper we assumed that subjects always would fixate a reach target – along with other non-targets. While this is close to the way humans usually coordinate eyes and reaches [15], there might be cases where people do not fixate a reach target. Our approach could be easily extended to deal with such situations by adding a dummy mixture component that all ows the description of movements to any target. As an alternative to mixture approaches, a system can explicitly estimate the target position in the state vector [9]. This approach, however, would not straightforwardly allow for the rich target information available; we look at the target but also at other locations, strongly suggesting mixture distributions. A combination of the two approaches could further improve decoding quality. We could both estimate speed and target position for the EKFT in a continuous manner while retaining the mixture over target priors. We believe that the issues that we have addressed here are almost universal. Virtually all types of movements are executed at varying speed. A probabilistic distribution for a small number of action candidates may also be expected in most BMI applications – after all there are usually only a small number of actions that make sense in a given environment. While this work is presented in the context of decoding human reaching, it may be applied to a wide range of BMI applications including lower limb prosthetic devices and human computer interactions, as well as different signal sources such as electrode grid recordings and electroencephalograms. The increased user control in conveying their intended movements would significantly improve the functionality of a neuroprosthetic device. A c k n o w l e d g e me n t s T h e a u t h o r s t h a n k T. H a s w e l l , E . K r e p k o v i c h , a n d V. Ravichandran for assistance with experiments. This work was funded by the NSF Program in Cyber-Physical Systems. R e f e re n c e s [1] L.R. Hochberg, M.D. Serruya, G.M. Friehs, J.A. Mukand, M. Saleh, A.H. Caplan, A. Branner, D. 8 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Chen, R.D. Penn, and J.P. Donoghue, “Neuronal ensemble control of prosthetic devices by a human with tetraplegia,” Nature, vol. 442, 2006, pp. 164–171. W. Wu, Y. Gao, E. Bienenstock, J.P. Donoghue, and M.J. Black, “Bayesian population decoding of motor cortical activity using a Kalman filter,” Neural Computation, vol. 18, 2006, pp. 80–118. W. Wu, M.J. Black, Y. Gao, E. Bienenstock, M. Serruya, A. Shaikhouni, and J.P. Donoghue, “Neural decoding of cursor motion using a Kalman filter,” Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference, 2003, p. 133. R.E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, vol. 82, 1960, pp. 35–45. G.G. Scandaroli, G.A. Borges, J.Y. Ishihara, M.H. Terra, A.F.D. Rocha, and F.A.D.O. Nascimento, “Estimation of foot orientation with respect to ground for an above knee robotic prosthesis,” Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems, St. Louis, MO, USA: IEEE Press, 2009, pp. 1112-1117. I. Cikajlo, Z. Matjačić, and T. Bajd, “Efficient FES triggering applying Kalman filter during sensory supported treadmill walking,” Journal of Medical Engineering & Technology, vol. 32, 2008, pp. 133144. S. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, and M.J. Black, “Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia,” Journal of Neural Engineering, vol. 5, 2008, pp. 455-476. L. Srinivasan, U.T. Eden, A.S. Willsky, and E.N. Brown, “A state-space analysis for reconstruction of goal-directed movements using neural signals,” Neural computation, vol. 18, 2006, pp. 2465–2494. G.H. Mulliken, S. Musallam, and R.A. Andersen, “Decoding trajectories from posterior parietal cortex ensembles,” Journal of Neuroscience, vol. 28, 2008, p. 12913. W. Wu, J.E. Kulkarni, N.G. Hatsopoulos, and L. Paninski, “Neural Decoding of Hand Motion Using a Linear State-Space Model With Hidden States,” IEEE Transactions on neural systems and rehabilitation engineering, vol. 17, 2009, p. 1. J.E. Kulkarni and L. Paninski, “State-space decoding of goal-directed movements,” IEEE Signal Processing Magazine, vol. 25, 2008, p. 78. C. Kemere and T. Meng, “Optimal estimation of feed-forward-controlled linear systems,” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP'05), 2005. B.M. Yu, C. Kemere, G. Santhanam, A. Afshar, S.I. Ryu, T.H. Meng, M. Sahani, and K.V. Shenoy, “Mixture of trajectory models for neural decoding of goal-directed movements,” Journal of neurophysiology, vol. 97, 2007, p. 3763. N. Hatsopoulos, J. Joshi, and J.G. O'Leary, “Decoding continuous and discrete motor behaviors using motor and premotor cortical ensembles,” Journal of neurophysiology, vol. 92, 2004, p. 1165. R.S. Johansson, G. Westling, A. Backstrom, and J.R. Flanagan, “Eye-hand coordination in object manipulation,” Journal of Neuroscience, vol. 21, 2001, p. 6917. G. Wu, F.C. van der Helm, H.E.J. Veeger, M. Makhsous, P. Van Roy, C. Anglin, J. Nagels, A.R. Karduna, and K. McQuade, “ISB recommendation on definitions of joint coordinate systems of various joints for the reporting of human joint motion–Part II: shoulder, elbow, wrist and hand,” Journal of biomechanics, vol. 38, 2005, pp. 981–992. D. Simon, Optimal state estimation: Kalman, H [infinity] and nonlinear approaches, John Wiley and Sons, 2006. Z. Ghahramani and G.E. Hinton, “Parameter estimation for linear dynamical systems,” University of Toronto technical report CRG-TR-96-2, vol. 6, 1996. 9

6 0.075944312 138 nips-2010-Large Margin Multi-Task Metric Learning

7 0.071589172 27 nips-2010-Agnostic Active Learning Without Constraints

8 0.071390025 25 nips-2010-Active Learning by Querying Informative and Representative Examples

9 0.070046216 243 nips-2010-Smoothness, Low Noise and Fast Rates

10 0.063898504 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

11 0.062569879 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

12 0.062314607 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

13 0.060509376 261 nips-2010-Supervised Clustering

14 0.056587968 23 nips-2010-Active Instance Sampling via Matrix Partition

15 0.055099044 142 nips-2010-Learning Bounds for Importance Weighting

16 0.053211551 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

17 0.049902219 46 nips-2010-Causal discovery in multiple models from different experiments

18 0.046093248 182 nips-2010-New Adaptive Algorithms for Online Classification

19 0.045150559 191 nips-2010-On the Theory of Learnining with Privileged Information

20 0.044782482 64 nips-2010-Distributionally Robust Markov Decision Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.124), (1, 0.025), (2, 0.058), (3, -0.027), (4, 0.049), (5, 0.051), (6, -0.09), (7, -0.059), (8, 0.089), (9, -0.052), (10, 0.104), (11, 0.029), (12, 0.011), (13, -0.11), (14, -0.09), (15, 0.148), (16, -0.028), (17, -0.057), (18, -0.014), (19, 0.003), (20, -0.079), (21, 0.09), (22, -0.021), (23, 0.014), (24, 0.002), (25, 0.022), (26, -0.095), (27, 0.066), (28, 0.14), (29, -0.083), (30, 0.079), (31, 0.122), (32, -0.159), (33, -0.071), (34, 0.012), (35, 0.024), (36, -0.03), (37, 0.281), (38, 0.067), (39, -0.035), (40, 0.093), (41, -0.07), (42, 0.027), (43, -0.02), (44, -0.023), (45, 0.107), (46, -0.088), (47, -0.104), (48, 0.048), (49, -0.188)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96357799 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation

Author: Abhishek Kumar, Avishek Saha, Hal Daume

Abstract: This paper presents a co-regularization based approach to semi-supervised domain adaptation. Our proposed approach (EA++) builds on the notion of augmented space (introduced in E ASYA DAPT (EA) [1]) and harnesses unlabeled data in target domain to further assist the transfer of information from source to target. This semi-supervised approach to domain adaptation is extremely simple to implement and can be applied as a pre-processing step to any supervised learner. Our theoretical analysis (in terms of Rademacher complexity) of EA and EA++ show that the hypothesis class of EA++ has lower complexity (compared to EA) and hence results in tighter generalization bounds. Experimental results on sentiment analysis tasks reinforce our theoretical findings and demonstrate the efficacy of the proposed method when compared to EA as well as few other representative baseline approaches.

2 0.57488412 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

Author: Alessandro Bergamo, Lorenzo Torresani

Abstract: Most current image categorization methods require large collections of manually annotated training examples to learn accurate visual recognition models. The time-consuming human labeling effort effectively limits these approaches to recognition problems involving a small number of different object classes. In order to address this shortcoming, in recent years several authors have proposed to learn object classifiers from weakly-labeled Internet images, such as photos retrieved by keyword-based image search engines. While this strategy eliminates the need for human supervision, the recognition accuracies of these methods are considerably lower than those obtained with fully-supervised approaches, because of the noisy nature of the labels associated to Web data. In this paper we investigate and compare methods that learn image classifiers by combining very few manually annotated examples (e.g., 1-10 images per class) and a large number of weakly-labeled Web photos retrieved using keyword-based image search. We cast this as a domain adaptation problem: given a few stronglylabeled examples in a target domain (the manually annotated examples) and many source domain examples (the weakly-labeled Web photos), learn classifiers yielding small generalization error on the target domain. Our experiments demonstrate that, for the same number of strongly-labeled examples, our domain adaptation approach produces significant recognition rate improvements over the best published results (e.g., 65% better when using 5 labeled training examples per class) and that our classifiers are one order of magnitude faster to learn and to evaluate than the best competing method, despite our use of large weakly-labeled data sets.

3 0.51515621 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

Author: Wei Wang, Zhi-hua Zhou

Abstract: The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is O( 1 ), where the order of 1/ǫ is independent of the parameter in Tsybakov noise, ǫ contrasting to previous polynomial bounds where the order of 1/ǫ is related to the parameter in Tsybakov noise. 1

4 0.48380536 2 nips-2010-A Bayesian Approach to Concept Drift

Author: Stephen Bach, Mark Maloof

Abstract: To cope with concept drift, we placed a probability distribution over the location of the most-recent drift point. We used Bayesian model comparison to update this distribution from the predictions of models trained on blocks of consecutive observations and pruned potential drift points with low probability. We compare our approach to a non-probabilistic method for drift and a probabilistic method for change-point detection. In our experiments, our approach generally yielded improved accuracy and/or speed over these other methods. 1

5 0.43317384 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions

Author: Peter Stobbe, Andreas Krause

Abstract: Many combinatorial problems arising in machine learning can be reduced to the problem of minimizing a submodular function. Submodular functions are a natural discrete analog of convex functions, and can be minimized in strongly polynomial time. Unfortunately, state-of-the-art algorithms for general submodular minimization are intractable for larger problems. In this paper, we introduce a novel subclass of submodular minimization problems that we call decomposable. Decomposable submodular functions are those that can be represented as sums of concave functions applied to modular functions. We develop an algorithm, SLG, that can efÄ?Ĺš ciently minimize decomposable submodular functions with tens of thousands of variables. Our algorithm exploits recent results in smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint classiÄ?Ĺš cation-and-segmentation task, and show that it outperforms the state-of-the-art general purpose submodular minimization algorithms by several orders of magnitude. 1

6 0.42608351 167 nips-2010-Mixture of time-warped trajectory models for movement decoding

7 0.4132407 274 nips-2010-Trading off Mistakes and Don't-Know Predictions

8 0.36345938 27 nips-2010-Agnostic Active Learning Without Constraints

9 0.35925731 142 nips-2010-Learning Bounds for Importance Weighting

10 0.33606181 15 nips-2010-A Theory of Multiclass Boosting

11 0.33501586 25 nips-2010-Active Learning by Querying Informative and Representative Examples

12 0.33070537 64 nips-2010-Distributionally Robust Markov Decision Processes

13 0.32089889 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

14 0.29932395 62 nips-2010-Discriminative Clustering by Regularized Information Maximization

15 0.29867294 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

16 0.29289955 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

17 0.28798521 114 nips-2010-Humans Learn Using Manifolds, Reluctantly

18 0.28746828 138 nips-2010-Large Margin Multi-Task Metric Learning

19 0.28703144 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

20 0.2805604 261 nips-2010-Supervised Clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.284), (13, 0.057), (17, 0.012), (27, 0.029), (30, 0.067), (35, 0.023), (45, 0.169), (50, 0.034), (52, 0.037), (53, 0.011), (60, 0.024), (77, 0.055), (78, 0.054), (90, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74702632 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation

Author: Abhishek Kumar, Avishek Saha, Hal Daume

Abstract: This paper presents a co-regularization based approach to semi-supervised domain adaptation. Our proposed approach (EA++) builds on the notion of augmented space (introduced in E ASYA DAPT (EA) [1]) and harnesses unlabeled data in target domain to further assist the transfer of information from source to target. This semi-supervised approach to domain adaptation is extremely simple to implement and can be applied as a pre-processing step to any supervised learner. Our theoretical analysis (in terms of Rademacher complexity) of EA and EA++ show that the hypothesis class of EA++ has lower complexity (compared to EA) and hence results in tighter generalization bounds. Experimental results on sentiment analysis tasks reinforce our theoretical findings and demonstrate the efficacy of the proposed method when compared to EA as well as few other representative baseline approaches.

2 0.59851635 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1

3 0.58269823 265 nips-2010-The LASSO risk: asymptotic results and real world examples

Author: Mohsen Bayati, José Pereira, Andrea Montanari

Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.

4 0.58219337 117 nips-2010-Identifying graph-structured activation patterns in networks

Author: James Sharpnack, Aarti Singh

Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1

5 0.58212698 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

Author: Alessandro Chiuso, Gianluigi Pillonetto

Abstract: We introduce a new Bayesian nonparametric approach to identification of sparse dynamic linear systems. The impulse responses are modeled as Gaussian processes whose autocovariances encode the BIBO stability constraint, as defined by the recently introduced “Stable Spline kernel”. Sparse solutions are obtained by placing exponential hyperpriors on the scale factors of such kernels. Numerical experiments regarding estimation of ARMAX models show that this technique provides a definite advantage over a group LAR algorithm and state-of-the-art parametric identification techniques based on prediction error minimization. 1

6 0.58053964 243 nips-2010-Smoothness, Low Noise and Fast Rates

7 0.57898819 63 nips-2010-Distributed Dual Averaging In Networks

8 0.57850415 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

9 0.5776037 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

10 0.5775556 27 nips-2010-Agnostic Active Learning Without Constraints

11 0.57741982 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers

12 0.57692999 36 nips-2010-Avoiding False Positive in Multi-Instance Learning

13 0.57689106 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

14 0.57665241 148 nips-2010-Learning Networks of Stochastic Differential Equations

15 0.57631201 282 nips-2010-Variable margin losses for classifier design

16 0.57580692 182 nips-2010-New Adaptive Algorithms for Online Classification

17 0.57525474 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

18 0.57466269 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

19 0.57425958 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

20 0.57397252 290 nips-2010-t-logistic regression