Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji
Abstract: In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training. We term this method as learning with target priors (LTP). Specifically, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y). Compared to the conventional (semi)supervised learning approach, LTP can make efficient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efficiently implemented and deployed in tasks where running efficiency is critical. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1
In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables.
In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training.
We term this method as learning with target priors (LTP).
Specifically, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y).
Compared to the conventional (semi)supervised learning approach, LTP can make efficient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning.
Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efficiently implemented and deployed in tasks where running efficiency is critical.
We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1 Introduction One of the central problems in machine learning is prediction/inference, where given an input datum X, we would like to predict or infer the value of a target variable of interest, y, assuming X and y have some intrinsic relationship.
The prediction/inference task in many practical applications involves high dimensional and structured data and target variables.
In the Bayesian approach, our knowledge about input and target variables, as well as their relationships, are all represented as probability distributions.
The posterior distribution can be explicitly constructed from the target prior, p(y), which encodes our knowledge on the internal structure of the target y, and the likelihood, p(X|y), which summarizes the process of generating X from y, as p(y|X) ∝ p(X|y)p(y).
Or it can be directly handled as in the conditional random fields [9] without referring to the target prior or the likelihood.
The advantage of the Bayesian approach is that it incorporates prior knowledge about data and target variables into the prediction/inference task in a principled manner.
In this work, we describe a new approach to learning a parametric regressor fθ (X), which we term as learning with target prior (LTP).
In many practical applications, the target variables y follow the some regular spatial and temporal patterns that can be described probabilistically, and the observed target variables are samples of such distributions.
Such regular patterns can benefit the task of decoding the finger movements from ECoG signals in a brain computer interface (BCI) system, Fig.
In LTP learning, we incorporate such spatial and temporal regular patterns of the target variables into the learning framework.
Specifically, we learn a probability distribution p(y) that captures the spatial and temporal regularities of the target variable y, then we estimate the function parameters θ, by maximizing the log-likelihood of the output y = fθ (X) with respect to the the prior distribution.
LTP learning can be applied to both unsupervised learning, in which no corresponded input and output are available, and semi-supervised learning in which part of corresponding outputs are available.
We demonstrate the effectiveness of LTP learning in two problems: BCI decoding and pose estimation.
In Sections 4 and 5, details on deployment and experimental evaluation of this general framework in two applications, namely BCI decoding and pose estimation from video, are described.
The prior knowledge about the target variables in classification problems is exploited in recent works as learning with uncertain labels, in which the distribution over the target class labels for each data example is used in place of corresponding pairs of data/target variables [10].
There are several works directly embed domain constraints about the target variables in learning.
For instance, constraint driven learning (CODL) [3] enforces task specific constraints on the target labels by appending a penalty term in the objective function.
Posterior regularization [5] directly imposes regularization on the posterior of the latent target variables, of which CODL can be seen as a special case with MAP approximation.
However, all these approaches have only been applied to problems with discrete outputs (classification or labeling) and may be difficult to extend to incorporate complex dependencies in high-dimensional continuous target variables.
Dependencies in the target variables can be directly modeled in conditional random fields (CRF) [9], as a probabilistic graphical model between the output components.
Some of the recent supervised parametric learning methods can take advantage of some structure constraints over the target variables.
These methods can be viewed as special cases of LTP learning, where general probabilistic models for target variables can be incorporated.
3 General Framework In this section, we describe the general framework of learning with target priors.
Specifically, our task is to learn the parameter θ in a parametric family of functions of X, fθ (x), to best predict the corresponding target variable y.
Both the data and target variable can be of high dimensions.
Knowledge about target variable is provided through a target prior model in the form of a parametric probability distribution, pη (y), with model parameter η.
In the following, we apply the LTP learning to unsupervised learning in which no corresponded input and output are available, as well as semi-supervised learning in which part of corresponding outputs are available.
For the unsupervised learning, assume we are given a set of outputs y ∈ RY×m , as well as a set of uncorresponded inputs X ∈ RX ×n , where Y and X are the dimensionality, and m and n are the temporal length for y and X respectively.
This is applicable to the case of BCI where it is easier to gather inputs X or structured targets y than it is to gather corresponded inputs and targets (X, y).
In many real BCI applications the input brain signals X are collected only under thoughts without actual body movement y.
The body movements could be easily collected when the brain signals are not being recorded.
In both the finger movement decoding and pose estimation, y and X could be extracted from different subjects.
A prior model pη (y) is learned from {yi }m , where yi ∈ RY×1 i=1 and η is parameter of the prior model.
the parameter θ is chosen in the way that the outputs not only minimize the loss function on training data, but also make the predicted targets on the unlabeled data comply with the target prior.
Next, we adapt unsupervised/semi-supervised learning with LTP to the prediction/inference in two applications, namely, decoding ECoG signal to predict finger movement in BCI and estimation of body poses from videos, where the-state-of-the-art performances are achieved.
Many recent studies in neurobiology have suggested that electrocorticographic (ECoG) signals recorded near the brain surface show strong correlations with limb motions [2, 8].
ECoG signal decoding is the critical step in ECoG based BCI systems, the goal of which is to obtain a functional mapping between the ECoG signals and the kinematic variables (e.
The ECoG decoding problem has been widely solved with supervised parametric learning [26, 8, 25], where corresponded ECoG signals and target kinematic variables are collected from one subject and used to train a parametric regressor.
However, the decoder learned from data collected from one subject in a controlled experiment usually has trouble to generalize for the same subject over time and in an open environment (temporal generalization) [18], or to decode signals from other subjects (cross-subject generalization) [24].
To generalize better across subjects, a collaborative paradigm was proposed to integrating information from multiple subjects [24].
In [17] it is investigated that certain spectral features of ECoG signals can be used across subjects to classify movements.
At the same time, in BCI it is typically much easier to gather samples of uncorresponded target variables, i.
e, traces of finger movements recorded by digital gloves, than it is to gather corresponding pairs of training samples.
Thus in this work, we propose to improve the temporal and cross-subject generalization of BCI decoders with the learning with target priors framework.
In the first step, we obtain a parametric target prior model using uncorresponded samples of the target data, in this case, the traces of finger positions.
Let us first define notations that are to be used subsequently: we use a linear decoding function, as: fθ (x) = XT θ, to predict the traces of finger movements y as target variable.
Linear decoding function are widely used in BCI decoding [1] for its simplicity and run-time efficiency in constructing hardware based BCI system.
1 Target Prior Model We use the Gaussian-Bernoulli restricted Boltzmann machine (GB-RBM) [14]: pη (y) = 1 −Eη (y,h) , where Z is the normalizing constant, and h ∈ {0, 1}H are binary hidden varihe Z ables, as the parametric target prior model.
The target variable y is normalized to have zero mean and unit standard variance.
2 Learning Regressor Parameter θ With training data and the GB-RBM as the target prior model, we optimize the objective function of LTP in Eq.
3 Experimental Settings The ECoG data and target finger movement variables are collected from a clinical setting based on five subjects (A-E) who underwent brain surgeries [8].
For each channel, features are extracted based on signal power of three bands (1-60Hz, 60-100Hz, 100-200Hz) [2], which results in 144 or 204 features for subjects with 48 or 64 channels, respectively.
4 Learning Target Prior Model and Decoding Function The training data for the prior model pη (y) are either from other subjects or from the same subject but were collected at a different time and do not have correspondence with the training input data.
5 Generalization Across Subjects We learn the decoding function for new subjects by deploying the unsupervised LTP learning in Section 3.
Even though it is difficult to get the corresponded samples from new subjects, we always have the input ECoG signals, whose features will be used as the input of the unsupervised LTP learning.
We compare the unsupervised LTP learning with linear regression [2] in two ways: 1) the linear regression (intra subject) in which the corresponded data and target variables are available.
The accuracy of linear regression is calculated based on five fold cross-validation, that is, 4/5 trials (25 trials) are used for training and 1/5 trials (5 trials) are used for testing.
2) the linear regression (inter Table 1: Results on thumb of subjects based on 2 fold cross validation (correlation coefficient).
The results for inter subjects are calculated based on 5 fold cross-validation (each time one subject is used for training and the model is tested on other four subjects).
Linear regression is trained on pairs of features and targets while LTP only uses the targets to train the prior model.
For the linear regression trained and tested on different subjects, the channels across subjects are aligned by the 3-d position of the sensors.
Note that the performances of the unsupervised LTP learning is on par with those of the linear regression (intra) on subject A, B, C and D, which suggests that the decoder learned
69 On the other hand, not surprisingly, the performances of linear regression (inter subjects) suggest that it cannot be extended across subjects, which is due to brain difference for different subjects as stated above. [sent-187, score-0.293]
70 The generalization ability gained by unsupervised LTP learning is mainly because it directly learns decoding functions on the new subject without using brain signal from existing subjects, which are believed to change dramatically among subjects. [sent-188, score-0.461]
71 One thing we noticed is that the unsupervised LTP learning does not work well on subject E, which is because the thumb movement speed of subject E is much slower than subject A, on which the prior model is trained. [sent-189, score-0.608]
72 This suggests that the quality of the target prior model is critical for the performance. [sent-190, score-0.279]
73 5 300 0 50 100 150 200 250 300 (C) Figure 3: (A) Comparison among three models across subjects; (B) Sample results for subject A; (C) Sample results for subject B. [sent-199, score-0.248]
74 6 Online Learning for Decoding Functions In the next set of experiment, we use the learning with target priors framework for learning decoding functions that generalize over time. [sent-201, score-0.433]
75 The new samples come sequentially and thus we want the decoding function to be online updated. [sent-206, score-0.203]
76 Then the decoding function i=1 with parameter θ is used to decode the first batch {Xj }Y . [sent-210, score-0.253]
77 After the batch {Xj }Y is decoded, j=1 i=1 {Xj }Y , not including the predicted target variables, is included as part of the unlabeled training j=1 data to update the parameter θ by the semi-supervised learning in section 3. [sent-211, score-0.276]
78 Generally, we are trying to maximally use the ”seen” data to get the decoding function prepared for the ”unseen” coming samples. [sent-214, score-0.209]
79 The model is tested on the thumb of five subjects based on 2 fold cross validation, that is, we treat the first 15 trials as the paired data/target variables and then online test the remaining trials. [sent-216, score-0.394]
80 This means that by regularizing 6 the new features with the target prior, the semi-supervised learning in Section 3 successfully obtains information from the new features and adapts the decoders well for new coming samples. [sent-219, score-0.336]
81 5 Pose Estimation from Videos In this section, we apply learning with target priors to the problem of the pose estimation problem, the goal of which is to extract 3D human pose from images or video sequences. [sent-220, score-0.488]
82 We will show that the algorithms learned by LTP are more generalizable both across subjects and over time on the same subject respectively. [sent-222, score-0.374]
83 sGPLVM models a shared latent space by pose and image features through GPLVM, while FOLS-GPLVM models a shared latent space and a private latent space for each part. [sent-238, score-0.227]
84 The training of both sGPLVM and FOLS-GPLVM require corresponded images and poses (X, y) while LTP does not require this. [sent-242, score-0.201]
85 For the unsupervised LTP learning, the target prior model is trained on the subspace of the joint angles {yi }n on sequence 1 and tested on the features of all 6 sequences. [sent-243, score-0.572]
86 Ridge regression, sGPLVM and FOLS-GPLVM are trained on the first sequence with paired samples {Xi , yi }n and tested on all the 6 sequences. [sent-246, score-0.206]
87 We can see that when testing on the sequence from the same subject (sequence 2), unsupervised LTP learning is not the best. [sent-249, score-0.235]
88 In contrast, when testing on the sequences from subjects B and C, unsupervised LTP learning achieves the best results, which is slightly better than sGPLVM. [sent-250, score-0.286]
89 Considering that only linear dimension reduction and linear function are assumed for unsupervised LTP learning and paired samples are not required, unsupervised LTP learning is even more competitive. [sent-251, score-0.248]
90 Thus the experiments demonstrate that the algorithm learned by unsupervised 7 Table 2: Train prior model on the first sequence and test on all with mean absolute joint angle error. [sent-253, score-0.293]
91 The reason that ridge regression, sGPLVM and FOLS-GPLVM do not generalize well is that the relations between poses and images are solely learned from corresponded poses and images, and these relations may have difficulty to hold for the new subjects due to may factors (i. [sent-292, score-0.559]
92 LTP avoids this problem by learning the relations using the generalizable prior distribution over the targets and the images from the new subjects. [sent-294, score-0.214]
93 In this experiment, for each subject we treat the first sequence as the paired samples {Xi , yi }m and estimate the 3-D pose of the second sequence i=1 {Xi }n . [sent-296, score-0.408]
94 The prior model is trained on the joint angles of the first sequence {yi }m . [sent-297, score-0.221]
95 6 Conclusion and Discussion In this work, we describe a new learning scheme for parametric learning, known as learning with target priors, that uses a prior model over the target variables and a set of uncorresponded data in training. [sent-300, score-0.719]
96 Compared to the conventional (semi)supervised learning approach, LTP can make efficient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. [sent-301, score-0.375]
97 Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efficiently implemented and deployed in tasks where running efficiency is critical, such as on-line BCI signal decoding. [sent-302, score-0.257]
98 We demonstrate the effectiveness of the proposed approach in terms of generalization on parametric regression tasks for BCI signal decoding and pose estimation from video. [sent-303, score-0.449]
99 First, in the current work we only use a simple target prior model in the form of GB-RBM. [sent-305, score-0.279]
100 Anatomically constrained decoding of finger flexion from electrocorticographic signals. [sent-436, score-0.237]
