jmlr jmlr2007 jmlr2007-15 knowledge-graph by maker-knowledge-mining

15 jmlr-2007-Bilinear Discriminant Component Analysis

Source: pdf

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classiﬁcation, regularization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU The City College of The City University of New York Convent Avenue @ 138th Street New York, NY 10031, USA Editor: Leslie Pack Kaelbling Abstract Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. [sent-9, score-0.137]

2 A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. [sent-11, score-0.2]

3 The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. [sent-12, score-0.085]

4 Introduction The work presented in this paper is motivated by the analysis of functional brain imaging signals recorded with functional magnetic resonance imaging (fMRI) or electric or magnetic encephalography (EEG/MEG). [sent-14, score-0.435]

5 These imaging modalities record brain activity across time at multiple locations, providing spatio-temporal data. [sent-15, score-0.262]

6 The design of a brain imaging experiment often includes multiple repetitions or trials. [sent-16, score-0.2]

7 Hence, brain imaging data is often given as a three-dimensional array including space, time, and trials. [sent-18, score-0.2]

8 These methods include principal component analysis (Squires et al. [sent-21, score-0.068]

9 Linear decomposition of a data matrix X ∈ IRD×T using PCA or ICA involves estimation of the factors of the model K (X)i j ≈ ∑ (ak )i (sk ) j k=1 where ak ∈ IRD , sk ∈ IRT , and K denotes the number of components in the model. [sent-28, score-0.161]

10 DYRHOLM , C HRISTOFOROU AND PARRA When applying PCA or ICA to brain imaging data, trials are often combined with time samples to form a single dimension, thereby ignoring the tensor structure of the data (see, e. [sent-32, score-0.303]

11 In PARAFAC the three-way data array X ∈ IRD×T ×N is decomposed under the model K (Xn )i j ≈ ∑ (ak )i (bk ) j (ck )n (1) k=1 where ak ∈ IRD , bk ∈ IRT , ck ∈ IRN , and · ≈ · denotes least-squares approximation. [sent-36, score-0.186]

12 Typically, the variability in the data due to noise and task irrelevant activity is quite large when compared to the signal that relates specifically to the experimental question under consideration. [sent-45, score-0.096]

13 In fMRI data, for instance, the activity that is extracted from the raw data is often only a small fraction of the total background BOLD signal. [sent-46, score-0.062]

14 A similar situation arises in EEG where often hundreds of trials have to be averaged to gain a signiﬁcant difference between two experimental conditions. [sent-47, score-0.103]

15 , Beckmann and Smith, 2005), if the ICA decomposition is followed by component inspection in order to identify task speciﬁc components. [sent-56, score-0.068]

16 In this paper we propose an algorithm that includes the labels at the earliest stage in order to identify possible subspaces wherein the SNR of task speciﬁc activity is maximized. [sent-58, score-0.062]

17 We propose to ﬁnd a subspace projection in which the dimensions sum up to an optimal classiﬁcation of the trials, while each dimension contributes to this discriminant sum independently across trials. [sent-59, score-0.184]

18 The subspace will be restricted to a bilinear subspace to express the assumption that each contributing dimensions should have a ﬁxed spatial proﬁle and an associated temporal proﬁle. [sent-60, score-0.458]

19 1098 B ILINEAR D ISCRIMINANT C OMPONENT A NALYSIS underlying task relevant activity can be expected to involve a number of interacting sources, and we therefore allow the bilinear subspace to be of rank-K where K > 1. [sent-63, score-0.309]

20 All this can be compactly represented in a factorization of the form (W ) (Xn K )i j = ∑ (ak )i (bk ) j (ck )n (2) k=1 (W ) where Xn denotes projected data in discriminant directions {ak } and {bk }, with statistically independent (ck )n across n. [sent-64, score-0.159]

21 As a ﬁrst step, a manifold of possible {ak , bk } is identiﬁed using Bilinear Discriminant Analysis (BLDA), extending the work of Dyrholm and Parra (2006). [sent-65, score-0.066]

22 To select a speciﬁc {ak , bk } we require that the resulting {ck } are independent across n. [sent-67, score-0.066]

23 LDA is directly applicable in the form (3) by letting the data vector xn be a stacking of the elements of the data matrix Xn , but, the data matrix structure could potentially be exploited to obtain a more parsimonious representation of the weight vector w. [sent-88, score-0.209]

24 In EEG for instance, an electrical current source which is spatially static in the brain will 1099 DYRHOLM , C HRISTOFOROU AND PARRA give a rank-one contribution to the spatiotemporal Xn (see also Makeig et al. [sent-90, score-0.106]

25 Let R denote the number of columns in both U and V, then (4) is equivalent to (3) but with a rank-R constraint on w, that is, (5) wT xn = ∑(W)i j (Xn )i j where W = UVT . [sent-93, score-0.209]

26 In this paper we generalize maximum likelihood Logistic Regression to the case of a bilinear factorization of w. [sent-99, score-0.265]

27 The proposed algorithm has the advantage that it allows us to regularize the estimation and incorporate prior assumptions about smoothness as described in Section 2. [sent-101, score-0.097]

28 2 we extract components from the EEG of six human subjects in a rapid serial visual presentation paradigm. [sent-105, score-0.081]

29 For instance, if knowledge is available about the smoothness in the column space of Xn (e. [sent-108, score-0.097]

30 , temporal smoothness), such knowledge can be incorporated by declaring a prior p. [sent-112, score-0.077]

31 Spatial and temporal smoothness is typically a valid assumption in EEG and fMRI, see, for example, Penny et al. [sent-116, score-0.174]

32 Let uk denote the kth column of U, and let vk denote the kth column of V. [sent-120, score-0.123]

33 We declare Gaussian Process priors for uk and vk , that is, assume uk ∼ N (0, Ku ) and vk ∼ N (0, Kv ), where the covariance matrices Ku and Kv deﬁne the degree and form of smoothness of uk and vk respectively. [sent-121, score-0.275]

34 This is done through choice of covariance function: Let r be a spatial or temporal measure in context of X n . [sent-122, score-0.169]

35 For instance r is a measure of spatial distance between data acquisition sensors, or a measure of time difference between two samples in the data. [sent-123, score-0.092]

36 We propose to enhance the task relevant activity ˜ ˜ by projecting the data using {U, V} (similar to the argument of the Trace operator in Equation 7), hence we need a meaningful way to estimate G. [sent-137, score-0.062]

37 Deﬁne Wr as the ˜ ˜ First, we deﬁne the projection that uses U ˜ ˜ outer product of the rth columns of U and V. [sent-140, score-0.076]

38 Our reasoning behind this choice is that different cortical processes might respond to the same stimuli, and are hence not temporally independent, but the trial-to-trial variability between the different cortical networks might satisfy the independence criterion better. [sent-145, score-0.142]

39 That is, by making the component activations as independent as possible across trials, we hope to be able to segregate activity arising from different cortical 1101 DYRHOLM , C HRISTOFOROU AND PARRA networks into separate sets of components. [sent-146, score-0.217]

40 The ﬁrst experiment benchmarks the classiﬁcation performance of our BLDA method with smoothness regularization against state-of-the art methods in a published data set. [sent-155, score-0.097]

41 The 28 channel EEG was recorded from a single subject performing a ‘self-paced key typing’, that is, pressing with the index and little ﬁngers corresponding keys in a self-chosen order and timing. [sent-160, score-0.126]

42 Trial matrices were extracted by epoching the data starting 630ms before each key-press. [sent-162, score-0.069]

43 For the competition, the ﬁrst 316 epochs were to be used for classiﬁer training, while the remaining 100 epochs were to be used as a test set. [sent-164, score-0.066]

44 Data were recorded at 1000 Hz with a pass-band between 0. [sent-165, score-0.069]

45 1 T RAINING AND R ESULTS We tuned the smoothness prior parameters using cross validation on the training set using only a single component, that is, R=1. [sent-169, score-0.097]

46 We used the Mat´ rn class of covariance functions for incorporating e smoothness regularization in the model c. [sent-170, score-0.097]

47 Temporal smoothness was implemented by letting ri j equal the normalized temporal latency |i − j| between samples i and j. [sent-174, score-0.174]

48 The number of misclassiﬁed trials in the test set was 21 which places our method on a new third place given the result of the competition which can be found online at http://ida. [sent-197, score-0.175]

49 The achieved classiﬁcation performance supports the validity of the bilinear weight space factorization in EEG. [sent-206, score-0.265]

50 2 Experiment II: Bilinear Discriminant Component Analysis of Real EEG We applied the BDCA method in real EEG data which was recorded while the human subjects were stimulated with a sequence of images presented at a rate of ten images per second. [sent-208, score-0.112]

51 Sixty-four EEG channels were recorded at 2048Hz, re-referenced to average reference (one channel removed to obtain full row-rank data), ﬁltered (for anti-aliasing) and downsampled to 128Hz sampling rate, and ﬁltered again with a pass-band between 0. [sent-217, score-0.101]

52 Trial matrices of dimension (D, T ) = (64, 64) were extracted by epoching (500ms per epoch) the data in alignment with image stimulus. [sent-220, score-0.069]

53 The number of recorded target/distractor trials was roughly 60/3000 for training and 40/2000 for testing, but varied slightly between subjects. [sent-221, score-0.172]

54 Figure 1 shows a single BDCA component for each subject. [sent-284, score-0.068]

55 Clearly, there are inter-subject variability in the spatial topographies and in the temporal proﬁles, however, all temporal proﬁles exhibit positive peaks at around 125ms and 300ms after target stimulus, and negative peak at around 200ms. [sent-285, score-0.399]

56 The peak at 300ms in the temporal proﬁle is in agreement with the conventional P300 which is typically observed with a rare target stimulus (Gerson et al. [sent-286, score-0.187]

57 The spatial topographies shown in Figure 1 are rather complex. [sent-291, score-0.156]

58 This may simply represent noise, but it is also possible that BDCA, using additional trials and R > 2, could decompose the complex rank-one patterns into more than two components with localized, that is, ‘simpler’, topography. [sent-292, score-0.141]

59 Two of the subjects (4 and 6) showed another interesting component with a broad spatial projection located slightly below the center on the scalp, see Figure 2. [sent-293, score-0.246]

60 The component time courses were dominated by a 20Hz rhythm which seemed to modulate in amplitude around 200–300ms. [sent-294, score-0.14]

61 To validate the new hypothesis, we measured the single-component classiﬁcation performances on the test set for each subject, that is, performed classiﬁcation based only on the (subject speciﬁc) component shown in Figure 2. [sent-296, score-0.068]

62 2 of 2 Figure 1: For each subject, one of the (spatial) ak vectors is shown topographically on a cartoon head, and the corresponding (temporal) bk vector is shown right next to it. [sent-322, score-0.167]

63 The sign ambiguity between component topographies and time courses has been set so that the P300 peak has a positive projection to the center in the back of the cartoon head. [sent-323, score-0.334]

64 All temporal proﬁles exhibit positive peaks at around 100ms and 350ms (P300) after target stimulus, and negative peak at around 200ms. [sent-324, score-0.132]

65 02 0 100 200 300 Time (ms) 400 500 −4 20 x 10 10 0 0 100 200 300 Time (ms) 400 500 Figure 2: Two of the subjects showed a component with a broad spatial projection located slightly below center on the scalp. [sent-327, score-0.246]

66 The component time courses shown here have that in common that they were dominated by a 20Hz rhythm which seemed to modulate in amplitude around 200–300ms. [sent-328, score-0.14]

67 87 AUC for subject 6 which indicated that the newly identiﬁed components were indeed associated with target detection. [sent-330, score-0.095]

68 Finally we would like to point out that the resulting smoothness (e. [sent-331, score-0.097]

69 , the time courses in Figure 1) is not only affected by the choice of regularization parameters but also by the data and the number of trials. [sent-333, score-0.072]

70 If more data is available that supports a deviation from the prior smoothness assumption the resulting time courses can and will be more punctuated in time. [sent-334, score-0.169]

71 Conclusion and Discussion Bilinear Discriminant Analysis (BLDA) can give better classiﬁcation performance in situations where a bilinear decomposition of the parameter matrix can be assumed as in (5). [sent-336, score-0.205]

72 Such parameter matrix decompositions might prove reasonable in situations were component analysis according to model (2) is meaningful. [sent-337, score-0.068]

73 One such component contributes to a rank-one subspace in data Xn . [sent-339, score-0.11]

74 If separate spatial distributions have separate time courses the model assumes that these contribution add up linearly. [sent-340, score-0.164]

75 We presented a method for BLDA which allows smoothness regularization for better generalization performance in data sets with limited examples. [sent-342, score-0.097]

76 This step was motivated by application to functional brain imaging were the number of examples is typically very limited compared to the data space dimensionality. [sent-343, score-0.2]

77 We showed that BLDA can yield a data subspace factorization which makes it a useful tool for supervised extraction of components as opposed to simply a tool for classifying data matrices. [sent-345, score-0.14]

78 We identiﬁed some essential ambiguities in such supervised subspace component decomposition and proposed to resolve them by assuming independence across the labeled mode (i. [sent-346, score-0.11]

79 Though the algorithm was motivated by functional brain imaging data (with space, time, and labeled trials as its dimensions) it should be applicable for any data set that records a matrix rather than a vector for every repetition. [sent-357, score-0.303]

80 When the dependent variables are continuous rather than discrete one can use a bilinear model with a unit link function to derive the corresponding bilinear regression. [sent-361, score-0.41]

81 1106 B ILINEAR D ISCRIMINANT C OMPONENT A NALYSIS Acknowledgments The EEG topographies in this paper were plotted using the Matlab toolbox “EEGLAB” (Delorme and Makeig, 2004). [sent-363, score-0.064]

82 Deﬁne π(Xn ) ≡ E[yn ] = 1 R T 1 + e−(w0 +∑r=1 ur Xn vr ) . [sent-368, score-0.639]

83 1 Smoothness Regularization with Gaussian Processes The log posterior is equal to the log likelihood plus evaluation of the log prior, that is, log p(w0 , {ur , vr }|X) = l(w0 , {ur , vr }) + log p(w0 , {ur , vr }) − log p(X) where X denotes data in all the trials available. [sent-372, score-1.309]

84 Here we consider the maximum of the posterior (MAP) estimate, that is, (w0 , {ur , vr })MAP = arg max l(w0 , {ur , vr }) + log p(w0 , {ur , vr }) w0 ,{ur ,vr } with independent priors log p(w0 , {ur , vr }) = log p(w0 ) + ∑ log p(ur ) + ∑ log p(vr ). [sent-373, score-1.458]

85 r (11) r For iterative MAP estimation, the terms for the Gaussian prior, to be inserted in (11), are (here shown for ur ) dim ur 1 1 log p(ur ) = − log(2π) − log(det K) − uT K−1 ur 2 2 2 r where dim ur = D (or likewise dim vr = T or dim w0 = 1) (see also Rasmussen and Williams, 2006). [sent-374, score-1.852]

86 The extra terms, to be added to the ML terms, are; for the gradient ∂ log p(ur ) = −K−1 ur ∂ur and for the Hessian ∂2 log p(ur ) = −K−1 e j ∂ur ∂(ur ) j where e j is the jth unit vector. [sent-375, score-0.437]

87 These expressions (and similar for w0 and vr ) thus augment the terms in the maximum likelihood algorithm above. [sent-376, score-0.302]

88 Equations for Maximum-Likelihood Estimation of G The log likelihood is given by N ˆ ˆ log p(Xn |VG−1 , UGT ) = − log det[AT A] + ∑ log p(ˆn ) s 2 n see also Dyrholm et al. [sent-378, score-0.2]

89 We choose the component activation prior pdf p(·) = 1/[π cosh(·)], as proposed by Bell and Sejnowski 1108 B ILINEAR D ISCRIMINANT C OMPONENT A NALYSIS (1995), which is appropriate for super-Gaussian independent activations (see also Lee et al. [sent-384, score-0.134]

90 This choice however might not ﬁt the scaling of the data very well so we parameterize the activation pdf and rewrite the likelihood N ˆ ˆ log p(Xn |VG−1 , UGT , α) = − log det[AT A] + ∑ log p(ˆ n /α2 ) z 2 n ˆ where zk (n) = (sk (n) − E[sk (n)])/ var[sk (n)]. [sent-386, score-0.183]

91 We do not actually compute the Hessian but use the outer product approximation to the Hessian given by averaging gradient products across trials (see also Bishop, 1996). [sent-388, score-0.103]

92 Tensorial extensions of independent component analysis for group fMRI data analysis. [sent-402, score-0.068]

93 The BCI competition 2003: progress and perspectives in detection and discrimination of EEG single trials. [sent-445, score-0.108]

94 Spatial and temporal independent component analysis of functional MRI data containing a pair of task-related waveforms. [sent-471, score-0.145]

95 Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis. [sent-487, score-0.322]

96 Cortical origins of response time variability during rapid discrimination of visual objects. [sent-514, score-0.07]

97 Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. [sent-532, score-0.068]

98 Feature extraction using supervised independent component analysis by maximizing class distance. [sent-636, score-0.068]

99 Normalized radial basis function networks and bilinear discriminant analysis for face recognition. [sent-660, score-0.304]

100 Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. [sent-665, score-0.099]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ur', 0.337), ('vr', 0.302), ('eeg', 0.254), ('parra', 0.242), ('dyrholm', 0.216), ('xn', 0.209), ('bilinear', 0.205), ('parafac', 0.143), ('ica', 0.133), ('fmri', 0.132), ('bdca', 0.111), ('blda', 0.111), ('hristoforou', 0.111), ('ilinear', 0.111), ('ugt', 0.111), ('neuroimage', 0.108), ('brain', 0.106), ('trials', 0.103), ('discriminant', 0.099), ('smoothness', 0.097), ('vec', 0.095), ('imaging', 0.094), ('bci', 0.094), ('makeig', 0.094), ('omponent', 0.094), ('snr', 0.094), ('vg', 0.094), ('spatial', 0.092), ('ms', 0.082), ('gerson', 0.08), ('mat', 0.08), ('iscriminant', 0.077), ('temporal', 0.077), ('competition', 0.072), ('courses', 0.072), ('ak', 0.069), ('recorded', 0.069), ('component', 0.068), ('nalysis', 0.067), ('auc', 0.067), ('ut', 0.067), ('bk', 0.066), ('beckmann', 0.064), ('delorme', 0.064), ('rch', 0.064), ('rsvp', 0.064), ('topographies', 0.064), ('lda', 0.063), ('activity', 0.062), ('factorization', 0.06), ('subject', 0.057), ('peak', 0.055), ('stimulus', 0.055), ('sk', 0.054), ('blankertz', 0.054), ('cortical', 0.054), ('ird', 0.054), ('ck', 0.051), ('trace', 0.051), ('pca', 0.05), ('log', 0.05), ('miwakeichi', 0.048), ('thorpe', 0.048), ('wr', 0.048), ('vk', 0.047), ('hessian', 0.045), ('projection', 0.043), ('subjects', 0.043), ('subspace', 0.042), ('yn', 0.041), ('bishop', 0.041), ('mcculloch', 0.04), ('dim', 0.038), ('kth', 0.038), ('rasmussen', 0.038), ('components', 0.038), ('matrices', 0.037), ('trial', 0.037), ('det', 0.036), ('williams', 0.036), ('magnetic', 0.036), ('discrimination', 0.036), ('variability', 0.034), ('rth', 0.033), ('activation', 0.033), ('activations', 0.033), ('epochs', 0.033), ('andersen', 0.032), ('bullmore', 0.032), ('calhoun', 0.032), ('cartoon', 0.032), ('christoforos', 0.032), ('christoforou', 0.032), ('deferred', 0.032), ('downsampled', 0.032), ('eeglab', 0.032), ('electrode', 0.032), ('epoching', 0.032), ('harshman', 0.032), ('irt', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

2 0.063462056 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

Abstract: We present a uniﬁed and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and uniﬁes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling

3 0.052895293 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to ﬁt together and to yield efﬁcient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a speciﬁc model structure has been ﬁxed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation

4 0.050010182 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

Abstract: Face recognition is a challenging problem due to variations in pose, illumination, and expression. Techniques that can provide effective feature representation with enhanced discriminability are crucial. Wavelets have played an important role in image processing for its ability to capture localized spatial-frequency information of images. In this paper, we propose a novel local discriminant coordinates method based on wavelet packet for face recognition to compensate for these variations. Traditional wavelet-based methods for face recognition select or operate on the most discriminant subband, and neglect the scattered characteristic of discriminant features. The proposed method selects the most discriminant coordinates uniformly from all spatial frequency subbands to overcome the deﬁciency of traditional wavelet-based methods. To measure the discriminability of coordinates, a new dilation invariant entropy and a maximum a posterior logistic model are put forward. Moreover, a new triangle square ratio criterion is used to improve classiﬁcation using the Euclidean distance and the cosine criterion. Experimental results show that the proposed method is robust for face recognition under variations in illumination, pose and expression. Keywords: local discriminant coordinates, invariant entropy, logistic model, wavelet packet, face recognition, illumination, pose and expression variations

5 0.048502292 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

Author: Masashi Sugiyama, Matthias Krauledat, Klaus-Robert Müller

Abstract: A common assumption in supervised learning is that the input points in the training set follow the same probability distribution as the input points that will be given in the future test phase. However, this assumption is not satisﬁed, for example, when the outside of the training region is extrapolated. The situation where the training input points and test input points follow different distributions while the conditional distribution of output values given input points is unchanged is called the covariate shift. Under the covariate shift, standard model selection techniques such as cross validation do not work as desired since its unbiasedness is no longer maintained. In this paper, we propose a new method called importance weighted cross validation (IWCV), for which we prove its unbiasedness even under the covariate shift. The IWCV procedure is the only one that can be applied for unbiased classiﬁcation under covariate shift, whereas alternatives to IWCV exist for regression. The usefulness of our proposed method is illustrated by simulations, and furthermore demonstrated in the brain-computer interface, where strong non-stationarity effects can be seen between training and test sessions. Keywords: covariate shift, cross validation, importance sampling, extrapolation, brain-computer interface

6 0.047933511 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution

7 0.04483778 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

8 0.043148499 70 jmlr-2007-Ranking the Best Instances

9 0.037738729 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

10 0.036281887 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

11 0.033929233 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

12 0.033830751 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

13 0.033307768 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

14 0.032747667 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

15 0.032157723 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

16 0.030951273 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

17 0.029882656 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

18 0.029734753 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

19 0.029515108 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

20 0.028973566 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, 0.033), (2, 0.002), (3, 0.058), (4, -0.085), (5, -0.064), (6, -0.089), (7, -0.076), (8, 0.026), (9, -0.013), (10, -0.073), (11, -0.073), (12, -0.061), (13, -0.204), (14, -0.041), (15, 0.051), (16, -0.041), (17, -0.126), (18, -0.231), (19, -0.126), (20, 0.231), (21, 0.004), (22, -0.111), (23, -0.125), (24, -0.059), (25, -0.169), (26, 0.003), (27, -0.142), (28, 0.14), (29, -0.062), (30, 0.115), (31, -0.21), (32, -0.013), (33, -0.016), (34, 0.027), (35, -0.149), (36, 0.052), (37, -0.022), (38, 0.047), (39, -0.029), (40, -0.04), (41, 0.012), (42, -0.021), (43, 0.025), (44, 0.032), (45, -0.172), (46, -0.086), (47, 0.078), (48, 0.332), (49, -0.133)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95585388 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

2 0.40265921 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

3 0.36078197 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

Author: Masashi Sugiyama, Matthias Krauledat, Klaus-Robert Müller

4 0.31160504 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution

Author: ZoltĂĄn SzabĂł, BarnabĂĄs PĂłczos, AndrĂĄs LĹ‘rincz

Abstract: We introduce the blind subspace deconvolution (BSSD) problem, which is the extension of both the blind source deconvolution (BSD) and the independent subspace analysis (ISA) tasks. We examine the case of the undercomplete BSSD (uBSSD). Applying temporal concatenation we reduce this problem to ISA. The associated Ă˘€˜high dimensionalĂ˘€™ ISA problem can be handled by a recent technique called joint f-decorrelation (JFD). Similar decorrelation methods have been used previously for kernel independent component analysis (kernel-ICA). More precisely, the kernel canonical correlation (KCCA) technique is a member of this family, and, as is shown in this paper, the kernel generalized variance (KGV) method can also be seen as a decorrelation method in the feature space. These kernel based algorithms will be adapted to the ISA task. In the numerical examples, we (i) examine how efÄ?Ĺš ciently the emerging higher dimensional ISA tasks can be tackled, and (ii) explore the working and advantages of the derived kernel-ISA methods. Keywords: undercomplete blind subspace deconvolution, independent subspace analysis, joint decorrelation, kernel methods

5 0.26678532 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

6 0.24573754 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

7 0.22193851 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

8 0.21439835 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

9 0.20537499 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

10 0.1948964 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

11 0.19330475 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)

12 0.17483675 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

13 0.17204992 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

14 0.16983831 44 jmlr-2007-Large Margin Semi-supervised Learning

15 0.15966778 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

16 0.15160443 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

17 0.15075544 90 jmlr-2007-Value Regularization and Fenchel Duality

18 0.1429414 70 jmlr-2007-Ranking the Best Instances

19 0.14018111 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

20 0.13905637 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.535), (4, 0.03), (8, 0.025), (10, 0.012), (12, 0.028), (15, 0.013), (28, 0.03), (40, 0.04), (45, 0.014), (48, 0.042), (60, 0.024), (85, 0.065), (98, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78315163 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

2 0.2186617 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

Author: Masashi Sugiyama, Matthias Krauledat, Klaus-Robert Müller

3 0.20199947 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution

Author: ZoltĂĄn SzabĂł, BarnabĂĄs PĂłczos, AndrĂĄs LĹ‘rincz

4 0.19760606 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semideﬁnite programming

5 0.19528307 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

6 0.19481134 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

7 0.19389376 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

8 0.19361171 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

9 0.1930645 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

10 0.19283724 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

11 0.19238129 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

12 0.19206107 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

13 0.18995342 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

14 0.18951184 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

15 0.1890918 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

16 0.18815441 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

17 0.18789873 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

18 0.18641587 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

19 0.18602496 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

20 0.18586558 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)