jmlr jmlr2010 jmlr2010-101 knowledge-graph by maker-knowledge-mining

101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

Source: pdf

Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra

Abstract: Traditional analysis methods for single-trial classiﬁcation of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classiﬁcation, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an uniﬁed framework for the analysis of EEG, combining ﬁrst and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classiﬁcation for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically signiﬁcant gains in classiﬁcation accuracy, with a reduction in overall classiﬁcation error from 26%-28% to 19%. Keywords: regularization, classiﬁcation, bilinear decomposition, neural signals, brain computer interface

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Here we propose a method that provides an uniﬁed framework for the analysis of EEG, combining ﬁrst and second-order spatial and temporal features based on a bilinear model. [sent-13, score-0.615]

2 Keywords: regularization, classiﬁcation, bilinear decomposition, neural signals, brain computer interface 1. [sent-16, score-0.246]

3 Introduction The work presented in this paper is motivated by the analysis of functional brain imaging signals recorded via electroencephalography (EEG). [sent-17, score-0.114]

4 Many of these classiﬁcation techniques were originally developed in the context of Brain Computer Interfaces (BCI) but are now more widely used to interpret activity associated with neural processing. [sent-24, score-0.141]

5 , 2002, 2003) the aim is to decode brain activity on a single-trial basis in order to provide a direct control pathway between a user’s intentions and a computer. [sent-28, score-0.212]

6 By extracting activity that differs maximally between two experimental conditions, the typically low signal-to-noise ratio of EEG can be overcome. [sent-35, score-0.141]

7 Typical in this case is to average the measured potentials following stimulus presentation, thereby canceling uncorrelated noise that is not reproducible from one trial to the next. [sent-44, score-0.134]

8 This averaged activity, called an event related potential (ERP), captures activity that is time-locked to the stimulus presentation but cancels induced oscillatory activity that is not locked in phase to the timing of the stimulus. [sent-45, score-0.483]

9 Alternatively, many studies compute the oscillatory activity in speciﬁc frequency bands by ﬁltering and squaring the signal prior to averaging. [sent-46, score-0.413]

10 Induced changes in oscillatory activity are termed event related synchronization or desynchronization (ERS/ERD) Pfurtscheller and da Silva (1999). [sent-47, score-0.254]

11 First order methods include temporal ﬁltering and thresholding (Birbaumer et al. [sent-49, score-0.157]

12 , 2007) and the well known common spatial patterns method (CSP) (Ramoser et al. [sent-56, score-0.283]

13 , 2005), and common sparse spectral spatial patterns (CSSSP) (Dornhege et al. [sent-58, score-0.283]

14 Instead, it would be desirable for the analysis method to extract the relevant neurophysiological activity de novo with minimal prior expectations. [sent-62, score-0.141]

15 Through a bilinear formulation, the method can simultaneously identify spatial linear components as well as temporal modulation of activity. [sent-64, score-0.648]

16 Further, through the bilinear formulation, the method exploits the spatio-temporal nature of the EEG signals and provides a reduced parametrization of the high dimensional data space. [sent-66, score-0.218]

17 We show that a broad set of state-of-the-art EEG analysis methods can be characterized as special cases under this bilinear framework. [sent-67, score-0.175]

18 We then present the bilinear model, discuss interpretation in the context of EEG, and establish a link to current analysis methods. [sent-73, score-0.175]

19 The goal will be to use the N examples to optimize these parameters such that the discriminant function takes on positive values for examples with yn = +1 and negative values for yn = −1. [sent-85, score-0.27]

20 3 Interpretation and Rationale of the Model The discriminant criterion deﬁned in (1) is the sum of a linear and a quadratic term, each combining bilinear components of the EEG signal. [sent-90, score-0.298]

21 By re-writing the term as: Trace(U⊤ XV) = Trace(VU⊤ X) = Trace(W⊤ X) , where we deﬁned, W = UV⊤ , it is easy to see that the bilinear projection is a linear combination of elements of X with a rank-R constraint on W. [sent-94, score-0.175]

22 In particular, the polarity of the signal (positive evoked response versus negative evoked response) will contribute to discrimination if it is consistent across trials. [sent-96, score-0.141]

23 This bilinear projection reduces the number of model parameters of W from D × T dimensions to R × (D + T ) which is a signiﬁcant dimensionality reduction that alleviates the problem of over-ﬁtting in parameters estimation given the small training set size. [sent-98, score-0.175]

24 The second term of Equation (1) is the power of spatially and temporally weighted signals and thus captures the second-order statistics of the signal. [sent-101, score-0.113]

25 7), then B deﬁnes a temporal ﬁlter on the signal and the model captures powers of the ﬁltered signal. [sent-106, score-0.244]

26 Further, by allowing B to be learned from the data, we may be able to identify new frequency bands that have so far not been identiﬁed in novel experimental paradigms. [sent-107, score-0.108]

27 The spatial weights A together with the Trace operation ensure that the power is measured, not in individual electrodes, but in some component space that may reﬂect activity distributed across several electrodes. [sent-108, score-0.49]

28 The diagonal matrix Λ partitions the K spatial components (i. [sent-109, score-0.316]

29 , K columns of A) into those that contribute power positively and those that contribute power negatively to the total sum. [sent-111, score-0.102]

30 Since each column of A measures the power from different sources, then by multiplying the expression with Λ we capture the difference in power between different spatial components. [sent-112, score-0.351]

31 It is known that imagining a movement of the left hand reduces oscillatory activity over the motor cortex of the right hemisphere, while an imagined right-hand movement reduces oscillations over the left motor cortex. [sent-114, score-0.527]

32 Each of these cortical areas will be captured by a different spatial distribution in the EEG. [sent-115, score-0.283]

33 If we limit the columns of A to two, then these columns may capture the power of oscillatory activity over the right and left motor cortex respectively. [sent-116, score-0.464]

34 C = 1 indicates that the discriminant activity is dominated by the ﬁrst-order features; C = 0 indicates that the activity is dominated by second-order features, and any value in between denotes the importance of one component versus the other. [sent-120, score-0.404]

35 , ur and ar map the sensor signal to a current-source space, vr are temporal weight on a source signal and br can be arranged to represent a temporal ﬁlter) it becomes intuitive for the experimenter to incorporate prior knowledge of an experimental setup in the model. [sent-124, score-0.698]

36 If the signal of interest is known to be in a speciﬁc frequency band, one can ﬁx matrix B to capture only the desired frequency band. [sent-125, score-0.267]

37 For example, B can be ﬁxed to a Toeplitz matrix with coefﬁcients corresponding to an 8Hz-12Hz bandpass ﬁlter, then the second-order term is able to extract power in the alpha-band which is known to be modulated during motor related tasks. [sent-126, score-0.147]

38 In such a scenario the experimenter can ﬁx the temporal proﬁle parameter V to emphasize time samples around the expected location of the peak activity and optimize over the rest of the parameters. [sent-128, score-0.33]

39 fMRI has high spatial resolution and can provide locations within the brain that may be known to participate in the processing during a particular experimental paradigm. [sent-130, score-0.354]

40 This location information can be incorporated into the present model by ﬁxing the spatial parameters ur and a to reﬂect a localized source (often approximated as a current dipole). [sent-131, score-0.348]

41 The remaining temporal parameters of the model can then be optimized. [sent-132, score-0.157]

42 The following list identiﬁes some of the algorithms and how they relate to the model used in the SOBDA framework: • Set C = 1, R = 1 and choose temporal component v to select a time window of interest (i. [sent-135, score-0.216]

43 (2002, 2005) • Set C = 1 and select some R > 1 and choose the component vectors vr to select multiple time windows of interest as in 1. [sent-140, score-0.122]

44 Learn for each temporal window the corresponding spatial vector ur from examples separately and then combine these components by learning a linear combination of the elements. [sent-141, score-0.565]

45 (2008) • Set C = 1, R = D while constraining U to be a diagonal matrix and select, separately for each channel, the time window vr which is most discriminative. [sent-144, score-0.117]

46 Then train the diagonal terms of 669 C HRISTOFOROU , H ARALICK , S AJDA AND PARRA U resulting in a latency dependent spatial ﬁlter (Luo and Sajda, 2006a). [sent-145, score-0.283]

47 Alternatively, in the ﬁrst step, use feature selection to ﬁnd the right set of time windows vr simultaneously for all channels (Luo and Sajda, 2006b). [sent-146, score-0.125]

48 • Set C = 1,R = 1 and learn the spatial and temporal components u, v simultaneously. [sent-147, score-0.473]

49 This reduces to the rank-one bilinear discriminant as in Dyrholm and Parra (2006) • Select C = 1 and some R > 1 and learn all columns of the spatial and temporal projection matrix U and V simultaneously. [sent-148, score-0.739]

50 • Set C = 0, K = 2 and ﬁx B to a Toeplitz structure encoding a speciﬁc frequency band and set the diagonal of Λ to be [1 − 1]. [sent-151, score-0.197]

51 With this deﬁnition, the discriminant criterion given by the log-odds ratio of the posterior class probability P(y = +1|X) = f (X; θ) , log P(y = −1|X) is simply the discriminant function which we chose to deﬁne in (1) as a sum of linear and quadratic terms. [sent-164, score-0.18]

52 In such a case we can re-write Equations (3) and (4) as ∂L(θ) ∂ar = 2(1 −C) ∑ yn πn Xn FH DDH FX⊤ ar . [sent-170, score-0.144]

53 Spatial and temporal smoothness is typically a valid assumption in EEG (Penny et al. [sent-183, score-0.187]

54 This is done through choice of covariance function: Let r be a spatial or temporal measure in context of X. [sent-188, score-0.44]

55 For instance r is a measure of spatial distance between data acquisition sensors, or a measure of time difference between two samples in the data. [sent-189, score-0.283]

56 Similarly, the Gaussian prior can be used on the columns of the temporal matrix V (i. [sent-195, score-0.191]

57 Following Rasmussen and Williams (2005) the shape parameter was chosen to be ν = 100 for the spatial components and ν = 2. [sent-204, score-0.316]

58 However, for the spectral parameter we found it important to initialize to a frequency band that was expected to carry useful information, for example, 8Hz-30Hz. [sent-220, score-0.197]

59 The simulation aims to quantify the algorithm’s performance on a broad spectrum of conditions and various noise levels, as well as to compare the extracted spatial, temporal and frequency components with ground truth. [sent-225, score-0.359]

60 We used two spatial patterns (SP) and employ a logistic regression classiﬁer on the resulting SP. [sent-239, score-0.314]

61 Since CSP,MLR and sMLR require the data to be band-pass ﬁltered to the frequency of interest, data sets where ﬁltered in the range of 8Hz-30Hz for these two methods. [sent-243, score-0.108]

62 For the spatial parameter A we set K = 2 allowing for two spatial patterns, while we enforce a Toeplitz structure on B. [sent-245, score-0.566]

63 We used 3 dipoles at three different locations, with one dipole used to generate evoked response activity, one dipole to generate induced oscillatory activity, and one dipole to generate unrelated noise/interference. [sent-257, score-0.566]

64 We used a half-sinusoid lasting 125ms with the peak positioned at 300ms after trial-onset and a trial-to-trial Gaussian temporal jitter with standard deviation of 10ms. [sent-259, score-0.189]

65 The second dipole’s component simulates ERS/ERD in the frequency band of 8Hz to 30Hz. [sent-260, score-0.229]

66 A variable signal in this frequency band was generated by bandpass ﬁltering an uncorrelated Gaussian process. [sent-261, score-0.293]

67 The third dipole was used to generate noise in the source space representing brain activity that is not related to the evoked/induced activity. [sent-262, score-0.348]

68 The SNR of the ERP component is in the range of -33dB to -13dB, and in the range of -22dB to -10dB for the oscillatory component. [sent-271, score-0.145]

69 As a decomposition method, SOBDA extracts spatial, temporal and frequency components. [sent-317, score-0.265]

70 The ﬁrst row shows the extracted temporal component U and the frequency component d. [sent-320, score-0.39]

71 2 We can see that the method extracted a temporal component with a peak at 300ms which is exactly the signal used in the simulation data design. [sent-321, score-0.333]

72 Similarly, the frequency band extracted shows a higher amplitude in the range of 8Hz-30Hz which is the band used to generate the oscillatory component. [sent-322, score-0.508]

73 The spatial components extracted and the corresponding dipole used in the model generation are shown in rows two and three in the ﬁgure. [sent-323, score-0.513]

74 The last column of the ﬁgure captures the second-order oscillatory component and the dipole of the rank one noise. [sent-325, score-0.317]

75 Top row: Extracted temporal weight of linear term (left) and frequency weights of quadratic term (right). [sent-335, score-0.265]

76 Speciﬁcally, we evaluated the SOBDA algorithm on a simulated data set using the process described above, but this time we initialize matrix B to a high-pass ﬁlter with cut of frequency at 1 Hz. [sent-343, score-0.149]

77 Figure 3 shows the temporal and frequency component obtained from SOBDA. [sent-345, score-0.297]

78 As it is evident from the ﬁgure, the resulting frequency component has higher weights for frequencies in the band 8Hz-30Hz, which is the band used to generate the power component in the simulated data. [sent-346, score-0.425]

79 Thus the proposed method is able optimize the frequency band even in cases where we use a generic initialization of the matrix B. [sent-347, score-0.197]

80 676 S ECOND -O RDER B ILINEAR D ISCRIMINANT A NALYSIS First Order Temporal coefficient v Magnitude of FFT coefficients 1. [sent-348, score-0.125]

81 The Fourier coefﬁcients were initialized to a high-pass ﬁlter with cut off frequency at 1 Hz Left ﬁgure: Extracted temporal weight of linear term. [sent-361, score-0.265]

82 Speciﬁcally, we are looking for one ERP components associated with the readiness potential that is, the slow increase in amplitude before an actual hand movement. [sent-376, score-0.115]

83 In the case of the second-order term involving the parameter A we set K = 2 because we are interested in ﬁnding the modulation of oscillatory activity associated with the different movements of the movements of the hands. [sent-377, score-0.254]

84 Hands and ﬁngers are represented in somato-sensory cortex covering different areas and will hence modulate activity in distinct spatial proﬁles. [sent-378, score-0.464]

85 The temporal ﬁlter was selected based on prior knowledge of the relevant frequency band. [sent-381, score-0.265]

86 We note that in all three cases the extracted components follow the general shape of the pre-motor or readiness potential (a. [sent-450, score-0.128]

87 In addition, for two of the data sets, the frequency weightings suggest that alpha band activity also provides discriminant information for this task. [sent-454, score-0.428]

88 This ﬁnding is consistent with the changes in the µ rhythm— that is, alpha-band activity localized over the motor cortex and associated with motor planning and execution. [sent-455, score-0.317]

89 Speciﬁcally, in the current experimental paradigm, we are looking for one ERP components associated with the readiness potential, that is, the slow increase in amplitude before an actual hand movement. [sent-533, score-0.115]

90 In the case of the second-order term involving the parameter A we set the K = 2 because we are interested in ﬁnding two components corresponding to the two different spatial proﬁles of the two classes. [sent-535, score-0.316]

91 The parametrization 680 S ECOND -O RDER B ILINEAR D ISCRIMINANT A NALYSIS Temporal coefficient V Magnitute of FFT coefficients Amplitute 0. [sent-551, score-0.125]

92 A 2 1 0 −400 −200 Time (ms) 0 0 0 Temporal coefficient V Magnitute of FFT coefficients 0. [sent-564, score-0.125]

93 Left: Temporal weights of linear component (ﬁrst column) and and frequency weights of quadratic component (second column). [sent-569, score-0.172]

94 Right: Spatial weights of linear component (third column) and two spatial weights for second-order spatial components (fourth and ﬁfth column). [sent-570, score-0.631]

95 = C The gradient with respect to vr , the rth column of V is: ∂{ f (Xn ; θ) + w0} ∂vr ∂{Trace U⊤ Xn V} ∂vr R ∂{∑r′ =1 u⊤ Xn vr′ } r′ = C ∂vr ⊤ = Cur Xn . [sent-589, score-0.129]

96 Classifying single trial EEG: Towards brain computer interu facing. [sent-608, score-0.122]

97 The bci competition 2003: progress and perspectives in detection and discrimination of EEG single trials. [sent-648, score-0.106]

98 Combined optimization u of spatial and temporal ﬁlters for improving brain-computer interfacing. [sent-657, score-0.44]

99 Linear spatial integration for single-trial detection in encephalogra phy. [sent-745, score-0.283]

100 Optimal spatial ﬁltering of single trial EEG u during imagined hand movement. [sent-804, score-0.379]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eeg', 0.428), ('sobda', 0.306), ('spatial', 0.283), ('snr', 0.211), ('erp', 0.204), ('parra', 0.192), ('bilinear', 0.175), ('temporal', 0.157), ('activity', 0.141), ('csp', 0.136), ('dipole', 0.136), ('aralick', 0.113), ('blankertz', 0.113), ('hristoforou', 0.113), ('ilinear', 0.113), ('oscillatory', 0.113), ('rder', 0.113), ('frequency', 0.108), ('db', 0.105), ('bdca', 0.102), ('iscriminant', 0.097), ('mlr', 0.091), ('yn', 0.09), ('discriminant', 0.09), ('vr', 0.09), ('band', 0.089), ('ajda', 0.087), ('econd', 0.087), ('toeplitz', 0.087), ('tomioka', 0.079), ('dyrholm', 0.079), ('nalysis', 0.071), ('brain', 0.071), ('erd', 0.07), ('hz', 0.068), ('motor', 0.068), ('coefficient', 0.068), ('bci', 0.068), ('xn', 0.067), ('ur', 0.065), ('gerson', 0.061), ('extracted', 0.061), ('fh', 0.058), ('birbaumer', 0.057), ('coefficients', 0.057), ('fft', 0.057), ('ramoser', 0.057), ('ar', 0.054), ('lter', 0.052), ('stimulus', 0.052), ('signal', 0.051), ('fourier', 0.051), ('trial', 0.051), ('smlr', 0.048), ('amplitude', 0.048), ('trials', 0.047), ('br', 0.047), ('aihara', 0.045), ('bandpass', 0.045), ('curio', 0.045), ('evoked', 0.045), ('imagined', 0.045), ('magnitute', 0.045), ('pfurtscheller', 0.045), ('wolpaw', 0.045), ('philiastides', 0.044), ('sajda', 0.044), ('trace', 0.043), ('signals', 0.043), ('simulated', 0.041), ('cortex', 0.04), ('christoforou', 0.039), ('rth', 0.039), ('competition', 0.038), ('captures', 0.036), ('luo', 0.035), ('channels', 0.035), ('guration', 0.035), ('power', 0.034), ('columns', 0.034), ('amplitute', 0.034), ('componet', 0.034), ('dornhege', 0.034), ('haralick', 0.034), ('lemm', 0.034), ('readiness', 0.034), ('components', 0.033), ('component', 0.032), ('peak', 0.032), ('logistic', 0.031), ('potentials', 0.031), ('smoothness', 0.03), ('df', 0.029), ('window', 0.027), ('ms', 0.026), ('movement', 0.026), ('crossvalidation', 0.026), ('sensor', 0.026), ('neuroimage', 0.024), ('fmri', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra

2 0.086133383 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

Abstract: The presence of asymmetry in the misclassiﬁcation costs or class prevalences is a common occurrence in the pattern classiﬁcation domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the speciﬁcation of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspeciﬁed. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspeciﬁcation, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassiﬁcation cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets

3 0.056947071 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

Author: Valero Laparra, Juan Gutiérrez, Gustavo Camps-Valls, Jesús Malo

Abstract: A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. The performance of these methods improves when relations among the local frequency coefﬁcients are explicitly included. However, in these techniques, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources. In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefﬁcients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations in the estimated signal. Since relations among the coefﬁcients are speciﬁc to the signal, the regularization property of SVR is exploited to remove the noise, which does not share this feature. The speciﬁc signal relations are encoded in an anisotropic kernel obtained from mutual information measures computed on a representative image database. In the proposed scheme, training considers minimizing the Kullback-Leibler divergence (KLD) between the estimated and actual probability functions (or histograms) of signal and noise in order to enforce similarity up to the higher (computationally estimable) order. Due to its non-parametric nature, the method can eventually cope with different noise sources without the need of an explicit re-formulation, as it is strictly necessary under parametric Bayesian formalisms. Results under several noise levels and noise sources show that: (1) the proposed method outperforms conventional wavelet methods that assume coefﬁcient independence, (2) it is similar to state-of-the-art methods that do explicitly include these relations when the noise source is Gaussian, and (3) it gives better numerical and visual performance when more complex, realistic noise sources are considered. Therefore, the proposed machine learning approach can be seen as a mor

4 0.056109536 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda

Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classiﬁer, such as feature selection, robustness to noise, and as a result, classiﬁer generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a ﬁxed point continuation phase and an interior point phase. The ﬁrst phase is based completely on memory efﬁcient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can signiﬁcantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efﬁcient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, ﬁxed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA

5 0.040524159 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

Author: Dotan Di Castro, Ron Meir

Abstract: Actor-Critic based approaches were among the ﬁrst to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference

6 0.040407561 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

7 0.038267463 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases

8 0.038112298 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

9 0.034450993 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

10 0.031048547 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

11 0.030330135 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

12 0.030267948 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity

13 0.029753778 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

14 0.029463395 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

15 0.028535502 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

16 0.027630256 48 jmlr-2010-How to Explain Individual Classification Decisions

17 0.026956528 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

18 0.025212457 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

19 0.024014326 22 jmlr-2010-Classification Using Geometric Level Sets

20 0.023538701 61 jmlr-2010-Learning From Crowds

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.127), (1, 0.0), (2, 0.011), (3, 0.015), (4, -0.044), (5, 0.059), (6, 0.081), (7, -0.053), (8, -0.019), (9, -0.014), (10, 0.01), (11, 0.035), (12, -0.046), (13, -0.065), (14, 0.001), (15, -0.047), (16, 0.131), (17, 0.109), (18, -0.081), (19, -0.112), (20, -0.049), (21, 0.021), (22, 0.016), (23, 0.026), (24, 0.074), (25, -0.002), (26, -0.063), (27, -0.007), (28, -0.035), (29, 0.133), (30, 0.106), (31, 0.359), (32, -0.34), (33, -0.066), (34, 0.152), (35, 0.155), (36, 0.053), (37, -0.073), (38, -0.177), (39, 0.027), (40, 0.122), (41, 0.034), (42, -0.155), (43, -0.035), (44, 0.084), (45, 0.104), (46, -0.134), (47, -0.099), (48, 0.038), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93330568 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra

2 0.42495647 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

Author: Valero Laparra, Juan Gutiérrez, Gustavo Camps-Valls, Jesús Malo

3 0.40322793 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda

4 0.39617491 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan

Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classiﬁcation jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efﬁcient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a ﬂexible and efﬁcient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition

5 0.33936444 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

6 0.31026012 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

7 0.26162842 77 jmlr-2010-Model-based Boosting 2.0

8 0.241064 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases

9 0.21469106 37 jmlr-2010-Evolving Static Representations for Task Transfer

10 0.17824623 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

11 0.1712807 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

12 0.16622598 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

13 0.165012 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

14 0.16324452 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

15 0.16226491 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

16 0.16193834 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

17 0.15772083 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

18 0.15177554 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

19 0.15078768 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

20 0.13636981 61 jmlr-2010-Learning From Crowds

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.015), (4, 0.011), (5, 0.448), (8, 0.014), (21, 0.014), (32, 0.096), (33, 0.014), (36, 0.038), (37, 0.048), (75, 0.123), (81, 0.015), (85, 0.047), (96, 0.018), (97, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69262064 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra

2 0.64815706 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

Author: Dotan Di Castro, Ron Meir

3 0.34597325 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

Author: Jörg Lücke, Julian Eggert

Abstract: We show how a preselection of hidden variables can be used to efﬁciently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efﬁciently computable approximation to the sufﬁcient statistics of a given model. The computational cost to compute the sufﬁcient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the beneﬁts of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artiﬁcial and realistic data, and are compared to other models in the literature. We ﬁnd that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models

4 0.34443688 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan

5 0.3420215 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

6 0.33953986 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

7 0.33778149 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

8 0.33760905 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

9 0.33563417 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

10 0.33415416 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

11 0.33353952 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

12 0.3334623 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

13 0.33235762 109 jmlr-2010-Stochastic Composite Likelihood

14 0.33180463 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

15 0.33167559 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

16 0.33129326 102 jmlr-2010-Semi-Supervised Novelty Detection

17 0.33020949 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

18 0.32990861 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

19 0.32981488 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

20 0.32922542 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices