jmlr jmlr2010 jmlr2010-101 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
Reference: text
sentIndex sentText sentNum sentScore
1 Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. [sent-13, score-0.615]
2 Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface 1. [sent-16, score-0.246]
3 Introduction The work presented in this paper is motivated by the analysis of functional brain imaging signals recorded via electroencephalography (EEG). [sent-17, score-0.114]
4 Many of these classification techniques were originally developed in the context of Brain Computer Interfaces (BCI) but are now more widely used to interpret activity associated with neural processing. [sent-24, score-0.141]
5 , 2002, 2003) the aim is to decode brain activity on a single-trial basis in order to provide a direct control pathway between a user’s intentions and a computer. [sent-28, score-0.212]
6 By extracting activity that differs maximally between two experimental conditions, the typically low signal-to-noise ratio of EEG can be overcome. [sent-35, score-0.141]
7 Typical in this case is to average the measured potentials following stimulus presentation, thereby canceling uncorrelated noise that is not reproducible from one trial to the next. [sent-44, score-0.134]
8 This averaged activity, called an event related potential (ERP), captures activity that is time-locked to the stimulus presentation but cancels induced oscillatory activity that is not locked in phase to the timing of the stimulus. [sent-45, score-0.483]
9 Alternatively, many studies compute the oscillatory activity in specific frequency bands by filtering and squaring the signal prior to averaging. [sent-46, score-0.413]
10 Induced changes in oscillatory activity are termed event related synchronization or desynchronization (ERS/ERD) Pfurtscheller and da Silva (1999). [sent-47, score-0.254]
11 First order methods include temporal filtering and thresholding (Birbaumer et al. [sent-49, score-0.157]
12 , 2007) and the well known common spatial patterns method (CSP) (Ramoser et al. [sent-56, score-0.283]
13 , 2005), and common sparse spectral spatial patterns (CSSSP) (Dornhege et al. [sent-58, score-0.283]
14 Instead, it would be desirable for the analysis method to extract the relevant neurophysiological activity de novo with minimal prior expectations. [sent-62, score-0.141]
15 Through a bilinear formulation, the method can simultaneously identify spatial linear components as well as temporal modulation of activity. [sent-64, score-0.648]
16 Further, through the bilinear formulation, the method exploits the spatio-temporal nature of the EEG signals and provides a reduced parametrization of the high dimensional data space. [sent-66, score-0.218]
17 We show that a broad set of state-of-the-art EEG analysis methods can be characterized as special cases under this bilinear framework. [sent-67, score-0.175]
18 We then present the bilinear model, discuss interpretation in the context of EEG, and establish a link to current analysis methods. [sent-73, score-0.175]
19 The goal will be to use the N examples to optimize these parameters such that the discriminant function takes on positive values for examples with yn = +1 and negative values for yn = −1. [sent-85, score-0.27]
20 3 Interpretation and Rationale of the Model The discriminant criterion defined in (1) is the sum of a linear and a quadratic term, each combining bilinear components of the EEG signal. [sent-90, score-0.298]
21 By re-writing the term as: Trace(U⊤ XV) = Trace(VU⊤ X) = Trace(W⊤ X) , where we defined, W = UV⊤ , it is easy to see that the bilinear projection is a linear combination of elements of X with a rank-R constraint on W. [sent-94, score-0.175]
22 In particular, the polarity of the signal (positive evoked response versus negative evoked response) will contribute to discrimination if it is consistent across trials. [sent-96, score-0.141]
23 This bilinear projection reduces the number of model parameters of W from D × T dimensions to R × (D + T ) which is a significant dimensionality reduction that alleviates the problem of over-fitting in parameters estimation given the small training set size. [sent-98, score-0.175]
24 The second term of Equation (1) is the power of spatially and temporally weighted signals and thus captures the second-order statistics of the signal. [sent-101, score-0.113]
25 7), then B defines a temporal filter on the signal and the model captures powers of the filtered signal. [sent-106, score-0.244]
26 Further, by allowing B to be learned from the data, we may be able to identify new frequency bands that have so far not been identified in novel experimental paradigms. [sent-107, score-0.108]
27 The spatial weights A together with the Trace operation ensure that the power is measured, not in individual electrodes, but in some component space that may reflect activity distributed across several electrodes. [sent-108, score-0.49]
28 The diagonal matrix Λ partitions the K spatial components (i. [sent-109, score-0.316]
29 , K columns of A) into those that contribute power positively and those that contribute power negatively to the total sum. [sent-111, score-0.102]
30 Since each column of A measures the power from different sources, then by multiplying the expression with Λ we capture the difference in power between different spatial components. [sent-112, score-0.351]
31 It is known that imagining a movement of the left hand reduces oscillatory activity over the motor cortex of the right hemisphere, while an imagined right-hand movement reduces oscillations over the left motor cortex. [sent-114, score-0.527]
32 Each of these cortical areas will be captured by a different spatial distribution in the EEG. [sent-115, score-0.283]
33 If we limit the columns of A to two, then these columns may capture the power of oscillatory activity over the right and left motor cortex respectively. [sent-116, score-0.464]
34 C = 1 indicates that the discriminant activity is dominated by the first-order features; C = 0 indicates that the activity is dominated by second-order features, and any value in between denotes the importance of one component versus the other. [sent-120, score-0.404]
35 , ur and ar map the sensor signal to a current-source space, vr are temporal weight on a source signal and br can be arranged to represent a temporal filter) it becomes intuitive for the experimenter to incorporate prior knowledge of an experimental setup in the model. [sent-124, score-0.698]
36 If the signal of interest is known to be in a specific frequency band, one can fix matrix B to capture only the desired frequency band. [sent-125, score-0.267]
37 For example, B can be fixed to a Toeplitz matrix with coefficients corresponding to an 8Hz-12Hz bandpass filter, then the second-order term is able to extract power in the alpha-band which is known to be modulated during motor related tasks. [sent-126, score-0.147]
38 In such a scenario the experimenter can fix the temporal profile parameter V to emphasize time samples around the expected location of the peak activity and optimize over the rest of the parameters. [sent-128, score-0.33]
39 fMRI has high spatial resolution and can provide locations within the brain that may be known to participate in the processing during a particular experimental paradigm. [sent-130, score-0.354]
40 This location information can be incorporated into the present model by fixing the spatial parameters ur and a to reflect a localized source (often approximated as a current dipole). [sent-131, score-0.348]
41 The remaining temporal parameters of the model can then be optimized. [sent-132, score-0.157]
42 The following list identifies some of the algorithms and how they relate to the model used in the SOBDA framework: • Set C = 1, R = 1 and choose temporal component v to select a time window of interest (i. [sent-135, score-0.216]
43 (2002, 2005) • Set C = 1 and select some R > 1 and choose the component vectors vr to select multiple time windows of interest as in 1. [sent-140, score-0.122]
44 Learn for each temporal window the corresponding spatial vector ur from examples separately and then combine these components by learning a linear combination of the elements. [sent-141, score-0.565]
45 (2008) • Set C = 1, R = D while constraining U to be a diagonal matrix and select, separately for each channel, the time window vr which is most discriminative. [sent-144, score-0.117]
46 Then train the diagonal terms of 669 C HRISTOFOROU , H ARALICK , S AJDA AND PARRA U resulting in a latency dependent spatial filter (Luo and Sajda, 2006a). [sent-145, score-0.283]
47 Alternatively, in the first step, use feature selection to find the right set of time windows vr simultaneously for all channels (Luo and Sajda, 2006b). [sent-146, score-0.125]
48 • Set C = 1,R = 1 and learn the spatial and temporal components u, v simultaneously. [sent-147, score-0.473]
49 This reduces to the rank-one bilinear discriminant as in Dyrholm and Parra (2006) • Select C = 1 and some R > 1 and learn all columns of the spatial and temporal projection matrix U and V simultaneously. [sent-148, score-0.739]
50 • Set C = 0, K = 2 and fix B to a Toeplitz structure encoding a specific frequency band and set the diagonal of Λ to be [1 − 1]. [sent-151, score-0.197]
51 With this definition, the discriminant criterion given by the log-odds ratio of the posterior class probability P(y = +1|X) = f (X; θ) , log P(y = −1|X) is simply the discriminant function which we chose to define in (1) as a sum of linear and quadratic terms. [sent-164, score-0.18]
52 In such a case we can re-write Equations (3) and (4) as ∂L(θ) ∂ar = 2(1 −C) ∑ yn πn Xn FH DDH FX⊤ ar . [sent-170, score-0.144]
53 Spatial and temporal smoothness is typically a valid assumption in EEG (Penny et al. [sent-183, score-0.187]
54 This is done through choice of covariance function: Let r be a spatial or temporal measure in context of X. [sent-188, score-0.44]
55 For instance r is a measure of spatial distance between data acquisition sensors, or a measure of time difference between two samples in the data. [sent-189, score-0.283]
56 Similarly, the Gaussian prior can be used on the columns of the temporal matrix V (i. [sent-195, score-0.191]
57 Following Rasmussen and Williams (2005) the shape parameter was chosen to be ν = 100 for the spatial components and ν = 2. [sent-204, score-0.316]
58 However, for the spectral parameter we found it important to initialize to a frequency band that was expected to carry useful information, for example, 8Hz-30Hz. [sent-220, score-0.197]
59 The simulation aims to quantify the algorithm’s performance on a broad spectrum of conditions and various noise levels, as well as to compare the extracted spatial, temporal and frequency components with ground truth. [sent-225, score-0.359]
60 We used two spatial patterns (SP) and employ a logistic regression classifier on the resulting SP. [sent-239, score-0.314]
61 Since CSP,MLR and sMLR require the data to be band-pass filtered to the frequency of interest, data sets where filtered in the range of 8Hz-30Hz for these two methods. [sent-243, score-0.108]
62 For the spatial parameter A we set K = 2 allowing for two spatial patterns, while we enforce a Toeplitz structure on B. [sent-245, score-0.566]
63 We used 3 dipoles at three different locations, with one dipole used to generate evoked response activity, one dipole to generate induced oscillatory activity, and one dipole to generate unrelated noise/interference. [sent-257, score-0.566]
64 We used a half-sinusoid lasting 125ms with the peak positioned at 300ms after trial-onset and a trial-to-trial Gaussian temporal jitter with standard deviation of 10ms. [sent-259, score-0.189]
65 The second dipole’s component simulates ERS/ERD in the frequency band of 8Hz to 30Hz. [sent-260, score-0.229]
66 A variable signal in this frequency band was generated by bandpass filtering an uncorrelated Gaussian process. [sent-261, score-0.293]
67 The third dipole was used to generate noise in the source space representing brain activity that is not related to the evoked/induced activity. [sent-262, score-0.348]
68 The SNR of the ERP component is in the range of -33dB to -13dB, and in the range of -22dB to -10dB for the oscillatory component. [sent-271, score-0.145]
69 As a decomposition method, SOBDA extracts spatial, temporal and frequency components. [sent-317, score-0.265]
70 The first row shows the extracted temporal component U and the frequency component d. [sent-320, score-0.39]
71 2 We can see that the method extracted a temporal component with a peak at 300ms which is exactly the signal used in the simulation data design. [sent-321, score-0.333]
72 Similarly, the frequency band extracted shows a higher amplitude in the range of 8Hz-30Hz which is the band used to generate the oscillatory component. [sent-322, score-0.508]
73 The spatial components extracted and the corresponding dipole used in the model generation are shown in rows two and three in the figure. [sent-323, score-0.513]
74 The last column of the figure captures the second-order oscillatory component and the dipole of the rank one noise. [sent-325, score-0.317]
75 Top row: Extracted temporal weight of linear term (left) and frequency weights of quadratic term (right). [sent-335, score-0.265]
76 Specifically, we evaluated the SOBDA algorithm on a simulated data set using the process described above, but this time we initialize matrix B to a high-pass filter with cut of frequency at 1 Hz. [sent-343, score-0.149]
77 Figure 3 shows the temporal and frequency component obtained from SOBDA. [sent-345, score-0.297]
78 As it is evident from the figure, the resulting frequency component has higher weights for frequencies in the band 8Hz-30Hz, which is the band used to generate the power component in the simulated data. [sent-346, score-0.425]
79 Thus the proposed method is able optimize the frequency band even in cases where we use a generic initialization of the matrix B. [sent-347, score-0.197]
80 676 S ECOND -O RDER B ILINEAR D ISCRIMINANT A NALYSIS First Order Temporal coefficient v Magnitude of FFT coefficients 1. [sent-348, score-0.125]
81 The Fourier coefficients were initialized to a high-pass filter with cut off frequency at 1 Hz Left figure: Extracted temporal weight of linear term. [sent-361, score-0.265]
82 Specifically, we are looking for one ERP components associated with the readiness potential that is, the slow increase in amplitude before an actual hand movement. [sent-376, score-0.115]
83 In the case of the second-order term involving the parameter A we set K = 2 because we are interested in finding the modulation of oscillatory activity associated with the different movements of the movements of the hands. [sent-377, score-0.254]
84 Hands and fingers are represented in somato-sensory cortex covering different areas and will hence modulate activity in distinct spatial profiles. [sent-378, score-0.464]
85 The temporal filter was selected based on prior knowledge of the relevant frequency band. [sent-381, score-0.265]
86 We note that in all three cases the extracted components follow the general shape of the pre-motor or readiness potential (a. [sent-450, score-0.128]
87 In addition, for two of the data sets, the frequency weightings suggest that alpha band activity also provides discriminant information for this task. [sent-454, score-0.428]
88 This finding is consistent with the changes in the µ rhythm— that is, alpha-band activity localized over the motor cortex and associated with motor planning and execution. [sent-455, score-0.317]
89 Specifically, in the current experimental paradigm, we are looking for one ERP components associated with the readiness potential, that is, the slow increase in amplitude before an actual hand movement. [sent-533, score-0.115]
90 In the case of the second-order term involving the parameter A we set the K = 2 because we are interested in finding two components corresponding to the two different spatial profiles of the two classes. [sent-535, score-0.316]
91 The parametrization 680 S ECOND -O RDER B ILINEAR D ISCRIMINANT A NALYSIS Temporal coefficient V Magnitute of FFT coefficients Amplitute 0. [sent-551, score-0.125]
92 A 2 1 0 −400 −200 Time (ms) 0 0 0 Temporal coefficient V Magnitute of FFT coefficients 0. [sent-564, score-0.125]
93 Left: Temporal weights of linear component (first column) and and frequency weights of quadratic component (second column). [sent-569, score-0.172]
94 Right: Spatial weights of linear component (third column) and two spatial weights for second-order spatial components (fourth and fifth column). [sent-570, score-0.631]
95 = C The gradient with respect to vr , the rth column of V is: ∂{ f (Xn ; θ) + w0} ∂vr ∂{Trace U⊤ Xn V} ∂vr R ∂{∑r′ =1 u⊤ Xn vr′ } r′ = C ∂vr ⊤ = Cur Xn . [sent-589, score-0.129]
96 Classifying single trial EEG: Towards brain computer interu facing. [sent-608, score-0.122]
97 The bci competition 2003: progress and perspectives in detection and discrimination of EEG single trials. [sent-648, score-0.106]
98 Combined optimization u of spatial and temporal filters for improving brain-computer interfacing. [sent-657, score-0.44]
99 Linear spatial integration for single-trial detection in encephalogra phy. [sent-745, score-0.283]
100 Optimal spatial filtering of single trial EEG u during imagined hand movement. [sent-804, score-0.379]
wordName wordTfidf (topN-words)
[('eeg', 0.428), ('sobda', 0.306), ('spatial', 0.283), ('snr', 0.211), ('erp', 0.204), ('parra', 0.192), ('bilinear', 0.175), ('temporal', 0.157), ('activity', 0.141), ('csp', 0.136), ('dipole', 0.136), ('aralick', 0.113), ('blankertz', 0.113), ('hristoforou', 0.113), ('ilinear', 0.113), ('oscillatory', 0.113), ('rder', 0.113), ('frequency', 0.108), ('db', 0.105), ('bdca', 0.102), ('iscriminant', 0.097), ('mlr', 0.091), ('yn', 0.09), ('discriminant', 0.09), ('vr', 0.09), ('band', 0.089), ('ajda', 0.087), ('econd', 0.087), ('toeplitz', 0.087), ('tomioka', 0.079), ('dyrholm', 0.079), ('nalysis', 0.071), ('brain', 0.071), ('erd', 0.07), ('hz', 0.068), ('motor', 0.068), ('coefficient', 0.068), ('bci', 0.068), ('xn', 0.067), ('ur', 0.065), ('gerson', 0.061), ('extracted', 0.061), ('fh', 0.058), ('birbaumer', 0.057), ('coefficients', 0.057), ('fft', 0.057), ('ramoser', 0.057), ('ar', 0.054), ('lter', 0.052), ('stimulus', 0.052), ('signal', 0.051), ('fourier', 0.051), ('trial', 0.051), ('smlr', 0.048), ('amplitude', 0.048), ('trials', 0.047), ('br', 0.047), ('aihara', 0.045), ('bandpass', 0.045), ('curio', 0.045), ('evoked', 0.045), ('imagined', 0.045), ('magnitute', 0.045), ('pfurtscheller', 0.045), ('wolpaw', 0.045), ('philiastides', 0.044), ('sajda', 0.044), ('trace', 0.043), ('signals', 0.043), ('simulated', 0.041), ('cortex', 0.04), ('christoforou', 0.039), ('rth', 0.039), ('competition', 0.038), ('captures', 0.036), ('luo', 0.035), ('channels', 0.035), ('guration', 0.035), ('power', 0.034), ('columns', 0.034), ('amplitute', 0.034), ('componet', 0.034), ('dornhege', 0.034), ('haralick', 0.034), ('lemm', 0.034), ('readiness', 0.034), ('components', 0.033), ('component', 0.032), ('peak', 0.032), ('logistic', 0.031), ('potentials', 0.031), ('smoothness', 0.03), ('df', 0.029), ('window', 0.027), ('ms', 0.026), ('movement', 0.026), ('crossvalidation', 0.026), ('sensor', 0.026), ('neuroimage', 0.024), ('fmri', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
2 0.086133383 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
3 0.056947071 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
Author: Valero Laparra, Juan Gutiérrez, Gustavo Camps-Valls, Jesús Malo
Abstract: A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. The performance of these methods improves when relations among the local frequency coefficients are explicitly included. However, in these techniques, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources. In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefficients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations in the estimated signal. Since relations among the coefficients are specific to the signal, the regularization property of SVR is exploited to remove the noise, which does not share this feature. The specific signal relations are encoded in an anisotropic kernel obtained from mutual information measures computed on a representative image database. In the proposed scheme, training considers minimizing the Kullback-Leibler divergence (KLD) between the estimated and actual probability functions (or histograms) of signal and noise in order to enforce similarity up to the higher (computationally estimable) order. Due to its non-parametric nature, the method can eventually cope with different noise sources without the need of an explicit re-formulation, as it is strictly necessary under parametric Bayesian formalisms. Results under several noise levels and noise sources show that: (1) the proposed method outperforms conventional wavelet methods that assume coefficient independence, (2) it is similar to state-of-the-art methods that do explicitly include these relations when the noise source is Gaussian, and (3) it gives better numerical and visual performance when more complex, realistic noise sources are considered. Therefore, the proposed machine learning approach can be seen as a mor
4 0.056109536 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
5 0.040524159 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
6 0.040407561 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
7 0.038267463 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
8 0.038112298 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
9 0.034450993 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
10 0.031048547 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
11 0.030330135 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
12 0.030267948 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
13 0.029753778 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
14 0.029463395 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
15 0.028535502 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
16 0.027630256 48 jmlr-2010-How to Explain Individual Classification Decisions
17 0.026956528 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
18 0.025212457 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
19 0.024014326 22 jmlr-2010-Classification Using Geometric Level Sets
20 0.023538701 61 jmlr-2010-Learning From Crowds
topicId topicWeight
[(0, -0.127), (1, 0.0), (2, 0.011), (3, 0.015), (4, -0.044), (5, 0.059), (6, 0.081), (7, -0.053), (8, -0.019), (9, -0.014), (10, 0.01), (11, 0.035), (12, -0.046), (13, -0.065), (14, 0.001), (15, -0.047), (16, 0.131), (17, 0.109), (18, -0.081), (19, -0.112), (20, -0.049), (21, 0.021), (22, 0.016), (23, 0.026), (24, 0.074), (25, -0.002), (26, -0.063), (27, -0.007), (28, -0.035), (29, 0.133), (30, 0.106), (31, 0.359), (32, -0.34), (33, -0.066), (34, 0.152), (35, 0.155), (36, 0.053), (37, -0.073), (38, -0.177), (39, 0.027), (40, 0.122), (41, 0.034), (42, -0.155), (43, -0.035), (44, 0.084), (45, 0.104), (46, -0.134), (47, -0.099), (48, 0.038), (49, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.93330568 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
2 0.42495647 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
Author: Valero Laparra, Juan Gutiérrez, Gustavo Camps-Valls, Jesús Malo
Abstract: A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. The performance of these methods improves when relations among the local frequency coefficients are explicitly included. However, in these techniques, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources. In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefficients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations in the estimated signal. Since relations among the coefficients are specific to the signal, the regularization property of SVR is exploited to remove the noise, which does not share this feature. The specific signal relations are encoded in an anisotropic kernel obtained from mutual information measures computed on a representative image database. In the proposed scheme, training considers minimizing the Kullback-Leibler divergence (KLD) between the estimated and actual probability functions (or histograms) of signal and noise in order to enforce similarity up to the higher (computationally estimable) order. Due to its non-parametric nature, the method can eventually cope with different noise sources without the need of an explicit re-formulation, as it is strictly necessary under parametric Bayesian formalisms. Results under several noise levels and noise sources show that: (1) the proposed method outperforms conventional wavelet methods that assume coefficient independence, (2) it is similar to state-of-the-art methods that do explicitly include these relations when the noise source is Gaussian, and (3) it gives better numerical and visual performance when more complex, realistic noise sources are considered. Therefore, the proposed machine learning approach can be seen as a mor
3 0.40322793 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda
Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, fixed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA
4 0.39617491 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
5 0.33936444 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
6 0.31026012 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
7 0.26162842 77 jmlr-2010-Model-based Boosting 2.0
9 0.21469106 37 jmlr-2010-Evolving Static Representations for Task Transfer
10 0.17824623 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
11 0.1712807 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
12 0.16622598 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
13 0.165012 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
14 0.16324452 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
15 0.16226491 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
16 0.16193834 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
17 0.15772083 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
18 0.15177554 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
19 0.15078768 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
20 0.13636981 61 jmlr-2010-Learning From Crowds
topicId topicWeight
[(3, 0.015), (4, 0.011), (5, 0.448), (8, 0.014), (21, 0.014), (32, 0.096), (33, 0.014), (36, 0.038), (37, 0.048), (75, 0.123), (81, 0.015), (85, 0.047), (96, 0.018), (97, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.69262064 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
2 0.64815706 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
3 0.34597325 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
Author: Jörg Lücke, Julian Eggert
Abstract: We show how a preselection of hidden variables can be used to efficiently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efficiently computable approximation to the sufficient statistics of a given model. The computational cost to compute the sufficient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the benefits of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artificial and realistic data, and are compared to other models in the literature. We find that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models
4 0.34443688 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
5 0.3420215 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
6 0.33953986 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.33778149 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
8 0.33760905 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
9 0.33563417 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
10 0.33415416 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
11 0.33353952 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
12 0.3334623 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
13 0.33235762 109 jmlr-2010-Stochastic Composite Likelihood
14 0.33180463 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
15 0.33167559 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
16 0.33129326 102 jmlr-2010-Semi-Supervised Novelty Detection
17 0.33020949 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
18 0.32990861 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
19 0.32981488 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
20 0.32922542 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices