nips nips2001 nips2001-50 knowledge-graph by maker-knowledge-mining

50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Source: pdf

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

Abstract: Driven by the progress in the ﬁeld of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming ﬁnger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classiﬁcation accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classiﬁers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ¢¡ ¢¨¦¤¢ Abstract Driven by the progress in the ﬁeld of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i. [sent-9, score-0.132]

2 , systems that enable human subjects to control a computer only by means of their brain signals. [sent-11, score-0.243]

3 In a pseudo-online simulation our BCI detects upcoming ﬁnger movements in a natural keyboard typing condition and predicts their laterality. [sent-12, score-0.259]

4 This can be done on average 100–230 ms before the respective key is actually pressed, i. [sent-13, score-0.186]

5 1 Introduction The online analysis of single-trial electroencephalogram (EEG) measurements is a challenge for signal processing and machine learning. [sent-18, score-0.068]

6 Once the high inter-trial variability (see Figure 1) of this complex multivariate signal can be reliably processed, the next logical step is to make use of the brain activities for real-time control of, e. [sent-19, score-0.168]

7 In this work we study a pseudo-online evaluation of single-trial EEGs from voluntary self-paced ﬁnger movements and exploit the laterality of the left/right hand signal as one bit of information for later control. [sent-22, score-0.207]

8 Two key issues to start with when conceiving a BCI are (1) the deﬁnition of a behavioral context in which a subject’s brain signals will be monitored and used eventually as surrogate for a bodily, e. [sent-31, score-0.234]

9 , manual, input of computer commands, and (2) the choice of brain signals which are optimally capable to convey the subject’s intention to the computer. [sent-33, score-0.234]

10 Concerning the behavioral context, typewriting on a computer keyboard is a highly overlearned motor competence. [sent-34, score-0.331]

11 Accordingly, a natural ﬁrst choice is a BCI-situation which induces the subject to arrive at a particular decision that is coupled to a predeﬁned (learned) motor output. [sent-35, score-0.274]

12 This approach is well known as a two alternative forced choice-reaction task (2AFC) where one out of two stimuli (visual, auditory or somatosensory) has to be detected, categorised and responded to by issuing one out of two alternative motor commands, e. [sent-36, score-0.217]

13 , pushing a button with either the left or right hand. [sent-38, score-0.046]

14 A task variant without explicit sensory input is the voluntary, endogeneous generation of a ›go‹ command involving the deliberate choice between the two possible motor outputs at a self-paced rate. [sent-39, score-0.368]

15 Using multi-channel EEG-mapping it has been repeatedly demonstrated that several highly localised brain areas contribute to cerebral motor command processes. [sent-42, score-0.469]

16 Speciﬁcally, a negative ›Bereitschaftspotential‹ (BP) precedes the voluntary initiation of the movement. [sent-43, score-0.094]

17 Because one potential BCI-application is with paralysed patients, one might consider to mimic the ›no-motor-output‹ of these individuals by having healthy experimental subjects to intend a movement but to withhold its execution (motor imagery). [sent-45, score-0.474]

18 While it is true that brain potentials comparable to BP are associated with an imagination of hand movements, which indeed is consistent with the assumption that the primary motor cortex is active with motor imagery, actual motor performance signiﬁcantly increased these potentials [10]. [sent-46, score-0.993]

19 investigate slow cortical potentials (SCP) and how they can be selfregulated in a feedback scenario. [sent-55, score-0.102]

20 In their thought translation device [2] patients learn to produce cortical negativity or positivity at a central scalp location at will, which is fed back to the user. [sent-56, score-0.192]

21 After some training patients are able to transmit binary decisions in a 4 sec periodicity with accuracy levels up to 85% and therewith control a language support program or an internet browser. [sent-57, score-0.12]

22 built a BCI system based on event-related (de-)synchronisation (ERD/ERS, typically of the µ and central β rhythm) for online classiﬁcation of movement imaginations or preparations into 2–4 classes (e. [sent-59, score-0.424]

23 Typical preprocessing techniques are adaptive autoregressive parameters, common spatial patterns (after band pass ﬁltering) and band power in subject speciﬁc frequency bands. [sent-62, score-0.215]

24 Classiﬁcation is done by Fisher discriminant analysis, multi-layer neural networks or LVQ variants. [sent-63, score-0.07]

25 In classiﬁcation of exogeneous movement preparations, rates of 98%, 96% and 75% (for three subjects respectively) are obtained before movement onset 1 in a 3 classes task and trials of 8 sec [3]. [sent-64, score-0.874]

26 Only selected, artifact free trials (less that 40%) were used. [sent-65, score-0.096]

27 study EEG-based cursor control [4], translating the power in subject speciﬁc frequency bands, or autoregressive parameters, from two spatially ﬁltered scalp locations over sensorimotor cortex into vertical cursor movement. [sent-68, score-0.411]

28 Users initially gain control by various kinds of motor imagery (the setting favours ›movement‹ vs. [sent-69, score-0.361]

29 In cursor control trials of at least 4 sec duration trained subjects reach accuracies of over 90%. [sent-72, score-0.356]

30 Some subjects acquired also considerable control in a 2-d setup. [sent-73, score-0.111]

31 3 Acquisition and preprocessing of brain signals Experimental setup. [sent-74, score-0.314]

32 The subject sat in a normal chair, relaxed arms resting on the table, ﬁngers in the standard typing position at the computer keyboard. [sent-75, score-0.132]

33 The task was to press with the index and little ﬁngers the corresponding keys in a self-chosen order and timing (›self-paced key typing‹). [sent-76, score-0.052]

34 Typing of a total of 516 keystrokes was done at an average speed of 1 key every 2. [sent-79, score-0.065]

35 Brain activity was measured with 27 Ag/AgCl electrodes at positions of the extended international 10-20 system, 21 mounted over motor and somatosensory cortex, 5 frontal and one occipital, referenced to nasion (sampled at 1000 Hz, band-pass ﬁltered 0. [sent-81, score-0.288]

36 In an event channel the timing of keystrokes was stored along with the EEG signal. [sent-84, score-0.15]

37 All data were recorded with a NeuroScan device and converted to Matlab format for further analysis. [sent-85, score-0.039]

38 The signals were downsampled to 100 Hz by picking every 10th sample. [sent-86, score-0.102]

39 In a moderate rejection we sorted out only 3 out of 516 trials due to heavy measurement artifacts, while keeping trials that are contaminated by less serious artifacts or eye blinks. [sent-87, score-0.24]

40 3 values (marked by circles) of smoothed signals are taken as features in each channel. [sent-96, score-0.102]

41 −160 −140 −120 ↑ ←Σ Figure 2: Sparse Fisher Discriminant Analysis selected 68 features (shaded) from 405 input dimensions (27 channels × 15 samples [150 ms]) of raw EEG data. [sent-97, score-0.156]

42 We investigated the alternatives of taking all 27 channels, or only the 21 located over motor and sensorimotor cortex. [sent-100, score-0.254]

43 The 6 frontal and occipital channels are expected not to give strong contributions to the classiﬁcation task. [sent-101, score-0.134]

44 Hence a comparison shows, whether a classiﬁer is disturbed by low information channels or if it even manages to extract information from them. [sent-102, score-0.091]

45 Figure 1 depicts two single trial EEG signals at scalp location C3 for right ﬁnger movements. [sent-103, score-0.371]

46 These two single trials are very well-shaped and were selected for resembling the the grand average over all 241 right ﬁnger movements, which is drawn as thick line. [sent-104, score-0.142]

47 Usually the BP of a single trial is much more obscured by non task-related brain activity and noise. [sent-105, score-0.247]

48 The goal of preprocessing is to reveal task-related components to a degree that they can be detected by a classiﬁer. [sent-106, score-0.08]

49 Figure 1 shows also the feature vectors due to preprocessing (<5 Hz) calculated from the depicted raw single trial signals. [sent-107, score-0.26]

50 4 From response-aligned to online classiﬁcation We investigate some linear classiﬁcation methods. [sent-108, score-0.068]

51 If no a priori knowledge on the probability distribution of the ˆ data is available, a typical objective is to minimize a combination of empirical risk function and some regularization term that restrains the algorithm from overﬁtting to the training set {(xk , yk ) | k = 1, . [sent-110, score-0.059]

52 Taking a soft margin loss function [11] yields the empirical risk function ∑K max(0, 1 − yk (w xk + b)). [sent-114, score-0.113]

53 Fisher Discriminant (FD) is a well known classiﬁcation method, in which a projection vector is determined to maximize the distance between the projected means of the two classes while minimizing the variance of the projected data within each class [13]. [sent-116, score-0.072]

54 Regularized Fisher Discriminant (RFD) can be obtained via a mathematical programming approach [14]: min 1/2 ||w||2 + C/K ||ξ ||2 2 2 w,b,ξ yk (w xk + b) = 1 − ξk 2 As subject to for k = 1, . [sent-118, score-0.17]

55 ﬁlter <5 Hz <5 Hz none none ch’s mc all mc all FD 3. [sent-122, score-0.196]

56 8 Table 3: Test set error (± std) for classiﬁcation at 120 ms before keystroke; ›mc‹ refers to the 21 channels over (sensori) motor cortex, ›all‹ refers to all 27 channels. [sent-162, score-0.494]

57 The constraint 2 yk (w xk + b) = 1 − ξk ensures that the class means are projected to the corresponding class labels, i. [sent-164, score-0.149]

58 This choice favours solutions with sparse 2 vectors w, so that this method also yields some feature selection (in input space). [sent-173, score-0.08]

59 When applied to our raw EEG signals SFD selects 68 out of 405 input dimensions that allow for a left vs. [sent-174, score-0.167]

60 , high loadings for electrodes close to left and right hemisphere motor cortices which increase prior to the keystroke, cf. [sent-178, score-0.337]

61 Here we only consider linear SVMs: yk (w xk + b) subject to 1 − ξk , and ξk min 1/2 ||w||2 + C/K ||ξ ||1 2 w,b,ξ 0 The choice of regulization keeps a bound on the Vapnik-Chervonenkis dimension small. [sent-183, score-0.17]

62 The value of k chosen by model selection was around 15 for processed and around 25 for unprocessed data. [sent-187, score-0.074]

63 In the ﬁrst phase we make full use of the information that we have regarding the timing of the keystrokes. [sent-189, score-0.052]

64 For each single trial we calculate a feature vector as described above with respect to a ﬁxed timing relative to the key trigger (›response-aligned‹). [sent-190, score-0.167]

65 Table 3 reports the mean error on test sets in a 10×10fold crossvalidation for classifying in ›left‹ and ›right‹ at 120 ms prior to keypress. [sent-191, score-0.186]

66 It is more or less at chance level up to 120 ms before the keystroke. [sent-194, score-0.186]

67 Based upon this observation we chose t =−120 ms for investigating EEG-based classiﬁcation. [sent-196, score-0.186]

68 The right panel gives a details view: -230 to 50 ms. [sent-205, score-0.046]

69 For the classiﬁcation of raw data the error is roughly twice as high. [sent-207, score-0.065]

70 The concept of seeking sparse solution vectors allows SFD to cope very well with the high dimensional raw data. [sent-208, score-0.065]

71 So the SFD approach may be highly useful for online situations, when no precursory experiments are available for tuning the preprocessing. [sent-210, score-0.102]

72 The comparison of EEG- and EMG-based classiﬁcation in Figure 4 demonstrates the rapid response capability of our system: 230 ms before the actual keystroke the classiﬁcation rate exceeds 90%. [sent-211, score-0.251]

73 To assess this result it has to be recalled that movements were performed spontaneously. [sent-212, score-0.113]

74 The second phase is an important step towards online classiﬁcation of endogeneous brain signals. [sent-220, score-0.265]

75 We have to refrain from using event timing information (e. [sent-221, score-0.052]

76 Accordingly, classiﬁcation has to be performed in sliding windows and the classiﬁer does not know in what time relation the given signals are to the event—maybe there is even no event. [sent-224, score-0.208]

77 But in practice this is very likely to lead to unreliable results since those classiﬁers are highly specialized to signals that have a certain time relation to the response. [sent-226, score-0.136]

78 The typical way to make classiﬁcation more robust to time shifted signals is jittered training. [sent-228, score-0.102]

79 In our case we used 4 windows for each trial, ending at -240, -160, -80 and 0 ms relative to the response (i. [sent-229, score-0.236]

80 Detecting upcoming events is a crucial point in online analysis of brain signals in an unforced condition. [sent-233, score-0.397]

81 To accomplish this, we employ a second classiﬁer that distinguishes movement events from the ›rest‹. [sent-234, score-0.374]

82 For Figure 5, a classiﬁer was trained as described above and subsequently applied to windows sliding over unseen test samples yielding ›traces‹ of graded classiﬁer outputs. [sent-236, score-0.17]

83 5 right left 1 1. [sent-241, score-0.046]

84 tube 500 750 [ms] Figure 6: Graded classiﬁer output for movement detection in endogenous brain signals. [sent-243, score-0.445]

85 At t =−100 ms the median for right events in Figure 5 is approximately 0. [sent-245, score-0.329]

86 , applying the classiﬁer to right events from the test set yielded in 50% of the cases an output greater 0. [sent-248, score-0.107]

87 25 which means that the output to 90% of the right events was greater than 0. [sent-252, score-0.107]

88 The second classiﬁer (Figure 6) was trained for class ›movement‹ on all trials with jitters as described above and for class ›rest‹ in multiple windows between the keystrokes. [sent-254, score-0.146]

89 The preprocessing and classiﬁcation procedure was the same as for left vs. [sent-255, score-0.08]

90 The classiﬁer in Figure 5 shows a pronounced separation during the movement (preparation and execution) period. [sent-257, score-0.347]

91 From Figure 5 we observe that the left/right classiﬁer alone does not distinguish reliably between ›movement‹ and ›no movement‹ by the magnitude of its output, which explains the need for a movement detector. [sent-259, score-0.313]

92 2 for right events) which is probably due to the fact that the subject is right-handed. [sent-263, score-0.103]

93 The movement detector in Figure 6 brings up the movement phase while giving (mainly) negative output to the post movement period. [sent-264, score-0.939]

94 5 Concluding discussion We gave an outline of our BCI system in the experimental context of voluntary self-paced movements. [sent-268, score-0.094]

95 Our approach has the potential for high bit rates, since (1) it works at a high trial frequency, and (2) classiﬁcation errors are very low. [sent-269, score-0.115]

96 , improvement can come from appropriate training schemes to shape the brain signals. [sent-272, score-0.132]

97 The two-stage process of ﬁrst a meta classiﬁcation whether a movement is about to take place and then a decision between left/right ﬁnger movement is very natural and an important new feature of the proposed system. [sent-273, score-0.626]

98 6% of the trials due to artifacts, so our approach seems ideally suited for the true, highly noisy feedback BCI scenario. [sent-275, score-0.167]

99 Deecke, “Neuroimage of voluntary movement: topography of the Bereitschaftspotential, a 64-channel DC current source density study”, Neuroimage, 9(1): 124–134, 1999. [sent-375, score-0.094]

100 Brain potentials associated with imagination of hand movements”, Electroencephalogr. [sent-382, score-0.102]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('classi', 0.331), ('movement', 0.313), ('bci', 0.301), ('motor', 0.217), ('eeg', 0.21), ('nger', 0.194), ('ms', 0.186), ('cation', 0.156), ('sfd', 0.151), ('hz', 0.144), ('brain', 0.132), ('er', 0.125), ('trial', 0.115), ('movements', 0.113), ('scalp', 0.108), ('emg', 0.102), ('signals', 0.102), ('trials', 0.096), ('voluntary', 0.094), ('channels', 0.091), ('command', 0.086), ('preprocessing', 0.08), ('fisher', 0.076), ('subjects', 0.075), ('typing', 0.075), ('bp', 0.073), ('discriminant', 0.07), ('online', 0.068), ('potentials', 0.065), ('raw', 0.065), ('cursor', 0.065), ('endogeneous', 0.065), ('imagery', 0.065), ('keystroke', 0.065), ('keystrokes', 0.065), ('lang', 0.065), ('rfd', 0.065), ('graded', 0.064), ('events', 0.061), ('tsch', 0.06), ('yk', 0.059), ('subject', 0.057), ('mc', 0.057), ('sliding', 0.056), ('xk', 0.054), ('timing', 0.052), ('fd', 0.051), ('windows', 0.05), ('artifacts', 0.048), ('commands', 0.048), ('ller', 0.047), ('right', 0.046), ('patients', 0.045), ('accuracies', 0.045), ('ers', 0.045), ('bereitschaftspotential', 0.043), ('birbaumer', 0.043), ('deecke', 0.043), ('favours', 0.043), ('ilog', 0.043), ('lindinger', 0.043), ('neuroimage', 0.043), ('ngers', 0.043), ('occipital', 0.043), ('paralysed', 0.043), ('pfurtscheller', 0.043), ('preparations', 0.043), ('typewriting', 0.043), ('wolpaw', 0.043), ('cortex', 0.043), ('execution', 0.043), ('mika', 0.043), ('none', 0.041), ('preparation', 0.041), ('device', 0.039), ('sec', 0.039), ('band', 0.039), ('onset', 0.038), ('cortices', 0.037), ('imagination', 0.037), ('electrodes', 0.037), ('keyboard', 0.037), ('sensorimotor', 0.037), ('unprocessed', 0.037), ('feedback', 0.037), ('selection', 0.037), ('projected', 0.036), ('svms', 0.036), ('median', 0.036), ('control', 0.036), ('differentiation', 0.034), ('somatosensory', 0.034), ('pronounced', 0.034), ('sessions', 0.034), ('upcoming', 0.034), ('ltered', 0.034), ('percentile', 0.034), ('preprocessed', 0.034), ('highly', 0.034), ('channel', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999975 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

2 0.22743773 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

Author: Michael Kositsky, Andrew G. Barto

Abstract: Tangential hand velocity proﬁles of rapid human arm movements often appear as sequences of several bell-shaped acceleration-deceleration phases called submovements or movement units. This suggests how the nervous system might efﬁciently control a motor plant in the presence of noise and feedback delay. Another critical observation is that stochasticity in a motor control problem makes the optimal control policy essentially different from the optimal control policy for the deterministic case. We use a simpliﬁed dynamic model of an arm and address rapid aimed arm movements. We use reinforcement learning as a tool to approximate the optimal policy in the presence of noise and feedback delay. Using a simpliﬁed model we show that multiple submovements emerge as an optimal policy in the presence of noise and feedback delay. The optimal policy in this situation is to drive the arm’s end point close to the target by one fast submovement and then apply a few slow submovements to accurately drive the arm’s end point into the target region. In our simulations, the controller sometimes generates corrective submovements before the initial fast submovement is completed, much like the predictive corrections observed in a number of psychophysical experiments.

3 0.20731124 116 nips-2001-Linking Motor Learning to Function Approximation: Learning in an Unlearnable Force Field

Author: O. Donchin, Reza Shadmehr

Abstract: Reaching movements require the brain to generate motor commands that rely on an internal model of the task’s dynamics. Here we consider the errors that subjects make early in their reaching trajectories to various targets as they learn an internal model. Using a framework from function approximation, we argue that the sequence of errors should reﬂect the process of gradient descent. If so, then the sequence of errors should obey hidden state transitions of a simple dynamical system. Fitting the system to human data, we ﬁnd a surprisingly good ﬁt accounting for 98% of the variance. This allows us to draw tentative conclusions about the basis elements used by the brain in transforming sensory space to motor commands. To test the robustness of the results, we estimate the shape of the basis elements under two conditions: in a traditional learning paradigm with a consistent force ﬁeld, and in a random sequence of force ﬁelds where learning is not possible. Remarkably, we ﬁnd that the basis remains invariant. 1

4 0.20423438 46 nips-2001-Categorization by Learning and Combining Object Parts

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classiﬁers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classiﬁers are then combined in a second stage to yield a hierarchical SVM classiﬁer. Experimental results in face classiﬁcation show considerable robustness against rotations in depth and suggest performance at signiﬁcantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classiﬁcation experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classiﬁer which may be relevant for biological models of visual recognition.

5 0.19657078 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classiﬁers each trained to achieve high detection rates and modest false positive rates can yield a ﬁnal detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classiﬁers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields signiﬁcant improvements in performance over conventional AdaBoost. The ﬁnal face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.

6 0.16646616 60 nips-2001-Discriminative Direction for Kernel Classifiers

7 0.1571243 43 nips-2001-Bayesian time series classification

8 0.13765864 139 nips-2001-Online Learning with Kernels

9 0.13622423 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

10 0.13360605 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

11 0.13244584 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

12 0.12401985 144 nips-2001-Partially labeled classification with Markov random walks

13 0.12202999 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

14 0.11363334 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

15 0.11291362 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

16 0.11110738 105 nips-2001-Kernel Machines and Boolean Functions

17 0.10688245 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

18 0.099313967 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

19 0.085273743 8 nips-2001-A General Greedy Approximation Algorithm with Applications

20 0.085088432 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.272), (1, 0.024), (2, -0.116), (3, 0.249), (4, -0.101), (5, 0.096), (6, -0.083), (7, 0.072), (8, 0.224), (9, 0.059), (10, 0.012), (11, -0.172), (12, -0.302), (13, 0.02), (14, -0.052), (15, 0.074), (16, -0.164), (17, -0.02), (18, 0.118), (19, -0.11), (20, 0.002), (21, -0.08), (22, -0.059), (23, 0.003), (24, -0.077), (25, -0.11), (26, 0.013), (27, 0.098), (28, -0.075), (29, 0.046), (30, 0.04), (31, -0.048), (32, 0.04), (33, 0.016), (34, -0.028), (35, 0.042), (36, -0.065), (37, 0.065), (38, -0.0), (39, 0.037), (40, 0.069), (41, -0.032), (42, -0.006), (43, 0.036), (44, -0.049), (45, -0.04), (46, -0.046), (47, 0.018), (48, 0.03), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95276469 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

2 0.70214027 116 nips-2001-Linking Motor Learning to Function Approximation: Learning in an Unlearnable Force Field

Author: O. Donchin, Reza Shadmehr

3 0.68409181 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

Author: Michael Kositsky, Andrew G. Barto

4 0.60331756 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies

Author: A. D'avella, M. C. Tresch

Abstract: The question of whether the nervous system produces movement through the combination of a few discrete elements has long been central to the study of motor control. Muscle synergies, i.e. coordinated patterns of muscle activity, have been proposed as possible building blocks. Here we propose a model based on combinations of muscle synergies with a speciﬁc amplitude and temporal structure. Time-varying synergies provide a realistic basis for the decomposition of the complex patterns observed in natural behaviors. To extract time-varying synergies from simultaneous recording of EMG activity we developed an algorithm which extends existing non-negative matrix factorization techniques.

5 0.60203838 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

Author: Michael C. Mozer, Robert Dodier, Michael D. Colagrosso, Cesar Guerra-Salcedo, Richard Wolniewicz

Abstract: When designing a two-alternative classiﬁer, one ordinarily aims to maximize the classiﬁer’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classiﬁer achieve a speciﬁed correct acceptance or correct rejection rate (i.e., that it achieve a ﬁxed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classiﬁer subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difﬁculty of the problem and the ﬁnancial return that it achieves to the service provider. When designing a classiﬁer, one must specify an objective measure by which the classiﬁer’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassiﬁcations. If the cost of a classiﬁcation error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classiﬁer’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classiﬁer that outputs class posteriors. To explain the ROC curve, a classiﬁer can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classiﬁer: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classiﬁer’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 0 0 correct rejection rate 20 40 60 80 100 100 (b) correct rejection rate 20 40 60 80 (a) 0 20 40 60 80 100 correct acceptance rate 0 20 40 60 80 100 correct acceptance rate FIGURE 1. (a) two ROC curves reﬂecting discrimination performance; the dashed curve indicates better performance. (b) two plausible ROC curves, neither of which is clearly superior to the other. and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reﬂects the discriminative ability of the classiﬁer. The dashed curve in Figure 1a is therefore a better classiﬁer than the solid curve. The degree to which the curve is bowed can be quantiﬁed by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classiﬁer to maximize either the ROC area or d’ often yields the same result as training a classiﬁer to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reﬂect a classiﬁer’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classiﬁer can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classiﬁer’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classiﬁer. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classiﬁer and the nature of the domain. Thus, it is conceivable that a classiﬁer could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacriﬁcing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 1 OUR DOMAIN Athene Software focuses on predicting and managing subscriber churn in the telecommunications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). “Churn” refers to the loss of subscribers who switch from one company to the other. Churn is a signiﬁcant problem for wireless, long distance, and internet service providers. For example, in the wireless industry, domestic monthly churn rates are 2–3% of the customer base. Consequently, service providers are highly motivated to identify subscribers who are dissatisﬁed with their service and offer them incentives to prevent churn. We use techniques from statistical machine learning—primarily neural networks and ensemble methods—to estimate the probability that an individual subscriber will churn in the near future. The prediction of churn is based on various sources of information about a subscriber, including: call detail records (date, time, duration, and location of each call, and whether call was dropped due to lack of coverage or available bandwidth), ﬁnancial information appearing on a subscriber’s bill (monthly base fee, additional charges for roaming and usage beyond monthly prepaid limit), complaints to the customer service department and their resolution, information from the initial application for service (contract details, rate plan, handset type, credit report), market information (e.g., rate plans offered by the service provider and its competitors), and demographic data. Churn prediction is an extremely difﬁcult problem for several reasons. First, the business environment is highly nonstationary; models trained on data from a certain time period perform far better with hold-out examples from that same time period than examples drawn from successive time periods. Second, features available for prediction are only weakly related to churn; when computing mutual information between individual features and churn, the greatest value we typically encounter is .01 bits. Third, information critical to predicting subscriber behavior, such as quality of service, is often unavailable. Obtaining accurate churn predictions is only part of the challenge of subscriber retention. Subscribers who are likely to churn must be contacted by a call center and offered some incentive to remain with the service provider. In a mathematically principled business scenario, one would frame the challenge as maximizing proﬁtability to a service provider, and making the decision about whether to contact a subscriber and what incentive to offer would be based on the expected utility of offering versus not offering an incentive. However, business practices complicate the scenario and place some unique constraints on predictive models. First, call centers are operated by a staff of customer service representatives who can contact subscribers at a ﬁxed rate; consequently, our models cannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal business strategies at the service providers constrain the minimum acceptable CA or CR rates (above and beyond the goal of maximizing proﬁtability). Third, contracts that Athene makes with service providers will occasionally call for achieving a speciﬁc target CA and CR rate. These three practical issues pose formal problems which, to the best of our knowledge, have not been addressed by the machine learning community. The formal problems can be stated in various ways, including: (1) maximize the CA rate, subject to the constraint that a ﬁxed percentage of the subscriber base is identiﬁed as potential churners, (2) optimize the CR rate, subject to the constraint that the CA rate should be αCA, (3) optimize the CA rate, subject to the constraint that the CR rate should be αCR, and ﬁnally—what marketing executives really want—(4) design a classiﬁer that has a CA rate of αCA and a CR rate of αCR. Problem (1) sounds somewhat different than problems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate as a function of the total fraction of subscribers identiﬁed by the model. Problem (1) thus imposes the constraint that the solution lies at one coordinate of the lift curve, just as problems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC curve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although addressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and thus we focus on it. Our goal is not altogether unreasonable, because a solution to problem (4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere except in the region near CA αCA and CR αCR. Hence, the approaches we consider will trade off performance in some regions of the ROC curve against performance in other regions. We call this prodding the ROC curve. 2 FOUR ALGORITHMS TO PROD THE ROC CURVE In this section, we describe four algorithms for prodding the ROC curve toward a target CA rate of αCA and a target CR rate of αCR. 2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES Suppose we train a classiﬁer on a set of positive and negative examples from a class— churners and nonchurners in our domain. Following training, the classiﬁer will assign a posterior probability of class membership to each example. The examples can be sorted by the posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2). We can identify the thresholds, θCA and θCR, which yield CA and CR rates of αCA and αCR, respectively. If the classiﬁer’s discrimination performance fails to achieve the target CA and CR rates, then θCA will be lower than θCR, as depicted in the Figure. If we can bring these two thresholds together, we will achieve the target CA and CR rates. Thus, the ﬁrst algorithm we propose involves training a series of classiﬁers, attempting to make classiﬁer n+1 achieve better CA and CR rates by focusing its effort on examples from classiﬁer n that lie between θCA and θCR; the positive examples must be pushed above θCR and the negative examples must be pushed below θCA. (Of course, the thresholds are speciﬁc to a classiﬁer, and hence should be indexed by n.) We call this the emphasis algorithm, because it involves placing greater weight on the examples that lie between the two thresholds. In the Figure, the emphasis for classiﬁer n+1 would be on examples e5 through e8. This retraining procedure can be iterated until the classiﬁer’s training set performance reaches asymptote. In our implementation, we deﬁne a weighting of each example i for training classiﬁer n, λ in . For classiﬁer 1, λ i1 = 1 . For subsequent classiﬁers, λ in + 1 = λ in if example i is not in the region of emphasis, or λ in + 1 = κ e λ in otherwise, where κe is a constant, κe > 1. 2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES The second algorithm we propose is related to the ﬁrst, but takes a slightly different perspective on the continuum depicted in Figure 2. Positive examples below θCA—such as e2—are clearly the most dif ﬁcult positive examples to classify correctly. Not only are they the most difﬁcult positive examples, but they do not in fact need to be classiﬁed correctly to achieve the target CA and CR rates. Threshold θCR does not depend on examples such as e2, and threshold θCA allows a fraction (1–αCA) of the positive examples to be classiﬁed incorrectly. Likewise, one can argue that negative examples above θCR—such as e10 and e11—need not be of concern. Essentially , the second algorithm, which we term the eemd phasis algorithm, is like the emphasis algorithm in that a series of classiﬁers are trained, but when training classiﬁer n+1, less weight is placed on the examples whose correct clasθCA e1 e2 e3 0 e4 θCR e5 e6 e7 e8 churn probability e9 e10 e11 e12 e13 1 FIGURE 2. A schematic depiction of all training examples arranged by the classiﬁer’s posterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar corresponds to a negative example (e.g., a nonchurner). siﬁcation is unnecessary to achieve the target CA and CR rates for classiﬁer n. As with the emphasis algorithm, the retraining procedure can be iterated until no further performance improvements are obtained on the training set. Note that the set of examples given emphasis by the previous algorithm is not the complement of the set of examples deemphasized by the current algorithm; the algorithms are not identical. In our implementation, we assign a weight to each example i for training classiﬁer n, λ in . For classiﬁer 1, λ i1 = 1 . For subsequent classiﬁers, λ in + 1 = λ in if example i is not in the region of deemphasis, or λ in + 1 = κ d λ in otherwise, where κd is a constant, κd <1. 2.3 CONSTRAINED OPTIMIZATION The third algorithm we propose is formulated as maximizing the CR rate while maintaining the CA rate equal to αCA. (We do not attempt to simultaneously maximize the CA rate while maintaining the CR rate equal to αCR.) Gradient methods cannot be applied directly because the CA and CR rates are nondifferentiable, but we can approximate the CA and CR rates with smooth differentiable functions: 1 1 CA ( w, t ) = ----- ∑ σ β ( f ( x i, w ) – t ) CR ( w, t ) = ------ ∑ σ β ( t – f ( x i, w ) ) , P i∈P N i∈N where P and N are the set of positive and negative examples, respectively, f(x,w) is the model posterior for input x, w is the parameterization of the model, t is a threshold, and σβ –1 is a sigmoid function with scaling parameter β: σ β ( y ) = ( 1 + exp ( – βy ) ) . The larger β is, the more nearly step-like the sigmoid is and the more nearly equal the approximations are to the model CR and CA rates. We consider the problem formulation in which CA is a constraint and CR is a ﬁgure of merit. We convert the constrained optimization problem into an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982), which involves iteratively maximizing an objective function 2 µ A ( w, t ) = CR ( w, t ) + ν CA ( w, t ) – α CA + -- CA ( w, t ) – α CA 2 with a ﬁxed Lagrangian multiplier, ν, and then updating ν following the optimization step: ν ← ν + µ CA ( w *, t * ) – α CA , where w * and t * are the values found by the optimization step. We initialize ν = 1 and ﬁx µ = 1 and β = 10 and iterate until ν converges. 2.4 GENETIC ALGORITHM The fourth algorithm we explore is a steady-state genetic search over a space deﬁned by the continuous parameters of a classiﬁer (Whitley, 1989). The ﬁtness of a classiﬁer is the reciprocal of the number of training examples falling between the θCA and θCR thresholds. Much like the emphasis algorithm, this ﬁtness function encourages the two thresholds to come together. The genetic search permits direct optimization over a nondifferentiable criterion, and therefore seems sensible for the present task. 3 METHODOLOGY For our tests, we studied two large data bases made available to Athene by two telecommunications providers. Data set 1 had 50,000 subscribers described by 35 input features and a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input features and a churn rate of 6.42%. For each data base, the features input to the classiﬁer were obtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose these two large, real world data sets because achieving gains with these data sets should be more difﬁcult than with smaller, less noisy data sets. Plus, with our real-world data, we can evaluate the cost savings achieved by an improvement in prediction accuracy. We performed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn ratio in each split. In all tests, we chose α CR = 0.90 and α CA = 0.50 , values which, based on our past experience in this domain, are ambitious yet realizable targets for data sets such as these. We used a logistic regression model (i.e., a no hidden unit neural network) for our studies, believing that it would be more difﬁcult to obtain improvements with such a model than with a more ﬂexible multilayer perceptron. For the emphasis and deemphasis algorithms, models were trained to minimize mean-squared error on the training set. We chose κe = 1.3 and κd = .75 by quick exploration. Because the weightings are cumulative over training restarts, the choice of κ is not critical for either algorithm; rather, the magnitude of κ controls how many restarts are necessary to reach asymptotic performance, but the results we obtained were robust to the choice of κ. The emphasis and deemphasis algorithms were run for 100 iterations, which was the number of iterations required to reach asymptotic performance on the training set. 4 RESULTS Figure 3 illustrates training set performance for the emphasis algorithm on data set 1. The graph on the left shows the CA rate when the CR rate is .9, and the graph on the right show the CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the ROC curve is improving in the region around (αCA, αCR). Figure 4 shows cross-validation performance on the two data sets for the four prodding algorithms as well as for a traditional least-squares training procedure. The emphasis and deemphasis algorithms yield reliable improvements in performance in the critical region of the ROC curve over the traditional training procedure. The constrained-optimization and genetic algorithms perform well on achieving a high CR rate for a ﬁxed CA rate, but neither does as well on achieving a high CA rate for a ﬁxed CR rate. For the constrained-optimization algorithm, this result is not surprising as it was trained asymmetrically, with the CA rate as the constraint. However, for the genetic algorithm, we have little explanation for its poor performance, other than the difﬁculty faced in searching a continuous space without gradient information. 5 DISCUSSION In this paper, we have identiﬁed an interesting, novel problem in classiﬁer design which is motivated by our domain of churn prediction and real-world business considerations. Rather than seeking a classiﬁer that maximizes discriminability between two classes, as measured by area under the ROC curve, we are concerned with optimizing performance at certain points along the ROC curve. We presented four alternative approaches to prodding the ROC curve, and found that all four have promise, depending on the speciﬁc goal. Although the magnitude of the gain is small—an increase of about .01 in the CR rate given a target CA rate of .50—the impro vement results in signiﬁcant dollar savings. Using a framework for evaluating dollar savings to a service provider, based on estimates of subscriber retention and costs of intervention obtained in real world data collection (Mozer et 0.845 0.84 0.39 0.835 0.385 CR rate CA rate 0.4 0.395 0.38 0.83 0.825 0.375 0.82 0.37 0.815 0.365 0.81 0 5 10 15 20 25 30 35 40 45 50 Iteration 0 5 10 15 20 25 30 35 40 45 50 Iteration FIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as a function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of .5. Error bars indicate +/–1 standard error of the mean. Data set 1 0.835 0.380 0.830 0.375 0.825 CR rate ISP Test Set 0.840 0.385 CA rate 0.390 0.370 0.820 0.365 0.815 0.360 0.810 0.355 0.805 0.350 0.800 std emph deemph constr GA std emph deemph constr GA std emph deemph constr GA 0.900 0.375 0.350 CR rate Data set 2 0.875 CA rate Wireless Test Set 0.850 0.325 0.825 0.300 0.800 std emph deemph constr GA FIGURE 4. Cross-validation performance on the two data sets for the standard training procedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained optimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for CR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one standard error of the mean over the 10 data splits. al., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates go from (.50, .80) to (.50, .81), which amounts to an 8% increase in proﬁtability of the subscriber intervention effort. These ﬁgures are clearly promising. However, based on the data sets we have studied, it is difﬁcult to know whether another algorithm might exist that achieves even greater gains. Interestingly, all algorithms we proposed yielded roughly the same gains when successful, suggesting that we may have milked the data for whatever gain could be had, given the model class evaluated. Our work clearly illustrate the difﬁculty of the problem, and we hope that others in the NIPS community will be motivated by the problem to suggest even more powerful, theoretically grounded approaches. 6 ACKNOWLEDGEMENTS No white males were angered in the course of conducting this research. We thank Lian Yan and David Grimes for comments and assistance on this research. This research was supported in part by McDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549–01A1. 7 REFERENCES Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic. Chang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (1019–1026). San Mateo, CA: Morgan Kaufmann. Frederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic ﬁndings and patient history data with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of the SPIE, 3338, 241–245. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Mozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing revenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Networks, 11, 690–696. Whitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms (pp. 116–121). San Mateo, CA: Morgan Kaufmann.

6 0.59280854 60 nips-2001-Discriminative Direction for Kernel Classifiers

7 0.59145164 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

8 0.53025609 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

9 0.52432495 46 nips-2001-Categorization by Learning and Combining Object Parts

10 0.51503265 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

11 0.46013206 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

12 0.45115423 43 nips-2001-Bayesian time series classification

13 0.43616852 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

14 0.40541634 144 nips-2001-Partially labeled classification with Markov random walks

15 0.38377723 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

16 0.37151229 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

17 0.33364978 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

18 0.33169335 139 nips-2001-Online Learning with Kernels

19 0.3311418 105 nips-2001-Kernel Machines and Boolean Functions

20 0.32816547 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.261), (14, 0.05), (17, 0.015), (19, 0.031), (20, 0.023), (27, 0.123), (30, 0.106), (36, 0.01), (38, 0.026), (59, 0.018), (72, 0.07), (79, 0.036), (83, 0.038), (91, 0.114)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82909495 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

2 0.77660841 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

Author: Si Wu, Shun-ichi Amari

Abstract: This study investigates a population decoding paradigm, in which the estimation of stimulus in the previous step is used as prior knowledge for consecutive decoding. We analyze the decoding accuracy of such a Bayesian decoder (Maximum a Posteriori Estimate), and show that it can be implemented by a biologically plausible recurrent network, where the prior knowledge of stimulus is conveyed by the change in recurrent interactions as a result of Hebbian learning. 1

3 0.63180459 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

4 0.62977493 60 nips-2001-Discriminative Direction for Kernel Classifiers

Author: Polina Golland

Abstract: In many scientiﬁc and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classiﬁer for labeling new examples while making as few mistakes as possible. In the traditional classiﬁcation setting, the resulting classiﬁer is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we deﬁne a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classiﬁer function. We derive the discriminative direction for kernel-based classiﬁers, demonstrate the technique on several examples and brieﬂy discuss its use in the statistical shape analysis, an application that originally motivated this work.

5 0.62884641 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama

Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classiﬁcation algorithms can be employed without further modiﬁcations. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1

6 0.62805426 56 nips-2001-Convolution Kernels for Natural Language

7 0.62770009 13 nips-2001-A Natural Policy Gradient

8 0.62499082 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

9 0.62410462 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

10 0.62409562 149 nips-2001-Probabilistic Abstraction Hierarchies

11 0.62391478 46 nips-2001-Categorization by Learning and Combining Object Parts

12 0.62173009 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

13 0.62166661 8 nips-2001-A General Greedy Approximation Algorithm with Applications

14 0.62130195 185 nips-2001-The Method of Quantum Clustering

15 0.62024736 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

16 0.62008727 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

17 0.619784 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

18 0.6186341 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

19 0.61699468 22 nips-2001-A kernel method for multi-labelled classification

20 0.61689144 190 nips-2001-Thin Junction Trees