nips nips2004 nips2004-20 knowledge-graph by maker-knowledge-mining

20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

Source: pdf

Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf

Abstract: Motivated by the particular problems involved in communicating with “locked-in” paralysed patients, we aim to develop a braincomputer interface that uses auditory stimuli. We describe a paradigm that allows a user to make a binary decision by focusing attention on one of two concurrent auditory stimulus sequences. Using Support Vector Machine classiﬁcation and Recursive Channel Elimination on the independent components of averaged eventrelated potentials, we show that an untrained user’s EEG data can be classiﬁed with an encouragingly high level of accuracy. This suggests that it is possible for users to modulate EEG signals in a single trial by the conscious direction of attention, well enough to be useful in BCI. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract Motivated by the particular problems involved in communicating with “locked-in” paralysed patients, we aim to develop a braincomputer interface that uses auditory stimuli. [sent-9, score-0.264]

2 We describe a paradigm that allows a user to make a binary decision by focusing attention on one of two concurrent auditory stimulus sequences. [sent-10, score-0.404]

3 Using Support Vector Machine classiﬁcation and Recursive Channel Elimination on the independent components of averaged eventrelated potentials, we show that an untrained user’s EEG data can be classiﬁed with an encouragingly high level of accuracy. [sent-11, score-0.082]

4 This suggests that it is possible for users to modulate EEG signals in a single trial by the conscious direction of attention, well enough to be useful in BCI. [sent-12, score-0.434]

5 1 Introduction The aim of research into brain-computer interfaces (BCIs) is to allow a person to control a computer using signals from the brain, without the need for any muscular movement—for example, to allow a completely paralysed patient to communicate. [sent-13, score-0.438]

6 Total or near-total paralysis can result in cases of brain-stem stroke, cerebral palsy, and Amytrophic Lateral Sclerosis (ALS, also known as Lou Gehrig’s disease). [sent-14, score-0.038]

7 It has been shown that some patients in a “locked-in” state, in which most cognitive functions are intact despite complete paralysis, can learn to communicate via an interface that interprets electrical signals from the brain, measured externally by electro-encephalogram (EEG) [1]. [sent-15, score-0.346]

8 The experience of clinical groups applying BCI is that diﬀerent paradigms work to varying degrees with diﬀerent patients. [sent-17, score-0.038]

9 For some patients, long immobility and the degeneration of the pyramidal cells of the motor cortex may make it diﬃcult to produce imagined-movement signals. [sent-18, score-0.079]

10 Thus, there is considerable motivation to add to the palette of available BCI paradigms by exploring EEG signals that occur in response to auditory stimuli—a patient’s sense of hearing is often uncompromised by their condition. [sent-20, score-0.48]

11 Here, we report the results of an experiment on healthy subjects, designed to develop a BCI paradigm in which a user can make a binary choice. [sent-21, score-0.127]

12 We attempt to classify EEG signals that occur in response to two simultaneous auditory stimulus streams. [sent-22, score-0.543]

13 To communicate a binary decision, the subject focuses attention on one of the two streams, left or right. [sent-23, score-0.197]

14 [6] and others reported in the 60’s and 70’s that selective attention in a dichotic listening task caused a measurable modulation of EEG signals (see [7, 8] for a review). [sent-25, score-0.343]

15 This modulation was signiﬁcant when signals were averaged over a large number of instances, but our aim is to discover whether single trials are classiﬁable, using machine-learning algorithms, with a high enough accuracy to be useful in a BCI. [sent-26, score-0.46]

16 2 Stimuli and methods EEG signals were recorded from 15 healthy untrained subjects (9 female, 6 male) between the ages of 20 and 38, using 39 silver chloride electrodes, referenced to the ears. [sent-27, score-0.49]

17 An additional EOG electrode was positioned lateral to and slightly below the left eye, to record eye movement artefacts—blinks and horizontal and vertical saccades all produced clearly identiﬁable signals on the EOG channel. [sent-28, score-0.494]

18 The signals were ﬁltered by an analog band-pass ﬁlter between 0. [sent-29, score-0.231]

19 On each trial, the appearance of a ﬁxation point on screen was followed after 1 sec by an arrow pointing left or right (25 left, 25 right in each block, in random order). [sent-33, score-0.084]

20 The arrow disappeared after 500 msec, after which there was a pause of 500 msec, and then the auditory stimulus was presented, lasting 4 seconds. [sent-34, score-0.395]

21 500 msec after the end of the auditory stimulus, the ﬁxation point disappeared and there was a pause of between 2 and 4 seconds for the subject to relax. [sent-35, score-0.471]

22 The auditory stimulus consisted of two periodic sequences of 50-msec-long squarewave beeps, one presented from a speaker to the left of the subject, and the other from a speaker to the right. [sent-37, score-0.279]

23 The right-hand sequence consisted of eight beeps of frequencies 1500 Hz (non-target) and 1650 Hz (target), repeating with a period of 490 msec. [sent-40, score-0.242]

24 The left-hand sequence consisted of seven beeps of frequencies 800 Hz (non-target) and 880 Hz (target), starting 70 msec after start of the right-hand sequence and repeating with a period of 555 msec. [sent-41, score-0.382]

25 According to the direction of the arrow on each trial, subjects were instructed to count the number of target beeps in either the left or right sequence. [sent-42, score-0.547]

26 In the pause between trials, they were instructed to report the number of target beeps using a numeric keypad. [sent-43, score-0.357]

27 1 deviant tones (longer duration, or absent) time (sec) 0. [sent-44, score-0.375]

28 5 concatenated averaged signal LEFT and RIGHT sound signals of different periods 0 -0. [sent-49, score-0.412]

29 Comparison of the average response to a left beat with the average response to a right beat, on a single trial, should thus emphasize any modulating eﬀect of the direction of attention on the ERP, of the kind described by Hillyard et al. [sent-53, score-0.354]

30 1 In order to avoid contamination of the EEG signals with movement artefacts, a few practice trials were performed before the ﬁrst block, so that subjects learned to wait until the ﬁxation point was out before looking at the keypad or beginning the hand movement toward it. [sent-55, score-0.727]

31 2 Although a paralysed patient would clearly be unable to give responses in this way, it is hoped that this extra motivation would not be necessary. [sent-56, score-0.165]

32 An additional stimulus feature was designed to investigate whether mismatch negativity (MMN) could form a useful basis for a BCI. [sent-57, score-0.167]

33 Mismatch negativity is a diﬀerence between the ERP to standard stimuli and the ERP to deviant stimuli, i. [sent-58, score-0.512]

34 rare stimulus events (with probability of occurrence typically around 0. [sent-60, score-0.101]

35 Thus there is the possibility that a deviant stimulus (say, a longer beep) inserted into the sequence to which the subject is attending (same side, same pitch) might elicit a larger MMN signal than a deviant in the unattended sequence. [sent-64, score-1.03]

36 To explore this, after at least two standard beats of each trial, one of the beats (randomly chosen, with the constraint that the epoch following the deviant on the left should not overlap with the epoch following the deviant on the right) was made to deviate on each trial. [sent-65, score-1.186]

37 ) For 8 subjects, the deviant beat was simply a silent beat—a disruptive pause in the otherwise regular sequence. [sent-67, score-0.582]

38 A sixteenth subject, in the long-deviant condition, had to be eliminated because of poor signal quality. [sent-69, score-0.103]

39 3 Analysis As a ﬁrst step in analyzing the data, the raw EEG signals were examined by eye for each of the 400 trials of each of the subjects. [sent-70, score-0.453]

40 Trials were rejected if they contained obvious large artefact signals caused by blinks or saccades (visible in the EOG and across most of the frontal positions), small periodic eye movements, or other muscle movements (neck and brow, judged from electrode positions O9 and O10, Fp1, Fpz and Fp2). [sent-71, score-0.674]

41 Between 6 and 228 trials had to be rejected out of 400, depending on the subject. [sent-72, score-0.191]

42 One of two alternative preprocessing methods was then used. [sent-73, score-0.074]

43 In order to look for eﬀects of the attention-modulation reported by Hillyard et al, method (A) took the average ERP in response to standard beats (discarding the ﬁrst beat). [sent-74, score-0.175]

44 In order to look for possible attention-modulation of MMN, method (B) subtracted the average response to standards from the response to the deviant beat. [sent-75, score-0.617]

45 In both methods, the average ERP signal to beats on the left was concatenated with the average ERP signal following beats on the right, as depicted in ﬁgure 2 (for illustrative purposes the ﬁgure uses the sound signal itself, rather than an ERP). [sent-76, score-0.453]

46 For each trial, either preprocessing method resulted in a signal of 142 (left) + 125 (right) = 267 time samples for each of 40 channels (39 EEG channels plus one EOG), for a total of 10680 input dimensions to the classiﬁer. [sent-77, score-0.128]

47 To evaluate its performance, the trials from a single subject were split into ten nonoverlapping partitions of equal size: each such partition was used in turn as a test set for evaluating the performance of the classiﬁer trained on the other 90% of the trials. [sent-79, score-0.267]

48 For the purposes of the ICA, the concatenation of all the preprocessed signals from one EEG channel, from all trials in the training partition, was treated as a single mixture signal. [sent-81, score-0.38]

49 This matrix, computed only from the training set, was then used to separate the signals in both the training set and the test set. [sent-84, score-0.231]

50 Then, the signals were centered and normalized: for each averaged (unmixed) ERP in each of the 40 ICs of each trial, the mean was subtracted, and the signal was divided by its 2-norm. [sent-85, score-0.329]

51 Thus the entry Kij in the kernel matrix of the SVM was proportional to the sum of the coeﬃcients of correlation between corresponding epochs in trials i and j. [sent-86, score-0.149]

52 Single-trial error rate was estimated as the mean proportion of misclassiﬁed test trials across the ten folds. [sent-88, score-0.186]

53 For comparison, the classiﬁcation was also performed on the mixture signals without ICA, and with and without the normalizing step. [sent-89, score-0.231]

54 It can be seen that the best error rate obtainable with a given subject varies according to the subject, between 3% and 37%, in a way that is not entirely explained by the diﬀerences in the numbers of good (artefactfree) trials available. [sent-94, score-0.23]

55 ICA generally improved the results, by anything up to 14%. [sent-95, score-0.037]

56 Preprocessing method (B) generally performed poorly (minimum 19% error, and generally over 35%). [sent-96, score-0.074]

57 For preprocessing method A, normalization generally produced a small improvement. [sent-98, score-0.111]

58 Note that, despite the fact that this method does not use the ERPs that occur in response to deviant beats, the results for subject in the silent-deviant condition were generally better than for those in the long-deviant condition. [sent-102, score-0.566]

59 In order to examine the extent to which the dimensionality of the classiﬁcation problem could be reduced, recursive feature elimination [15] was performed (limited now to preprocessing method A with ICA and normalization). [sent-104, score-0.275]

60 For each of ten folds, ICA and normalization was performed, then an SVM was trained and tested. [sent-105, score-0.037]

61 For 2 each independent component j, an elimination criterion value cj = i∈Fj wi was computed, where w is the hyperplane normal vector of the trained SVM, and Fj is the set of indices to features that are part of component j. [sent-106, score-0.14]

62 The IC with the lowest criterion score cj was deemed to be the least inﬂuential for classiﬁcation, Table 1: SVM classiﬁcation error rates: the best rates for each of the preprocessing methods, A and B (see text), are in bold. [sent-107, score-0.074]

63 CM CN GH JH KT KW TD TT AH AK CG CH DK KB SK deviant duration (msec) 0 0 0 0 0 0 0 0 100 100 100 100 100 100 100 # good trials 326 250 198 348 380 394 371 367 353 172 271 375 241 363 239 Method A no ICA ICA · · · · 0. [sent-110, score-0.585]

64 Then the SVM was re-trained and re-tested, and the elimination process iterated until one channel remained. [sent-231, score-0.196]

65 Results for the two subject groups are plotted in the left and right panels of ﬁgure 3, showing estimated error rates averaged over ten folds against the number of ICs used for classiﬁcation. [sent-234, score-0.202]

66 Each subject’s initials, together with the number of useable trials that subject performed, are printed to the right of the corresponding curve. [sent-235, score-0.23]

67 A possible exception is KT, whose performance may improve by 2–3% after elimination of 20 components, and a clearer exception 3 RICE was also carried out using the full 400 trials for each subject (results not shown). [sent-239, score-0.442]

68 Despite the (sometimes drastic) reduction in the number of trials, rejection by eye of artefact trials did not raise the classiﬁcation error rate by an appreciable amount. [sent-240, score-0.288]

69 The one exception was subject SK, for whom the probability of mis-classiﬁcation increased by about 0. [sent-241, score-0.117]

70 1 when 161 trials containing strong movement signals were removed—clearly this subject’s movements were classiﬁably dependent on whether he was attending to the left or to the right. [sent-242, score-0.588]

71 5 deviant duration = 100 msec deviant duration = 0 0. [sent-244, score-1.012]

72 05 0 5 10 15 20 25 30 35 40 number of ICs retained KB (363) 5 10 15 20 25 30 35 40 number of ICs retained Figure 2: Results of recursive independent component elimination is CG, for whom elimination of 25 components yields an improvement of roughly 10%. [sent-253, score-0.341]

73 A thorough analysis is not possible here—however, with the mixture weightings for many ICs spread very widely around the electrode array, we found no strong evidence for or against the particular involvement of muscle movement artefact signals in the classiﬁcation. [sent-255, score-0.523]

74 Separability of EEG o signals recorded during right and left motor imagery using adaptive autoregressive parameters. [sent-280, score-0.306]

75 Electrical signs of selective attention in the human brain. [sent-320, score-0.076]

76 The role of attention in auditory information processing as revealed by aa a event-related potentials and other brain measures of cognitive function. [sent-328, score-0.364]

77 Behavioral and electrophysiological eﬀects of task-irrelevant o sound change: a new distraction paradigm. [sent-337, score-0.113]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('deviant', 0.375), ('eeg', 0.277), ('beeps', 0.242), ('erp', 0.242), ('signals', 0.231), ('subjects', 0.183), ('trials', 0.149), ('msec', 0.14), ('elimination', 0.14), ('auditory', 0.138), ('ics', 0.134), ('mmn', 0.132), ('ica', 0.13), ('trial', 0.107), ('beats', 0.102), ('stimulus', 0.101), ('di', 0.1), ('artefacts', 0.096), ('epoch', 0.096), ('beat', 0.092), ('eog', 0.088), ('hillyard', 0.088), ('paralysed', 0.088), ('bci', 0.087), ('subject', 0.081), ('patient', 0.077), ('patients', 0.077), ('pause', 0.077), ('attention', 0.076), ('preprocessing', 0.074), ('eye', 0.073), ('response', 0.073), ('stimuli', 0.071), ('muscle', 0.07), ('xation', 0.07), ('brain', 0.069), ('artefact', 0.066), ('beep', 0.066), ('distraction', 0.066), ('negativity', 0.066), ('hz', 0.065), ('movement', 0.063), ('movements', 0.061), ('duration', 0.061), ('recursive', 0.061), ('cg', 0.058), ('modulate', 0.058), ('rehabilitation', 0.058), ('erent', 0.057), ('channel', 0.056), ('signal', 0.054), ('rice', 0.052), ('standards', 0.052), ('classi', 0.052), ('paradigm', 0.05), ('electrode', 0.049), ('eliminated', 0.049), ('schr', 0.049), ('sound', 0.047), ('attending', 0.044), ('blinks', 0.044), ('erps', 0.044), ('immobility', 0.044), ('involvement', 0.044), ('jh', 0.044), ('wol', 0.044), ('arrow', 0.044), ('subtracted', 0.044), ('averaged', 0.044), ('svm', 0.042), ('rejected', 0.042), ('pitch', 0.042), ('interfaces', 0.042), ('potentials', 0.041), ('aa', 0.04), ('left', 0.04), ('kt', 0.039), ('bingen', 0.039), ('user', 0.039), ('conscious', 0.038), ('contamination', 0.038), ('healthy', 0.038), ('instructed', 0.038), ('kw', 0.038), ('listen', 0.038), ('paradigms', 0.038), ('paralysis', 0.038), ('saccades', 0.038), ('silent', 0.038), ('untrained', 0.038), ('interface', 0.038), ('ten', 0.037), ('generally', 0.037), ('modulation', 0.036), ('exception', 0.036), ('periods', 0.036), ('sk', 0.035), ('motor', 0.035), ('kb', 0.035), ('disappeared', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf

2 0.24659073 117 nips-2004-Methods Towards Invasive Human Brain Computer Interfaces

Author: Thomas N. Lal, Thilo Hinterberger, Guido Widman, Michael Schröder, N. J. Hill, Wolfgang Rosenstiel, Christian E. Elger, Niels Birbaumer, Bernhard Schölkopf

Abstract: During the last ten years there has been growing interest in the development of Brain Computer Interfaces (BCIs). The ﬁeld has mainly been driven by the needs of completely paralyzed patients to communicate. With a few exceptions, most human BCIs are based on extracranial electroencephalography (EEG). However, reported bit rates are still low. One reason for this is the low signal-to-noise ratio of the EEG [16]. We are currently investigating if BCIs based on electrocorticography (ECoG) are a viable alternative. In this paper we present the method and examples of intracranial EEG recordings of three epilepsy patients with electrode grids placed on the motor cortex. The patients were asked to repeatedly imagine movements of two kinds, e.g., tongue or ﬁnger movements. We analyze the classiﬁability of the data using Support Vector Machines (SVMs) [18, 21] and Recursive Channel Elimination (RCE) [11]. 1

3 0.24459934 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

Author: Pradeep Shenoy, Rajesh P. Rao

Abstract: We describe an approach to building brain-computer interfaces (BCI) based on graphical models for probabilistic inference and learning. We show how a dynamic Bayesian network (DBN) can be used to infer probability distributions over brain- and body-states during planning and execution of actions. The DBN is learned directly from observed data and allows measured signals such as EEG and EMG to be interpreted in terms of internal states such as intent to move, preparatory activity, and movement execution. Unlike traditional classiﬁcation-based approaches to BCI, the proposed approach (1) allows continuous tracking and prediction of internal states over time, and (2) generates control signals based on an entire probability distribution over states rather than binary yes/no decisions. We present preliminary results of brain- and body-state estimation using simultaneous EEG and EMG signals recorded during a self-paced left/right hand movement task. 1

4 0.10150805 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

Author: Hyun J. Park, Te W. Lee

Abstract: Capturing dependencies in images in an unsupervised manner is important for many image processing applications. We propose a new method for capturing nonlinear dependencies in images of natural scenes. This method is an extension of the linear Independent Component Analysis (ICA) method by building a hierarchical model based on ICA and mixture of Laplacian distribution. The model parameters are learned via an EM algorithm and it can accurately capture variance correlation and other high order structures in a simple manner. We visualize the learned variance structure and demonstrate applications to image segmentation and denoising. 1 In trod u ction Unsupervised learning has become an important tool for understanding biological information processing and building intelligent signal processing methods. Real biological systems however are much more robust and flexible than current artificial intelligence mostly due to a much more efficient representations used in biological systems. Therefore, unsupervised learning algorithms that capture more sophisticated representations can provide a better understanding of neural information processing and also provide improved algorithm for signal processing applications. For example, independent component analysis (ICA) can learn representations similar to simple cell receptive fields in visual cortex [1] and is also applied for feature extraction, image segmentation and denoising [2,3]. ICA can approximate statistics of natural image patches by Eq.(1,2), where X is the data and u is a source signal whose distribution is a product of sparse distributions like a generalized Laplacian distribution. X = Au (1) P (u ) = ∏ P (u i ) (2) But the representation learned by the ICA algorithm is relatively low-level. In biological systems there are more high-level representations such as contours, textures and objects, which are not well represented by the linear ICA model. ICA learns only linear dependency between pixels by finding strongly correlated linear axis. Therefore, the modeling capability of ICA is quite limited. Previous approaches showed that one can learn more sophisticated high-level representations by capturing nonlinear dependencies in a post-processing step after the ICA step [4,5,6,7,8]. The focus of these efforts has centered on variance correlation in natural images. After ICA, a source signal is not linearly predictable from others. However, given variance dependencies, a source signal is still ‘predictable’ in a nonlinear manner. It is not possible to de-correlate this variance dependency using a linear transformation. Several researchers have proposed extensions to capture the nonlinear dependencies. Portilla et al. used Gaussian Scale Mixture (GSM) to model variance dependency in wavelet domain. This model can learn variance correlation in source prior and showed improvement in image denoising [4]. But in this model, dependency is defined only between a subset of wavelet coefficients. Hyvarinen and Hoyer suggested using a special variance related distribution to model the variance correlated source prior. This model can learn grouping of dependent sources (Subspace ICA) or topographic arrangements of correlated sources (Topographic ICA) [5,6]. Similarly, Welling et al. suggested a product of expert model where each expert represents a variance correlated group [7]. The product form of the model enables applications to image denoising. But these models don’t reveal higher-order structures explicitly. Our model is motivated by Lewicki and Karklin who proposed a 2-stage model where the 1st stage is an ICA model (Eq. (3)) and the 2 nd-stage is a linear generative model where another source v generates logarithmic variance for the 1st stage (Eq. (4)) [8]. This model captures variance dependency structure explicitly, but treating variance as an additional random variable introduces another level of complexity and requires several approximations. Thus, it is difficult to obtain a simple analytic PDF of source signal u and to apply the model for image processing problems. ( P (u | λ ) = c exp − u / λ q ) (3) log[λ ] = Bv (4) We propose a hierarchical model based on ICA and a mixture of Laplacian distribution. Our model can be considered as a simplification of model in [8] by constraining v to be 0/1 random vector where only one element can be 1. Our model is computationally simpler but still can capture variance dependency. Experiments show that our model can reveal higher order structures similar to [8]. In addition, our model provides a simple parametric PDF of variance correlated priors, which is an important advantage for adaptive signal processing. Utilizing this, we demonstrate simple applications on image segmentation and image denoising. Our model provides an improved statistic model for natural images and can be used for other applications including feature extraction, image coding, or learning even higher order structures. 2 Modeling nonlinear dependencies We propose a hierarchical or 2-stage model where the 1 st stage is an ICA source signal model and the 2nd stage is modeled by a mixture model with different variances (figure 1). In natural images, the correlation of variance reflects different types of regularities in the real world. Such specialized regularities can be summarized as “context” information. To model the context dependent variance correlation, we use mixture models where Laplacian distributions with different variance represent different contexts. For each image patch, a context variable Z “selects” which Laplacian distribution will represent ICA source signal u. Laplacian distributions have 0-mean but different variances. The advantage of Laplacian distribution for modeling context is that we can model a sparse distribution using only one Laplacian distribution. But we need more than two Gaussian distributions to do the same thing. Also conventional ICA is a special case of our model with one Laplacian. We define the mixture model and its learning algorithm in the next sections. Figure 1: Proposed hierarchical model (1st stage is ICA generative model. 2nd stage is mixture of “context dependent” Laplacian distributions which model U. Z is a random variable that selects a Laplacian distribution that generates the given image patch) 2.1 Mixture of Laplacian Distribution We define a PDF for mixture of M-dimensional Laplacian Distribution as Eq.(5), where N is the number of data samples, and K is the number of mixtures. N N K M N K r r r P(U | Λ, Π) = ∏ P(u n | Λ, Π) = ∏∑ π k P(u n | λk ) = ∏∑ π k ∏ n n k n k m 1 (2λ ) k ,m  u n,m exp −  λk , m      (5) r r r r r un = (un,1 , un , 2 , , , un,M ) : n-th data sample, U = (u1 , u 2 , , , ui , , , u N ) r r r r r λk = (λk ,1 , λk , 2 ,..., λk ,M ) : Variance of k-th Laplacian distribution, Λ = (λ1 , λ2 , , , λk , , , λK ) πk : probability of Laplacian distribution k, Π = (π 1 , , , π K ) and ∑ k πk =1 It is not easy to maximize Eq.(5) directly, and we use EM (expectation maximization) algorithm for parameter estimation. Here we introduce a new hidden context variable Z that represents which Laplacian k, is responsible for a given data point. Assuming we know the hidden variable Z, we can write the likelihood of data and Z as Eq.(6), n zk K   N r  (π )zkn   1  ⋅ exp − z n u n ,m   P(U , Z | Λ, Π ) = ∏ P(u n , Z | Λ, Π ) = ∏ ∏ k ∏      k   k λk , m n n m   2λk ,m        N               (6) n z k : Hidden binary random variable, 1 if n-th data sample is generated from k-th n Laplacian, 0 other wise. ( Z = (z kn ) and ∑ z k = 1 for all n = 1…N) k 2.2 EM algorithm for learning the mixture model The EM algorithm maximizes the log likelihood of data averaged over hidden variable Z. The log likelihood and its expectation can be computed as in Eq.(7,8).   u 1 n n log P(U , Z | Λ, Π ) = ∑  z k log(π k ) + ∑ z k  log( ) − n ,m  2λk ,m λk , m n ,k  m       (7)   u 1 n E {log P (U , Z | Λ, Π )} = ∑ E z k log(π k ) + ∑  log( ) − n ,m  2λ k , m λk , m n ,k m    { }     (8) The expectation in Eq.(8) can be evaluated, if we are given the data U and estimated parameters Λ and Π. For Λ and Π, EM algorithm uses current estimation Λ’ and Π’. { } { } ∑ z P( z n n E z k ≡ E zk | U , Λ' , Π ' = 1 n z k =0 n k n k n | u n , Λ' , Π ' ) = P( z k = 1 | u n , Λ' , Π ' ) (9) = n n P (u n | z k = 1, Λ' , Π ' ) P( z k = 1 | Λ ' , Π ' ) P(u n | Λ' , Π ' ) = M u n ,m 1 1 1 ∏ 2λ ' exp(− λ ' ) ⋅ π k ' = c P (u n | Λ ' , Π ' ) m k ,m k ,m n M πk ' ∏ 2λ m k ,m ' exp(− u n ,m λk , m ' ) Where the normalization constant can be computed as K K M k k =1 m =1 n cn = P (u n | Λ ' , Π ' ) = ∑ P (u n | z k , Λ ' , Π ' ) P ( z kn | Λ ' , Π ' ) = ∑ π k ∏ 1 (2λ ) exp( − k ,m u n ,m λk ,m ) (10) The EM algorithm works by maximizing Eq.(8), given the expectation computed from Eq.(9,10). Eq.(9,10) can be computed using Λ’ and Π’ estimated in the previous iteration of EM algorithm. This is E-step of EM algorithm. Then in M-step of EM algorithm, we need to maximize Eq.(8) over parameter Λ and Π. First, we can maximize Eq.(8) with respect to Λ, by setting the derivative as 0.  1 u n,m  ∂E{log P (U , Z | Λ, Π )} n  = 0 = ∑ E z k  − +  λ k , m (λ k , m ) 2   ∂λ k ,m  n   { } ⇒ λ k ,m ∑ E{z }⋅ u = ∑ E{z } n k n ,m n (11) n k n Second, for maximization of Eq.(8) with respect to Π, we can rewrite Eq.(8) as below. n (12) E {log P (U , Z | Λ , Π )} = C + ∑ E {z k ' }log(π k ' ) n ,k ' As we see, the derivative of Eq.(12) with respect to Π cannot be 0. Instead, we need to use Lagrange multiplier method for maximization. A Lagrange function can be defined as Eq.(14) where ρ is a Lagrange multiplier. { } (13) n L (Π , ρ ) = − ∑ E z k ' log(π k ' ) + ρ (∑ π k ' − 1) n,k ' k' By setting the derivative of Eq.(13) to be 0 with respect to ρ and Π, we can simply get the maximization solution with respect to Π. We just show the solution in Eq.(14). ∂L(Π, ρ ) ∂L(Π, ρ ) =0 = 0, ∂Π ∂ρ  n   n  ⇒ π k =  ∑ E z k  /  ∑∑ E z k     k n  n { } { } (14) Then the EM algorithm can be summarized as figure 2. For the convergence criteria, we can use the expectation of log likelihood, which can be calculated from Eq. (8). πk = { } , λk , m = E um + e (e is small random noise) 2. Calculate the Expectation by 1. Initialize 1 K u n ,m 1 M πk ' ∏ 2λ ' exp( − λ ' ) cn m k ,m k ,m 3. Maximize the log likelihood given the Expectation { } { } n n E z k ≡ E zk | U , Λ' , Π ' =     λk ,m ←  ∑ E {z kn }⋅ u n,m  /  ∑ E {z kn } ,     π k ←  ∑ E {z kn } /  ∑∑ E {z kn }   n   n   k n  4. If (converged) stop, otherwise repeat from step 2.  n Figure 2: Outline of EM algorithm for Learning the Mixture Model 3 Experimental Results Here we provide examples of image data and show how the learning procedure is performed for the mixture model. We also provide visualization of learned variances that reveal the structure of variance correlation and an application to image denoising. 3.1 Learning Nonlinear Dependencies in Natural images As shown in figure 1, the 1 st stage of the proposed model is simply the linear ICA. The ICA matrix A and W(=A-1) are learned by the FastICA algorithm [9]. We sampled 105(=N) data from 16x16 patches (256 dim.) of natural images and use them for both first and second stage learning. ICA input dimension is 256, and source dimension is set to be 160(=M). The learned ICA basis is partially shown in figure 1. The 2nd stage mixture model is learned given the ICA source signals. In the 2 nd stage the number of mixtures is set to 16, 64, or 256(=K). Training by the EM algorithm is fast and several hundred iterations are sufficient for convergence (0.5 hour on a 1.7GHz Pentium PC). For the visualization of learned variance, we adapted the visualization method from [8]. Each dimension of ICA source signal corresponds to an ICA basis (columns of A) and each ICA basis is localized in both image and frequency space. Then for each Laplacian distribution, we can display its variance vector as a set of points in image and frequency space. Each point can be color coded by variance value as figure 3. (a1) (a2) (b1) (b2) Figure 3: Visualization of learned variances (a1 and a2 visualize variance of Laplacian #4 and b1 and 2 show that of Laplacian #5. High variance value is mapped to red color and low variance is mapped to blue. In Laplacian #4, variances for diagonally oriented edges are high. But in Laplacian #5, variances for edges at spatially right position are high. Variance structures are related to “contexts” in the image. For example, Laplacian #4 explains image patches that have oriented textures or edges. Laplacian #5 captures patches where left side of the patch is clean but right side is filled with randomly oriented edges.) A key idea of our model is that we can mix up independent distributions to get nonlinearly dependent distribution. This modeling power can be shown by figure 4. Figure 4: Joint distribution of nonlinearly dependent sources. ((a) is a joint histogram of 2 ICA sources, (b) is computed from learned mixture model, and (c) is from learned Laplacian model. In (a), variance of u2 is smaller than u1 at center area (arrow A), but almost equal to u1 at outside (arrow B). So the variance of u2 is dependent on u1. This nonlinear dependency is closely approximated by mixture model in (b), but not in (c).) 3.2 Unsupervised Image Segmentation The idea behind our model is that the image can be modeled as mixture of different variance correlated “contexts”. We show how the learned model can be used to classify different context by an unsupervised image segmentation task. Given learned model and data, we can compute the expectation of a hidden variable Z from Eq. (9). Then for an image patch, we can select a Laplacian distribution with highest probability, which is the most explaining Laplacian or “context”. For segmentation, we use the model with 16 Laplacians. This enables abstract partitioning of images and we can visualize organization of images more clearly (figure 5). Figure 5: Unsupervised image segmentation (left is original image, middle is color labeled image, right image shows color coded Laplacians with variance structure. Each color corresponds to a Laplacian distribution, which represents surface or textural organization of underlying contexts. Laplacian #14 captures smooth surface and Laplacian #9 captures contrast between clear sky and textured ground scenes.) 3.3 Application to Image Restoration The proposed mixture model provides a better parametric model of the ICA source distribution and hence an improved model of the image structure. An advantage is in the MAP (maximum a posterior) estimation of a noisy image. If we assume Gaussian noise n, the image generation model can be written as Eq.(15). Then, we can compute MAP estimation of ICA source signal u by Eq.(16) and reconstruct the original image. (15) X = Au + n (16) ˆ u = argmax log P (u | X , A) = argmax (log P ( X | u , A) + log P (u ) ) u u Since we assumed Gaussian noise, P(X|u,A) in Eq. (16) is Gaussian. P(u) in Eq. (16) can be modeled as a Laplacian or a mixture of Laplacian distribution. The mixture distribution can be approximated by a maximum explaining Laplacian. We evaluated 3 different methods for image restoration including ICA MAP estimation with simple Laplacian prior, same with Laplacian mixture prior, and the Wiener filter. Figure 6 shows an example and figure 7 summarizes the results obtained with different noise levels. As shown MAP estimation with the mixture prior performs better than the others in terms of SNR and SSIM (Structural Similarity Measure) [10]. Figure 6: Image restoration results (signal variance 1.0, noise variance 0.81) 16 ICA MAP (Mixture prior) ICA MAP (Laplacian prior) W iener 14 0.8 SSIM Index SNR 12 10 8 6 0.6 0.4 0.2 4 2 ICA MAP(Mixture prior) ICA MAP(Laplacian prior) W iener Noisy Image 1 0 0.5 1 1.5 Noise variance 2 2.5 0 0 0.5 1 1.5 Noise variance 2 2.5 Figure 7: SNR and SSIM for 3 different algorithms (signal variance = 1.0) 4 D i s c u s s i on We proposed a mixture model to learn nonlinear dependencies of ICA source signals for natural images. The proposed mixture of Laplacian distribution model is a generalization of the conventional independent source priors and can model variance dependency given natural image signals. Experiments show that the proposed model can learn the variance correlated signals grouped as different mixtures and learn highlevel structures, which are highly correlated with the underlying physical properties captured in the image. Our model provides an analytic prior of nearly independent and variance-correlated signals, which was not viable in previous models [4,5,6,7,8]. The learned variances of the mixture model show structured localization in image and frequency space, which are similar to the result in [8]. Since the model is given no information about the spatial location or frequency of the source signals, we can assume that the dependency captured by the mixture model reveals regularity in the natural images. As shown in image labeling experiments, such regularities correspond to specific surface types (textures) or boundaries between surfaces. The learned mixture model can be used to discover hidden contexts that generated such regularity or correlated signal groups. Experiments also show that the labeling of image patches is highly correlated with the object surface types shown in the image. The segmentation results show regularity across image space and strong correlation with high-level concepts. Finally, we showed applications of the model for image restoration. We compare the performance with the conventional ICA MAP estimation and Wiener filter. Our results suggest that the proposed model outperforms other traditional methods. It is due to the estimation of the correlated variance structure, which provides an improved prior that has not been considered in other methods. In our future work, we plan to exploit the regularity of the image segmentation result to lean more high-level structures by building additional hierarchies on the current model. Furthermore, the application to image coding seems promising. References [1] A. J. Bell and T. J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edge Filters, Vision Research, 37(23):3327–3338, 1997. [2] A. Hyvarinen, Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation,Neural Computation, 11(7):1739-1768, 1999. [3] T. Lee, M. Lewicki, and T. Sejnowski., ICA Mixture Models for unsupervised Classification of non-gaussian classes and automatic context switching in blind separation. PAMI, 22(10), October 2000. [4] J. Portilla, V. Strela, M. J. Wainwright and E. P Simoncelli, Image Denoising using Scale Mixtures of Gaussians in the Wavelet Domain, IEEE Trans. On Image Processing, Vol.12, No. 11, 1338-1351, 2003. [5] A. Hyvarinen, P. O. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neurocomputing, 1999. [6] A. Hyvarinen, P.O. Hoyer, Topographic Independent component analysis as a model of V1 Receptive Fields, Neurocomputing, Vol. 38-40, June 2001. [7] M. Welling and G. E. Hinton, S. Osindero, Learning Sparse Topographic Representations with Products of Student-t Distributions, NIPS, 2002. [8] M. S. Lewicki and Y. Karklin, Learning higher-order structures in natural images, Network: Comput. Neural Syst. 14 (August 2003) 483-499. [9] A.Hyvarinen, P.O. Hoyer, Fast ICA matlab code., http://www.cis.hut.fi/projects/compneuro/extensions.html/ [10] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, The SSIM Index for Image Quality Assessment, IEEE Transactions on Image Processing, vol. 13, no. 4, Apr. 2004.

5 0.099896356 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

Author: Rasmus K. Olsson, Lars K. Hansen

Abstract: We discuss an identiﬁcation framework for noisy speech mixtures. A block-based generative model is formulated that explicitly incorporates the time-varying harmonic plus noise (H+N) model for a number of latent sources observed through noisy convolutive mixtures. All parameters including the pitches of the source signals, the amplitudes and phases of the sources, the mixing ﬁlters and the noise statistics are estimated by maximum likelihood, using an EM-algorithm. Exact averaging over the hidden sources is obtained using the Kalman smoother. We show that pitch estimation and source separation can be performed simultaneously. The pitch estimates are compared to laryngograph (EGG) measurements. Artiﬁcial and real room mixtures are used to demonstrate the viability of the approach. Intelligible speech signals are re-synthesized from the estimated H+N models.

6 0.095199294 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

7 0.090144522 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters

8 0.084305517 132 nips-2004-Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis

9 0.084075049 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity

10 0.081832781 29 nips-2004-Beat Tracking the Graphical Model Way

11 0.079600997 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

12 0.079200521 21 nips-2004-An Information Maximization Model of Eye Movements

13 0.077313967 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

14 0.076850384 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

15 0.075416707 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

16 0.074083589 152 nips-2004-Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization

17 0.071250424 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

18 0.070333533 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

19 0.069745041 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

20 0.068332382 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.175), (1, -0.074), (2, -0.013), (3, -0.094), (4, -0.123), (5, -0.009), (6, 0.353), (7, 0.015), (8, 0.095), (9, -0.082), (10, -0.06), (11, 0.045), (12, -0.071), (13, -0.214), (14, 0.3), (15, -0.023), (16, 0.024), (17, -0.075), (18, -0.14), (19, 0.101), (20, 0.03), (21, -0.116), (22, -0.039), (23, 0.118), (24, -0.011), (25, 0.145), (26, 0.065), (27, 0.09), (28, -0.146), (29, 0.031), (30, -0.055), (31, -0.03), (32, 0.053), (33, 0.002), (34, -0.061), (35, 0.018), (36, -0.029), (37, 0.121), (38, -0.046), (39, 0.051), (40, -0.003), (41, 0.036), (42, 0.0), (43, 0.035), (44, -0.083), (45, -0.016), (46, 0.028), (47, -0.037), (48, -0.005), (49, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96089232 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf

2 0.86368173 117 nips-2004-Methods Towards Invasive Human Brain Computer Interfaces

Author: Thomas N. Lal, Thilo Hinterberger, Guido Widman, Michael Schröder, N. J. Hill, Wolfgang Rosenstiel, Christian E. Elger, Niels Birbaumer, Bernhard Schölkopf

3 0.79836166 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

Author: Pradeep Shenoy, Rajesh P. Rao

4 0.45334783 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

Author: Lavi Shpigelman, Koby Crammer, Rony Paz, Eilon Vaadia, Yoram Singer

Abstract: We devise and experiment with a dynamical kernel-based system for tracking hand movements from neural activity. The state of the system corresponds to the hand location, velocity, and acceleration, while the system’s input are the instantaneous spike rates. The system’s state dynamics is deﬁned as a combination of a linear mapping from the previous estimated state and a kernel-based mapping tailored for modeling neural activities. In contrast to generative models, the activity-to-state mapping is learned using discriminative methods by minimizing a noise-robust loss function. We use this approach to predict hand trajectories on the basis of neural activity in motor cortex of behaving monkeys and ﬁnd that the proposed approach is more accurate than both a static approach based on support vector regression and the Kalman ﬁlter. 1

5 0.40129089 155 nips-2004-Responding to Modalities with Different Latencies

Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka

Abstract: Motor control depends on sensory feedback in multiple modalities with different latencies. In this paper we consider within the framework of reinforcement learning how different sensory modalities can be combined and selected for real-time, optimal movement control. We propose an actor-critic architecture with multiple modules, whose output are combined using a softmax function. We tested our architecture in a simulation of a sequential reaching task. Reaching was initially guided by visual feedback with a long latency. Our learning scheme allowed the agent to utilize the somatosensory feedback with shorter latency when the hand is near the experienced trajectory. In simulations with different latencies for visual and somatosensory feedback, we found that the agent depended more on feedback with shorter latency. 1

6 0.39859778 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

7 0.34940633 29 nips-2004-Beat Tracking the Graphical Model Way

8 0.33935592 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

9 0.3259894 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

10 0.31191474 21 nips-2004-An Information Maximization Model of Eye Movements

11 0.2897006 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

12 0.28783205 193 nips-2004-Theories of Access Consciousness

13 0.28392059 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

14 0.2708891 132 nips-2004-Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis

15 0.26927811 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

16 0.25901401 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

17 0.25761622 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

18 0.2575061 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

19 0.24531211 74 nips-2004-Harmonising Chorales by Probabilistic Inference

20 0.24113227 109 nips-2004-Mass Meta-analysis in Talairach Space

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.098), (15, 0.066), (18, 0.059), (26, 0.046), (31, 0.034), (33, 0.151), (35, 0.039), (39, 0.012), (50, 0.038), (52, 0.014), (63, 0.282), (71, 0.013), (76, 0.017), (89, 0.011), (94, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79792047 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces

Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf

2 0.74920732 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

Author: Philip M. Long, Xinyu Wu

Abstract: We establish a mistake bound for an ensemble method for classiﬁcation based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more reﬁned bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simpliﬁcation of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simpliﬁed algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1

3 0.61574358 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the ﬁnal regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efﬁcient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a ﬁnite number of steps. Then we provide conﬁdence-interval-type bounds on the generalization error. 1

4 0.59719151 117 nips-2004-Methods Towards Invasive Human Brain Computer Interfaces

Author: Thomas N. Lal, Thilo Hinterberger, Guido Widman, Michael Schröder, N. J. Hill, Wolfgang Rosenstiel, Christian E. Elger, Niels Birbaumer, Bernhard Schölkopf

5 0.5937807 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

Author: Peter M. Todd, Anja Dieckmann

Abstract: Simple lexicographic decision heuristics that consider cues one at a time in a particular order and stop searching for cues as soon as a decision can be made have been shown to be both accurate and frugal in their use of information. But much of the simplicity and success of these heuristics comes from using an appropriate cue order. For instance, the Take The Best heuristic uses validity order for cues, which requires considerable computation, potentially undermining the computational advantages of the simple decision mechanism. But many cue orders can achieve good decision performance, and studies of sequential search for data records have proposed a number of simple ordering rules that may be of use in constructing appropriate decision cue orders as well. Here we consider a range of simple cue ordering mechanisms, including tallying, swapping, and move-to-front rules, and show that they can find cue orders that lead to reasonable accuracy and considerable frugality when used with lexicographic decision heuristics. 1 O ne -Re ason De c i si on M aki ng and O r de r e d Se ar c h How do we know what information to consider when making a decision? Imagine the problem of deciding which of two objects or options is greater along some criterion, such as which of two cities is larger. We may know various facts about each city, such as whether they have a major sports team or a university or airport. To decide between them, we could weight and sum all the cues we know, or we could use a simpler lexicographic rule to look at one cue at a time in a particular order until we find a cue that discriminates between the options and indicates a choice [1]. Such lexicographic rules are used by people in a variety of decision tasks [2]-[4], and have been shown to be both accurate in their inferences and frugal in the amount of information they consider before making a decision. For instance, Gigerenzer and colleagues [5] demonstrated the surprising performance of several decision heuristics that stop information search as soon as one discriminating cue is found; because only that cue is used to make the decision, and no integration of information is involved, they called these heuristics “one-reason” decision mechanisms. Given some set of cues that can be looked up to make the decision, these heuristics differ mainly in the search rule that determines the order in which the information is searched. But then the question of what information to consider becomes, how are these search orders determined? Particular cue orders make a difference, as has been shown in research on the Take The Best heuristic (TTB) [6], [7]. TTB consists of three building blocks. (1) Search rule: Search through cues in the order of their validity, a measure of accuracy equal to the proportion of correct decisions made by a cue out of all the times that cue discriminates between pairs of options. (2) Stopping rule: Stop search as soon as one cue is found that discriminates between the two options. (3) Decision rule: Select the option to which the discriminating cue points, that is, the option that has the cue value associated with higher criterion values. The performance of TTB has been tested on several real-world data sets, ranging from professors’ salaries to fish fertility [8], in cross-validation comparisons with other more complex strategies. Across 20 data sets, TTB used on average only a third of the available cues (2.4 out of 7.7), yet still outperformed multiple linear regression in generalization accuracy (71% vs. 68%). The even simpler Minimalist heuristic, which searches through available cues in a random order, was more frugal (using 2.2 cues on average), yet still achieved 65% accuracy. But the fact that the accuracy of Minimalist lagged behind TTB by 6 percentage points indicates that part of the secret of TTB’s success lies in its ordered search. Moreover, in laboratory experiments [3], [4], [9], people using lexicographic decision strategies have been shown to employ cue orders based on the cues’ validities or a combination of validity and discrimination rate (proportion of decision pairs on which a cue discriminates between the two options). Thus, the cue order used by a lexicographic decision mechanism can make a considerable difference in accuracy; the same holds true for frugality, as we will see. But constructing an exact validity order, as used by Take The Best, takes considerable information and computation [10]. If there are N known objects to make decisions over, and C cues known for each object, then each of the C cues must be evaluated for whether it discriminates correctly (counting up R right decisions), incorrectly (W wrong decisions), or does not discriminate between each of the N·(N-1)/2 possible object pairs, yielding C·N·(N-1)/2 checks to perform to gather the information needed to compute cue validities (v = R/(R+W)) in this domain. But a decision maker typically does not know all of the objects to be decided upon, nor even all the cue values for those objects, ahead of time—is there any simpler way to find an accurate and frugal cue order? In this paper, we address this question through simulation-based comparison of a variety of simple cue-order-learning rules. Hope comes from two directions: first, there are many cue orders besides the exact validity ordering that can yield good performance; and second, research in computer science has demonstrated the efficacy of a range of simple ordering rules for a closely related search problem. Consequently, we find that simple mechanisms at the cue-order-learning stage can enable simple mechanisms at the decision stage, such as lexicographic one-reason decision heuristics, to perform well. 2 Si mpl e appr oac he s to c onstr uc ti ng c ue s e ar c h or de r s To compare different cue ordering rules, we evaluate the performance of different cue orders when used by a one-reason decision heuristic within a particular well-studied sample domain: large German cities, compared on the criterion of population size using 9 cues ranging from having a university to the presence of an intercity train line [6], [7]. Examining this domain makes it clear that there are many good possible cue orders. When used with one-reason stopping and decision building blocks, the mean accuracy of the 362,880 (9!) cue orders is 70%, equivalent to the performance expected from Minimalist. The accuracy of the validity order, 74.2%, falls toward the upper end of the accuracy range (62-75.8%), but there are still 7421 cue orders that do better than the validity order. The frugality of the search orders ranges from 2.53 cues per decision to 4.67, with a mean of 3.34 corresponding to using Minimalist; TTB has a frugality of 4.23, implying that most orders are more frugal. Thus, there are many accurate and frugal cue orders that could be found—a satisficing decision maker not requiring optimal performance need only land on one. An ordering problem of this kind has been studied in computer science for nearly four decades, and can provide us with a set of potential heuristics to test. Consider the case of a set of data records arranged in a list, each of which will be required during a set of retrievals with a particular probability pi. On each retrieval, a key is given (e.g. a record’s title) and the list is searched from the front to the end until the desired record, matching that key, is found. The goal is to minimize the mean search time for accessing the records in this list, for which the optimal ordering is in decreasing order of pi. But if these retrieval probabilities are not known ahead of time, how can the list be ordered after each successive retrieval to achieve fast access? This is the problem of self-organizing sequential search [11], [12]. A variety of simple sequential search heuristics have been proposed for this problem, centering on three main approaches: (1) transpose, in which a retrieved record is moved one position closer to the front of the list (i.e., swapping with the record in front of it); (2) move-to-front (MTF), in which a retrieved record is put at the front of the list, and all other records remain in the same relative order; and (3) count, in which a tally is kept of the number of times each record is retrieved, and the list is reordered in decreasing order of this tally after each retrieval. Because count rules require storing additional information, more attention has focused on the memory-free transposition and MTF rules. Analytic and simulation results (reviewed in [12]) have shown that while transposition rules can come closer to the optimal order asymptotically, in the short run MTF rules converge more quickly (as can count rules). This may make MTF (and count) rules more appealing as models of cue order learning by humans facing small numbers of decision trials. Furthermore, MTF rules are more responsive to local structure in the environment (e.g., clumped retrievals over time of a few records), and transposition can result in very poor performance under some circumstances (e.g., when neighboring pairs of “popular” records get trapped at the end of the list by repeatedly swapping places). It is important to note that there are important differences between the selforganizing sequential search problem and the cue-ordering problem we address here. In particular, when a record is sought that matches a particular key, search proceeds until the correct record is found. In contrast, when a decision is made lexicographically and the list of cues is searched through, there is no one “correct” cue to find—each cue may or may not discriminate (allow a decision to be made). Furthermore, once a discriminating cue is found, it may not even make the right decision. Thus, given feedback about whether a decision was right or wrong, a discriminating cue could potentially be moved up or down in the ordered list. This dissociation between making a decision or not (based on the cue discrimination rates), and making a right or wrong decision (based on the cue validities), means that there are two ordering criteria in this problem—frugality and accuracy—as opposed to the single order—search time—for records based on their retrieval probability pi . Because record search time corresponds to cue frugality, the heuristics that work well for the self-organizing sequential search task are likely to produce orders that emphasize frugality (reflecting cue discrimination rates) over accuracy in the cue-ordering task. Nonetheless, these heuristics offer a useful starting point for exploring cue-ordering rules. 2.1 The cue-ordering rules We focus on search order construction processes that are psychologically plausible by being frugal both in terms of information storage and in terms of computation. The decision situation we explore is different from the one assumed by Juslin and Persson [10] who strongly differentiate learning about objects from later making decisions about them. Instead we assume a learning-while-doing situation, consisting of tasks that have to be done repeatedly with feedback after each trial about the adequacy of one’s decision. For instance, we can observe on multiple occasions which of two supermarket checkout lines, the one we have chosen or (more likely) another one, is faster, and associate this outcome with cues including the lines’ lengths and the ages of their respective cashiers. In such situations, decision makers can learn about the differential usefulness of cues for solving the task via the feedback received over time. We compare several explicitly defined ordering rules that construct cue orders for use by lexicographic decision mechanisms applied to a particular probabilistic inference task: forced choice paired comparison, in which a decision maker has to infer which of two objects, each described by a set of binary cues, is “bigger” on a criterion—just the task for which TTB was formulated. After an inference has been made, feedback is given about whether a decision was right or wrong. Therefore, the order-learning algorithm has information about which cues were looked up, whether a cue discriminated, and whether a discriminating cue led to the right or wrong decision. The rules we propose differ in which pieces of information they use and how they use them. We classify the learning rules based on their memory requirement—high versus low—and their computational requirements in terms of full or partial reordering (see Table 1). Table 1: Learning rules classified by memory and computational requirements High memory load, complete reordering High memory load, local reordering Low memory load, local reordering Validity: reorders cues based on their current validity Tally swap: moves cue up (down) one position if it has made a correct (incorrect) decision if its tally of correct minus incorrect decisions is ( ) than that of next higher (lower) cue Simple swap: moves cue up one position after correct decision, and down after an incorrect decision Tally: reorders cues by number of correct minus incorrect decisions made so far Associative/delta rule: reorders cues by learned association strength Move-to-front (2 forms): Take The Last (TTL): moves discriminating cue to front TTL-correct: moves cue to front only if it correctly discriminates The validity rule, a type of count rule, is the most demanding of the rules we consider in terms of both memory requirements and computational complexity. It keeps a count of all discriminations made by a cue so far (in all the times that the cue was looked up) and a separate count of all the correct discriminations. Therefore, memory load is comparatively high. The validity of each cue is determined by dividing its current correct discrimination count by its total discrimination count. Based on these values computed after each decision, the rule reorders the whole set of cues from highest to lowest validity. The tally rule only keeps one count per cue, storing the number of correct decisions made by that cue so far minus the number of incorrect decisions. If a cue discriminates correctly on a given trial, one point is added to its tally, if it leads to an incorrect decision, one point is subtracted. The tally rule is less demanding in terms of memory and computation: Only one count is kept, no division is required. The simple swap rule uses the transposition rather than count approach. This rule has no memory of cue performance other than an ordered list of all cues, and just moves a cue up one position in this list whenever it leads to a correct decision, and down if it leads to an incorrect decision. In other words, a correctly deciding cue swaps positions with its nearest neighbor upwards in the cue order, and an incorrectly deciding cue swaps positions with its nearest neighbor downwards. The tally swap rule is a hybrid of the simple swap rule and the tally rule. It keeps a tally of correct minus incorrect discriminations per cue so far (so memory load is high) but only locally swaps cues: When a cue makes a correct decision and its tally is greater than or equal to that of its upward neighbor, the two cues swap positions. When a cue makes an incorrect decision and its tally is smaller than or equal to that of its downward neighbor, the two cues also swap positions. We also evaluate two types of move-to-front rules. First, the Take The Last (TTL) rule moves the last discriminating cue (that is, whichever cue was found to discriminate for the current decision) to the front of the order. This is equivalent to the Take The Last heuristic [6], [7], which uses a memory of cues that discriminated in the past to determine cue search order for subsequent decisions. Second, TTLcorrect moves the last discriminating cue to the front of the order only if it correctly discriminated; otherwise, the cue order remains unchanged. This rule thus takes accuracy as well as frugality into account. Finally, we include an associative learning rule that uses the delta rule to update cue weights according to whether they make correct or incorrect discriminations, and then reorders all cues in decreasing order of this weight after each decision. This corresponds to a simple network with nine input units encoding the difference in cue value between the two objects (A and B) being decided on (i.e., ini = -1 if cuei(A) cuei(B), and 0 if cuei(A)=cuei(B) or cuei was not checked) and with one output unit whose target value encodes the correct decision (t = 1 if criterion(A)>criterion(B), otherwise -1), and with the weights between inputs and output updated according to wi = lr · (t - ini·wi) · ini with learning rate lr = 0.1. We expect this rule to behave similarly to Oliver’s rule initially (moving a cue to the front of the list by giving it the largest weight when weights are small) and to swap later on (moving cues only a short distance once weights are larger). 3 Si mul ati on Study of Si mpl e O r de r i ng Rul e s To test the performance of these order learning rules, we use the German cities data set [6], [7], consisting of the 83 largest-population German cities (those with more than 100,000 inhabitants), described on 9 cues that give some information about population size. Discrimination rate and validity of the cues are negatively correlated (r = -.47). We present results averaged over 10,000 learning trials for each rule, starting from random initial cue orders. Each trial consisted of 100 decisions between randomly selected decision pairs. For each decision, the current cue order was used to look up cues until a discriminating cue was found, which was used to make the decision (employing a onereason or lexicographic decision strategy). After each decision, the cue order was updated using the particular order-learning rule. We start by considering the cumulative accuracies (i.e., online or amortized performance—[12]) of the rules, defined as the total percentage of correct decisions made so far at any point in the learning process. The contrasting measure of offline accuracy—how well the current learned cue order would do if it were applied to the entire test set—will be subsequently reported (see Figure 1). For all but the move-to-front rules, cumulative accuracies soon rise above that of the Minimalist heuristic (proportion correct = .70) which looks up cues in random order and thus serves as a lower benchmark. However, at least throughout the first 100 decisions, cumulative accuracies stay well below the (offline) accuracy that would be achieved by using TTB for all decisions (proportion correct = .74), looking up cues in the true order of their ecological validities. Except for the move-to-front rules, whose cumulative accuracies are very close to Minimalist (mean proportion correct in 100 decisions: TTL: .701; TTL-correct: .704), all learning rules perform on a surprisingly similar level, with less than one percentage point difference in favor of the most demanding rule (i.e., delta rule: .719) compared to the least (i.e., simple swap: .711; for comparison: tally swap: .715; tally: .716; validity learning rule: .719). Offline accuracies are slightly higher, again with the exception of the move to front rules (TTL: .699; TTL-correct: .702; simple swap: .714; tally swap: .719; tally: .721; validity learning rule: .724; delta rule: .725; see Figure 1). In longer runs (10,000 decisions) the validity learning rule is able to converge on TTB’s accuracy, but the tally rule’s performance changes little (to .73). Figure 1: Mean offline accuracy of order learning rules Figure 2: Mean offline frugality of order learning rules All learning rules are, however, more frugal than TTB, and even more frugal than Minimalist, both in terms of online as well as offline frugality. Let us focus on their offline frugality (see Figure 2): On average, the rules look up fewer cues than Minimalist before reaching a decision. There is little difference between the associative rule, the tallying rules and the swapping rules (mean number of cues looked up in 100 decisions: delta rule: 3.20; validity learning rule: 3.21; tally: 3.01; tally swap: 3.04; simple swap: 3.13). Most frugal are the two move-to front rules (TTL-correct: 2.87; TTL: 2.83). Consistent with this finding, all of the learning rules lead to cue orders that show positive correlations with the discrimination rate cue order (reaching the following values after 100 decisions: validity learning rule: r = .18; tally: r = .29; tally swap: r = .24; simple swap: r = .18; TTL-correct: r = .48; TTL: r = .56). This means that cues that often lead to discriminations are more likely to end up in the first positions of the order. This is especially true for the move-to-front rules. In contrast, the cue orders resulting from all learning rules but the validity learning rule do not correlate or correlate negatively with the validity cue order, and even the correlations of the cue orders resulting from the validity learning rule after 100 decisions only reach an average r = .12. But why would the discrimination rates of cues exert more of a pull on cue order than validity, even when the validity learning rule is applied? As mentioned earlier, this is what we would expect for the move-to-front rules, but it was unexpected for the other rules. Part of the explanation comes from the fact that in the city data set we used for the simulations, validity and discrimination rate of cues are negatively correlated. Having a low discrimination rate means that a cue has little chance to be used and hence to demonstrate its high validity. Whatever learning rule is used, if such a cue is displaced downward to the lower end of the order by other cues, it may have few chances to escape to the higher ranks where it belongs. The problem is that when a decision pair is finally encountered for which that cue would lead to a correct decision, it is unlikely to be checked because other, more discriminating although less valid, cues are looked up before and already bring about a decision. Thus, because one-reason decision making is intertwined with the learning mechanism and so influences which cues can be learned about, what mainly makes a cue come early in the order is producing a high number of correct decisions and not so much a high ratio of correct discriminations to total discriminations regardless of base rates. This argument indicates that performance may differ in environments where cue validities and discrimination rates correlate positively. We tested the learning rules on one such data set (r=.52) of mammal species life expectancies, predicted from 9 cues. It also differs from the cities environment with a greater difference between TTB’s and Minimalist’s performance (6.5 vs. 4 percentage points). In terms of offline accuracy, the validity learning rule now indeed more closely approaches TTB’s accuracy after 100 decisions (.773 vs. .782)., The tally rule, in contrast, behaves very much as in the cities environment, reaching an accuracy of .752, halfway between TTB and Minimalist (accuracy =.716). Thus only some learning rules can profit from the positive correlation. 4 D i s c u s s i on Most of the simpler cue order learning rules we have proposed do not fall far behind a validity learning rule in accuracy, and although the move-to-front rules cannot beat the accuracy achieved if cues were selected randomly, they compensate for this failure by being highly frugal. Interestingly, the rules that do achieve higher accuracy than Minimalist also beat random cue selection in terms of frugality. On the other hand, all rules, even the delta rule and the validity learning rule, stay below TTB’s accuracy across a relatively high number of decisions. But often it is necessary to make good decisions without much experience. Therefore, learning rules should be preferred that quickly lead to orders with good performance. The relatively complex rules with relatively high memory requirement, i.e., the delta and the validity learning rule, but also the tally learning rule, more quickly rise in accuracy compared the rules with lower requirements. Especially the tally rule thus represents a good compromise between cost, correctness and psychological plausibility considerations. Remember that the rules based on tallies assume full memory of all correct minus incorrect decisions made by a cue so far. But this does not make the rule implausible, at least from a psychological perspective, even though computer scientists were reluctant to adopt such counting approaches because of their extra memory requirements. There is considerable evidence that people are actually very good at remembering the frequencies of events. Hasher and Zacks [13] conclude from a wide range of studies that frequencies are encoded in an automatic way, implying that people are sensitive to this information without intention or special effort. Estes [14] pointed out the role frequencies play in decision making as a shortcut for probabilities. Further, the tally rule and the tally swap rule are comparatively simple, not having to keep track of base rates or perform divisions as does the validity rule. From the other side, the simple swap and move to front rules may not be much simpler, because storing a cue order may be about as demanding as storing a set of tallies. We have run experiments (reported elsewhere) in which indeed the tally swap rule best accounts for people’s actual processes of ordering cues. Our goal in this paper was to explore how well simple cue-ordering rules could work in conjunction with lexicographic decision strategies. This is important because it is necessary to take into account the set-up costs of a heuristic in addition to its application costs when considering the mechanism’s overall simplicity. As the example of the validity search order of TTB shows, what is easy to apply may not necessarily be so easy to set up. But simple rules can also be at work in the construction of a heuristic’s building blocks. We have proposed such rules for the construction of one building block, the search order. Simple learning rules inspired by research in computer science can enable a one-reason decision heuristic to perform only slightly worse than if it had full knowledge of cue validities from the very beginning. Giving up the assumption of full a priori knowledge for the slight decrease in accuracy seems like a reasonable bargain: Through the addition of learning rules, one-reason decision heuristics might lose some of their appeal to decision theorists who were surprised by the performance of such simple mechanisms compared to more complex algorithms, but they gain psychological plausibility and so become more attractive as explanations for human decision behavior. References [1] Fishburn, P.C. (1974). Lexicographic orders, utilities and decision rules: A survey. Management Science, 20, 1442-1471. [2] Payne, J.W., Bettman, J.R., & Johnson, E.J. (1993). The adaptive decision maker. New York: Cambridge University Press. [3] Bröder, A. (2000). Assessing the empirical validity of the “Take-The-Best” heuristic as a model of human probabilistic inference. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26 (5), 1332-1346. [4] Bröder, A. (2003). Decision making with the “adaptive toolbox”: Influence of environmental structure, intelligence, and working memory load. Journal of Experimental Psychology: Learning, Memory, & Cognition, 29, 611-625. [5] Gigerenzer, G., Todd, P.M., & The ABC Research Group (1999). Simple heuristics that make us smart. New York: Oxford University Press. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103 (4), 650-669. [7] Gigerenzer, G., & Goldstein, D.G. (1999). Betting on one good reason: The Take The Best Heuristic. In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [8] Czerlinski, J., Gigerenzer, G., & Goldstein, D.G. (1999). How good are simple heuristics? In G. Gigerenzer, P.M. Todd & The ABC Research Group, Simple heuristics that make us smart. New York: Oxford University Press. [9] Newell, B.R., & Shanks, D.R. (2003). Take the best or look at the rest? Factors influencing ‘one-reason’ decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 53-65. [10] Juslin, P., & Persson, M. (2002). PROBabilities from EXemplars (PROBEX): a “lazy” algorithm for probabilistic inference from generic knowledge. Cognitive Science, 26, 563-607. [11] Rivest, R. (1976). On self-organizing sequential search heuristics. Communications of the ACM, 19(2), 63-67. [12] Bentley, J.L. & McGeoch, C.C. (1985). Amortized analyses of self-organizing sequential search heuristics. Communications of the ACM, 28(4), 404-411. [13] Hasher, L., & Zacks, R.T. (1984). Automatic Processing of fundamental information: The case of frequency of occurrence. American Psychologist, 39, 1372-1388. [14] Estes, W.K. (1976). The cognitive side of probability learning. Psychological Review, 83, 3764.

6 0.57616168 131 nips-2004-Non-Local Manifold Tangent Learning

7 0.568151 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

8 0.56667966 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

9 0.55959064 102 nips-2004-Learning first-order Markov models for control

10 0.55812943 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

11 0.5571388 163 nips-2004-Semi-parametric Exponential Family PCA

12 0.55640966 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

13 0.55597025 69 nips-2004-Fast Rates to Bayes for Kernel Machines

14 0.55519724 64 nips-2004-Experts in a Markov Decision Process

15 0.55490291 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

16 0.55428571 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

17 0.55355263 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

18 0.55292213 116 nips-2004-Message Errors in Belief Propagation

19 0.55220473 124 nips-2004-Multiple Alignment of Continuous Time Series

20 0.55211627 207 nips-2004-ℓ₀-norm Minimization for Basis Selection