nips nips2003 nips2003-175 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Virginia Sa
Abstract: Why are sensory modalities segregated the way they are? In this paper we show that sensory modalities are well designed for self-supervised cross-modal learning. Using the Minimizing-Disagreement algorithm on an unsupervised speech categorization task with visual (moving lips) and auditory (sound signal) inputs, we show that very informative auditory dimensions actually harm performance when moved to the visual side of the network. It is better to throw them away than to consider them part of the “visual input”. We explain this finding in terms of the statistical structure in sensory inputs. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Why are sensory modalities segregated the way they are? [sent-3, score-0.398]
2 In this paper we show that sensory modalities are well designed for self-supervised cross-modal learning. [sent-4, score-0.379]
3 Using the Minimizing-Disagreement algorithm on an unsupervised speech categorization task with visual (moving lips) and auditory (sound signal) inputs, we show that very informative auditory dimensions actually harm performance when moved to the visual side of the network. [sent-5, score-0.829]
4 It is better to throw them away than to consider them part of the “visual input”. [sent-6, score-0.035]
5 We explain this finding in terms of the statistical structure in sensory inputs. [sent-7, score-0.111]
6 1 Introduction In previous work [1, 2] we developed a simple neural network algorithm that learned categories from co-occurences of patterns to different sensory modalities. [sent-8, score-0.194]
7 Using only the co-occuring patterns of lip motion and acoustic signal, the network learned separate visual and auditory networks (subnets) to distinguish 5 consonant vowel utterances. [sent-9, score-0.474]
8 In this paper we show that the success of this biologically motivated algorithm depends crucially on the statistics of features derived from different sensory modalities. [sent-11, score-0.09]
9 We do this by examining the performance when the two “network-modalities” or pseudo-modalities are made up of inputs from the different sensory modalities. [sent-12, score-0.149]
10 The modalities are essentially trained by running Kohonen’s LVQ2. [sent-14, score-0.289]
11 1 algorithm[3] but with the target class set by the output of the subnet of the other modality (receiving a co-occuring pattern) not a supervisory external signal. [sent-15, score-0.778]
12 feedback of class picked by auditory input Multi-sensory object area "Class" Units Hidden Units Modality/Network 1 (Visual) Modality/Network 2 (Auditory) visual input Figure 1: The network for Minimizing-Disagreement algorithm. [sent-17, score-0.441]
13 The weights from the hidden units to the output units determine the “labels” of the hidden units. [sent-18, score-0.372]
14 These weights are updated throughout training to allow hidden units to change classes if needed. [sent-19, score-0.208]
15 During training each modality creates an output label for the other as shown on the right side of the figure. [sent-20, score-0.58]
16 Initialize hidden unit weight vectors in each modality (unsupervised clustering) 2. [sent-23, score-0.662]
17 Initialize hidden unit labels using unsupervised clustering of the activity patterns across the hidden units from both modalities 3. [sent-24, score-0.68]
18 The label of a hidden unit is the output unit to which it projects most strongly. [sent-26, score-0.266]
19 • For each modality update the hidden unit weight vectors according to the LVQ2. [sent-27, score-0.662]
20 1 rule (Only the rules for modality 1 are given below) Updates are performed only if the current pattern X1 (n) falls within c(n) of the border between two hidden units of different classes (one of them agreeing with the output from the other modality). [sent-28, score-0.708]
21 • Update the labeling weights using Hebbian learning between the winning hidden unit and the output of the other modality. [sent-30, score-0.206]
22 In order to discourage runaway to one of the trivial global minima of disagreement, where both modalities only ever output one class, weights to the output class neurons are renormalized at each step. [sent-31, score-0.499]
23 This normalization means that the algorithm is not modifying the output weights to minimize the disagreement but instead clustering the hidden unit representation using the output class given by the other modality. [sent-32, score-0.445]
24 This objective is better for these weights as it balances the goal of agreement with the desire to avoid the trivial solution of all hidden units having the same label. [sent-33, score-0.203]
25 Vx time Vy motion vectors image areas Ax frequency vectors time Ay frequency channels Figure 2: An example Auditory and Visual pattern vector. [sent-34, score-0.176]
26 The figure shows which dimensions went into each of Ax, Ay, Vx, and Vy. [sent-35, score-0.151]
27 1 Creation of Sub-Modalities The original auditory and visual data were collected using an 8mm camcorder and directional microphone. [sent-37, score-0.293]
28 The speaker spoke 118 repetitions of /ba/, /va/, /da/, /ga/, and /wa/. [sent-38, score-0.019]
29 The first 98 samples of each utterance class formed the training set and the remaining 20 the test set. [sent-39, score-0.093]
30 The auditory feature vector was encoded using a 24 channel mel code1 over 20 msec windows overlapped by 10 msec. [sent-40, score-0.227]
31 This is a coarse short time frequency encoding, which crudely approximates peripheral auditory processing. [sent-41, score-0.208]
32 Each feature vector was linearly scaled so that all dimensions lie in the range [-1,1]. [sent-42, score-0.156]
33 The final auditory code is a (24 × 9) 216 dimension vector for each utterance. [sent-43, score-0.184]
34 An example auditory feature vector is shown in Figure 2 (bottom). [sent-44, score-0.208]
35 The visual data were processed using software designed and written by Ramprasad Polana [4]. [sent-45, score-0.109]
36 Visual frames were digitized as 64 × 64 8 bit gray-level images using the Datacube MaxVideo system. [sent-46, score-0.052]
37 Segments were taken as 6 frames before the acoustically determined utterance offset and 4 after. [sent-47, score-0.093]
38 Each pair of frames was then averaged and then these averaged frames were divided into 25 equal areas (5 × 5) and the motion magnitudes within each frame were averaged within each area. [sent-49, score-0.27]
39 The final visual feature vector of dimension (5 frames × 25 areas) 125 was linearly normalized as for the auditory vectors. [sent-50, score-0.369]
40 An example visual feature vector is shown in Figure 2 (top). [sent-51, score-0.133]
41 The original auditory and visual feature vectors were divided into two parts (called Ax, Ay and Vx,Vy as shown in Figure 2). [sent-52, score-0.37]
42 The partition was arbitrarily determined as a compromise between wanting a similar number of dimensions and similar information content in each part. [sent-53, score-0.153]
43 Our goal is to combine them in different ways and observe the performance of the minimizing-disagreement algorithm. [sent-55, score-0.037]
44 We first benchmarked the divided “sub-modalities” to see how useful they were for the task. [sent-56, score-0.031]
45 For this, we ran a supervised algorithm on each subset. [sent-57, score-0.085]
46 Sub-Modality Ax Ay Vx Vy Supervised Performance 89 ± 2 91 ± 2 83 ± 2 77 ± 3 Table 1: Supervised performance of each of the sub-modalities. [sent-60, score-0.037]
47 The idea is to test all possible combinations of pseudo-modalities and compare the resulting performance of the final individual subnets with what a supervised algorithm could do with the same dimensions. [sent-64, score-0.187]
48 3 Pseudo-Modality Experiments In order to allow fair comparison, appropriate parameters were found for each modality division. [sent-66, score-0.473]
49 The data were divided into 75% Training, and 25% Test data. [sent-67, score-0.031]
50 Optimal parameters were selected by observing performance on the training data, and performance is reported on the test data. [sent-68, score-0.074]
51 The results for all possible divisions are presented in Figure 3. [sent-69, score-0.038]
52 The light gray bar and number represents the test-set performance of the pseudo-modality consisting of the sub-modalities listed below it. [sent-71, score-0.063]
53 The darker bar and number represents the test-set performance of the other pseudo-modality. [sent-72, score-0.063]
54 The black outlines (and numbers above the outline) give the performance of the corresponding supervised algorithm (LVQ2. [sent-73, score-0.122]
55 Thus, the empty area between the shaded area and black outline represents the loss from lack of supervision. [sent-75, score-0.064]
56 For each submodality, we can ask: To get the best performance of a subnet using those dimensions, where should one put the other sub-modalities in a M-D network? [sent-77, score-0.255]
57 For instance, to answer that question of Ax, one would compare the performance of the Ax subnet in Ax/Ay+V network with that of the Ax+Ay subnet in the Ax+Ay/Vx+Vy network, with that of the Ax+Vx+Vy subnet in the Ax+Vx+Vy/Ay network etc. [sent-78, score-0.87]
58 The subnet containing Ax that performs the best is the Ax+Ay subnet (trained with co-modality Vx+Vy). [sent-79, score-0.436]
59 In fact, it turns out that for each submodality, the architecture for optimal post-training performance of the subnet containing that submodality, is to put the dimensions from the same “real” modality on the same side and those from the other modality on the other side. [sent-80, score-1.371]
60 We can answer this question by comparing the performance of the Ax/Ay+Vx+Vy network with that of the Ax/Vx+Vy network as shown in Figure 4. [sent-82, score-0.216]
61 For that particular division, the results are not significantly different (even though we have removed the most useful dimensions), but for all the other divisions, performance is improved when dimensions are removed so that only dimensions from one “real” sensory modality are on one side. [sent-83, score-0.904]
62 Standard errors for the self-supervised performance means are ±1. [sent-85, score-0.037]
63 Note that this is true even though a supervised network with Ax+Vx+Vy does much better than a supervised network with Vx+Vy — this is not a simple feature selection result. [sent-89, score-0.346]
64 4 Correlational structure is important Why do we get these results? [sent-91, score-0.021]
65 The answer is that the results are very dependent on the statistical structure between dimensions within and between different sensory modalities. [sent-92, score-0.309]
66 Consider a simpler system of two 1-Dimensional modalities and two classes of objects. [sent-93, score-0.314]
67 Assume that the sensation detected by each modality has a probability density given by a Gaussian of different mean for each class. [sent-94, score-0.516]
68 The densities seen by each modality are shown in Figure 5. [sent-95, score-0.531]
69 In part A) of the Figure, the joint density for the stimuli to both modalities is shown for the case of conditionally uncorrelated stimuli (within each class, the inputs are uncorrelated). [sent-96, score-0.505]
70 Parts C) and D) show the changing joint density as the sensations to the two modalities become more correlated within each class. [sent-97, score-0.37]
71 Notice that the density changes from a “two blob” structure to more of a “ridge” structure. [sent-98, score-0.064]
72 As it does this the projection of the joint density gives less indication of the underlying bi-modal structure and the local minimum of the Minimizing-Disagreement Energy function gets shallower and narrower. [sent-99, score-0.102]
73 In the figure imagine that there are two classes of objects, with densities given by the thick curve and the thin curve and that this marginal density is the same in each one-dimensional modality. [sent-102, score-0.152]
74 In the top case, the modalities are conditionally independent. [sent-104, score-0.385]
75 Given that a “thick” object is present, the particular pattern to each modality is independent. [sent-105, score-0.494]
76 The lines represent a possible sampling of data (where points are joined if they co-occured). [sent-106, score-0.063]
77 The minimizing disagreement algorithm wants to find a line from top to bottom that crosses the fewest lines – within the pattern space, disagreement is minimized for the dashed line shown. [sent-107, score-0.399]
78 Standard errors for the self-supervised performance means are ±1. [sent-109, score-0.037]
79 Example Densities for A Joint Density with two classes in one modality B ρ =0 0. [sent-112, score-0.498]
80 The M-D algorithm wants to find a partition that crosses the fewest lines. [sent-134, score-0.119]
81 Conditional Information (I(X;Y |Class)) (with diagonal zeroed) Within-Class Correlation Coefficients (averaged over each class) Figure 7: Statistical Structure of our data In the bottom case, the modalities are strongly dependent. [sent-135, score-0.289]
82 In this case there are many local minima for minimum disagreement, that are not closely related to the class boundary. [sent-136, score-0.072]
83 It is easy for the networks to minimize the disagreement between the outputs of the modalities, without paying attention to the class. [sent-137, score-0.121]
84 Having two very strongly dependent variables, one on each side of the network, means that the network can minimize disagreement by simply listening to those units. [sent-138, score-0.274]
85 To verify that our auditory-visual results were due to statistical differences between the dimensions, we examined the statistical structure of our data. [sent-139, score-0.021]
86 It turns out that, within a class, the correlation coefficient between most pairs of dimensions is fairly low. [sent-140, score-0.132]
87 However, for related auditory features (similar time and frequency band) correlations are high and also for related visual features. [sent-141, score-0.317]
88 We also computed the conditional mutual information between each pair of features given the class I(x; y|Class). [sent-143, score-0.052]
89 This value is 0 if and only if the two features are conditionally independent given the class. [sent-145, score-0.077]
90 The graphs show that many of the auditory dimensions are highly dependent on each other (even given the class), as are many of the visual dimensions. [sent-146, score-0.464]
91 This makes them unsuitable for serving on the other side of a M-D network. [sent-147, score-0.038]
92 5 Discussion The minimizing-disagreement algorithm was initially developed as a model of selfsupervised cortical learning and the importance of conditionally uncorrelated structure was mentioned in [5]. [sent-149, score-0.134]
93 However in the co-training style algorithms, inputs that are conditionally dependent are not helpful, but they are also not as harmful. [sent-152, score-0.138]
94 Because the self-supervised algorithm is dependent on the class structure being evident in the joint space as its only source of supervision, it is very sensitive to conditionally dependent relationships between the modalities. [sent-153, score-0.285]
95 We have shown that different sensory modalities are ideally suited for teaching each other. [sent-154, score-0.379]
96 color and motion for the visual modality) which are also likely to be conditionally independent (and indeed may be actively kept so [8, 9, 10]). [sent-157, score-0.246]
97 We suggest that brain connectivity may be constrained not only due to volume limits, but because limiting connectivity may be beneficial for learning. [sent-158, score-0.046]
98 Acknowledgements A preliminary version of this work appeared in a book chapter [5] in the book, Psychology of Learning and Motivation. [sent-159, score-0.044]
99 Color adaptation of edge-detectors in the human visual system. [sent-203, score-0.109]
wordName wordTfidf (topN-words)
[('modality', 0.473), ('vy', 0.37), ('ax', 0.33), ('vx', 0.303), ('modalities', 0.289), ('ay', 0.242), ('subnet', 0.218), ('auditory', 0.184), ('dimensions', 0.132), ('disagreement', 0.121), ('visual', 0.109), ('sensory', 0.09), ('virginia', 0.087), ('hidden', 0.087), ('supervised', 0.085), ('conditionally', 0.077), ('network', 0.076), ('units', 0.067), ('submodality', 0.065), ('subnets', 0.065), ('densities', 0.058), ('unit', 0.055), ('sa', 0.052), ('frames', 0.052), ('class', 0.052), ('book', 0.044), ('fewest', 0.044), ('harmful', 0.044), ('joined', 0.044), ('ramprasad', 0.044), ('performances', 0.043), ('density', 0.043), ('utterance', 0.041), ('dependent', 0.039), ('motion', 0.039), ('joint', 0.038), ('divisions', 0.038), ('side', 0.038), ('performance', 0.037), ('unsupervised', 0.036), ('uncorrelated', 0.036), ('output', 0.035), ('creation', 0.035), ('throw', 0.035), ('dana', 0.035), ('label', 0.034), ('clustering', 0.031), ('divided', 0.031), ('pseudo', 0.03), ('weights', 0.029), ('wants', 0.029), ('patterns', 0.028), ('diego', 0.028), ('answer', 0.027), ('thick', 0.026), ('bar', 0.026), ('crosses', 0.025), ('weight', 0.025), ('classes', 0.025), ('initialize', 0.025), ('frequency', 0.024), ('areas', 0.024), ('averaged', 0.024), ('feature', 0.024), ('outline', 0.024), ('connectivity', 0.023), ('unlabeled', 0.023), ('inputs', 0.022), ('vectors', 0.022), ('partition', 0.021), ('structure', 0.021), ('pattern', 0.021), ('psychology', 0.021), ('color', 0.021), ('removed', 0.02), ('trivial', 0.02), ('minima', 0.02), ('gure', 0.02), ('area', 0.02), ('bene', 0.019), ('relationships', 0.019), ('lines', 0.019), ('top', 0.019), ('segregated', 0.019), ('lip', 0.019), ('blob', 0.019), ('minton', 0.019), ('discourage', 0.019), ('throwing', 0.019), ('correlational', 0.019), ('medin', 0.019), ('cowan', 0.019), ('disagree', 0.019), ('prof', 0.019), ('alspector', 0.019), ('went', 0.019), ('mel', 0.019), ('consonant', 0.019), ('psychonomic', 0.019), ('spoke', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 175 nips-2003-Sensory Modality Segregation
Author: Virginia Sa
Abstract: Why are sensory modalities segregated the way they are? In this paper we show that sensory modalities are well designed for self-supervised cross-modal learning. Using the Minimizing-Disagreement algorithm on an unsupervised speech categorization task with visual (moving lips) and auditory (sound signal) inputs, we show that very informative auditory dimensions actually harm performance when moved to the visual side of the network. It is better to throw them away than to consider them part of the “visual input”. We explain this finding in terms of the statistical structure in sensory inputs. 1
2 0.087903753 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
Author: Brian J. Fischer, Charles H. Anderson
Abstract: The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive fields [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we first formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.
3 0.063265644 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
Author: D. Philipona, J.k. O'regan, J.-p. Nadal, Olivier Coenen
Abstract: Is there a way for an algorithm linked to an unknown body to infer by itself information about this body and the world it is in? Taking the case of space for example, is there a way for this algorithm to realize that its body is in a three dimensional world? Is it possible for this algorithm to discover how to move in a straight line? And more basically: do these questions make any sense at all given that the algorithm only has access to the very high-dimensional data consisting of its sensory inputs and motor outputs? We demonstrate in this article how these questions can be given a positive answer. We show that it is possible to make an algorithm that, by analyzing the law that links its motor outputs to its sensory inputs, discovers information about the structure of the world regardless of the devices constituting the body it is linked to. We present results from simulations demonstrating a way to issue motor orders resulting in “fundamental” movements of the body as regards the structure of the physical world. 1
4 0.058896713 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
5 0.05027505 5 nips-2003-A Classification-based Cocktail-party Processor
Author: Nicoleta Roman, Deliang Wang, Guy J. Brown
Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t ) Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t ) IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.
6 0.050060436 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons
7 0.049895287 7 nips-2003-A Functional Architecture for Motion Pattern Processing in MSTd
8 0.049877834 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
9 0.043965325 73 nips-2003-Feature Selection in Clustering Problems
10 0.043258891 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence
11 0.039592486 43 nips-2003-Bounded Invariance and the Formation of Place Fields
12 0.039362613 121 nips-2003-Log-Linear Models for Label Ranking
13 0.039263848 113 nips-2003-Learning with Local and Global Consistency
14 0.03824681 176 nips-2003-Sequential Bayesian Kernel Regression
15 0.037355848 111 nips-2003-Learning the k in k-means
16 0.036340039 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
17 0.036001567 46 nips-2003-Clustering with the Connectivity Kernel
18 0.035511583 37 nips-2003-Automatic Annotation of Everyday Movements
19 0.035157904 92 nips-2003-Information Bottleneck for Gaussian Variables
20 0.034451317 119 nips-2003-Local Phase Coherence and the Perception of Blur
topicId topicWeight
[(0, -0.124), (1, -0.006), (2, 0.076), (3, 0.009), (4, -0.066), (5, 0.014), (6, 0.055), (7, 0.028), (8, -0.035), (9, 0.033), (10, 0.075), (11, 0.01), (12, 0.043), (13, 0.008), (14, 0.009), (15, 0.025), (16, -0.031), (17, -0.04), (18, 0.01), (19, -0.007), (20, -0.037), (21, -0.092), (22, -0.004), (23, 0.089), (24, 0.007), (25, -0.065), (26, 0.038), (27, 0.055), (28, -0.011), (29, -0.047), (30, 0.07), (31, 0.063), (32, 0.076), (33, -0.072), (34, 0.081), (35, -0.014), (36, 0.045), (37, -0.042), (38, -0.066), (39, 0.012), (40, -0.089), (41, 0.047), (42, -0.059), (43, 0.069), (44, -0.001), (45, 0.119), (46, 0.016), (47, 0.05), (48, -0.107), (49, -0.092)]
simIndex simValue paperId paperTitle
same-paper 1 0.9249689 175 nips-2003-Sensory Modality Segregation
Author: Virginia Sa
Abstract: Why are sensory modalities segregated the way they are? In this paper we show that sensory modalities are well designed for self-supervised cross-modal learning. Using the Minimizing-Disagreement algorithm on an unsupervised speech categorization task with visual (moving lips) and auditory (sound signal) inputs, we show that very informative auditory dimensions actually harm performance when moved to the visual side of the network. It is better to throw them away than to consider them part of the “visual input”. We explain this finding in terms of the statistical structure in sensory inputs. 1
2 0.52474242 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
Author: Brian J. Fischer, Charles H. Anderson
Abstract: The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive fields [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we first formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.
3 0.50160384 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
Author: D. Philipona, J.k. O'regan, J.-p. Nadal, Olivier Coenen
Abstract: Is there a way for an algorithm linked to an unknown body to infer by itself information about this body and the world it is in? Taking the case of space for example, is there a way for this algorithm to realize that its body is in a three dimensional world? Is it possible for this algorithm to discover how to move in a straight line? And more basically: do these questions make any sense at all given that the algorithm only has access to the very high-dimensional data consisting of its sensory inputs and motor outputs? We demonstrate in this article how these questions can be given a positive answer. We show that it is possible to make an algorithm that, by analyzing the law that links its motor outputs to its sensory inputs, discovers information about the structure of the world regardless of the devices constituting the body it is linked to. We present results from simulations demonstrating a way to issue motor orders resulting in “fundamental” movements of the body as regards the structure of the physical world. 1
4 0.4854016 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science
Author: Woojae Kim, Daniel J. Navarro, Mark A. Pitt, In J. Myung
Abstract: Despite the popularity of connectionist models in cognitive science, their performance can often be difficult to evaluate. Inspired by the geometric approach to statistical model selection, we introduce a conceptually similar method to examine the global behavior of a connectionist model, by counting the number and types of response patterns it can simulate. The Markov Chain Monte Carlo-based algorithm that we constructed Þnds these patterns efficiently. We demonstrate the approach using two localist network models of speech perception. 1
5 0.43425843 45 nips-2003-Circuit Optimization Predicts Dynamic Networks for Chemosensory Orientation in Nematode C. elegans
Author: Nathan A. Dunn, John S. Conery, Shawn R. Lockery
Abstract: The connectivity of the nervous system of the nematode Caenorhabditis elegans has been described completely, but the analysis of the neuronal basis of behavior in this system is just beginning. Here, we used an optimization algorithm to search for patterns of connectivity sufficient to compute the sensorimotor transformation underlying C. elegans chemotaxis, a simple form of spatial orientation behavior in which turning probability is modulated by the rate of change of chemical concentration. Optimization produced differentiator networks with inhibitory feedback among all neurons. Further analysis showed that feedback regulates the latency between sensory input and behavior. Common patterns of connectivity between the model and biological networks suggest new functions for previously identified connections in the C. elegans nervous system. 1
6 0.42962462 5 nips-2003-A Classification-based Cocktail-party Processor
7 0.42290115 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel
8 0.40807882 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems
9 0.39179128 187 nips-2003-Training a Quantum Neural Network
10 0.38460815 7 nips-2003-A Functional Architecture for Motion Pattern Processing in MSTd
11 0.38292426 196 nips-2003-Wormholes Improve Contrastive Divergence
12 0.38288775 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons
13 0.36694074 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
14 0.35825494 56 nips-2003-Dopamine Modulation in a Basal Ganglio-Cortical Network of Working Memory
15 0.35634717 185 nips-2003-The Doubly Balanced Network of Spiking Neurons: A Memory Model with High Capacity
16 0.35056114 130 nips-2003-Model Uncertainty in Classical Conditioning
17 0.34956464 43 nips-2003-Bounded Invariance and the Formation of Place Fields
18 0.34904325 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
19 0.29967839 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples
20 0.29149771 113 nips-2003-Learning with Local and Global Consistency
topicId topicWeight
[(0, 0.027), (11, 0.033), (30, 0.03), (35, 0.032), (53, 0.121), (69, 0.018), (71, 0.046), (76, 0.024), (79, 0.367), (85, 0.067), (91, 0.129)]
simIndex simValue paperId paperTitle
1 0.79048204 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface
Author: Yu Zhou, Steven G. Mason, Gary E. Birch
Abstract: This paper presents an energy normalization transform as a method to reduce system errors in the LF-ASD brain-computer interface. The energy normalization transform has two major benefits to the system performance. First, it can increase class separation between the active and idle EEG data. Second, it can desensitize the system to the signal amplitude variability. For four subjects in the study, the benefits resulted in the performance improvement of the LF-ASD in the range from 7.7% to 18.9%, while for the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance did not change notably with normalization. 1 In trod u ction In an effort to provide alternative communication channels for people who suffer from severe loss of motor function, several researchers have worked over the past two decades to develop a direct Brain-Computer Interface (BCI). Since electroencephalographic (EEG) signal has good time resolution and is non-invasive, it is commonly used for data source of a BCI. A BCI system converts the input EEG into control signals, which are then used to control devices like computers, environmental control system and neuro-prostheses. Mason and Birch [1] proposed the Low-Frequency Asynchronous Switch Design (LF-ASD) as a BCI which detected imagined voluntary movement-related potentials (IVMRPs) in spontaneous EEG. The principle signal processing components of the LF-ASD are shown in Figure 1. sIN Feature Extractor sLPF LPF Feature Classifier sFE sFC Figure 1: The original LF-ASD design. The input to the low-pass filter (LPF), denoted as SIN in Figure 1, are six bipolar EEG signals recorded from F1-FC1, Fz-FCz, F2-FC2, FC1-C1, FCz-Cz and FC2-C2 sampled at 128 Hz. The cutoff frequency of the LPF implemented by Mason and Birch was 4 Hz. The Feature Extractor of the LF-ASD extracts custom features related to IVMRPs. The Feature Classifier implements a one-nearest-neighbor (1NN) classifier, which determines if the input signals are related to a user state of voluntary movement or passive (idle) observation. The LF-ASD was able to achieve True Positive (TP) values in the range of 44%-81%, with the corresponding False Positive (FP) values around 1% [1]. Although encouraging, the current error rates of the LF-ASD are insufficient for real-world applications. This paper proposes a method to improve the system performance. 2 Design and Rationale The improved design of the LF-ASD with the Energy Normalization Transform (ENT) is provided in Figure 2. SIN ENT SN SNLPF LPF Feature Extractor SNFE SNFC Feature Classifier Figure 2: The improved LF-ASD with the Energy Normalization Transform. The design of the Feature Extractor and Feature Classifier were the same as shown in Figure 1. The Energy Normalization Transform (ENT) is implemented as S N (n ) = S s=( w s= − ( N ∑ w IN S IN −1) / 2 N −1) / 2 (n ) 2 (n − s) w N where W N (normalization window size) is the only parameter in the equation. The optimal parameter value was obtained by exhaustive search for the best class separation between active and idle EEG data. The method of obtaining the active and idle EEG data is provided in Section 3.1. The idea to use energy normalization to improve the LF-ASD design was based primarily on an observation that high frequency power decreases significantly around movement. For example, Jasper and Penfield [3] and Pfurtscheller et al, [4] reported EEG power decrease in the mu (8-12 Hz) and beta rhythm (18-26 Hz) when people are involved in motor related activity. Also Mason [5] found that the power in the frequency components greater than 4Hz decreased significantly during movement-related potential periods, while power in the frequency components less than 4Hz did not. Thus energy normalization, which would increase the low frequency power level, would strengthen the 0-4 Hz features used in the LF-ASD and hence reduce errors. In addition, as a side benefit, it can automatically adjust the mean scale of the input signal and desensitize the system to change in EEG power, which is known to vary over time [2]. Therefore, it was postulated that the addition of ENT into the improved design would have two major benefits. First, it can increase the EEG power around motor potentials, consequently increasing the class separation and feature strength. Second, it can desensitize the system to amplitude variance of the input signal. In addition, since the system components of the modified LF-ASD after the ENT were the same as in the original design, a major concern was whether or not the ENT distorted the features used by the LF-ASD. Since the features used by the LFASD are generated from the 0-4 Hz band, if the ENT does not distort the phase and magnitude spectrum in this specific band, it would not distort the features related to movement potential detection in the application. 3 3.1 Evaluation Test data Two types of EEG data were pre-recorded from five able-bodied individuals as shown in Figure 3. Active Data Type and Idle Data Type. Active Data was recorded during repeated right index finger flexions alternating with periods of no motor activity; Idle Data was recorded during extended periods of passive observation. Figure 3: Data Definition of M1, M2, Idle1 and Idle2. Observation windows centered at the time of the finger switch activations (as shown in Figure 4) were imposed in the active data to separate data related to movements from data during periods of idleness. For purpose of this study, data in the front part of the observation window was defined as M1 and data in the rear part of the window was defined as M2. Data falling out of the observation window was defined as Idle2. All the data in the Idle Data Type was defined as Idle1 for comparison with Idle2. Figure 4: Ensemble Average of EEG centered on finger activations. Figure 5: Density distribution of Idle1, Idle2, M1 and M2. It was noted, in terms of the density distribution of active and idle data, the separation between M2 and Idle2 was the largest and Idle1 and Idle2 were nearly identical (see Figure 5). For the study, M2 and Idle2 were chosen to represent the active and idle data classes and the separation between M2 and Idle2 data was defined by the difference of means (DOM) scaled by the amplitude range of Idle2. 3.2 Optimal parameter determination The optimal combination of normalization window size, W N, and observation window size, W O was selected to be that which achieved the maximal DOM value. This was determined by exhaustive search, and discussed in Section 4.1. 3.3 Effect of ENT on the Low Pass Filter output As mentioned previously, it was postulated that the ENT had two major impacts: increasing the class separation between active and idle EEG and desensitizing the system to the signal amplitude variance. The hypothesis was evaluated by comparing characteristics of SNLPF and SLPF in Figure 1 and Figure 2. DOM was applied to measure the increased class separation. The signal with the larger DOM meant larger class separation. In addition, the signal with smaller standard deviation may result in a more stable feature set. 3.4 Effect of ENT on the LF-ASD output The performances of the original and improved designs were evaluated by comparing the signal characteristics of SNFC in Figure 2 to SFC in Figure 1. A Receiver Operating Characteristic Curve (ROC Curve) [6] was generated for the original and improved designs. The ROC Curve characterizes the system performance over a range of TP vs. FP values. The larger area under ROC Curve indicates better system performance. In real applications, a BCI with high-level FP rates could cause frustration for subjects. Therefore, in this work only the LF-ASD performance when the FP values are less than 1% were studied. 4 4.1 Results Optimal normalization window size (WN) The method to choose optimal WN was an exhaustive search for maximal DOM between active and idle classes. This method was possibly dependent on the observation window size (W O). However, as shown in Figure 6a, the optimal WN was found to be independent of WO. Experimentally, the W O values were selected in the range of 50-60 samples, which corresponded to largest DOM between nonnormalized active and idle data. The optimal WN was obtained by exhaustive search for the largest DOM through normalized active and idle data. The DOM vs. WN profile for Subject 1 is shown in Figure 6b. a) b) Figure 6: Optimal parameter determination for Subject 1 in Channel 1 a) DOM vs. WO; b) DOM vs. WN. When using ENT, a small W N value may cause distortion to the feature set used by the LF-ASD. Thus, the optimal W N was not selected in this range (< 40 samples). When W N is greater than 200, the ENT has lost its capability to increase class separation and the DOM curve gradually goes towards the best separation without normalization. Thus, the optimal W N should correspond to the maximal DOM value when W N is in the range from 40 to 200. In Figure 6b, the optimal WN is around 51. 4.2 Effect of ENT on the Low Pass Filter output With ENT, the standard deviation of the low frequency EEG signal decreased from around 1.90 to 1.30 over the six channels and over the five subjects. This change resulted in more stable feature sets. Thus, the ENT desensitizes the system to input signal variance. a) b) Figure 7: Density distribution of the active vs. idle class without (a) and with (b) ENT, for Subject 1 in Channel 1. As shown in Figure 7, by increasing the EEG power around motor potentials, ENT can increase class separations between active and idle EEG data. The class separation in (frontal) Channels 1-3 across all subjects increased consistently with the proposed ENT. The same was true for (midline) Channels 4-6, for all subjects except Subject 5, whose DOM in channel 5-6 decreased by 2.3% and 3.4% respectively with normalization. That is consistent with the fact that his EEG power in Channels 4-6 does not decrease. On average, across all five subjects, DOM increases with normalization to about 28.8%, 26.4%, 39.4%, 20.5%, 17.8% and 22.5% over six channels respectively. In addition, the magnitude and phase spectrums of the EEG signal before and after ENT is provided in Figure 8. The ENT has no visible distortion to the signal in the low frequency band (0-4 Hz) used by the LF-ASD. Therefore, the ENT does not distort the features used by the LF-ASD. (a) (b) Figure 8: Magnitude and phase spectrum of the EEG signal before and after ENT. 4.3 Effect of ENT on the LF-ASD output The two major benefits of the ENT to the low frequency EEG data result in the performance improvement of the LF-ASD. Subject 1’s ROC Curves with and without ENT is shown in Figure 9, where the ROC-Curve with ENT of optimal parameter value is above the ROC Curve without ENT. This indicates that the improved LF-ASD performs better. Table I compares the system performance with and without ENT in terms of TP with corresponding FP at 1% across all the 5 subjects. Figure 9: The ROC Curves (in the section of interest) of Subject 1 with different WN values and the corresponding ROC Curve without ENT. Table I: Performance of the LF-ASD with and without LF-ASD in terms of the True Positive rate with corresponding False Positive at 1%. Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 TP without ENT 66.1% 82.7% 79.7% 79.3% 90.5% TP with ENT 85.0% 90.4% 88.0% 87.8% 88.7% Performance Improvement 18.9% 7.7% 8.3% 8.5% -1.8% For 4 out of 5 subjects, corresponding with the FP at 1%, the improved system with ENT increased the TP value by 7.7%, 8.3%, 8.5% and 18.9% respectively. Thus, for these subjects, the range of TP with FP at 1% was improved from 66.1%-82.7% to 85.0%-90.4% with ENT. For the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance remained around 90% with ENT. In addition, this evaluation is conservative. Since the codebook in the Feature Classifier and the parameters in the Feature Extractor of the LF-ASD were derived from nonnormalized EEG, they work in favor of the non-normalized EEG. Therefore, if the parameters and the codebook of the modified LF-ASD are generated from the normalized EEG in the future, the modified LF-ASD may show better performance than this evaluation. 5 Conclusion The evaluation with data from five able-bodied subjects indicates that the proposed system with Energy Normalization Transform (ENT) has better performance than the original. This study has verified the original hypotheses that the improved design with ENT might have two major benefits: increased the class separation between active and idle EEG and desensitized the system performance to input amplitude variance. As a side benefit, the ENT can also make the design less sensitive to the mean input scale. In the broad band, the Energy Normalization Transform is a non-linear transform. However, it has no visible distortion to the signal in the 0-4 Hz band. Therefore, it does not distort the features used by the LF-ASD. For 4 out of 5 subjects, with the corresponding False Positive rate at 1%, the proposed transform increased the system performance by 7.7%, 8.3%, 8.5% and 18.9% respectively in terms of True Positive rate. Thus, the overall performance of the LF-ASD for these subjects was improved from 66.1%-82.7% to 85.0%-90.4%. For the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance did not change notably with normalization. In the future with the codebook derived from the normalized data, the performance could be further improved. References [1] Mason, S. G. and Birch, G. E., (2000) A Brain-Controlled Switch for Asynchronous Control Applications. IEEE Trans Biomed Eng, 47(10):1297-1307. [2] Vaughan, T. M., Wolpaw, J. R., and Donchin, E. (1996) EEG-Based Communication: Prospects and Problems. IEEE Trans Reh Eng, 4(4):425-430. [3] Jasper, H. and Penfield, W. (1949) Electrocortiograms in man: Effect of voluntary movement upon the electrical activity of the precentral gyrus. Arch.Psychiat.Nervenkr., 183:163-174. [4] Pfurtscheller, G., Neuper, C., and Flotzinger, D. (1997) EEG-based discrimination between imagination of right and left hand movement. Electroencephalography and Clinical Neurophysiology, 103:642-651. [5] Mason, S. G. (1997) Detection of single trial index finger flexions from continuous, spatiotemporal EEG. PhD Thesis, UBC, January. [6] Green, D. M. and Swets, J. A. (1996) Signal Detection Theory and Psychophysics New York: John Wiley and Sons, Inc.
same-paper 2 0.78247285 175 nips-2003-Sensory Modality Segregation
Author: Virginia Sa
Abstract: Why are sensory modalities segregated the way they are? In this paper we show that sensory modalities are well designed for self-supervised cross-modal learning. Using the Minimizing-Disagreement algorithm on an unsupervised speech categorization task with visual (moving lips) and auditory (sound signal) inputs, we show that very informative auditory dimensions actually harm performance when moved to the visual side of the network. It is better to throw them away than to consider them part of the “visual input”. We explain this finding in terms of the statistical structure in sensory inputs. 1
3 0.67400807 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
4 0.46825513 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
5 0.46335304 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
6 0.4623701 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
7 0.4602755 73 nips-2003-Feature Selection in Clustering Problems
8 0.46015131 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
9 0.45892626 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
10 0.45892483 113 nips-2003-Learning with Local and Global Consistency
11 0.4584482 81 nips-2003-Geometric Analysis of Constrained Curves
12 0.45837355 30 nips-2003-Approximability of Probability Distributions
13 0.4578734 68 nips-2003-Eye Movements for Reward Maximization
14 0.45777535 126 nips-2003-Measure Based Regularization
15 0.45642021 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
16 0.45620024 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
17 0.4559876 143 nips-2003-On the Dynamics of Boosting
18 0.45294741 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing
19 0.45269182 78 nips-2003-Gaussian Processes in Reinforcement Learning
20 0.452618 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks