nips nips2003 nips2003-162 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
Reference: text
sentIndex sentText sentNum sentScore
1 However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. [sent-4, score-1.059]
2 Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. [sent-5, score-0.969]
3 In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. [sent-6, score-0.674]
4 We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. [sent-7, score-0.274]
5 1 Introduction Working with a time-frequency representation of speech can have many advantages over processing the raw amplitude samples of the signal directly. [sent-8, score-0.642]
6 Much of the structure in speech and other audio signals manifests itself through simultaneous common onset, offset or co-modulation of energy in multiple frequency bands, as harmonics or as coloured noise bursts. [sent-9, score-0.643]
7 Furthermore, there are many important high-level operations which are much easier to perform in a short-time multiband spectral representation than on the time domain signal. [sent-10, score-0.145]
8 For example, time-scale modification algorithms attempt to lengthen or shorten a signal without affecting its frequency content. [sent-11, score-0.258]
9 The main idea is to upsample or downsample the spectrogram of the signal along the time axis while leaving the frequency axis unwarped. [sent-12, score-0.964]
10 Source separation or denoising algorithms often work by identifying certain time-frequency regions as having high signal-to-noise or as belonging to the source of interest and “masking-out” others. [sent-13, score-0.098]
11 Of course, there are many clever and efficient speech processing algorithms for pitch tracking[6], denoising[7], and even timescale modification[4] that do operate directly on the signal samples, but the spectral domain certainly has its advantages. [sent-15, score-0.707]
12 s1 s n/2 sn/2 +1 sn sn+1 M1 s3n/2 s +1 3n/2 s2n sN M3 M2 Figure 1: In the generative model, the spectrogram is obtained by taking overlapping windows of length n from the time-domain speech signal, and computing the energy spectrum. [sent-16, score-1.226]
13 In order to reap the benefits of working with a spectrogram of the audio, it is often important to “invert” the spectral representation back into a time domain signal which is consistent with a new time-frequency representation we obtain after processing. [sent-17, score-0.938]
14 For example, we may mask out certain cells in the spectrogram after determining that they represent energy from noise signals, or we may drop columns of the spectrogram to modify the timescale. [sent-18, score-1.251]
15 How do we recover the denoised or sped up speech signal? [sent-19, score-0.379]
16 In this paper we study this inversion and present an efficient algorithm for recovering signals from their overlapping short-time spectral magnitudes using maximum a posteriori inference in a simple probability model. [sent-20, score-0.349]
17 This is essentially a problem of phase recovery, although with the important constraint that overlapping analysis windows must agree with each other about their estimates of the underlying waveform. [sent-21, score-0.239]
18 The standard approach, exemplified by the classic paper of Griffin and Lim [1], is to alternate between estimating the time domain signal given a current estimate of the phase and the observed spectrogram, and estimating the phase given the hypothesized signal and the observed spectrogram. [sent-22, score-0.774]
19 Unfortunately, at any iteration, this technique maintains inconsistent estimates of the signal and the phase. [sent-23, score-0.327]
20 Our algorithm maximizes the a posteriori probability of the estimated speech signal by adjusting the estimated signal samples directly, thus avoiding inconsistent phase estimates. [sent-24, score-1.122]
21 At each step of iterative optimization, the method is guaranteed to reduce the discrepancy between the observed spectrogram and the spectrogram of the estimated waveform. [sent-25, score-1.259]
22 Further, by jointly optimizing all samples simultaneously, the method can make global changes in the waveform, so as to better match all short-time spectral magnitudes. [sent-26, score-0.206]
23 2 A Generative Model of Speech Signals and Spectrograms An advantage of viewing phase recovery as a problem of probabilistic inference of the speech signal is that a prior distribution over time-domain speech signals can be used to improve performance. [sent-27, score-1.277]
24 For example, if the identity of the speaker that produced the spectrogram is known, a speaker-specific speech model can be used to obtain a higher-quality reconstruction of the time-domain signal. [sent-28, score-1.099]
25 However, it is important to point out that when prior knowledge of the speaker is not available, our technique works well using a uniform prior. [sent-29, score-0.137]
26 For a time-domain signal with N samples, let s be a column vector containing samples s1 , . [sent-30, score-0.263]
27 We define the spectrogram of a signal as the magnitude of its windowed shorttime Fourier transform. [sent-34, score-0.846]
28 } denote the spectrogram of s; mk is the magnitude spectrum of the kth window and mf is the magnitude of the f th frequency k component. [sent-39, score-1.029]
29 Further, let n be the width of the window used to obtain the short-time transform. [sent-40, score-0.078]
30 We assume the windows are spaced at intervals of n/2, although this assumption is easy to relax. [sent-41, score-0.067]
31 1, a particular time-domain sample s t contributes to exactly two windows in the spectrogram. [sent-43, score-0.067]
32 The joint distribution over the speech signal s and the spectrogram M is P (s, M) = P (s)P (M|s). [sent-44, score-1.2]
33 (1) We use an Rth-order autoregressive model for the prior distribution over time-domain speech signals: R N 1 2 P (s) ∝ ar st−r − st . [sent-45, score-0.85]
34 The autoregressive model can be estimated beforehand, using training data for a specific speaker or a general class of speakers. [sent-47, score-0.237]
35 Although this model is overly simple for general speech signals, it is useful for avoiding discontinuities introduced at window boundaries by mis-matched phase components in neighboring frames. [sent-48, score-0.607]
36 To avoid artifacts at frame boundaries, the variance of the prior can be set to low values at frame boundaries, enabling the prior to “pave over” the artifacts. [sent-49, score-0.164]
37 Note that the magnitude spectra are independent given the time domain signal. [sent-51, score-0.183]
38 The likelihood in (3) favors configurations of s that match the observed spectrogram, while the prior in (2) places more weight on configurations that match the autoregressive model. [sent-52, score-0.255]
39 1 Making the speech signal explicit in the model ˆ We can simplify the functional form mk (s), by introducing the n×n Fourier transform matrix, F. [sent-54, score-0.749]
40 Let sk be an n-vector containing the samples from the kth window. [sent-55, score-0.147]
41 Using the fact that the magnitude of a complex number c is cc∗ , where ∗ denotes complex conjugation, we have ˆ mk (s) = (Fsk ) ◦ (Fsk )∗ = (Fsk ) ◦ (F∗ sk ), where ◦ indicates element-wise product. [sent-56, score-0.237]
42 The joint distribution in (1) can now be written R 1 ||(Fsk )◦(F∗ sk )−mk ||2 2σ 2 1 ( ar st−r −st )2 . [sent-57, score-0.284]
43 For clarity, we have used a 3rd order autoregressive model and a window length of 4. [sent-60, score-0.211]
44 In this graphical model, function nodes are represented by black disks and each function node corresponds to a term in the joint distribution. [sent-61, score-0.056]
45 There is one function node connecting each observed short-time energy spectrum to the set of n time-domain samples from which it was possibly derived, and one function node connecting each time-domain sample to its R predecessors in the autoregressive model. [sent-62, score-0.313]
46 P (s, M) ∝ exp − exp − Taking the logarithm of the joint distribution in (4) and expanding the norm, we obtain log P (s, M) ∝ − − 1 2σ 2 1 2ρ2 n n 2 ∗ Fij Fil snk−n/2+j snk−n/2+l − mki k i j=1 l=1 R 2 ar st−r − st t r=1 . [sent-63, score-0.436]
47 (5) g1 S1 S2 g2 S3 f g3 S4 S5 f 1 M1 g N−2 g4 S6 SN f 2 M2 L ML Figure 2: Factor graph for the model in (4) using a 3rd order autoregressive model, window length of 4 and an overlap of 2 samples. [sent-64, score-0.211]
48 The log-probability is quartic in the unknown speech samples, s1 , . [sent-66, score-0.451]
49 For simplicity of presentation above, we implicitly assumed a rectangular window for computing the spectrogram. [sent-70, score-0.078]
50 3 Inference Algorithms The goal of probabilistic inference is to compute the posterior distribution over speech waveforms and output a typical sample or a mode of the posterior as an estimate of the reconstructed speech signal. [sent-73, score-0.895]
51 To find a mode of the posterior, we have explored the use of iterative conditional modes (ICM) [8], Markov chain Monte Carlo methods [9], variational techniques [10], and direct application of numerical optimization methods for finding the maximum a posteriori speech signal. [sent-74, score-0.494]
52 This technique is guaranteed to increase the joint probability of the speech waveform and the observed spectrum, at each step. [sent-77, score-0.505]
53 At every stage we set st to its most probable value, given the other speech samples and the observed spectrogram: s∗ = argmaxst P (st |M, s \ st ) = argmaxst P (s, M). [sent-78, score-0.797]
54 t This value can be found by extracting the terms in (5) that depend on st and optimizing the resulting quartic equation with complex coefficients. [sent-79, score-0.201]
55 To select an initial configuration of s, we applied an inverse Fourier transform to the observed magnitude spectra M, assuming a random phase. [sent-80, score-0.167]
56 We also implemented an inference algorithm that directly searches for a maximum of log P (s, M) w. [sent-82, score-0.093]
57 The same derivatives used to find the ICM updates were used in a conjugate gradient optimizer, which is capable of finding search directions in the vector space s, and jointly adjusting all speech samples simultaneously. [sent-86, score-0.578]
58 We 0 2 4 Time (seconds) 0 2 4 Time (seconds) 0 2 4 Time (seconds) Frequency (kHz) 8 4 0 Figure 3: Reconstruction results for an utterance from the WSJ database. [sent-87, score-0.076]
59 The spectrogram of the reconstruction fails to capture the finer details in the original signal. [sent-90, score-0.647]
60 The spectrogram captures most of the fine details in the original signal. [sent-92, score-0.582]
61 initialized the conjugate gradient optimizer using the same procedure as described above for ICM. [sent-93, score-0.165]
62 For all experiments we used a (Hamming) window of length 256 and with an overlap of 128 samples. [sent-95, score-0.078]
63 Where possible, we trained a 12th order AR model of the speaker using an utterance different from the one used to create the spectrogram. [sent-96, score-0.149]
64 For convergence to a good local minima, it is important to down weight the contribution of the AR-model for the first several iterations of conjugate gradient optimization. [sent-97, score-0.089]
65 This way, the AR model operates on the signal with very little error in the estimated spectrogram. [sent-99, score-0.267]
66 Along the frame boundaries, the variance of the prior (AR model) was set to a small value to smooth down spikes that are not very probable apriori. [sent-100, score-0.11]
67 Further, we also tried using a cubic spline smoother along the boundaries as a post processing step for better sound quality. [sent-101, score-0.132]
68 1 Evaluation The quality of sound in the estimated signal is an important factor in determining the effectiveness of the algorithm. [sent-103, score-0.365]
69 To demonstrate improvement in the perceptual quality of sound we have placed audio files on the web; for demonstrations please check, http://www. [sent-104, score-0.288]
70 Our algorithm consistently outperformed the algorithm proposed in [1] both in terms of sound quality and in matching the observed spectrogram . [sent-108, score-0.741]
71 3 shows reconstruction result for an utterance from WSJ data. [sent-110, score-0.141]
72 The graph on the right compares the log probability under ICM to our algorithm Analysis of signal to noise ratio of the true and estimated signal can be used to measure the quality of the estimated signal, with high dB gain indicating good reconstruction. [sent-119, score-0.639]
73 As the input to our model does not include a phase component, we cannot measure SNR by comparing the recovered signal to any true time domain signal. [sent-120, score-0.377]
74 Instead, we define the following approximation SN R∗ = 10 log u w 1 2 w f |su,w (f )| Eu 1 1 2 ˆ s f ( Eu |ˆu,w (f )| − Eu |su,w (f )|) (6) where Eu = t s2 is the total energy in utterance u. [sent-121, score-0.172]
75 Summations over u, w and f are over t all utterances, windows and frequencies respectively. [sent-122, score-0.067]
76 4 reports dB gain averaged over several utterances for [1] and our algorithm with and without an AR model. [sent-124, score-0.072]
77 Moving the summation over w in (6) outside the log produces similar quality estimates. [sent-126, score-0.096]
78 2 Time Scale Modification As an example to show the potential utility of spectrogram inversion, we investigated an extremely simple approach to time scale modification of speech signals. [sent-128, score-0.989]
79 Starting from the original signal we form the spectrogram (or else we may start with the spectrogram directly), and upsample or downsample it along the time axis. [sent-129, score-1.499]
80 (For example, to speed up the speech by a factor of two we can discard every second column of the spectrogram. [sent-130, score-0.379]
81 ) In spite of the fact that this approach does not use any phase information from the original signal, it produces results with good perceptual sound quality. [sent-131, score-0.211]
82 (Audio demonstrations are available on the web site given earlier. [sent-132, score-0.072]
83 ) 5 Variational Inference The framework described so far focuses on obtaining fixed point estimates for the time domain signal by maximizing the joint log probability of the model in (5). [sent-133, score-0.397]
84 As exact inference of P (s|M) is intractable, we approximate it using a fully factored distribution Q(s) where, Q(s) = qi (si ) (7) i Here we assume qi (si ) ∼ N (µi , ηi ). [sent-135, score-0.218]
85 The goal of variational approximation is to infer the parameters {µi , ηi }, ∀i by minimizing the KL divergence between the approximating Q distribution and the true posterior P (s|M). [sent-136, score-0.068]
86 Simplifying and rearranging terms we get, n n H(qi ) − D =− i j=1 l=1 2 ηi Gi (µ, η) + 2 ∗ Fij Fil µnk−n/2+j µnk−n/2+l − mki (9) i Gi (µ, η) accounts for uncertainty in s. [sent-138, score-0.057]
87 6 Conclusion In this paper, we have introduced a simple probabilistic model of noisy spectrograms in which the samples of the unknown time domain signal are represented directly as hidden variables. [sent-142, score-0.493]
88 But using a continuous gradient optimizer on these quantities, we are able to accurately estimate the full speech signal from only the short time spectral magnitudes taken in overlapping windows. [sent-143, score-0.887]
89 Our algorithm’s reconstructions are substantially better, both in terms of informal perceptual quality and measured signal to noise ratio, than the standard approach of Griffin and Lim[1]. [sent-144, score-0.351]
90 Furthermore, in our setting, it is easy to incorporate an a-priori model of gross speech structure in the form of an AR-model, whose influence on the reconstruction is user-tunable. [sent-145, score-0.444]
91 Spectrogram inversion has many potential applications; as an example we have demonstrated an extremely simple but nonetheless effective time scale modification algorithm which subsamples the spectrogram of the original utterance and then inverts. [sent-146, score-0.718]
92 In addition to improved experimental results, our approach highlights two important lessons from the point of view of statistical signal processing algorithms. [sent-147, score-0.211]
93 In other words, practitioners often try to solve for as more variables as possible conditioned on quantities that have just been updated. [sent-151, score-0.05]
94 Our experience in this model has shown that direct continuous optimization using gradient techniques allows all quantities to adjust simultaneously and ultimately finds far superior solutions. [sent-152, score-0.085]
95 Because of its probabilistic nature, our model can easily be extended to include other pieces of prior information, or to deal with missing or noisy spectrogram frames. [sent-153, score-0.639]
96 This opens the door to unified phase recovery and denoising algorithms, and to the possibility of performing sophisticated speech separation or denoising inside the pipeline of a standard speech recognition system, in which typically only short time spectral magnitudes are available. [sent-154, score-1.206]
97 Acknowledgments We thank Carl Rasmussen for his conjugate gradient optimizer. [sent-155, score-0.089]
98 S Signal estimation from modified short time Fourier transform In IEEE Transactions on Acoustics, Speech and Signal Processing, 1984 32/2 [2] Kschischang, F. [sent-161, score-0.054]
99 Nelson Removal of noise from speech using the dual EKF algorithm in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, May, 1998 [8] Besag, J On the statistical analysis of dirty pictures Journal of the Royal Statistical Society B vol. [sent-194, score-0.413]
100 Saul An introduction to variational methods for graphical models Learning in Graphical Models, edited by M. [sent-203, score-0.066]
wordName wordTfidf (topN-words)
[('spectrogram', 0.582), ('speech', 0.379), ('icm', 0.239), ('signal', 0.211), ('ar', 0.205), ('mk', 0.133), ('autoregressive', 0.133), ('spectrograms', 0.125), ('st', 0.103), ('fsk', 0.096), ('grif', 0.088), ('phase', 0.088), ('qi', 0.084), ('window', 0.078), ('optimizer', 0.076), ('utterance', 0.076), ('speaker', 0.073), ('sn', 0.072), ('quartic', 0.072), ('sound', 0.07), ('audio', 0.07), ('denoising', 0.07), ('windows', 0.067), ('spectral', 0.067), ('reconstruction', 0.065), ('eu', 0.063), ('str', 0.062), ('boundaries', 0.062), ('signals', 0.06), ('mki', 0.057), ('conjugate', 0.054), ('quality', 0.053), ('energy', 0.053), ('recovery', 0.053), ('magnitude', 0.053), ('perceptual', 0.053), ('db', 0.052), ('spectra', 0.052), ('frame', 0.052), ('samples', 0.052), ('sk', 0.051), ('domain', 0.05), ('inference', 0.05), ('quantities', 0.05), ('posteriori', 0.049), ('argmaxst', 0.048), ('bjf', 0.048), ('downsample', 0.048), ('fil', 0.048), ('snk', 0.048), ('upsample', 0.048), ('wsj', 0.048), ('utterances', 0.047), ('frequency', 0.047), ('overlapping', 0.047), ('lim', 0.047), ('fourier', 0.047), ('inconsistent', 0.045), ('acoustics', 0.045), ('magnitudes', 0.044), ('kth', 0.044), ('si', 0.043), ('log', 0.043), ('demonstrations', 0.042), ('fij', 0.042), ('spectrum', 0.039), ('variational', 0.038), ('hamming', 0.038), ('estimates', 0.037), ('observed', 0.036), ('modi', 0.035), ('gradient', 0.035), ('technique', 0.034), ('noise', 0.034), ('snr', 0.033), ('indexes', 0.033), ('jointly', 0.033), ('inversion', 0.032), ('estimated', 0.031), ('prior', 0.03), ('web', 0.03), ('posterior', 0.03), ('nk', 0.029), ('iterative', 0.028), ('match', 0.028), ('joint', 0.028), ('separation', 0.028), ('time', 0.028), ('waveform', 0.028), ('probable', 0.028), ('graphical', 0.028), ('seconds', 0.027), ('probabilistic', 0.027), ('optimizing', 0.026), ('alternate', 0.026), ('transform', 0.026), ('generative', 0.026), ('adjusting', 0.025), ('gain', 0.025), ('operates', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
2 0.17769498 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals
Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita
Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1
3 0.17075066 5 nips-2003-A Classification-based Cocktail-party Processor
Author: Nicoleta Roman, Deliang Wang, Guy J. Brown
Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t ) Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t ) IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.
4 0.11832587 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
5 0.11442865 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
Author: Pedro J. Moreno, Purdy P. Ho, Nuno Vasconcelos
Abstract: Over the last years significant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classifiers such as SVM’s. In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identification/verification and image classification tasks and show that these new kernels have the best performance in speaker verification and mostly outperform the Fisher kernel based SVM’s and the generative classifiers in speaker identification and image classification. 1
6 0.10652877 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons
7 0.094671048 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines
8 0.085288852 119 nips-2003-Local Phase Coherence and the Perception of Blur
9 0.073592722 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
10 0.073238179 193 nips-2003-Variational Linear Response
11 0.072645523 157 nips-2003-Plasticity Kernels and Temporal Statistics
12 0.068607442 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation
13 0.068343826 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
14 0.06598229 115 nips-2003-Linear Dependent Dimensionality Reduction
15 0.062032394 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis
16 0.057112508 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
17 0.055574086 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
18 0.054807689 88 nips-2003-Image Reconstruction by Linear Programming
19 0.054052971 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
20 0.053931139 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
topicId topicWeight
[(0, -0.19), (1, -0.001), (2, 0.06), (3, 0.051), (4, -0.062), (5, 0.05), (6, 0.203), (7, -0.062), (8, -0.103), (9, 0.154), (10, 0.169), (11, -0.118), (12, -0.069), (13, -0.05), (14, -0.11), (15, -0.209), (16, 0.156), (17, -0.08), (18, -0.039), (19, -0.019), (20, -0.104), (21, 0.046), (22, 0.074), (23, -0.0), (24, -0.031), (25, 0.048), (26, 0.034), (27, 0.036), (28, -0.025), (29, 0.005), (30, 0.032), (31, 0.079), (32, -0.027), (33, -0.082), (34, 0.051), (35, 0.011), (36, -0.034), (37, 0.054), (38, -0.111), (39, 0.089), (40, 0.05), (41, 0.039), (42, 0.002), (43, 0.004), (44, 0.043), (45, -0.085), (46, -0.016), (47, 0.005), (48, 0.126), (49, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.96095103 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
2 0.86363435 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals
Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita
Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1
3 0.65041023 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons
Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin
Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.
4 0.63064051 5 nips-2003-A Classification-based Cocktail-party Processor
Author: Nicoleta Roman, Deliang Wang, Guy J. Brown
Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t ) Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t ) IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.
5 0.55002135 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines
Author: William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Douglas A. Jones, Timothy R. Leek
Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.
6 0.46162909 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel
7 0.46124816 157 nips-2003-Plasticity Kernels and Temporal Statistics
8 0.4399333 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
9 0.43002185 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis
10 0.38902763 119 nips-2003-Local Phase Coherence and the Perception of Blur
11 0.37830743 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface
12 0.35776266 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
13 0.34796265 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
14 0.28965983 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
15 0.28810993 123 nips-2003-Markov Models for Automated ECG Interval Analysis
16 0.28627095 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation
17 0.28301737 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence
18 0.27576706 115 nips-2003-Linear Dependent Dimensionality Reduction
19 0.2657795 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
20 0.26229024 107 nips-2003-Learning Spectral Clustering
topicId topicWeight
[(0, 0.037), (11, 0.023), (29, 0.03), (30, 0.087), (35, 0.056), (53, 0.106), (69, 0.019), (71, 0.066), (76, 0.059), (79, 0.255), (85, 0.084), (91, 0.066)]
simIndex simValue paperId paperTitle
1 0.82710874 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface
Author: Yu Zhou, Steven G. Mason, Gary E. Birch
Abstract: This paper presents an energy normalization transform as a method to reduce system errors in the LF-ASD brain-computer interface. The energy normalization transform has two major benefits to the system performance. First, it can increase class separation between the active and idle EEG data. Second, it can desensitize the system to the signal amplitude variability. For four subjects in the study, the benefits resulted in the performance improvement of the LF-ASD in the range from 7.7% to 18.9%, while for the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance did not change notably with normalization. 1 In trod u ction In an effort to provide alternative communication channels for people who suffer from severe loss of motor function, several researchers have worked over the past two decades to develop a direct Brain-Computer Interface (BCI). Since electroencephalographic (EEG) signal has good time resolution and is non-invasive, it is commonly used for data source of a BCI. A BCI system converts the input EEG into control signals, which are then used to control devices like computers, environmental control system and neuro-prostheses. Mason and Birch [1] proposed the Low-Frequency Asynchronous Switch Design (LF-ASD) as a BCI which detected imagined voluntary movement-related potentials (IVMRPs) in spontaneous EEG. The principle signal processing components of the LF-ASD are shown in Figure 1. sIN Feature Extractor sLPF LPF Feature Classifier sFE sFC Figure 1: The original LF-ASD design. The input to the low-pass filter (LPF), denoted as SIN in Figure 1, are six bipolar EEG signals recorded from F1-FC1, Fz-FCz, F2-FC2, FC1-C1, FCz-Cz and FC2-C2 sampled at 128 Hz. The cutoff frequency of the LPF implemented by Mason and Birch was 4 Hz. The Feature Extractor of the LF-ASD extracts custom features related to IVMRPs. The Feature Classifier implements a one-nearest-neighbor (1NN) classifier, which determines if the input signals are related to a user state of voluntary movement or passive (idle) observation. The LF-ASD was able to achieve True Positive (TP) values in the range of 44%-81%, with the corresponding False Positive (FP) values around 1% [1]. Although encouraging, the current error rates of the LF-ASD are insufficient for real-world applications. This paper proposes a method to improve the system performance. 2 Design and Rationale The improved design of the LF-ASD with the Energy Normalization Transform (ENT) is provided in Figure 2. SIN ENT SN SNLPF LPF Feature Extractor SNFE SNFC Feature Classifier Figure 2: The improved LF-ASD with the Energy Normalization Transform. The design of the Feature Extractor and Feature Classifier were the same as shown in Figure 1. The Energy Normalization Transform (ENT) is implemented as S N (n ) = S s=( w s= − ( N ∑ w IN S IN −1) / 2 N −1) / 2 (n ) 2 (n − s) w N where W N (normalization window size) is the only parameter in the equation. The optimal parameter value was obtained by exhaustive search for the best class separation between active and idle EEG data. The method of obtaining the active and idle EEG data is provided in Section 3.1. The idea to use energy normalization to improve the LF-ASD design was based primarily on an observation that high frequency power decreases significantly around movement. For example, Jasper and Penfield [3] and Pfurtscheller et al, [4] reported EEG power decrease in the mu (8-12 Hz) and beta rhythm (18-26 Hz) when people are involved in motor related activity. Also Mason [5] found that the power in the frequency components greater than 4Hz decreased significantly during movement-related potential periods, while power in the frequency components less than 4Hz did not. Thus energy normalization, which would increase the low frequency power level, would strengthen the 0-4 Hz features used in the LF-ASD and hence reduce errors. In addition, as a side benefit, it can automatically adjust the mean scale of the input signal and desensitize the system to change in EEG power, which is known to vary over time [2]. Therefore, it was postulated that the addition of ENT into the improved design would have two major benefits. First, it can increase the EEG power around motor potentials, consequently increasing the class separation and feature strength. Second, it can desensitize the system to amplitude variance of the input signal. In addition, since the system components of the modified LF-ASD after the ENT were the same as in the original design, a major concern was whether or not the ENT distorted the features used by the LF-ASD. Since the features used by the LFASD are generated from the 0-4 Hz band, if the ENT does not distort the phase and magnitude spectrum in this specific band, it would not distort the features related to movement potential detection in the application. 3 3.1 Evaluation Test data Two types of EEG data were pre-recorded from five able-bodied individuals as shown in Figure 3. Active Data Type and Idle Data Type. Active Data was recorded during repeated right index finger flexions alternating with periods of no motor activity; Idle Data was recorded during extended periods of passive observation. Figure 3: Data Definition of M1, M2, Idle1 and Idle2. Observation windows centered at the time of the finger switch activations (as shown in Figure 4) were imposed in the active data to separate data related to movements from data during periods of idleness. For purpose of this study, data in the front part of the observation window was defined as M1 and data in the rear part of the window was defined as M2. Data falling out of the observation window was defined as Idle2. All the data in the Idle Data Type was defined as Idle1 for comparison with Idle2. Figure 4: Ensemble Average of EEG centered on finger activations. Figure 5: Density distribution of Idle1, Idle2, M1 and M2. It was noted, in terms of the density distribution of active and idle data, the separation between M2 and Idle2 was the largest and Idle1 and Idle2 were nearly identical (see Figure 5). For the study, M2 and Idle2 were chosen to represent the active and idle data classes and the separation between M2 and Idle2 data was defined by the difference of means (DOM) scaled by the amplitude range of Idle2. 3.2 Optimal parameter determination The optimal combination of normalization window size, W N, and observation window size, W O was selected to be that which achieved the maximal DOM value. This was determined by exhaustive search, and discussed in Section 4.1. 3.3 Effect of ENT on the Low Pass Filter output As mentioned previously, it was postulated that the ENT had two major impacts: increasing the class separation between active and idle EEG and desensitizing the system to the signal amplitude variance. The hypothesis was evaluated by comparing characteristics of SNLPF and SLPF in Figure 1 and Figure 2. DOM was applied to measure the increased class separation. The signal with the larger DOM meant larger class separation. In addition, the signal with smaller standard deviation may result in a more stable feature set. 3.4 Effect of ENT on the LF-ASD output The performances of the original and improved designs were evaluated by comparing the signal characteristics of SNFC in Figure 2 to SFC in Figure 1. A Receiver Operating Characteristic Curve (ROC Curve) [6] was generated for the original and improved designs. The ROC Curve characterizes the system performance over a range of TP vs. FP values. The larger area under ROC Curve indicates better system performance. In real applications, a BCI with high-level FP rates could cause frustration for subjects. Therefore, in this work only the LF-ASD performance when the FP values are less than 1% were studied. 4 4.1 Results Optimal normalization window size (WN) The method to choose optimal WN was an exhaustive search for maximal DOM between active and idle classes. This method was possibly dependent on the observation window size (W O). However, as shown in Figure 6a, the optimal WN was found to be independent of WO. Experimentally, the W O values were selected in the range of 50-60 samples, which corresponded to largest DOM between nonnormalized active and idle data. The optimal WN was obtained by exhaustive search for the largest DOM through normalized active and idle data. The DOM vs. WN profile for Subject 1 is shown in Figure 6b. a) b) Figure 6: Optimal parameter determination for Subject 1 in Channel 1 a) DOM vs. WO; b) DOM vs. WN. When using ENT, a small W N value may cause distortion to the feature set used by the LF-ASD. Thus, the optimal W N was not selected in this range (< 40 samples). When W N is greater than 200, the ENT has lost its capability to increase class separation and the DOM curve gradually goes towards the best separation without normalization. Thus, the optimal W N should correspond to the maximal DOM value when W N is in the range from 40 to 200. In Figure 6b, the optimal WN is around 51. 4.2 Effect of ENT on the Low Pass Filter output With ENT, the standard deviation of the low frequency EEG signal decreased from around 1.90 to 1.30 over the six channels and over the five subjects. This change resulted in more stable feature sets. Thus, the ENT desensitizes the system to input signal variance. a) b) Figure 7: Density distribution of the active vs. idle class without (a) and with (b) ENT, for Subject 1 in Channel 1. As shown in Figure 7, by increasing the EEG power around motor potentials, ENT can increase class separations between active and idle EEG data. The class separation in (frontal) Channels 1-3 across all subjects increased consistently with the proposed ENT. The same was true for (midline) Channels 4-6, for all subjects except Subject 5, whose DOM in channel 5-6 decreased by 2.3% and 3.4% respectively with normalization. That is consistent with the fact that his EEG power in Channels 4-6 does not decrease. On average, across all five subjects, DOM increases with normalization to about 28.8%, 26.4%, 39.4%, 20.5%, 17.8% and 22.5% over six channels respectively. In addition, the magnitude and phase spectrums of the EEG signal before and after ENT is provided in Figure 8. The ENT has no visible distortion to the signal in the low frequency band (0-4 Hz) used by the LF-ASD. Therefore, the ENT does not distort the features used by the LF-ASD. (a) (b) Figure 8: Magnitude and phase spectrum of the EEG signal before and after ENT. 4.3 Effect of ENT on the LF-ASD output The two major benefits of the ENT to the low frequency EEG data result in the performance improvement of the LF-ASD. Subject 1’s ROC Curves with and without ENT is shown in Figure 9, where the ROC-Curve with ENT of optimal parameter value is above the ROC Curve without ENT. This indicates that the improved LF-ASD performs better. Table I compares the system performance with and without ENT in terms of TP with corresponding FP at 1% across all the 5 subjects. Figure 9: The ROC Curves (in the section of interest) of Subject 1 with different WN values and the corresponding ROC Curve without ENT. Table I: Performance of the LF-ASD with and without LF-ASD in terms of the True Positive rate with corresponding False Positive at 1%. Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 TP without ENT 66.1% 82.7% 79.7% 79.3% 90.5% TP with ENT 85.0% 90.4% 88.0% 87.8% 88.7% Performance Improvement 18.9% 7.7% 8.3% 8.5% -1.8% For 4 out of 5 subjects, corresponding with the FP at 1%, the improved system with ENT increased the TP value by 7.7%, 8.3%, 8.5% and 18.9% respectively. Thus, for these subjects, the range of TP with FP at 1% was improved from 66.1%-82.7% to 85.0%-90.4% with ENT. For the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance remained around 90% with ENT. In addition, this evaluation is conservative. Since the codebook in the Feature Classifier and the parameters in the Feature Extractor of the LF-ASD were derived from nonnormalized EEG, they work in favor of the non-normalized EEG. Therefore, if the parameters and the codebook of the modified LF-ASD are generated from the normalized EEG in the future, the modified LF-ASD may show better performance than this evaluation. 5 Conclusion The evaluation with data from five able-bodied subjects indicates that the proposed system with Energy Normalization Transform (ENT) has better performance than the original. This study has verified the original hypotheses that the improved design with ENT might have two major benefits: increased the class separation between active and idle EEG and desensitized the system performance to input amplitude variance. As a side benefit, the ENT can also make the design less sensitive to the mean input scale. In the broad band, the Energy Normalization Transform is a non-linear transform. However, it has no visible distortion to the signal in the 0-4 Hz band. Therefore, it does not distort the features used by the LF-ASD. For 4 out of 5 subjects, with the corresponding False Positive rate at 1%, the proposed transform increased the system performance by 7.7%, 8.3%, 8.5% and 18.9% respectively in terms of True Positive rate. Thus, the overall performance of the LF-ASD for these subjects was improved from 66.1%-82.7% to 85.0%-90.4%. For the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance did not change notably with normalization. In the future with the codebook derived from the normalized data, the performance could be further improved. References [1] Mason, S. G. and Birch, G. E., (2000) A Brain-Controlled Switch for Asynchronous Control Applications. IEEE Trans Biomed Eng, 47(10):1297-1307. [2] Vaughan, T. M., Wolpaw, J. R., and Donchin, E. (1996) EEG-Based Communication: Prospects and Problems. IEEE Trans Reh Eng, 4(4):425-430. [3] Jasper, H. and Penfield, W. (1949) Electrocortiograms in man: Effect of voluntary movement upon the electrical activity of the precentral gyrus. Arch.Psychiat.Nervenkr., 183:163-174. [4] Pfurtscheller, G., Neuper, C., and Flotzinger, D. (1997) EEG-based discrimination between imagination of right and left hand movement. Electroencephalography and Clinical Neurophysiology, 103:642-651. [5] Mason, S. G. (1997) Detection of single trial index finger flexions from continuous, spatiotemporal EEG. PhD Thesis, UBC, January. [6] Green, D. M. and Swets, J. A. (1996) Signal Detection Theory and Psychophysics New York: John Wiley and Sons, Inc.
same-paper 2 0.81617695 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
3 0.80163753 175 nips-2003-Sensory Modality Segregation
Author: Virginia Sa
Abstract: Why are sensory modalities segregated the way they are? In this paper we show that sensory modalities are well designed for self-supervised cross-modal learning. Using the Minimizing-Disagreement algorithm on an unsupervised speech categorization task with visual (moving lips) and auditory (sound signal) inputs, we show that very informative auditory dimensions actually harm performance when moved to the visual side of the network. It is better to throw them away than to consider them part of the “visual input”. We explain this finding in terms of the statistical structure in sensory inputs. 1
4 0.57415521 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals
Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita
Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1
5 0.57330602 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
Author: John C. Platt
Abstract: This paper applies fast sparse multidimensional scaling (MDS) to a large graph of music similarity, with 267K vertices that represent artists, albums, and tracks; and 3.22M edges that represent similarity between those entities. Once vertices are assigned locations in a Euclidean space, the locations can be used to browse music and to generate playlists. MDS on very large sparse graphs can be effectively performed by a family of algorithms called Rectangular Dijsktra (RD) MDS algorithms. These RD algorithms operate on a dense rectangular slice of the distance matrix, created by calling Dijsktra a constant number of times. Two RD algorithms are compared: Landmark MDS, which uses the Nyström approximation to perform MDS; and a new algorithm called Fast Sparse Embedding, which uses FastMap. These algorithms compare favorably to Laplacian Eigenmaps, both in terms of speed and embedding quality. 1
6 0.57233846 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
7 0.56850713 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
8 0.56738555 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
9 0.56646144 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
10 0.56330574 115 nips-2003-Linear Dependent Dimensionality Reduction
11 0.56124526 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
12 0.56117916 126 nips-2003-Measure Based Regularization
13 0.56052732 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
14 0.56006277 113 nips-2003-Learning with Local and Global Consistency
15 0.55999535 189 nips-2003-Tree-structured Approximations by Expectation Propagation
16 0.55879939 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
17 0.55756623 66 nips-2003-Extreme Components Analysis
18 0.55750519 78 nips-2003-Gaussian Processes in Reinforcement Learning
19 0.55656713 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
20 0.55573553 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach