nips nips2003 nips2003-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin
Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. [sent-12, score-0.717]
2 Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. [sent-13, score-0.093]
3 We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. [sent-14, score-0.993]
4 By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. [sent-15, score-0.683]
5 In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10. [sent-16, score-0.317]
6 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. [sent-18, score-1.663]
7 This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. [sent-19, score-0.232]
8 French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. [sent-20, score-0.896]
9 The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. [sent-21, score-0.152]
10 A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. [sent-22, score-0.778]
11 The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. [sent-23, score-0.436]
12 Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. [sent-25, score-0.325]
13 Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. [sent-26, score-0.159]
14 Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. [sent-27, score-0.463]
15 Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. [sent-28, score-0.436]
16 Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. [sent-29, score-0.026]
17 The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0. [sent-31, score-0.211]
18 Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. [sent-34, score-0.782]
19 Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. [sent-36, score-0.114]
20 Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. [sent-40, score-0.568]
21 The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). [sent-41, score-0.652]
22 The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. [sent-42, score-0.279]
23 Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). [sent-43, score-0.025]
24 In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. [sent-44, score-0.253]
25 Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. [sent-45, score-0.181]
26 Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. [sent-46, score-0.492]
27 1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. [sent-49, score-0.269]
28 Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. [sent-51, score-0.308]
29 The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. [sent-54, score-0.538]
30 The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. [sent-56, score-0.049]
31 This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. [sent-57, score-0.057]
32 The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. [sent-58, score-0.183]
33 In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. [sent-60, score-0.045]
34 The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. [sent-61, score-0.331]
35 In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. [sent-62, score-0.307]
36 2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. [sent-64, score-0.165]
37 For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. [sent-65, score-0.12]
38 1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3. [sent-68, score-0.407]
39 The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. [sent-70, score-0.23]
40 + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. [sent-75, score-0.25]
41 Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. [sent-76, score-0.114]
42 Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. [sent-77, score-0.032]
43 The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. [sent-78, score-0.437]
44 ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. [sent-79, score-0.156]
45 This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. [sent-80, score-0.029]
46 ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. [sent-81, score-0.336]
47 The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. [sent-82, score-0.163]
48 5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7. [sent-85, score-0.478]
49 75, and bands with 0 dB SNR produced NTI i’s of 0. [sent-87, score-0.202]
50 As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. [sent-92, score-0.239]
51 The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. [sent-93, score-0.171]
52 The Steeneken study had many more, but the exact corpus could not be found. [sent-94, score-0.04]
53 80 total words is enough to produce meaningful frequency weighting factors. [sent-95, score-0.202]
54 There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7. [sent-96, score-0.47]
55 The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. [sent-98, score-0.09]
56 A 1353 tap FIR filter was designed for each envelope condition. [sent-100, score-0.049]
57 The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. [sent-101, score-0.325]
58 Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. [sent-102, score-0.372]
59 To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. [sent-103, score-0.569]
60 While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. [sent-104, score-0.461]
61 1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. [sent-107, score-0.782]
62 This is very similar to [5] whose final standard deviation of prediction error for males was 12. [sent-109, score-0.137]
63 The NAI’s final standard deviation of prediction error for males was 8. [sent-112, score-0.137]
64 Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. [sent-115, score-0.867]
65 The vertical spread from the best fitting polynomial for males has a s. [sent-116, score-0.139]
66 8% The frequency weighting factors are similar for the NAI and the STI. [sent-128, score-0.298]
67 The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s. [sent-129, score-0.184]
68 Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. [sent-136, score-0.906]
69 As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. [sent-137, score-0.144]
70 The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. [sent-139, score-0.068]
71 Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [sent-140, score-0.326]
72 [8] does state that these frequency weights are dependant upon the conditions used for evaluation. [sent-141, score-0.144]
73 2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. [sent-143, score-0.838]
74 There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. [sent-144, score-0.519]
75 The redundancy factors attempt to remove correlate signals between bands. [sent-145, score-0.237]
76 Only adjacent bands are used here to reduce complexity. [sent-150, score-0.202]
77 The same testing, and adaptation strategy from Section 3. [sent-153, score-0.025]
78 Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. [sent-155, score-1.04]
79 The vertical spread from the best fitting polynomial for males has a s. [sent-156, score-0.139]
80 7%, for females the best fitting polynomial has a s. [sent-162, score-0.112]
81 The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. [sent-169, score-0.51]
82 Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. [sent-170, score-0.989]
83 The frequency weights for the NAIr and STIr are more similar than in Section 3. [sent-171, score-0.144]
84 The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. [sent-174, score-0.522]
85 It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. [sent-193, score-0.216]
86 The NAI provides a modest improvement over STI in predicting intelligibility. [sent-195, score-0.043]
87 We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. [sent-196, score-0.034]
88 The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. [sent-197, score-0.784]
89 While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. [sent-198, score-0.079]
90 The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. [sent-199, score-0.731]
91 The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. [sent-201, score-0.229]
92 (1962b) Methods for the calculation and use of the articulation index. [sent-222, score-0.093]
93 (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. [sent-232, score-0.294]
94 (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. [sent-253, score-0.328]
95 (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. [sent-275, score-0.173]
96 (1950) The perception of speech and its relation to telephony. [sent-283, score-0.198]
97 (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. [sent-291, score-0.346]
98 (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. [sent-299, score-0.753]
wordName wordTfidf (topN-words)
[('intelligibility', 0.436), ('sti', 0.409), ('nai', 0.391), ('bands', 0.202), ('auditory', 0.19), ('speech', 0.173), ('frequency', 0.144), ('steeneken', 0.142), ('redundancy', 0.141), ('nair', 0.124), ('female', 0.124), ('snr', 0.124), ('band', 0.114), ('male', 0.111), ('factors', 0.096), ('articulation', 0.093), ('distortion', 0.091), ('houtgast', 0.089), ('ntii', 0.089), ('db', 0.087), ('hearing', 0.079), ('periphery', 0.079), ('males', 0.079), ('fit', 0.074), ('ti', 0.072), ('kryter', 0.071), ('mcmaster', 0.071), ('nti', 0.071), ('nerve', 0.07), ('spike', 0.067), ('instantaneous', 0.066), ('distortions', 0.062), ('weighting', 0.058), ('predictor', 0.056), ('ansi', 0.053), ('french', 0.053), ('ohc', 0.053), ('steinberg', 0.053), ('stir', 0.053), ('females', 0.052), ('id', 0.052), ('neuronal', 0.051), ('filter', 0.049), ('transmission', 0.047), ('bruce', 0.046), ('discharge', 0.046), ('ai', 0.046), ('versus', 0.046), ('control', 0.045), ('predicting', 0.043), ('index', 0.043), ('centre', 0.041), ('corpus', 0.04), ('ear', 0.039), ('signal', 0.038), ('hz', 0.037), ('effects', 0.036), ('cvc', 0.036), ('galt', 0.036), ('haykin', 0.036), ('ihc', 0.036), ('impairment', 0.036), ('locus', 0.036), ('rippled', 0.036), ('sachs', 0.036), ('wideband', 0.036), ('acoustics', 0.034), ('distorted', 0.034), ('acoustic', 0.032), ('fitting', 0.031), ('narrowband', 0.031), ('hair', 0.031), ('fletcher', 0.031), ('snrs', 0.031), ('transfer', 0.031), ('seven', 0.03), ('prediction', 0.03), ('polynomial', 0.029), ('metric', 0.029), ('final', 0.028), ('eurospeech', 0.028), ('hamilton', 0.028), ('octave', 0.028), ('dutch', 0.028), ('jeff', 0.028), ('correction', 0.027), ('nl', 0.026), ('triplets', 0.026), ('bm', 0.026), ('nonlinear', 0.026), ('spiking', 0.026), ('adaptation', 0.025), ('relation', 0.025), ('calculated', 0.025), ('tables', 0.025), ('th', 0.024), ('utterances', 0.023), ('bandpass', 0.023), ('young', 0.023), ('rate', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons
Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin
Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.
2 0.25149751 5 nips-2003-A Classification-based Cocktail-party Processor
Author: Nicoleta Roman, Deliang Wang, Guy J. Brown
Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t ) Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t ) IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.
3 0.13169664 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
Author: Brian J. Fischer, Charles H. Anderson
Abstract: The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive fields [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we first formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.
4 0.10652877 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
5 0.090156212 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals
Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita
Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1
6 0.070817463 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
7 0.056243636 61 nips-2003-Entrainment of Silicon Central Pattern Generators for Legged Locomotory Control
8 0.052400894 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel
9 0.050385963 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
10 0.050060436 175 nips-2003-Sensory Modality Segregation
11 0.047464635 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines
12 0.043257564 100 nips-2003-Laplace Propagation
13 0.042623892 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
14 0.041592255 157 nips-2003-Plasticity Kernels and Temporal Statistics
15 0.039265659 140 nips-2003-Nonlinear Processing in LGN Neurons
16 0.038552262 18 nips-2003-A Summating, Exponentially-Decaying CMOS Synapse for Spiking Neural Systems
17 0.034959629 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
18 0.034743782 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification
19 0.033770099 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface
20 0.032779962 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
topicId topicWeight
[(0, -0.11), (1, 0.018), (2, 0.142), (3, 0.012), (4, -0.057), (5, 0.061), (6, 0.116), (7, -0.052), (8, -0.1), (9, 0.213), (10, 0.202), (11, -0.071), (12, 0.021), (13, 0.001), (14, -0.096), (15, -0.167), (16, 0.146), (17, -0.075), (18, -0.066), (19, 0.041), (20, -0.115), (21, -0.028), (22, 0.002), (23, 0.035), (24, 0.077), (25, -0.032), (26, 0.081), (27, -0.087), (28, 0.13), (29, 0.055), (30, 0.007), (31, 0.057), (32, 0.079), (33, -0.061), (34, 0.024), (35, -0.122), (36, 0.071), (37, 0.094), (38, -0.013), (39, -0.096), (40, -0.13), (41, -0.124), (42, -0.015), (43, 0.048), (44, -0.045), (45, 0.042), (46, -0.027), (47, 0.037), (48, 0.008), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.97521442 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons
Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin
Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.
2 0.90099519 5 nips-2003-A Classification-based Cocktail-party Processor
Author: Nicoleta Roman, Deliang Wang, Guy J. Brown
Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t ) Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t ) IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.
3 0.71842062 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
Author: Brian J. Fischer, Charles H. Anderson
Abstract: The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive fields [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we first formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.
4 0.68512416 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals
Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita
Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1
5 0.55859619 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
6 0.41852561 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel
7 0.31274965 175 nips-2003-Sensory Modality Segregation
8 0.29036102 140 nips-2003-Nonlinear Processing in LGN Neurons
9 0.27163377 85 nips-2003-Human and Ideal Observers for Detecting Image Curves
10 0.23303945 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
11 0.23179522 157 nips-2003-Plasticity Kernels and Temporal Statistics
12 0.22312461 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines
13 0.20277409 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
14 0.17141856 10 nips-2003-A Low-Power Analog VLSI Visual Collision Detector
15 0.16861144 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
16 0.16487516 61 nips-2003-Entrainment of Silicon Central Pattern Generators for Legged Locomotory Control
17 0.16199228 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface
18 0.16051316 18 nips-2003-A Summating, Exponentially-Decaying CMOS Synapse for Spiking Neural Systems
19 0.15952295 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
20 0.15862139 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
topicId topicWeight
[(0, 0.02), (6, 0.42), (11, 0.023), (29, 0.016), (30, 0.055), (35, 0.024), (47, 0.013), (53, 0.093), (71, 0.051), (76, 0.04), (85, 0.058), (91, 0.077)]
simIndex simValue paperId paperTitle
same-paper 1 0.83412975 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons
Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin
Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.
2 0.49950653 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
3 0.37732506 5 nips-2003-A Classification-based Cocktail-party Processor
Author: Nicoleta Roman, Deliang Wang, Guy J. Brown
Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t ) Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t ) IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.
4 0.34092599 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1
5 0.33892721 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing
Author: Konrad P. Körding, Daniel M. Wolpert
Abstract: When we learn a new motor skill, we have to contend with both the variability inherent in our sensors and the task. The sensory uncertainty can be reduced by using information about the distribution of previously experienced tasks. Here we impose a distribution on a novel sensorimotor task and manipulate the variability of the sensory feedback. We show that subjects internally represent both the distribution of the task as well as their sensory uncertainty. Moreover, they combine these two sources of information in a way that is qualitatively predicted by optimal Bayesian processing. We further analyze if the subjects can represent multimodal distributions such as mixtures of Gaussians. The results show that the CNS employs probabilistic models during sensorimotor learning even when the priors are multimodal.
6 0.33867329 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
7 0.33797354 107 nips-2003-Learning Spectral Clustering
8 0.33528054 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
9 0.33484125 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
10 0.33372822 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
11 0.33372414 115 nips-2003-Linear Dependent Dimensionality Reduction
12 0.33316028 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
13 0.33306849 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
14 0.33295131 30 nips-2003-Approximability of Probability Distributions
15 0.33190003 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
16 0.33156264 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
17 0.33151388 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
18 0.33108494 113 nips-2003-Learning with Local and Global Consistency
19 0.33095184 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
20 0.33020189 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model