nips nips2003 nips2003-144 knowledge-graph by maker-knowledge-mining

144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals


Source: pdf

Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita

Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One microphone blind dereverberation based on quasi-periodicity of speech signals Tomohiro Nakatani, Masato Miyoshi, and Keisuke Kinoshita Speech Open Lab. [sent-1, score-1.087]

2 jp Abstract Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. [sent-7, score-0.899]

3 Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. [sent-9, score-0.732]

4 We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. [sent-10, score-0.897]

5 The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. [sent-11, score-2.145]

6 1 Introduction Although numerous studies have been undertaken on robust automatic speech recognition (ASR) in the real world, long reverberation is still a serious problem that severely degrades the ASR performance [1]. [sent-12, score-0.586]

7 One simple way to overcome this problem is to dereverberate the speech signals prior to ASR, but this is also a challenging problem, especially when using a single microphone. [sent-13, score-0.462]

8 For example, certain blind equalization methods, including independent component analysis (ICA), can estimate the inverse filter of an unknown impulse response convolved with target signals when the signals are statistically independent and identically distributed sequences [2]. [sent-14, score-0.58]

9 However, these methods cannot appropriately deal with speech signals because speech signals have inherent properties, such as periodicity and formant structure, making their sequences statistically dependent. [sent-15, score-0.894]

10 Another approach that uses the properties of speech has also been proposed [3]. [sent-17, score-0.242]

11 The basic idea involves adaptively detecting time regions in which signal-to-reverberation ratios become small, and attenuating speech signals in those regions. [sent-18, score-0.444]

12 However, the precise separation of the signal and reverberation durations is difficult, therefore, this approach has achieved only moderate results so far. [sent-19, score-0.454]

13 In this paper, we propose a new principle for estimating an inverse filter by using an essential property of speech signals, namely quasi-periodicity, as a clue. [sent-20, score-0.276]

14 Therefore, when a long reverberation is added to a speech signal, signals in different time regions with different periods are mixed, thus degrading the periodicity of the signals in local time regions. [sent-22, score-0.988]

15 By contrast, we show that we can estimate an inverse filter for dereverberating a signal by enhancing the periodicity of the signal in each local time region. [sent-23, score-0.368]

16 The estimated filter can dereverberate both the periodic and non-periodic parts of speech signals with no prior knowledge of the target signals, even though only the periodic parts of the signals are used for the estimation. [sent-24, score-0.831]

17 1 Features of quasi-periodic signals When a source signal s(n) is recorded in a reverberant room1 , the obtained signal x(n) is represented as x(n) = h(n)∗s(n), where h(n) is the impulse response of the room and “∗” is a convolution operation. [sent-28, score-0.849]

18 The goal of the dereverberation is to estimate a dereverberation filter, w(n), for −N < n < N that dereverberates x(n), and to obtain the dereverberated signal y(n) by: y(n) = w(n) ∗ x(n) = (w(n) ∗ h(n)) ∗ s(n) = q(n) ∗ s(n). [sent-29, score-1.684]

19 (1) where q(n) = w(n) ∗ h(n) is referred to as a dereverberated impulse response. [sent-30, score-0.412]

20 In each local time region around n0 (n0 − δ < n < n0 + δ for ∀ n0 ), s(n) is approximately a periodic signal whose period is T (n0 ). [sent-32, score-0.263]

21 Outside the region (|n − n0 | > δ), s(n ) is also a periodic signal within its neighboring time region, but often has another period that is different from T (n0 ). [sent-34, score-0.263]

22 These features make x(n) a non-periodic signal even within local time regions when h(m) contains non-zero values for |m| > δ. [sent-35, score-0.163]

23 Inversely, the goal of our dereverberation is to estimate w(n) that makes y(n) a periodic signal in each local time region. [sent-37, score-0.865]

24 Once such a filter is obtained, q(m) must have zero values for |m| > δ, and thus, reverberant components longer than δ are eliminated from y(n). [sent-38, score-0.348]

25 An important additional feature of a quasi-periodic signal is that quasi-periodic components in a source signal can be enhanced by an adaptive harmonic filter. [sent-39, score-0.52]

26 An adaptive harmonic filter is a time-varying linear filter that enhances frequency components whose frequencies correspond to multiples of the fundamental frequency (F0 ) of the target signal, while preserving their phases and amplitudes. [sent-40, score-0.441]

27 2 Later, this assumption is extended so that s(n) is composed of quasi-periodic components and non-periodic components in the case of speech signals. [sent-44, score-0.389]

28 Even when x(n) contains a long reverberation, the reverberant components that have different frequencies from s(n) are reduced by the harmonic filter, and thus, the quasi-periodic components can be enhanced. [sent-46, score-0.544]

29 2 ATF-HERB: average transfer function based dereverberation Figure 1 is a diagram of ATF-HERB, which uses the average transfer function from reverberant signals to quasi-periodic signals. [sent-48, score-1.137]

30 A speech signal, S(ω), can be modeled by the sum of the quasi-periodic components, or voiced components, Sh (ω), and non-periodic components, or unvoiced components, Sn (ω), as eq. [sent-49, score-0.266]

31 The reverberant observed signal, X(ω), is then represented by the product of S and the transfer function, H(ω), of a room as eq. [sent-51, score-0.303]

32 The former transforms S into the direct signal, DS, and the latter into the reverberation part, RS, as shown in eq. [sent-54, score-0.368]

33 X is also represented by the sum of the direct signal of the quasi-periodic components, DSh , and the other components as eq. [sent-56, score-0.213]

34 (4) (5) (6) (7) Of these components, DSh can approximately be extracted from X by harmonic filtering. [sent-59, score-0.14]

35 Although the frequencies of quasi-periodic components change dynamically according to the changes in their fundamental frequency (F0 ), their reverberation remains unchanged at the same frequency. [sent-60, score-0.486]

36 Therefore, direct quasi-periodic components, DSh , can be enhanced by extracting frequency components located at multiples of its F0 . [sent-61, score-0.181]

37 ” This is because the signal DS + RS, which can be obtained ˆ by multiplying O(R) by X, becomes in a sense a dereverberated signal. [sent-66, score-0.437]

38 ˆ ˆ O(R(ω))X(ω) = D(ω)S(ω) + R(ω)S(ω), (9) ˆ Strictly speaking, R cannot be represented as a linear transformation because the reverberation ˆ ˆ included in X depends on the time pattern of X. [sent-67, score-0.379]

39 The rest of the reverberation included in X(= DS +RS), or (R− R)S, is eliminated by the dereverberation operator. [sent-71, score-1.023]

40 ˆ To estimate the dereverberation operator, we use the output of the harmonic filter, X. [sent-72, score-0.779]

41 Then, the dereverberation operator, O(R), can be approximated as the average of ˆ ˆ ˆ X/X, or W (ω) = E(X/X). [sent-74, score-0.639]

42 (11) approximately equals zero because N is a non-periodic component ˆ that the harmonic filter unexpectedly extracts and thus the magnitude of N almost always ˆ ) if a sufficiently long analysis window is used. [sent-83, score-0.173]

43 (12), that is, W (ω) has the value of the dereverberation operator multiplied by the probability of the harmonic components of speech with a larger magnitude than the non-periodic components. [sent-85, score-1.182]

44 Once the dereverberation operator is calculated from the periodic parts of speech signals for almost all the frequency ranges, it can dereverberate both the periodic and non-periodic parts of the signals because the inverse transfer function is independent of the source signal characteristics. [sent-86, score-1.795]

45 This is because the magnitudes of the non-periodic components relative to the periodic components tend to increase with frequency for a speech signal, and thus the P (|Sh | > |Sn |) value becomes smaller as ω increases. [sent-88, score-0.511]

46 To compensate for this decreasing gain, it may be useful to use the average attributes of speech on the probability, P (|Sh | > |Sn |). [sent-89, score-0.242]

47 In our experiments in section 4, however, W (ω) itself was used as the dereverberation operator without any compensation. [sent-90, score-0.713]

48 3 MMSE-HERB: minimum mean squared error criterion based dereverberation As discussed in section 2. [sent-92, score-0.658]

49 1, quasi-periodic signals can be dereverberated simply by enhancing their quasi-periodicity. [sent-93, score-0.46]

50 1, the values of the dereverberated impulse response, q(n), are expected to become zero using this method where |n| > δ, however, the values are not specifically determined where |n| < δ. [sent-102, score-0.392]

51 This may cause unexpected spectral modification of the dereverberated signal. [sent-103, score-0.278]

52 The complexity of the computing needed to minimize the cost function based on repetitive estimation increases as the dereverberation filter becomes longer. [sent-110, score-0.685]

53 The longer the reverberation becomes, the longer the dereverberation filter should be. [sent-111, score-1.009]

54 The new cost function evaluates the quasi-periodicity not in ˆ the time domain but in the frequency domain, and uses a fixed quasi-periodic signal X(ω) as the desired signal, instead of using the non-fixed quasi-periodic signal, F (f0 (n))[y(n)]. [sent-114, score-0.282]

55 The use of the fixed desired signals specifically provides the dereverberated impulse response, q(n), with the desired values, even in the time region, |n| < δ. [sent-116, score-0.574]

56 When calculating F (f0 (n))[x(n − l)], ˆ x(n) is time-shifted with l-points while f0 (n) of the adaptive harmonic filter is not time-shifted. [sent-122, score-0.165]

57 (16) again approaches the dereverberation operator, ˆ O(R), presented in section 2. [sent-126, score-0.639]

58 ∗ ∗ E(Sh Sh ) + E(Sn Sn ) ˆ O(R(ω)) (17) (18) ˆ Because N represents non-periodic components that are included unexpectedly and at random in the output of the harmonic filter, the absolute value of the second term in eq. [sent-128, score-0.273]

59 (16) becomes the dereverberation operator multiplied by the ratio of the expected power of the quasi-periodic components in the signals to that of whole signals. [sent-131, score-0.981]

60 As with the speech signals discussed in section 2. [sent-132, score-0.405]

61 2 may again be useful for the MMSE based dereverberation scheme. [sent-135, score-0.639]

62 3 Processing flow Based on the above two methods, we constructed a dereverberation algorithm composed of two steps as shown in Fig. [sent-136, score-0.654]

63 Both methods are implemented in the same processing flow except that the methods used to calculate the dereverberation operator are different. [sent-138, score-0.713]

64 In the first step, F0 is estimated from the reverberant signal, X. [sent-140, score-0.236]

65 Then the harmonic ˆ components included in X are estimated as X1 based on adaptive harmonic filˆ 1 ) is then calculated by ATF-HERB or tering. [sent-141, score-0.42]

66 The dereverberation operator O(R MMSE-HERB for a number of reverberant speech signals. [sent-142, score-1.191]

67 Finally, the dereverˆ berated signal is obtained by multiplying O(R1 ) by X. [sent-143, score-0.159]

68 The second step employs almost the same procedures as the first step except that the speech data dereverberated by the first step are used as the input signal. [sent-145, score-0.52]

69 The ˆ use of this dereverberated input signal means that reverberant components, R2 X2 , inevitably included in eq. [sent-146, score-0.698]

70 Therefore, a more effective dereverberation can be achieved in step 2. [sent-148, score-0.639]

71 In our preliminary experiments, however, repeating STEP 2 did not always improve the quality of the dereverberated signals. [sent-149, score-0.301]

72 This is because the estimation error of the dereverberation operators accumulates in the dereverberated signals when the signals are multiplied by more than one dereverberation operator. [sent-150, score-1.926]

73 8 Figure 4: Reverberation curves of the original impulse responses (thin line) and dereverberated impulse responses (male: thick dashed line, female: thick solid line) for different reverberation times (rtime). [sent-181, score-0.928]

74 Accurate F0 estimation is very important in terms of achieving effective dereverberation with our methods in this processing flow. [sent-182, score-0.677]

75 However, this is a difficult task, especially for speech with a long reverberation using existing F0 estimators. [sent-183, score-0.583]

76 To cope with this problem, we designed a simple filter that attenuates a signal that continues at the same frequency, and used it as a preprocessor for the F0 estimation [5]. [sent-184, score-0.175]

77 In addition, the dereverberation operator, ˆ O(R1 ), itself is a very effective preprocessor for an F0 estimator because the reverberation of the speech can be directly reduced by the operator. [sent-185, score-1.246]

78 This mechanism is already included ˆ in step 2 of the dereverberation procedure, that is, F0 estimation is applied to O(R1 )X. [sent-186, score-0.696]

79 4 Experimental results We examined the performance of the proposed dereverberation methods. [sent-188, score-0.639]

80 We used 5240 Japanese word utterances provided by a male and a female speaker (MAU and FKM, 12 kHz sampling) included in the ATR database as source signals, S(ω). [sent-190, score-0.197]

81 We used four impulse responses measured in a reverberant room whose reverberation times were about 0. [sent-191, score-0.738]

82 Figure 4 depicts the reverberation curves5 of the original impulse responses and the dereverberated impulse responses obtained with ATF-HERB. [sent-197, score-0.91]

83 The figure shows that the proposed methods could effectively reduce the reverberation in the impulse responses for the female speaker when the reverberation time (rtime) was longer than 0. [sent-198, score-0.909]

84 For the male speaker, the reverberation effect in the lower time region was also effectively reduced. [sent-200, score-0.406]

85 This means that strong reverberant components were eliminated, and we can expect the intelligibility of the signals to be improved [6]. [sent-201, score-0.487]

86 Figure 5 shows spectrograms of reverberant and dereverberated speech signals when rtime was 1. [sent-202, score-1.016]

87 As shown in the figure, the reverberation of the signal was effectively reduced, and the formant structure of the signal was restored. [sent-204, score-0.61]

88 Similar spectrogram features were observed under other reverberation conditions, and an improvement in sound quality could clearly be recognized by listening to the dereverberated signals [7]. [sent-205, score-0.83]

89 We also evaluated the quality of the dereverberated speech in terms of speaker dependent word recognition rates 5 [6]. [sent-206, score-0.621]

90 The reverberation curve shows the reduction in the energy of a room impulse response with time 2 Frequency (kHz) Frequency (kHz) 2 1 0 0 0. [sent-207, score-0.506]

91 2 Figure 5: Spectrogram of reverberant (left) and dereverberated (right) speech of a male speaker uttering “ba-ku-da-i”. [sent-215, score-0.837]

92 with an ASR system, and could achieve more than 95 % recognition rates under all the reverberation conditions with acoustic models trained using dereverberated speech signals. [sent-216, score-0.881]

93 5 Conclusion A new blind dereverberation principle based on the quasi-periodicity of speech signals was proposed. [sent-218, score-1.103]

94 We showed that ATF-HERB and a simplified version of MMSE-HERB are both capable of learning the dereverberation operator that can reduce reverberant components in speech signals. [sent-220, score-1.257]

95 Experimental results showed that a dereverberation operator trained with 5240 Japanese word utterances could achieve very high quality speech dereverberation. [sent-221, score-1.014]

96 Future work will include an investigation of how such high quality speech dereverberation can be achieved with fewer speech data. [sent-222, score-1.146]

97 , “Enhancement of reverberant speech using LP residual signal,” IEEE Trans. [sent-243, score-0.478]

98 , “Implementation and effects of single channel dereverberation based on the harmonic structure of speech,” Proc. [sent-251, score-0.779]

99 , “Blind dereverberation of single channel speech signal based on harmonic structure,” Proc. [sent-256, score-1.149]

100 , “Intelligibility of speech under nonexponential decay conditions,” JASA, vol. [sent-264, score-0.242]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dereverberation', 0.639), ('reverberation', 0.326), ('sh', 0.286), ('dereverberated', 0.278), ('speech', 0.242), ('reverberant', 0.236), ('signals', 0.163), ('harmonic', 0.14), ('signal', 0.128), ('sn', 0.125), ('impulse', 0.114), ('lter', 0.109), ('rtime', 0.097), ('periodic', 0.079), ('operator', 0.074), ('components', 0.066), ('asr', 0.06), ('mmse', 0.06), ('frequency', 0.058), ('dsh', 0.056), ('periodicity', 0.056), ('blind', 0.043), ('speaker', 0.042), ('dereverberate', 0.042), ('equalization', 0.042), ('miyoshi', 0.042), ('nakatani', 0.042), ('male', 0.039), ('evaluates', 0.036), ('transfer', 0.036), ('included', 0.034), ('ds', 0.034), ('unexpectedly', 0.033), ('room', 0.031), ('responses', 0.031), ('khz', 0.031), ('harmonics', 0.031), ('db', 0.03), ('female', 0.029), ('formant', 0.028), ('harmonicity', 0.028), ('herb', 0.028), ('kinoshita', 0.028), ('rsh', 0.028), ('yegnanarayana', 0.028), ('diagram', 0.027), ('ow', 0.025), ('adaptive', 0.025), ('rs', 0.024), ('eliminated', 0.024), ('voiced', 0.024), ('preprocessor', 0.024), ('spectrogram', 0.024), ('quality', 0.023), ('cost', 0.023), ('transforms', 0.023), ('estimation', 0.023), ('intelligibility', 0.022), ('inevitably', 0.022), ('multiples', 0.022), ('longer', 0.022), ('region', 0.022), ('multiplied', 0.021), ('frequencies', 0.021), ('parts', 0.021), ('target', 0.021), ('adaptively', 0.02), ('ntt', 0.02), ('referred', 0.02), ('time', 0.019), ('enhancing', 0.019), ('japanese', 0.019), ('criterion', 0.019), ('direct', 0.019), ('word', 0.018), ('utterances', 0.018), ('domain', 0.018), ('power', 0.018), ('inverse', 0.018), ('recognition', 0.018), ('source', 0.017), ('acoustic', 0.017), ('thick', 0.017), ('obtained', 0.016), ('response', 0.016), ('therefore', 0.016), ('principle', 0.016), ('inversely', 0.016), ('enhanced', 0.016), ('features', 0.016), ('especially', 0.015), ('calculated', 0.015), ('composed', 0.015), ('japan', 0.015), ('multiplying', 0.015), ('fundamental', 0.015), ('period', 0.015), ('reduced', 0.015), ('phases', 0.015), ('achieving', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals

Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita

Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1

2 0.17769498 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey

Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1

3 0.14169259 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification

Author: Felix A. Wichmann, Arnulf B. Graf

Abstract: We attempt to understand visual classification in humans using both psychophysical and machine learning techniques. Frontal views of human faces were used for a gender classification task. Human subjects classified the faces and their gender judgment, reaction time and confidence rating were recorded. Several hyperplane learning algorithms were used on the same classification task using the Principal Components of the texture and shape representation of the faces. The classification performance of the learning algorithms was estimated using the face database with the true gender of the faces as labels, and also with the gender estimated by the subjects. We then correlated the human responses to the distance of the stimuli to the separating hyperplane of the learning algorithms. Our results suggest that human classification can be modeled by some hyperplane algorithms in the feature space we used. For classification, the brain needs more processing for stimuli close to that hyperplane than for those further away. 1

4 0.12216837 5 nips-2003-A Classification-based Cocktail-party Processor

Author: Nicoleta Roman, Deliang Wang, Guy J. Brown

Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t )     Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t )     IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.

5 0.090156212 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons

Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin

Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.

6 0.063306332 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

7 0.062754281 14 nips-2003-A Nonlinear Predictive State Representation

8 0.058557887 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

9 0.054163273 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines

10 0.048104104 113 nips-2003-Learning with Local and Global Consistency

11 0.047507852 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

12 0.042090978 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel

13 0.041908726 157 nips-2003-Plasticity Kernels and Temporal Statistics

14 0.03546486 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression

15 0.033224367 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

16 0.0323519 115 nips-2003-Linear Dependent Dimensionality Reduction

17 0.032114502 88 nips-2003-Image Reconstruction by Linear Programming

18 0.030612117 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

19 0.030380324 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

20 0.029430117 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.097), (1, 0.006), (2, 0.072), (3, -0.026), (4, -0.07), (5, 0.049), (6, 0.136), (7, -0.059), (8, -0.088), (9, 0.18), (10, 0.184), (11, -0.088), (12, -0.016), (13, -0.038), (14, -0.07), (15, -0.143), (16, 0.142), (17, -0.04), (18, -0.014), (19, -0.03), (20, -0.048), (21, 0.107), (22, 0.043), (23, -0.058), (24, -0.01), (25, 0.103), (26, 0.028), (27, -0.07), (28, 0.109), (29, 0.051), (30, -0.013), (31, 0.068), (32, -0.058), (33, -0.004), (34, 0.062), (35, -0.068), (36, 0.073), (37, 0.055), (38, -0.119), (39, 0.093), (40, 0.01), (41, 0.103), (42, -0.102), (43, 0.022), (44, 0.042), (45, 0.04), (46, -0.034), (47, -0.045), (48, 0.11), (49, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95983297 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals

Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita

Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1

2 0.75567788 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey

Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1

3 0.66035038 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons

Author: Jeff Bondy, Ian Bruce, Suzanna Becker, Simon Haykin

Abstract: A major issue in evaluating speech enhancement and hearing compensation algorithms is to come up with a suitable metric that predicts intelligibility as judged by a human listener. Previous methods such as the widely used Speech Transmission Index (STI) fail to account for masking effects that arise from the highly nonlinear cochlear transfer function. We therefore propose a Neural Articulation Index (NAI) that estimates speech intelligibility from the instantaneous neural spike rate over time, produced when a signal is processed by an auditory neural model. By using a well developed model of the auditory periphery and detection theory we show that human perceptual discrimination closely matches the modeled distortion in the instantaneous spike rates of the auditory nerve. In highly rippled frequency transfer conditions the NAI’s prediction error is 8% versus the STI’s prediction error of 10.8%. 1 In trod u ction A wide range of intelligibility measures in current use rest on the assumption that intelligibility of a speech signal is based upon the sum of contributions of intelligibility within individual frequency bands, as first proposed by French and Steinberg [1]. This basic method applies a function of the Signal-to-Noise Ratio (SNR) in a set of bands, then averages across these bands to come up with a prediction of intelligibility. French and Steinberg’s original Articulation Index (AI) is based on 20 equally contributing bands, and produces an intelligibility score between zero and one: 1 20 AI = (1) ∑ TI i , 20 i =1 th where TIi (Transmission Index i) is the normalized intelligibility in the i band. The TI per band is a function of the signal to noise ratio or: (2) SNRi + 12 30 for SNRs between –12 dB and 18 dB. A SNR of greater than 18 dB means that the band has perfect intelligibility and TI equals 1, while an SNR under –12 dB means that a band is not contributing at all, and the TI of that band equals 0. The overall intelligibility is then a function of the AI, but this function changes depending on the semantic context of the signal. TI i = Kryter validated many of the underlying AI principles [2]. Kryter also presented the mechanics for calculating the AI for different number of bands - 5,6,15 or the original 20 - as well as important correction factors [3]. Some of the most important correction factors account for the effects of modulated noise, peak clipping, and reverberation. Even with the application of various correction factors, the AI does not predict intelligibility in the presence of some time-domain distortions. Consequently, the Modulation Transfer Function (MTF) has been utilized to measure the loss of intelligibility due to echoes and reverberation [4]. Steeneken and Houtgast later extended this approach to include nonlinear distortions, giving a new name to the predictor: the Speech Transmission Index (STI) [5]. These metrics proved more valid for a larger range of environments and interferences. The STI test signal is a long-term average speech spectrum, gaussian random signal, amplitude modulated by a 0.63 Hz to 12.5 Hz tone. Acoustic components within different frequency bands are switched on and off over the testing sequence to come up with an intelligibility score between zero and one. Interband intermodulation sources can be discerned, as long as the product does not fall into the testing band. Therefore, the STI allows for standard AI-frequency band weighted SNR effects, MTF-time domain effects, and some limited measurements of nonlinearities. The STI shows a high correlation with empirical tests, and has been codified as an ANSI standard [6]. For general acoustics it is very good. However, the STI does not accurately model intraband masker non-linearities, phase distortions or the underlying auditory mechanisms (outside of independent frequency bands) We therefore sought to extend the AI/STI concepts to predict intelligibility, on the assumption that the closest physical variable we have to the perceptual variable of intelligibility is the auditory nerve response. Using a spiking model of the auditory periphery [7] we form the Neuronal Articulation Index (NAI) by describing distortions in the spike trains of different frequency bands. The spiking over time of an auditory nerve fiber for an undistorted speech signal (control case) is compared to the neural spiking over time for the same signal after undergoing some distortion (test case). The difference in the estimated instantaneous discharge rate for the two cases is used to calculate a neural equivalent to the TI, the Neural Distortion (ND), for each frequency band. Then the NAI is calculated with a weighted average of NDs at different Best Frequencies (BFs). In general detection theory terms, the control neuronal response sets some locus in a high dimensional space, then the distorted neuronal response will project near that locus if it is perceptually equivalent, or very far away if it is not. Thus, the distance between the control neuronal response and the distorted neuronal response is a function of intelligibility. Due to the limitations of the STI mentioned above it is predicted that a measure of the neural coding error will be a better predictor than SNR for human intelligibility word-scores. Our method also has the potential to shed light on the underlying neurobiological mechanisms. 2 2.1 Meth o d Model The auditory periphery model used throughout (and hereafter referred to as the Auditory Model) is from [7]. The system is shown in Figure 1. Figure 1 Block diagram of the computational model of the auditory periphery from the middle ear to the Auditory Nerve. Reprinted from Fig. 1 of [7] with permission from the Acoustical Society of America © (2003). The auditory periphery model comprises several sections, each providing a phenomenological description of a different part of the cat auditory periphery function. The first section models middle ear filtering. The second section, labeled the “control path,” captures the Outer Hair Cells (OHC) modulatory function, and includes a wideband, nonlinear, time varying, band-pass filter followed by an OHC nonlinearity (NL) and low-pass (LP) filter. This section controls the time-varying, nonlinear behavior of the narrowband signal-path basilar membrane (BM) filter. The control-path filter has a wider bandwidth than the signal-path filter to account for wideband nonlinear phenomena such as two-tone rate suppression. The third section of the model, labeled the “signal path”, describes the filter properties and traveling wave delay of the BM (time-varying, narrowband filter); the nonlinear transduction and low-pass filtering of the Inner Hair Cell (IHC NL and LP); spontaneous and driven activity and adaptation in synaptic transmission (synapse model); and spike generation and refractoriness in the auditory nerve (AN). In this model, CIHC and COHC are scaling constants that control IHC and OHC status, respectively. The parameters of the synapse section of the model are set to produce adaptation and discharge-rate versus level behavior appropriate for a high-spontaneous- rate/low-threshold auditory nerve fiber. In order to avoid having to generate many spike trains to obtain a reliable estimate of the instantaneous discharge rate over time, we instead use the synaptic release rate as an approximation of the discharge rate, ignoring the effects of neural refractoriness. 2.2 Neural articulation index These results emulate most of the simulations described in Chapter 2 of Steeneken’s thesis [8], as it describes the full development of an STI metric from inception to end. For those interested, the following simulations try to map most of the second chapter, but instead of basing the distortion metric on a SNR calculation, we use the neural distortion. There are two sets of experiments. The first, in section 3.1, deals with applying a frequency weighting structure to combine the band distortion values, while section 3.2 introduces redundancy factors also. The bands, chosen to match [8], are octave bands centered at [125, 250, 500, 1000, 2000, 4000, 8000] Hz. Only seven bands are used here. The Neural AI (NAI) for this is: NAI = α 1 ⋅ NTI1 + α 2 ⋅ NTI2 + ... + α 7 ⋅ NTI7 , (3) th where •i is the i bands contribution and NTIi is the Neural Transmission Index in th the i band. Here all the •s sum to one, so each • factor can be thought of as the percentage contribution of a band to intelligibility. Since NTI is between [0,1], it can also be thought of as the percentage of acoustic features that are intelligible in a particular band. The ND per band is the projection of the distorted (Test) instantaneous spike rate against the clean (Control) instantaneous spike rate. ND = 1 − Test ⋅ Control T , Control ⋅ Control T (4) where Control and Test are vectors of the instantaneous spike rate over time, sampled at 22050 Hz. This type of error metric can only deal with steady state channel distortions, such as the ones used in [8]. ND was then linearly fit to resemble the TI equation 1-2, after normalizing each of the seven bands to have zero means and unit standard deviations across each of the seven bands. The NTI in the th i band was calculated as NDi − µ i (5) NTIi = m +b. σi NTIi is then thresholded to be no less then 0 and no greater then 1, following the TI thresholding. In equation (5) the factors, m = 2.5, b = -1, were the best linear fit to produce NTIi’s in bands with SNR greater then 15 dB of 1, bands with 7.5 dB SNR produce NTIi’s of 0.75, and bands with 0 dB SNR produced NTI i’s of 0.5. This closely followed the procedure outlined in section 2.3.3 of [8]. As the TI is a best linear fit of SNR to intelligibility, the NTI is a best linear fit of neural distortion to intelligibility. The input stimuli were taken from a Dutch corpus [9], and consisted of 10 Consonant-Vowel-Consonant (CVC) words, each spoken by four males and four females and sampled at 44100 Hz. The Steeneken study had many more, but the exact corpus could not be found. 80 total words is enough to produce meaningful frequency weighting factors. There were 26 frequency channel distortion conditions used for male speakers, 17 for female and three SNRs (+15 dB, +7.5 dB and 0 dB). The channel conditions were split into four groups given in Tables 1 through 4 for males, since females have negligible signal in the 125 Hz band, they used a subset, marked with an asterisk in Table 1 through Table 4. Table 1: Rippled Envelope ID # 1* 2* 3* 4* 5* 6* 7* 8* 125 1 0 1 0 1 0 1 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 Table 2: Adjacent Triplets ID # 9 10 11* 125 1 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 Table 3: Isolated Triplets ID # 12 13 14 15* 16* 17 125 1 1 1 0 0 0 OCTAVE-BAND CENTRE FREQUENCY 250 500 1K 2K 4K 8K 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 Table 4: Contiguous Bands OCTAVE-BAND CENTRE FREQUENCY ID # 18* 19* 20* 21 22* 23* 24 25 26* 125 250 500 1K 2K 4K 8K 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1 In the above tables a one represents a passband and a zero a stop band. A 1353 tap FIR filter was designed for each envelope condition. The female envelopes are a subset of these because they have no appreciable speech energy in the 125 Hz octave band. Using the 40 male utterances and 40 female utterances under distortion and calculating the NAI following equation (3) produces only a value between [0,1]. To produce a word-score intelligibility prediction between zero and 100 percent the NAI value was fit to a third order polynomial that produced the lowest standard deviation of error from empirical data. While Fletcher and Galt [10] state that the relation between AI and intelligibility is exponential, [8] fits with a third order polynomial, and we have chosen to compare to [8]. The empirical word-score intelligibility was from [8]. 3 3.1 R esu lts Determining frequency weighting structure For the first tests, the optimal frequency weights (the values of •i from equation 3) were designed through minimizing the difference between the predicted intelligibility and the empirical intelligibility. At each iteration one of the values was dithered up or down, and then the sum of the • i was normalized to one. This is very similar to [5] whose final standard deviation of prediction error for males was 12.8%, and 8.8% for females. The NAI’s final standard deviation of prediction error for males was 8.9%, and 7.1% for females. Figure 2 Relation between NAI and empirical word-score intelligibility for male (left) and female (right) speech with bandpass limiting and noise. The vertical spread from the best fitting polynomial for males has a s.d. = 8.9% versus the STI [5] s.d. = 12.8%, for females the fit has a s.d. = 7.1% versus the STI [5] s.d. = 8.8% The frequency weighting factors are similar for the NAI and the STI. The STI weighting factors from [8], which produced the optimal prediction of empirical data (male s.d. = 6.8%, female s.d. = 6.0%) and the NAI are plotted in Figure 3. Figure 3 Frequency weighting factors for the optimal predictor of male and female intelligibility calculated with the NAI and published by Steeneken [8]. As one can see, the low frequency information is tremendously suppressed in the NAI, while the high frequencies are emphasized. This may be an effect of the stimuli corpus. The corpus has a high percentage of stops and fricatives in the initial and final consonant positions. Since these have a comparatively large amount of high frequency signal they may explain this discrepancy at the cost of the low frequency weights. [8] does state that these frequency weights are dependant upon the conditions used for evaluation. 3.2 Determining frequency weighting with redundancy factors In experiment two, rather then using equation (3) that assumes each frequency band contributes independently, we introduce redundancy factors. There is correlation between the different frequency bands of speech [11], which tends to make the STI over-predict intelligibility. The redundancy factors attempt to remove correlate signals between bands. Equation (3) then becomes: NAIr = α 1 ⋅ NTI1 − β 1 NTI1 ⋅ NTI2 + α 2 ⋅ NTI2 − β 1 NTI2 ⋅ NTI3 + ... + α 7 ⋅ NTI7 , (6) where the r subscript denotes a redundant NAI and • is the correlation factor. Only adjacent bands are used here to reduce complexity. We replicated Section 3.1 except using equation 6. The same testing, and adaptation strategy from Section 3.1 was used to find the optimal •s and •s. Figure 4 Relation between NAIr and empirical word-score intelligibility for male speech (right) and female speech (left) with bandpass limiting and noise with Redundancy Factors. The vertical spread from the best fitting polynomial for males has a s.d. = 6.9% versus the STIr [8] s.d. = 4.7%, for females the best fitting polynomial has a s.d. = 5.4% versus the STIr [8] s.d. = 4.0%. The frequency weighting and redundancy factors given as optimal in Steeneken, versus calculated through optimizing the NAIr are given in Figure 5. Figure 5 Frequency and redundancy factors for the optimal predictor of male and female intelligibility calculated with the NAIr and published in [8]. The frequency weights for the NAIr and STIr are more similar than in Section 3.1. The redundancy factors are very different though. The NAI redundancy factors show no real frequency dependence unlike the convex STI redundancy factors. This may be due to differences in optimization that were not clear in [8]. Table 5: Standard Deviation of Prediction Error NAI STI [5] STI [8] MALE EQ. 3 8.9 % 12.8 % 6.8 % FEMALE EQ. 3 7.1 % 8.8 % 6.0 % MALE EQ. 6 6.9 % 4.7 % FEMALE EQ. 6 5.4 % 4.0 % The mean difference in error between the STI r, as given in [8], and the NAIr is 1.7%. This difference may be from the limited CVC word choice. It is well within the range of normal speaker variation, about 2%, so we believe that the NAI and NAIr are comparable to the STI and STI r in predicting speech intelligibility. 4 Conclusions These results are very encouraging. The NAI provides a modest improvement over STI in predicting intelligibility. We do not propose this as a replacement for the STI for general acoustics since the NAI is much more computationally complex then the STI. The NAI’s end applications are in predicting hearing impairment intelligibility and using statistical decision theory to describe the auditory systems feature extractors - tasks which the STI cannot do, but are available to the NAI. While the AI and STI can take into account threshold shifts in a hearing impaired individual, neither can account for sensorineural, suprathreshold degradations [12]. The accuracy of this model, based on cat anatomy and physiology, in predicting human speech intelligibility provides strong validation of attempts to design hearing aid amplification schemes based on physiological data and models [13]. By quantifying the hearing impairment in an intelligibility metric by way of a damaged auditory model one can provide a more accurate assessment of the distortion, probe how the distortion is changing the neuronal response and provide feedback for preprocessing via a hearing aid before the impairment. The NAI may also give insight into how the ear codes stimuli for the very robust, human auditory system. References [1] French, N.R. & Steinberg, J.C. (1947) Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19:90-119. [2] Kryter, K.D. (1962) Validation of the articulation index. J. Acoust. Soc. Am. 34:16981702. [3] Kryter, K.D. (1962b) Methods for the calculation and use of the articulation index. J. Acoust. Soc. Am. 34:1689-1697. [4] Houtgast, T. & Steeneken, H.J.M. (1973) The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28:66-73. [5] Steeneken, H.J.M. & Houtgast, T. (1980) A physical method for measuring speechtransmission quality. J. Acoust. Soc. Am. 67(1):318-326. [6] ANSI (1997) ANSI S3.5-1997 Methods for calculation of the speech intelligibility index. American National Standards Institute, New York. [7] Bruce, I.C., Sachs, M.B., Young, E.D. (2003) An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. J. Acoust. Soc. Am., 113(1):369-388. [8] Steeneken, H.J.M. (1992) On measuring and predicting speech intelligibility. Ph.D. Dissertation, University of Amsterdam. [9] van Son, R.J.J.H., Binnenpoorte, D., van den Heuvel, H. & Pols, L.C.W. (2001) The IFA corpus: a phonemically segmented Dutch “open source” speech database. Eurospeech 2001 Poster http://145.18.230.99/corpus/index.html [10] Fletcher, H., & Galt, R.H. (1950) The perception of speech and its relation to telephony. J. Acoust. Soc. Am. 22:89-151. [11] Houtgast, T., & Verhave, J. (1991) A physical approach to speech quality assessment: correlation patterns in the speech spectrogram. Proc. Eurospeech 1991, Genova:285-288. [12] van Schijndel, N.H., Houtgast, T. & Festen, J.M. (2001) Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am.110(1):529-542. [13] Sachs, M.B., Bruce, I.C., Miller, R.L., & Young, E. D. (2002) Biological basis of hearing-aid design. Ann. Biomed. Eng. 30:157–168.

4 0.60940695 5 nips-2003-A Classification-based Cocktail-party Processor

Author: Nicoleta Roman, Deliang Wang, Guy J. Brown

Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t )     Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t )     IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.

5 0.47495258 14 nips-2003-A Nonlinear Predictive State Representation

Author: Matthew R. Rudary, Satinder P. Singh

Abstract: Predictive state representations (PSRs) use predictions of a set of tests to represent the state of controlled dynamical systems. One reason why this representation is exciting as an alternative to partially observable Markov decision processes (POMDPs) is that PSR models of dynamical systems may be much more compact than POMDP models. Empirical work on PSRs to date has focused on linear PSRs, which have not allowed for compression relative to POMDPs. We introduce a new notion of tests which allows us to define a new type of PSR that is nonlinear in general and allows for exponential compression in some deterministic dynamical systems. These new tests, called e-tests, are related to the tests used by Rivest and Schapire [1] in their work with the diversity representation, but our PSR avoids some of the pitfalls of their representation—in particular, its potential to be exponentially larger than the equivalent POMDP. 1

6 0.42789778 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification

7 0.42694372 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel

8 0.38139307 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

9 0.31280854 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines

10 0.31022483 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface

11 0.30662525 157 nips-2003-Plasticity Kernels and Temporal Statistics

12 0.25664565 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

13 0.24790561 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

14 0.24203217 115 nips-2003-Linear Dependent Dimensionality Reduction

15 0.23694654 140 nips-2003-Nonlinear Processing in LGN Neurons

16 0.23531422 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

17 0.22986713 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

18 0.22619587 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

19 0.21202578 119 nips-2003-Local Phase Coherence and the Perception of Blur

20 0.19855121 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (11, 0.012), (30, 0.473), (35, 0.036), (53, 0.091), (69, 0.011), (71, 0.05), (76, 0.024), (79, 0.011), (85, 0.083), (91, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9696995 157 nips-2003-Plasticity Kernels and Temporal Statistics

Author: Peter Dayan, Michael Häusser, Michael London

Abstract: Computational mysteries surround the kernels relating the magnitude and sign of changes in efficacy as a function of the time difference between pre- and post-synaptic activity at a synapse. One important idea34 is that kernels result from filtering, ie an attempt by synapses to eliminate noise corrupting learning. This idea has hitherto been applied to trace learning rules; we apply it to experimentally-defined kernels, using it to reverse-engineer assumed signal statistics. We also extend it to consider the additional goal for filtering of weighting learning according to statistical surprise, as in the Z-score transform. This provides a fresh view of observed kernels and can lead to different, and more natural, signal statistics.

2 0.91771173 62 nips-2003-Envelope-based Planning in Relational MDPs

Author: Natalia H. Gardiol, Leslie P. Kaelbling

Abstract: A mobile robot acting in the world is faced with a large amount of sensory data and uncertainty in its action outcomes. Indeed, almost all interesting sequential decision-making domains involve large state spaces and large, stochastic action sets. We investigate a way to act intelligently as quickly as possible in domains where finding a complete policy would take a hopelessly long time. This approach, Relational Envelopebased Planning (REBP) tackles large, noisy problems along two axes. First, describing a domain as a relational MDP (instead of as an atomic or propositionally-factored MDP) allows problem structure and dynamics to be captured compactly with a small set of probabilistic, relational rules. Second, an envelope-based approach to planning lets an agent begin acting quickly within a restricted part of the full state space and to judiciously expand its envelope as resources permit. 1

3 0.89240927 71 nips-2003-Fast Embedding of Sparse Similarity Graphs

Author: John C. Platt

Abstract: This paper applies fast sparse multidimensional scaling (MDS) to a large graph of music similarity, with 267K vertices that represent artists, albums, and tracks; and 3.22M edges that represent similarity between those entities. Once vertices are assigned locations in a Euclidean space, the locations can be used to browse music and to generate playlists. MDS on very large sparse graphs can be effectively performed by a family of algorithms called Rectangular Dijsktra (RD) MDS algorithms. These RD algorithms operate on a dense rectangular slice of the distance matrix, created by calling Dijsktra a constant number of times. Two RD algorithms are compared: Landmark MDS, which uses the Nyström approximation to perform MDS; and a new algorithm called Fast Sparse Embedding, which uses FastMap. These algorithms compare favorably to Laplacian Eigenmaps, both in terms of speed and embedding quality. 1

same-paper 4 0.89046204 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals

Author: Tomohiro Nakatani, Masato Miyoshi, Keisuke Kinoshita

Abstract: Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long. 1

5 0.5482896 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey

Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1

6 0.53708333 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

7 0.45215276 5 nips-2003-A Classification-based Cocktail-party Processor

8 0.42701748 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

9 0.42474091 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

10 0.42079082 159 nips-2003-Predicting Speech Intelligibility from a Population of Neurons

11 0.40225011 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

12 0.38218489 115 nips-2003-Linear Dependent Dimensionality Reduction

13 0.38141251 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel

14 0.37598586 128 nips-2003-Minimax Embeddings

15 0.37051579 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

16 0.36890566 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

17 0.36822143 27 nips-2003-Analytical Solution of Spike-timing Dependent Plasticity Based on Synaptic Biophysics

18 0.36678165 119 nips-2003-Local Phase Coherence and the Perception of Blur

19 0.36518267 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

20 0.36089316 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks