nips nips2002 nips2002-147 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Guoning Hu, Deliang Wang
Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows: R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. [sent-5, score-0.57]
2 A major problem for these systems is their inability to deal with speech in the highfrequency range. [sent-6, score-0.354]
3 Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. [sent-8, score-0.317]
4 For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. [sent-9, score-0.485]
5 For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. [sent-10, score-0.707]
6 Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. [sent-11, score-0.638]
7 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. [sent-13, score-0.439]
8 An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. [sent-14, score-0.511]
9 In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. [sent-16, score-0.579]
10 Various algorithms have been proposed for monaural speech enhancement [14]. [sent-17, score-0.507]
11 While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. [sent-20, score-0.944]
12 According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. [sent-21, score-0.443]
13 Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). [sent-23, score-0.465]
14 In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. [sent-25, score-0.328]
15 In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. [sent-26, score-1.218]
16 However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. [sent-28, score-0.468]
17 In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. [sent-29, score-0.318]
18 We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. [sent-30, score-0.599]
19 For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. [sent-31, score-0.405]
20 It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. [sent-32, score-0.51]
21 Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. [sent-33, score-0.63]
22 More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. [sent-34, score-1.027]
23 For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. [sent-35, score-0.855]
24 Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. [sent-36, score-0.707]
25 The segments are grouped according to AM repetition rates. [sent-37, score-0.47]
26 We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. [sent-38, score-0.787]
27 The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. [sent-50, score-0.406]
28 The global pitch is obtained from the summary correlogram. [sent-56, score-0.459]
29 For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. [sent-57, score-0.978]
30 With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. [sent-58, score-0.73]
31 Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. [sent-59, score-0.621]
32 Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. [sent-62, score-0.499]
33 Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. [sent-63, score-0.946]
34 Since the intrusion is not strongly structured, most segments correspond to target speech. [sent-70, score-0.603]
35 The initial foreground stream successfully groups most of the major segments. [sent-72, score-0.344]
36 2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. [sent-74, score-0.766]
37 To obtain accurate pitch information, target pitch is first estimated from the foreground stream. [sent-75, score-1.305]
38 At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. [sent-76, score-0.533]
39 The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12. [sent-77, score-1.019]
40 First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. [sent-80, score-1.068]
41 At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. [sent-81, score-0.695]
42 Results of initial segregation for a speech and cocktail-party mixture. [sent-87, score-0.425]
43 τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. [sent-94, score-0.576]
44 Second, pitch periods in natural speech vary smoothly in time [11]. [sent-95, score-0.841]
45 We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. [sent-96, score-1.505]
46 Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. [sent-97, score-1.067]
47 Then the re-estimated pitch is further verified with the second constraint. [sent-101, score-0.459]
48 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. [sent-104, score-1.747]
49 3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. [sent-106, score-1.239]
50 This criterion compares an estimated pitch period with the periodicity of the unit response. [sent-107, score-0.715]
51 It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. [sent-109, score-0.514]
52 However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. [sent-110, score-0.557]
53 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. [sent-113, score-0.676]
54 Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. [sent-115, score-0.482]
55 Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. [sent-116, score-0.701]
56 To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0. [sent-118, score-0.376]
57 Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. [sent-120, score-0.927]
58 The solid line indicates the pitch contour obtained from clean speech. [sent-121, score-0.532]
59 The vertical line marks the position corresponding to the pitch period of target speech. [sent-127, score-0.676]
60 Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. [sent-130, score-1.019]
61 However, we are interested in finding a frequency within the plausible pitch range. [sent-135, score-0.499]
62 The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. [sent-138, score-0.593]
63 Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . [sent-139, score-1.065]
64 4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. [sent-144, score-0.367]
65 Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. [sent-145, score-0.679]
66 The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. [sent-146, score-1.817]
67 Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. [sent-147, score-0.634]
68 First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. [sent-149, score-1.085]
69 Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. [sent-150, score-0.483]
70 For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. [sent-155, score-0.786]
71 The resulting SF is the final segregated stream of target speech. [sent-156, score-0.454]
72 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. [sent-158, score-0.572]
73 5(c) shows all the units where target speech is stronger than intrusion. [sent-162, score-0.622]
74 The foreground stream generated by our algorithm contains most of the units where target speech is stronger. [sent-163, score-0.934]
75 In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. [sent-164, score-0.427]
76 A speech waveform is resynthesized from the final foreground stream. [sent-165, score-0.693]
77 Here, the foreground stream works as a binary mask. [sent-166, score-0.344]
78 Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. [sent-172, score-0.381]
79 The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. [sent-173, score-0.753]
80 Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. [sent-175, score-0.506]
81 Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. [sent-176, score-0.426]
82 The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. [sent-177, score-0.303]
83 In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. [sent-178, score-0.372]
84 The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). [sent-179, score-0.836]
85 The speech waveform resynthesized from the ideal binary mask is denoted by I(t). [sent-180, score-0.609]
86 Results of final segregation for the speech and cocktail-party mixture. [sent-187, score-0.496]
87 (c) Units where target speech is stronger than the intrusion. [sent-190, score-0.5]
88 60% of target speech energy, and the relative residual noise is kept at 3. [sent-240, score-0.468]
89 Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . [sent-245, score-0.303]
90 I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. [sent-246, score-0.349]
91 As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. [sent-250, score-1.222]
92 The proposed system considers the pitch contour of the target source only. [sent-251, score-0.653]
93 However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. [sent-252, score-0.748]
94 With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. [sent-253, score-0.613]
95 In our view, unvoiced speech poses the biggest challenge for monaural speech separation. [sent-260, score-0.868]
96 Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms? [sent-277, score-0.725]
97 Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm. [sent-297, score-0.439]
98 Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. [sent-325, score-0.884]
99 Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. [sent-377, score-0.351]
100 Weintraub, A theory and computational model of auditory monaural sound separation, Ph. [sent-382, score-0.308]
wordName wordTfidf (topN-words)
[('pitch', 0.459), ('speech', 0.318), ('segments', 0.254), ('harmonics', 0.214), ('foreground', 0.201), ('intrusion', 0.199), ('monaural', 0.189), ('unresolved', 0.158), ('target', 0.15), ('stream', 0.143), ('repetition', 0.142), ('units', 0.122), ('acoustic', 0.121), ('auditory', 0.119), ('mask', 0.115), ('resolved', 0.108), ('segregation', 0.107), ('rel', 0.094), ('separation', 0.092), ('intrusions', 0.09), ('rnr', 0.09), ('segregated', 0.09), ('snr', 0.08), ('ms', 0.079), ('filter', 0.077), ('frame', 0.076), ('grouped', 0.074), ('ideal', 0.073), ('resynthesized', 0.072), ('continuity', 0.072), ('interference', 0.072), ('envelope', 0.072), ('voiced', 0.072), ('final', 0.071), ('period', 0.067), ('autocorrelation', 0.067), ('periodicity', 0.067), ('periods', 0.064), ('sf', 0.063), ('casa', 0.063), ('unit', 0.057), ('grouping', 0.055), ('asa', 0.054), ('gammatone', 0.054), ('ohio', 0.054), ('energy', 0.051), ('temporal', 0.051), ('sec', 0.051), ('wang', 0.048), ('uij', 0.047), ('harmonic', 0.046), ('contour', 0.044), ('signal', 0.043), ('psychoacoustic', 0.043), ('cooke', 0.043), ('unvoiced', 0.043), ('response', 0.043), ('scene', 0.041), ('specifically', 0.041), ('khz', 0.041), ('frequency', 0.04), ('responses', 0.038), ('hz', 0.037), ('channels', 0.037), ('columbus', 0.036), ('dissertation', 0.036), ('ellis', 0.036), ('highfrequency', 0.036), ('masks', 0.036), ('periodicities', 0.036), ('resynthesis', 0.036), ('son', 0.036), ('undecided', 0.036), ('hu', 0.036), ('estimated', 0.036), ('lag', 0.034), ('reliable', 0.034), ('streams', 0.033), ('sounds', 0.033), ('stronger', 0.032), ('carlyon', 0.031), ('envelopes', 0.031), ('filtered', 0.031), ('asr', 0.031), ('autocorrelations', 0.031), ('oh', 0.031), ('waveform', 0.031), ('mechanisms', 0.03), ('label', 0.03), ('consecutive', 0.03), ('responding', 0.03), ('brown', 0.03), ('adjacent', 0.03), ('addition', 0.03), ('criterion', 0.029), ('clean', 0.029), ('rectified', 0.029), ('hearing', 0.029), ('multistage', 0.029), ('cocktail', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 147 nips-2002-Monaural Speech Separation
Author: Guoning Hu, Deliang Wang
Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows: R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.
2 0.47549543 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch
Author: Lawrence K. Saul, Daniel D. Lee, Charles L. Isbell, Yann L. Cun
Abstract: We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares fit. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.
3 0.17326669 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
Author: Gil-jin Jang, Te-Won Lee
Abstract: We present a new technique for achieving source separation when given only a single channel recording. The main idea is based on exploiting the inherent time structure of sound sources by learning a priori sets of basis filters in time domain that encode the sources in a statistically efficient manner. We derive a learning algorithm using a maximum likelihood approach given the observed single channel data and sets of basis filters. For each time point we infer the source signals and their contribution factors. This inference is possible due to the prior knowledge of the basis filters and the associated coefficient densities. A flexible model for density estimation allows accurate modeling of the observation and our experimental results exhibit a high level of separation performance for mixtures of two music signals as well as the separation of two voice signals.
4 0.17158899 12 nips-2002-A Neural Edge-Detection Model for Enhanced Auditory Sensitivity in Modulated Noise
Author: Alon Fishbach, Bradford J. May
Abstract: Psychophysical data suggest that temporal modulations of stimulus amplitude envelopes play a prominent role in the perceptual segregation of concurrent sounds. In particular, the detection of an unmodulated signal can be significantly improved by adding amplitude modulation to the spectral envelope of a competing masking noise. This perceptual phenomenon is known as “Comodulation Masking Release” (CMR). Despite the obvious influence of temporal structure on the perception of complex auditory scenes, the physiological mechanisms that contribute to CMR and auditory streaming are not well known. A recent physiological study by Nelken and colleagues has demonstrated an enhanced cortical representation of auditory signals in modulated noise. Our study evaluates these CMR-like response patterns from the perspective of a hypothetical auditory edge-detection neuron. It is shown that this simple neural model for the detection of amplitude transients can reproduce not only the physiological data of Nelken et al., but also, in light of previous results, a variety of physiological and psychoacoustical phenomena that are related to the perceptual segregation of concurrent sounds. 1 In t rod u ct i on The temporal structure of a complex sound exerts strong influences on auditory physiology (e.g. [10, 16]) and perception (e.g. [9, 19, 20]). In particular, studies of auditory scene analysis have demonstrated the importance of the temporal structure of amplitude envelopes in the perceptual segregation of concurrent sounds [2, 7]. Common amplitude transitions across frequency serve as salient cues for grouping sound energy into unified perceptual objects. Conversely, asynchronous amplitude transitions enhance the separation of competing acoustic events [3, 4]. These general principles are manifested in perceptual phenomena as diverse as comodulation masking release (CMR) [13], modulation detection interference [22] and synchronous onset grouping [8]. Despite the obvious importance of timing information in psychoacoustic studies of auditory masking, the way in which the CNS represents the temporal structure of an amplitude envelope is not well understood. Certainly many physiological studies have demonstrated neural sensitivities to envelope transitions, but this sensitivity is only beginning to be related to the variety of perceptual experiences that are evoked by signals in noise. Nelken et al. [15] have suggested a correspondence between neural responses to time-varying amplitude envelopes and psychoacoustic masking phenomena. In their study of neurons in primary auditory cortex (A1), adding temporal modulation to background noise lowered the detection thresholds of unmodulated tones. This enhanced signal detection is similar to the perceptual phenomenon that is known as comodulation masking release [13]. Fishbach et al. [11] have recently proposed a neural model for the detection of “auditory edges” (i.e., amplitude transients) that can account for numerous physiological [14, 17, 18] and psychoacoustical [3, 21] phenomena. The encompassing utility of this edge-detection model suggests a common mechanism that may link the auditory processing and perception of auditory signals in a complex auditory scene. Here, it is shown that the auditory edge detection model can accurately reproduce the cortical CMR-like responses previously described by Nelken and colleagues. 2 Th e M od el The model is described in detail elsewhere [11]. In short, the basic operation of the model is the calculation of the first-order time derivative of the log-compressed envelope of the stimulus. A computational model [23] is used to convert the acoustic waveform to a physiologically plausible auditory nerve representation (Fig 1a). The simulated neural response has a medium spontaneous rate and a characteristic frequency that is set to the frequency of the target tone. To allow computation of the time derivative of the stimulus envelope, we hypothesize the existence of a temporal delay dimension, along which the stimulus is progressively delayed. The intermediate delay layer (Fig 1b) is constructed from an array of neurons with ascending membrane time constants (τ); each neuron is modeled by a conventional integrate-and-fire model (I&F;, [12]). Higher membrane time constant induces greater delay in the neuron’s response [1]. The output of the delay layer converges to a single output neuron (Fig. 1c) via a set of connection with various efficacies that reflect a receptive field of a gaussian derivative. This combination of excitatory and inhibitory connections carries out the time-derivative computation. Implementation details and parameters are given in [11]. The model has 2 adjustable and 6 fixed parameters, the former were used to fit the responses of the model to single unit responses to variety of stimuli [11]. The results reported here are not sensitive to these parameters. (a) AN model (b) delay-layer (c) edge-detector neuron τ=6 ms I&F; Neuron τ=4 ms τ=3 ms bandpass log d dt RMS Figure 1: Schematic diagram of the model and a block diagram of the basic operation of each model component (shaded area). The stimulus is converted to a neural representation (a) that approximates the average firing rate of a medium spontaneous-rate AN fiber [23]. The operation of this stage can be roughly described as the log-compressed rms output of a bandpass filter. The neural representation is fed to a series of neurons with ascending membrane time constant (b). The kernel functions that are used to simulate these neurons are plotted for a few neurons along with the time constants used. The output of the delay-layer neurons converge to a single I&F; neuron (c) using a set of connections with weights that reflect a shape of a gaussian derivative. Solid arrows represent excitatory connections and white arrows represent inhibitory connections. The absolute efficacy is represented by the width of the arrows. 3 Resu lt s Nelken et al. [15] report that amplitude modulation can substantially modify the noise-driven discharge rates of A1 neurons in Halothane-anesthetized cats. Many cortical neurons show only a transient onset response to unmodulated noise but fire in synchrony (“lock”) to the envelope of modulated noise. A significant reduction in envelope-locked discharge rates is observed if an unmodulated tone is added to modulated noise. As summarized in Fig. 2, this suppression of envelope locking can reveal the presence of an auditory signal at sound pressure levels that are not detectable in unmodulated noise. It has been suggested that this pattern of neural responding may represent a physiological equivalent of CMR. Reproduction of CMR-like cortical activity can be illustrated by a simplified case in which the analytical amplitude envelope of the stimulus is used as the input to the edge-detector model. In keeping with the actual physiological approach of Nelken et al., the noise envelope is shaped by a trapezoid modulator for these simulations. Each cycle of modulation, E N(t), is given by: t 0≤t < 3D E N (t ) = P P − D (t − 3 D ) 3 D ≤ t < 4 D 0 4 D ≤ t < 8D £ P D ¢ ¡ where P is the peak pressure level and D is set to 12.5 ms. (b) Modulated noise 76 Spikes/sec Tone level (dB SPL) (a) Unmodulated noise 26 0 150 300 0 150 300 Time (ms) Figure 2: Responses of an A1 unit to a combination of noise and tone at many tone levels, replotted from Nelken et al. [15]. (a) Unmodulated noise and (b) modulated noise. The noise envelope is illustrated by the thick line above each figure. Each row shows the response of the neuron to the noise plus the tone at the level specified on the ordinate. The dashed line in (b) indicates the detection threshold level for the tone. The detection threshold (as defined and calculated by Nelken et al.) in the unmodulated noise was not reached. Since the basic operation of the model is the calculation of the rectified timederivative of the log-compressed envelope of the stimulus, the expected noisedriven rate of the model can be approximated by: ( ) ¢ E (t ) P0 d A ln 1 + dt ¡ M N ( t ) = max 0, ¥ ¤ £ where A=20/ln(10) and P0 =2e-5 Pa. The expected firing rate in response to the noise plus an unmodulated signal (tone) can be similarly approximated by: ) ¨ E ( t ) + PS P0 ¦ ( d A ln 1 + dt § M N + S ( t ) = max 0, © where PS is the peak pressure level of the tone. Clearly, both MN (t) and MN+S (t) are identically zero outside the interval [0 D]. Within this interval it holds that: M N (t ) = AP D P0 + P D t 0≤t < D Clearly, M N + S < M N for the interval [0 D] of each modulation cycle. That is, the addition of a tone reduces the responses of the model to the rising part of the modulated envelope. Higher tone levels (Ps ) cause greater reduction in the model’s firing rate. (c) (b) Level derivative (dB SPL/ms) Level (dB SPL) (a) (d) Time (ms) Figure 3: An illustration of the basic operation of the model on various amplitude envelopes. The simplified operation of the model includes log compression of the amplitude envelope (a and c) and rectified time-derivative of the log-compressed envelope (b and d). (a) A 30 dB SPL tone is added to a modulated envelope (peak level of 70 dB SPL) 300 ms after the beginning of the stimulus (as indicated by the horizontal line). The addition of the tone causes a great reduction in the time derivative of the log-compressed envelope (b). When the envelope of the noise is unmodulated (c), the time-derivative of the log-compressed envelope (d) shows a tiny spike when the tone is added (marked by the arrow). Fig. 3 demonstrates the effect of a low-level tone on the time-derivative of the logcompressed envelope of a noise. When the envelope is modulated (Fig. 3a) the addition of the tone greatly reduces the derivative of the rising part of the modulation (Fig. 3b). In the absence of modulations (Fig. 3c), the tone presentation produces a negligible effect on the level derivative (Fig. 3d). Model simulations of neural responses to the stimuli used by Nelken et al. are plotted in Fig. 4. As illustrated schematically in Fig 3 (d), the presence of the tone does not cause any significant change in the responses of the model to the unmodulated noise (Fig. 4a). In the modulated noise, however, tones of relatively low levels reduce the responses of the model to the rising part of the envelope modulations. (b) Modulated noise 76 Spikes/sec Tone level (dB SPL) (a) Unmodulated noise 26 0 150 300 0 Time (ms) 150 300 Figure 4: Simulated responses of the model to a combination of a tone and Unmodulated noise (a) and modulated noise (b). All conventions are as in Fig. 2. 4 Di scu ssi on This report uses an auditory edge-detection model to simulate the actual physiological consequences of amplitude modulation on neural sensitivity in cortical area A1. The basic computational operation of the model is the calculation of the smoothed time-derivative of the log-compressed stimulus envelope. The ability of the model to reproduce cortical response patterns in detail across a variety of stimulus conditions suggests similar time-sensitive mechanisms may contribute to the physiological correlates of CMR. These findings augment our previous observations that the simple edge-detection model can successfully predict a wide range of physiological and perceptual phenomena [11]. Former applications of the model to perceptual phenomena have been mainly related to auditory scene analysis, or more specifically the ability of the auditory system to distinguish multiple sound sources. In these cases, a sharp amplitude transition at stimulus onset (“auditory edge”) was critical for sound segregation. Here, it is shown that the detection of acoustic signals also may be enhanced through the suppression of ongoing responses to the concurrent modulations of competing background sounds. Interestingly, these temporal fluctuations appear to be a common property of natural soundscapes [15]. The model provides testable predictions regarding how signal detection may be influenced by the temporal shape of amplitude modulation. Carlyon et al. [6] measured CMR in human listeners using three types of noise modulation: squarewave, sine wave and multiplied noise. From the perspective of the edge-detection model, these psychoacoustic results are intriguing because the different modulator types represent manipulations of the time derivative of masker envelopes. Squarewave modulation had the most sharply edged time derivative and produced the greatest masking release. Fig. 5 plots the responses of the model to a pure-tone signal in square-wave and sine-wave modulated noise. As in the psychoacoustical data of Carlyon et al., the simulated detection threshold was lower in the context of square-wave modulation. Our modeling results suggest that the sharply edged square wave evoked higher levels of noise-driven activity and therefore created a sensitive background for the suppressing effects of the unmodulated tone. (b) 60 Spikes/sec Tone level (dB SPL) (a) 10 0 200 400 600 0 Time (ms) 200 400 600 Figure 5: Simulated responses of the model to a combination of a tone at various levels and a sine-wave modulated noise (a) or a square-wave modulated noise (b). Each row shows the response of the model to the noise plus the tone at the level specified on the abscissa. The shape of the noise modulator is illustrated above each figure. The 100 ms tone starts 250 ms after the noise onset. Note that the tone detection threshold (marked by the dashed line) is 10 dB lower for the square-wave modulator than for the sine-wave modulator, in accordance with the psychoacoustical data of Carlyon et al. [6]. Although the physiological basis of our model was derived from studies of neural responses in the cat auditory system, the key psychoacoustical observations of Carlyon et al. have been replicated in recent behavioral studies of cats (Budelis et al. [5]). These data support the generalization of human perceptual processing to other species and enhance the possible correspondence between the neuronal CMR-like effect and the psychoacoustical masking phenomena. Clearly, the auditory system relies on information other than the time derivative of the stimulus envelope for the detection of auditory signals in background noise. Further physiological and psychoacoustic assessments of CMR-like masking effects are needed not only to refine the predictive abilities of the edge-detection model but also to reveal the additional sources of acoustic information that influence signal detection in constantly changing natural environments. Ackn ow led g men t s This work was supported in part by a NIDCD grant R01 DC004841. Refe ren ces [1] Agmon-Snir H., Segev I. (1993). “Signal delay and input synchronization in passive dendritic structure”, J. Neurophysiol. 70, 2066-2085. [2] Bregman A.S. (1990). “Auditory scene analysis: The perceptual organization of sound”, MIT Press, Cambridge, MA. [3] Bregman A.S., Ahad P.A., Kim J., Melnerich L. (1994) “Resetting the pitch-analysis system. 1. Effects of rise times of tones in noise backgrounds or of harmonics in a complex tone”, Percept. Psychophys. 56 (2), 155-162. [4] Bregman A.S., Ahad P.A., Kim J. (1994) “Resetting the pitch-analysis system. 2. Role of sudden onsets and offsets in the perception of individual components in a cluster of overlapping tones”, J. Acoust. Soc. Am. 96 (5), 2694-2703. [5] Budelis J., Fishbach A., May B.J. (2002) “Behavioral assessments of comodulation masking release in cats”, Abst. Assoc. for Res. in Otolaryngol. 25. [6] Carlyon R.P., Buus S., Florentine M. (1989) “Comodulation masking release for three types of modulator as a function of modulation rate”, Hear. Res. 42, 37-46. [7] Darwin C.J. (1997) “Auditory grouping”, Trends in Cog. Sci. 1(9), 327-333. [8] Darwin C.J., Ciocca V. (1992) “Grouping in pitch perception: Effects of onset asynchrony and ear of presentation of a mistuned component”, J. Acoust. Soc. Am. 91 , 33813390. [9] Drullman R., Festen H.M., Plomp R. (1994) “Effect of temporal envelope smearing on speech reception”, J. Acoust. Soc. Am. 95 (2), 1053-1064. [10] Eggermont J J. (1994). “Temporal modulation transfer functions for AM and FM stimuli in cat auditory cortex. Effects of carrier type, modulating waveform and intensity”, Hear. Res. 74, 51-66. [11] Fishbach A., Nelken I., Yeshurun Y. (2001) “Auditory edge detection: a neural model for physiological and psychoacoustical responses to amplitude transients”, J. Neurophysiol. 85, 2303–2323. [12] Gerstner W. (1999) “Spiking neurons”, in Pulsed Neural Networks , edited by W. Maass, C. M. Bishop, (MIT Press, Cambridge, MA). [13] Hall J.W., Haggard M.P., Fernandes M.A. (1984) “Detection in noise by spectrotemporal pattern analysis”, J. Acoust. Soc. Am. 76, 50-56. [14] Heil P. (1997) “Auditory onset responses revisited. II. Response strength”, J. Neurophysiol. 77, 2642-2660. [15] Nelken I., Rotman Y., Bar-Yosef O. (1999) “Responses of auditory cortex neurons to structural features of natural sounds”, Nature 397, 154-157. [16] Phillips D.P. (1988). “Effect of Tone-Pulse Rise Time on Rate-Level Functions of Cat Auditory Cortex Neurons: Excitatory and Inhibitory Processes Shaping Responses to Tone Onset”, J. Neurophysiol. 59, 1524-1539. [17] Phillips D.P., Burkard R. (1999). “Response magnitude and timing of auditory response initiation in the inferior colliculus of the awake chinchilla”, J. Acoust. Soc. Am. 105, 27312737. [18] Phillips D.P., Semple M.N., Kitzes L.M. (1995). “Factors shaping the tone level sensitivity of single neurons in posterior field of cat auditory cortex”, J. Neurophysiol. 73, 674-686. [19] Rosen S. (1992) “Temporal information in speech: acoustic, auditory and linguistic aspects”, Phil. Trans. R. Soc. Lond. B 336, 367-373. [20] Shannon R.V., Zeng F.G., Kamath V., Wygonski J, Ekelid M. (1995) “Speech recognition with primarily temporal cues”, Science 270, 303-304. [21] Turner C.W., Relkin E.M., Doucet J. (1994). “Psychophysical and physiological forward masking studies: probe duration and rise-time effects”, J. Acoust. Soc. Am. 96 (2), 795-800. [22] Yost W.A., Sheft S. (1994) “Modulation detection interference – across-frequency processing and auditory grouping”, Hear. Res. 79, 48-58. [23] Zhang X., Heinz M.G., Bruce I.C., Carney L.H. (2001). “A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression”, J. Acoust. Soc. Am. 109 (2), 648-670.
5 0.15527566 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
Author: Christopher Williams, Michalis K. Titsias
Abstract: We consider data which are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of configurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images.
6 0.13022596 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement
7 0.098054044 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
8 0.096404962 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering
9 0.087124042 103 nips-2002-How Linear are Auditory Cortical Responses?
10 0.086629115 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
11 0.086258285 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
12 0.077277094 67 nips-2002-Discriminative Binaural Sound Localization
13 0.071640611 43 nips-2002-Binary Coding in Auditory Cortex
14 0.07083071 29 nips-2002-Analysis of Information in Speech Based on MANOVA
15 0.065698616 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
16 0.063579671 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
17 0.056950256 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion
18 0.049704362 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
19 0.048306547 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex
20 0.046496 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex
topicId topicWeight
[(0, -0.153), (1, 0.108), (2, 0.005), (3, 0.08), (4, -0.049), (5, -0.055), (6, -0.181), (7, 0.018), (8, 0.308), (9, -0.145), (10, 0.223), (11, 0.202), (12, -0.413), (13, -0.074), (14, 0.069), (15, -0.018), (16, -0.042), (17, -0.078), (18, -0.003), (19, -0.069), (20, -0.167), (21, 0.125), (22, 0.077), (23, -0.1), (24, 0.059), (25, -0.031), (26, 0.051), (27, 0.054), (28, -0.143), (29, 0.149), (30, -0.008), (31, 0.137), (32, 0.017), (33, -0.091), (34, 0.088), (35, 0.019), (36, 0.07), (37, -0.077), (38, 0.007), (39, -0.026), (40, 0.031), (41, -0.038), (42, 0.01), (43, -0.057), (44, -0.109), (45, -0.014), (46, 0.053), (47, 0.02), (48, -0.089), (49, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.9818508 147 nips-2002-Monaural Speech Separation
Author: Guoning Hu, Deliang Wang
Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows: R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.
2 0.95187777 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch
Author: Lawrence K. Saul, Daniel D. Lee, Charles L. Isbell, Yann L. Cun
Abstract: We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares fit. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.
3 0.43141273 12 nips-2002-A Neural Edge-Detection Model for Enhanced Auditory Sensitivity in Modulated Noise
Author: Alon Fishbach, Bradford J. May
Abstract: Psychophysical data suggest that temporal modulations of stimulus amplitude envelopes play a prominent role in the perceptual segregation of concurrent sounds. In particular, the detection of an unmodulated signal can be significantly improved by adding amplitude modulation to the spectral envelope of a competing masking noise. This perceptual phenomenon is known as “Comodulation Masking Release” (CMR). Despite the obvious influence of temporal structure on the perception of complex auditory scenes, the physiological mechanisms that contribute to CMR and auditory streaming are not well known. A recent physiological study by Nelken and colleagues has demonstrated an enhanced cortical representation of auditory signals in modulated noise. Our study evaluates these CMR-like response patterns from the perspective of a hypothetical auditory edge-detection neuron. It is shown that this simple neural model for the detection of amplitude transients can reproduce not only the physiological data of Nelken et al., but also, in light of previous results, a variety of physiological and psychoacoustical phenomena that are related to the perceptual segregation of concurrent sounds. 1 In t rod u ct i on The temporal structure of a complex sound exerts strong influences on auditory physiology (e.g. [10, 16]) and perception (e.g. [9, 19, 20]). In particular, studies of auditory scene analysis have demonstrated the importance of the temporal structure of amplitude envelopes in the perceptual segregation of concurrent sounds [2, 7]. Common amplitude transitions across frequency serve as salient cues for grouping sound energy into unified perceptual objects. Conversely, asynchronous amplitude transitions enhance the separation of competing acoustic events [3, 4]. These general principles are manifested in perceptual phenomena as diverse as comodulation masking release (CMR) [13], modulation detection interference [22] and synchronous onset grouping [8]. Despite the obvious importance of timing information in psychoacoustic studies of auditory masking, the way in which the CNS represents the temporal structure of an amplitude envelope is not well understood. Certainly many physiological studies have demonstrated neural sensitivities to envelope transitions, but this sensitivity is only beginning to be related to the variety of perceptual experiences that are evoked by signals in noise. Nelken et al. [15] have suggested a correspondence between neural responses to time-varying amplitude envelopes and psychoacoustic masking phenomena. In their study of neurons in primary auditory cortex (A1), adding temporal modulation to background noise lowered the detection thresholds of unmodulated tones. This enhanced signal detection is similar to the perceptual phenomenon that is known as comodulation masking release [13]. Fishbach et al. [11] have recently proposed a neural model for the detection of “auditory edges” (i.e., amplitude transients) that can account for numerous physiological [14, 17, 18] and psychoacoustical [3, 21] phenomena. The encompassing utility of this edge-detection model suggests a common mechanism that may link the auditory processing and perception of auditory signals in a complex auditory scene. Here, it is shown that the auditory edge detection model can accurately reproduce the cortical CMR-like responses previously described by Nelken and colleagues. 2 Th e M od el The model is described in detail elsewhere [11]. In short, the basic operation of the model is the calculation of the first-order time derivative of the log-compressed envelope of the stimulus. A computational model [23] is used to convert the acoustic waveform to a physiologically plausible auditory nerve representation (Fig 1a). The simulated neural response has a medium spontaneous rate and a characteristic frequency that is set to the frequency of the target tone. To allow computation of the time derivative of the stimulus envelope, we hypothesize the existence of a temporal delay dimension, along which the stimulus is progressively delayed. The intermediate delay layer (Fig 1b) is constructed from an array of neurons with ascending membrane time constants (τ); each neuron is modeled by a conventional integrate-and-fire model (I&F;, [12]). Higher membrane time constant induces greater delay in the neuron’s response [1]. The output of the delay layer converges to a single output neuron (Fig. 1c) via a set of connection with various efficacies that reflect a receptive field of a gaussian derivative. This combination of excitatory and inhibitory connections carries out the time-derivative computation. Implementation details and parameters are given in [11]. The model has 2 adjustable and 6 fixed parameters, the former were used to fit the responses of the model to single unit responses to variety of stimuli [11]. The results reported here are not sensitive to these parameters. (a) AN model (b) delay-layer (c) edge-detector neuron τ=6 ms I&F; Neuron τ=4 ms τ=3 ms bandpass log d dt RMS Figure 1: Schematic diagram of the model and a block diagram of the basic operation of each model component (shaded area). The stimulus is converted to a neural representation (a) that approximates the average firing rate of a medium spontaneous-rate AN fiber [23]. The operation of this stage can be roughly described as the log-compressed rms output of a bandpass filter. The neural representation is fed to a series of neurons with ascending membrane time constant (b). The kernel functions that are used to simulate these neurons are plotted for a few neurons along with the time constants used. The output of the delay-layer neurons converge to a single I&F; neuron (c) using a set of connections with weights that reflect a shape of a gaussian derivative. Solid arrows represent excitatory connections and white arrows represent inhibitory connections. The absolute efficacy is represented by the width of the arrows. 3 Resu lt s Nelken et al. [15] report that amplitude modulation can substantially modify the noise-driven discharge rates of A1 neurons in Halothane-anesthetized cats. Many cortical neurons show only a transient onset response to unmodulated noise but fire in synchrony (“lock”) to the envelope of modulated noise. A significant reduction in envelope-locked discharge rates is observed if an unmodulated tone is added to modulated noise. As summarized in Fig. 2, this suppression of envelope locking can reveal the presence of an auditory signal at sound pressure levels that are not detectable in unmodulated noise. It has been suggested that this pattern of neural responding may represent a physiological equivalent of CMR. Reproduction of CMR-like cortical activity can be illustrated by a simplified case in which the analytical amplitude envelope of the stimulus is used as the input to the edge-detector model. In keeping with the actual physiological approach of Nelken et al., the noise envelope is shaped by a trapezoid modulator for these simulations. Each cycle of modulation, E N(t), is given by: t 0≤t < 3D E N (t ) = P P − D (t − 3 D ) 3 D ≤ t < 4 D 0 4 D ≤ t < 8D £ P D ¢ ¡ where P is the peak pressure level and D is set to 12.5 ms. (b) Modulated noise 76 Spikes/sec Tone level (dB SPL) (a) Unmodulated noise 26 0 150 300 0 150 300 Time (ms) Figure 2: Responses of an A1 unit to a combination of noise and tone at many tone levels, replotted from Nelken et al. [15]. (a) Unmodulated noise and (b) modulated noise. The noise envelope is illustrated by the thick line above each figure. Each row shows the response of the neuron to the noise plus the tone at the level specified on the ordinate. The dashed line in (b) indicates the detection threshold level for the tone. The detection threshold (as defined and calculated by Nelken et al.) in the unmodulated noise was not reached. Since the basic operation of the model is the calculation of the rectified timederivative of the log-compressed envelope of the stimulus, the expected noisedriven rate of the model can be approximated by: ( ) ¢ E (t ) P0 d A ln 1 + dt ¡ M N ( t ) = max 0, ¥ ¤ £ where A=20/ln(10) and P0 =2e-5 Pa. The expected firing rate in response to the noise plus an unmodulated signal (tone) can be similarly approximated by: ) ¨ E ( t ) + PS P0 ¦ ( d A ln 1 + dt § M N + S ( t ) = max 0, © where PS is the peak pressure level of the tone. Clearly, both MN (t) and MN+S (t) are identically zero outside the interval [0 D]. Within this interval it holds that: M N (t ) = AP D P0 + P D t 0≤t < D Clearly, M N + S < M N for the interval [0 D] of each modulation cycle. That is, the addition of a tone reduces the responses of the model to the rising part of the modulated envelope. Higher tone levels (Ps ) cause greater reduction in the model’s firing rate. (c) (b) Level derivative (dB SPL/ms) Level (dB SPL) (a) (d) Time (ms) Figure 3: An illustration of the basic operation of the model on various amplitude envelopes. The simplified operation of the model includes log compression of the amplitude envelope (a and c) and rectified time-derivative of the log-compressed envelope (b and d). (a) A 30 dB SPL tone is added to a modulated envelope (peak level of 70 dB SPL) 300 ms after the beginning of the stimulus (as indicated by the horizontal line). The addition of the tone causes a great reduction in the time derivative of the log-compressed envelope (b). When the envelope of the noise is unmodulated (c), the time-derivative of the log-compressed envelope (d) shows a tiny spike when the tone is added (marked by the arrow). Fig. 3 demonstrates the effect of a low-level tone on the time-derivative of the logcompressed envelope of a noise. When the envelope is modulated (Fig. 3a) the addition of the tone greatly reduces the derivative of the rising part of the modulation (Fig. 3b). In the absence of modulations (Fig. 3c), the tone presentation produces a negligible effect on the level derivative (Fig. 3d). Model simulations of neural responses to the stimuli used by Nelken et al. are plotted in Fig. 4. As illustrated schematically in Fig 3 (d), the presence of the tone does not cause any significant change in the responses of the model to the unmodulated noise (Fig. 4a). In the modulated noise, however, tones of relatively low levels reduce the responses of the model to the rising part of the envelope modulations. (b) Modulated noise 76 Spikes/sec Tone level (dB SPL) (a) Unmodulated noise 26 0 150 300 0 Time (ms) 150 300 Figure 4: Simulated responses of the model to a combination of a tone and Unmodulated noise (a) and modulated noise (b). All conventions are as in Fig. 2. 4 Di scu ssi on This report uses an auditory edge-detection model to simulate the actual physiological consequences of amplitude modulation on neural sensitivity in cortical area A1. The basic computational operation of the model is the calculation of the smoothed time-derivative of the log-compressed stimulus envelope. The ability of the model to reproduce cortical response patterns in detail across a variety of stimulus conditions suggests similar time-sensitive mechanisms may contribute to the physiological correlates of CMR. These findings augment our previous observations that the simple edge-detection model can successfully predict a wide range of physiological and perceptual phenomena [11]. Former applications of the model to perceptual phenomena have been mainly related to auditory scene analysis, or more specifically the ability of the auditory system to distinguish multiple sound sources. In these cases, a sharp amplitude transition at stimulus onset (“auditory edge”) was critical for sound segregation. Here, it is shown that the detection of acoustic signals also may be enhanced through the suppression of ongoing responses to the concurrent modulations of competing background sounds. Interestingly, these temporal fluctuations appear to be a common property of natural soundscapes [15]. The model provides testable predictions regarding how signal detection may be influenced by the temporal shape of amplitude modulation. Carlyon et al. [6] measured CMR in human listeners using three types of noise modulation: squarewave, sine wave and multiplied noise. From the perspective of the edge-detection model, these psychoacoustic results are intriguing because the different modulator types represent manipulations of the time derivative of masker envelopes. Squarewave modulation had the most sharply edged time derivative and produced the greatest masking release. Fig. 5 plots the responses of the model to a pure-tone signal in square-wave and sine-wave modulated noise. As in the psychoacoustical data of Carlyon et al., the simulated detection threshold was lower in the context of square-wave modulation. Our modeling results suggest that the sharply edged square wave evoked higher levels of noise-driven activity and therefore created a sensitive background for the suppressing effects of the unmodulated tone. (b) 60 Spikes/sec Tone level (dB SPL) (a) 10 0 200 400 600 0 Time (ms) 200 400 600 Figure 5: Simulated responses of the model to a combination of a tone at various levels and a sine-wave modulated noise (a) or a square-wave modulated noise (b). Each row shows the response of the model to the noise plus the tone at the level specified on the abscissa. The shape of the noise modulator is illustrated above each figure. The 100 ms tone starts 250 ms after the noise onset. Note that the tone detection threshold (marked by the dashed line) is 10 dB lower for the square-wave modulator than for the sine-wave modulator, in accordance with the psychoacoustical data of Carlyon et al. [6]. Although the physiological basis of our model was derived from studies of neural responses in the cat auditory system, the key psychoacoustical observations of Carlyon et al. have been replicated in recent behavioral studies of cats (Budelis et al. [5]). These data support the generalization of human perceptual processing to other species and enhance the possible correspondence between the neuronal CMR-like effect and the psychoacoustical masking phenomena. Clearly, the auditory system relies on information other than the time derivative of the stimulus envelope for the detection of auditory signals in background noise. Further physiological and psychoacoustic assessments of CMR-like masking effects are needed not only to refine the predictive abilities of the edge-detection model but also to reveal the additional sources of acoustic information that influence signal detection in constantly changing natural environments. Ackn ow led g men t s This work was supported in part by a NIDCD grant R01 DC004841. Refe ren ces [1] Agmon-Snir H., Segev I. (1993). “Signal delay and input synchronization in passive dendritic structure”, J. Neurophysiol. 70, 2066-2085. [2] Bregman A.S. (1990). “Auditory scene analysis: The perceptual organization of sound”, MIT Press, Cambridge, MA. [3] Bregman A.S., Ahad P.A., Kim J., Melnerich L. (1994) “Resetting the pitch-analysis system. 1. Effects of rise times of tones in noise backgrounds or of harmonics in a complex tone”, Percept. Psychophys. 56 (2), 155-162. [4] Bregman A.S., Ahad P.A., Kim J. (1994) “Resetting the pitch-analysis system. 2. Role of sudden onsets and offsets in the perception of individual components in a cluster of overlapping tones”, J. Acoust. Soc. Am. 96 (5), 2694-2703. [5] Budelis J., Fishbach A., May B.J. (2002) “Behavioral assessments of comodulation masking release in cats”, Abst. Assoc. for Res. in Otolaryngol. 25. [6] Carlyon R.P., Buus S., Florentine M. (1989) “Comodulation masking release for three types of modulator as a function of modulation rate”, Hear. Res. 42, 37-46. [7] Darwin C.J. (1997) “Auditory grouping”, Trends in Cog. Sci. 1(9), 327-333. [8] Darwin C.J., Ciocca V. (1992) “Grouping in pitch perception: Effects of onset asynchrony and ear of presentation of a mistuned component”, J. Acoust. Soc. Am. 91 , 33813390. [9] Drullman R., Festen H.M., Plomp R. (1994) “Effect of temporal envelope smearing on speech reception”, J. Acoust. Soc. Am. 95 (2), 1053-1064. [10] Eggermont J J. (1994). “Temporal modulation transfer functions for AM and FM stimuli in cat auditory cortex. Effects of carrier type, modulating waveform and intensity”, Hear. Res. 74, 51-66. [11] Fishbach A., Nelken I., Yeshurun Y. (2001) “Auditory edge detection: a neural model for physiological and psychoacoustical responses to amplitude transients”, J. Neurophysiol. 85, 2303–2323. [12] Gerstner W. (1999) “Spiking neurons”, in Pulsed Neural Networks , edited by W. Maass, C. M. Bishop, (MIT Press, Cambridge, MA). [13] Hall J.W., Haggard M.P., Fernandes M.A. (1984) “Detection in noise by spectrotemporal pattern analysis”, J. Acoust. Soc. Am. 76, 50-56. [14] Heil P. (1997) “Auditory onset responses revisited. II. Response strength”, J. Neurophysiol. 77, 2642-2660. [15] Nelken I., Rotman Y., Bar-Yosef O. (1999) “Responses of auditory cortex neurons to structural features of natural sounds”, Nature 397, 154-157. [16] Phillips D.P. (1988). “Effect of Tone-Pulse Rise Time on Rate-Level Functions of Cat Auditory Cortex Neurons: Excitatory and Inhibitory Processes Shaping Responses to Tone Onset”, J. Neurophysiol. 59, 1524-1539. [17] Phillips D.P., Burkard R. (1999). “Response magnitude and timing of auditory response initiation in the inferior colliculus of the awake chinchilla”, J. Acoust. Soc. Am. 105, 27312737. [18] Phillips D.P., Semple M.N., Kitzes L.M. (1995). “Factors shaping the tone level sensitivity of single neurons in posterior field of cat auditory cortex”, J. Neurophysiol. 73, 674-686. [19] Rosen S. (1992) “Temporal information in speech: acoustic, auditory and linguistic aspects”, Phil. Trans. R. Soc. Lond. B 336, 367-373. [20] Shannon R.V., Zeng F.G., Kamath V., Wygonski J, Ekelid M. (1995) “Speech recognition with primarily temporal cues”, Science 270, 303-304. [21] Turner C.W., Relkin E.M., Doucet J. (1994). “Psychophysical and physiological forward masking studies: probe duration and rise-time effects”, J. Acoust. Soc. Am. 96 (2), 795-800. [22] Yost W.A., Sheft S. (1994) “Modulation detection interference – across-frequency processing and auditory grouping”, Hear. Res. 79, 48-58. [23] Zhang X., Heinz M.G., Bruce I.C., Carney L.H. (2001). “A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression”, J. Acoust. Soc. Am. 109 (2), 648-670.
4 0.43105948 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement
Author: Patrick J. Wolfe, Simon J. Godsill
Abstract: The Bayesian paradigm provides a natural and effective means of exploiting prior knowledge concerning the time-frequency structure of sound signals such as speech and music—something which has often been overlooked in traditional audio signal processing approaches. Here, after constructing a Bayesian model and prior distributions capable of taking into account the time-frequency characteristics of typical audio waveforms, we apply Markov chain Monte Carlo methods in order to sample from the resultant posterior distribution of interest. We present speech enhancement results which compare favourably in objective terms with standard time-varying filtering techniques (and in several cases yield superior performance, both objectively and subjectively); moreover, in contrast to such methods, our results are obtained without an assumption of prior knowledge of the noise power.
5 0.37855646 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering
Author: Hagai Attias
Abstract: Source separation is an important problem at the intersection of several fields, including machine learning, signal processing, and speech technology. Here we describe new separation algorithms which are based on probabilistic graphical models with latent variables. In contrast with existing methods, these algorithms exploit detailed models to describe source properties. They also use subband filtering ideas to model the reverberant environment, and employ an explicit model for background and sensor noise. We leverage variational techniques to keep the computational complexity per EM iteration linear in the number of frames. 1 The Source Separation Problem Fig. 1 illustrates the problem of source separation with a sensor array. In this problem, signals from K independent sources are received by each of L ≥ K sensors. The task is to extract the sources from the sensor signals. It is a difficult task, partly because the received signals are distorted versions of the originals. There are two types of distortions. The first type arises from propagation through a medium, and is approximately linear but also history dependent. This type is usually termed reverberations. The second type arises from background noise and sensor noise, which are assumed additive. Hence, the actual task is to obtain an optimal estimate of the sources from data. The task is difficult for another reason, which is lack of advance knowledge of the properties of the sources, the propagation medium, and the noises. This difficulty gave rise to adaptive source separation algorithms, where parameters that are related to those properties are adjusted to optimized a chosen cost function. Unfortunately, the intense activity this problem has attracted over the last several years [1–9] has not yet produced a satisfactory solution. In our opinion, the reason is that existing techniques fail to address three major factors. The first is noise robustness: algorithms typically ignore background and sensor noise, sometime assuming they may be treated as additional sources. It seems plausible that to produce a noise robust algorithm, noise signals and their properties must be modeled explicitly, and these models should be exploited to compute optimal source estimators. The second factor is mixing filters: algorithms typically seek, and directly optimize, a transformation that would unmix the sources. However, in many situations, the filters describing medium propagation are non-invertible, or have an unstable inverse, or have a stable inverse that is extremely long. It may hence be advantageous to Figure 1: The source separation problem. Signals from K = 2 speakers propagate toward L = 2 sensors. Each sensor receives a linear mixture of the speaker signals, distorted by multipath propagation, medium response, and background and sensor noise. The task is to infer the original signals from sensor data. estimate the mixing filters themselves, then use them to estimate the sources. The third factor is source properties: algorithms typically use a very simple source model (e.g., a one time point histogram). But in many cases one may easily obtain detailed models of the source signals. This is particularly true for speech sources, where large datasets exist and much modeling expertise has developed over decades of research. Separation of speakers is also one of the major potential commercial applications of source separation algorithms. It seems plausible that incorporating strong source models could improve performance. Such models may potentially have two more advantages: first, they could help limit the range of possible mixing filters by constraining the optimization problem. Second, they could help avoid whitening the extracted signals by effectively limiting their spectral range to the range characteristic of the source model. This paper makes several contributions to the problem of real world source separation. In the following, we present new separation algorithms that are the first to address all three factors. We work in the framework of probabilistic graphical models. This framework allows us to construct models for sources and for noise, combine them with the reverberant mixing transformation in a principled manner, and compute parameter and source estimates from data which are Bayes optimal. We identify three technical ideas that are key to our approach: (1) a strong speech model, (2) subband filtering, and (3) variational EM. 2 Frames, Subband Signals, and Subband Filtering We start with the concept of subband filtering. This is also a good point to define our notation. Let xm denote a time domain signal, e.g., the value of a sound pressure waveform at time point m = 0, 1, 2, .... Let Xn [k] denote the corresponding subband signal at time frame n and subband frequency k. The subband signals are obtained from the time domain signal by imposing an N -point window wm , m = 0 : N − 1 on that signal at equally spaced points nJ, n = 0, 1, 2, ..., and FFT-ing the windowed signal, N −1 e−iωk m wm xnJ+m , Xn [k] = (1) m=0 where ωk = 2πk/N and k = 0 : N − 1. The subband signals are also termed frames. Notice the difference in time scale between the time frame index n in Xn [k] and the time point index n in xn . The chosen value of the spacing J depends on the window length N . For J ≤ N the original signal xm can be synthesized exactly from the subband signals (synthesis formula omitted). An important consideration for selecting J, as well as the window shape, is behavior under filtering. Consider a filter hm applied to xm , and denote by ym the filtered signal. In the simple case hm = hδm,0 (no filtering), the subband signals keep the same dependence as the time domain ones, yn = hxn −→ Yn [k] = hXn [k] . For an arbitrary filter hm , we use the relation yn = hm xn−m −→ Yn [k] = Hm [k]Xn−m [k] , (2) m m with complex coefficients Hm [k] for each k. This relation between the subband signals is termed subband filtering, and the Hm [k] are termed subband filters. Unlike the simple case of non-filtering, the relation (2) holds approximately, but quite accurately using an appropriate choice of J and wm ; see [13] for details on accuracy. Throughout this paper, we will assume that an arbitrary filter hm can be modeled by the subband filters Hm [k] to a sufficient accuracy for our purposes. One advantage of subband filtering is that it replaces a long filter hm by a set of short independent filters Hm [k], one per frequency. This will turn out to decompose the source separation problem into a set of small (albeit coupled) problems, one per frequency. Another advantage is that this representation allows using a detailed speech model on the same footing with the filter model. This is because a speech model is defined on the time scale of a single frame, whereas the original filter hm , in contrast with Hm [k], is typically as long as 10 or more frames. As a final point on notation, we define a Gaussian distribution over a complex number Z ν by p(Z) = N (Z | µ, ν) = π exp(−ν | Z − µ |2 ) . Notice that this is a joint distribution over the real and imaginary parts of Z. The mean is µ = X and the precision (inverse variance) ν satisfies ν −1 = | X |2 − | µ |2 . 3 A Model for Speech Signals We assume independent sources, and model the distribution of source j by a mixture model over its subband signals Xjn , N/2−1 p(Xjn | Sjn = s) N (Xjn [k] | 0, Ajs [k]) = p(Sjn = s) = πjs k=1 p(X, S) p(Xjn | Sjn )p(Sjn ) , = (3) jn where the components are labeled by Sjn . Component s of source j is a zero mean Gaussian with precision Ajs . The mixing proportions of source j are πjs . The DAG representing this model is shown in Fig. 2. A similar model was used in [10] for one microphone speech enhancement for recognition (see also [11]). Here are several things to note about this model. (1) Each component has a characteristic spectrum, which may describe a particular part of a speech phoneme. This is because the precision corresponds to the inverse spectrum: the mean energy (w.r.t. the above distribution) of source j at frequency k, conditioned on label s, is | Xjn |2 = A−1 . (2) js A zero mean model is appropriate given the physics of the problem, since the mean of a sound pressure waveform is zero. (3) k runs from 1 to N/2 − 1, since for k > N/2, Xjn [k] = Xjn [N − k] ; the subbands k = 0, N/2 are real and are omitted from the model, a common practice in speech recognition engines. (4) Perhaps most importantly, for each source the subband signals are correlated via the component label s, as p(Xjn ) = s p(Xjn , Sjn = s) = k p(Xjn [k]) . Hence, when the source separation problem decomposes into one problem per frequency, these problems turn out to be coupled (see below), and independent frequency permutations are avoided. (5) To increase sn xn Figure 2: Graphical model describing speech signals in the subband domain. The model assumes i.i.d. frames; only the frame at time n is shown. The node Xn represents a complex N/2 − 1-dimensional vector Xn [k], k = 1 : N/2 − 1. model accuracy, a state transition matrix p(Sjn = s | Sj,n−1 = s ) may be added for each source. The resulting HMM models are straightforward to incorporate without increasing the algorithm complexity. There are several modes of using the speech model in the algorithms below. In one mode, the sources are trained online using the sensor data. In a second mode, source models are trained offline using available data on each source in the problem. A third mode correspond to separation of sources known to be speech but whose speakers are unknown. In this case, all sources have the same model, which is trained offline on a large dataset of speech signals, including 150 male and female speakers reading sentences from the Wall Street Journal (see [10] for details). This is the case presented in this paper. The training algorithm used was standard EM (omitted) using 256 clusters, initialized by vector quantization. 4 Separation of Non-Reverberant Mixtures We now present a source separation algorithm for the case of non-reverberant (or instantaneous) mixing. Whereas many algorithms exist for this case, our contribution here is an algorithm that is significantly more robust to noise. Its robustness results, as indicated in the introduction, from three factors: (1) explicitly modeling the noise in the problem, (2) using a strong source model, in particular modeling the temporal statistics (over N time points) of the sources, rather than one time point statistics, and (3) extracting each source signal from data by a Bayes optimal estimator obtained from p(X | Y ). A more minor point is handling the case of less sources than sensors in a principled way. The mixing situation is described by yin = j hij xjn + uin , where xjn is source signal j at time point n, yin is sensor signal i, hij is the instantaneous mixing matrix, and uin is the noise corrupting sensor i’s signal. The corresponding subband signals satisfy Yin [k] = j hij Xjn [k] + Uin [k] . To turn the last equation into a probabilistic graphical model, we assume that noise i has precision (inverse spectrum) Bi [k], and that noises at different sensors are independent (the latter assumption is often inaccurate but can be easily relaxed). This yields p(Yin | X) N (Yin [k] | = p(Y | X) p(Yin | X) , = hij Xjn [k], Bi [k]) j k (4) in which together with the speech model (3) forms a complete model p(Y, X, S) for this problem. The DAG representing this model for the case K = L = 2 is shown in Fig. 3. Notice that this model generalizes [4] to the subband domain. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n y1n−2 y1n−1 y1n y2n−2 y2n−1 y2 n Figure 3: Graphical model for noisy, non-reverberant 2 × 2 mixing, showing a 3 frame-long sequence. All nodes Yin and Xjn represent complex N/2 − 1-dimensional vectors (see Fig. 2). While Y1n and Y2n have the same parents, X1n and X2n , the arcs from the parents to Y2n are omitted for clarity. The model parameters θ = {hij , Bi [k], Ajs [k], πjs } are estimated from data by an EM algorithm. However, as the number of speech components M or the number of sources K increases, the E-step becomes computationally intractable, as it requires summing over all O(M K ) configurations of (S1n , ..., SKn ) at each frame. We approximate the E-step using a variational technique: focusing on the posterior distribution p(X, S | Y ), we compute an optimal tractable approximation q(X, S | Y ) ≈ p(X, S | Y ), which we use to compute the sufficient statistics (SS). We choose q(Xjn | Sjn , Y )q(Sjn | Y ) , q(X, S | Y ) = (5) jn where the hidden variables are factorized over the sources, and also over the frames (the latter factorization is exact in this model, but is an approximation for reverberant mixing). This posterior maintains the dependence of X on S, and thus the correlations between different subbands Xjn [k]. Notice also that this posterior implies a multimodal q(Xjn ) (i.e., a mixture distribution), which is more accurate than unimodal posteriors often employed in variational approximations (e.g., [12]), but is also harder to compute. A slightly more general form which allows inter-frame correlations by employing q(S | Y ) = jn q(Sjn | Sj,n−1 , Y ) may also be used, without increasing complexity. By optimizing in the usual way (see [12,13]) a lower bound on the likelihood w.r.t. q, we obtain q(Xjn [k] | Sjn = s, Y )q(Sjn = s | Y ) , q(Xjn , Sjn = s | Y ) = (6) k where q(Xjn [k] | Sjn = s, Y ) = N (Xjn [k] | ρjns [k], νjs [k]) and q(Sjn = s | Y ) = γjns . Both the factorization over k of q(Xjn | Sjn ) and its Gaussian functional form fall out from the optimization under the structural restriction (5) and need not be specified in advance. The variational parameters {ρjns [k], νjs [k], γjns }, which depend on the data Y , constitute the SS and are computed in the E-step. The DAG representing this posterior is shown in Fig. 4. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n {y im } Figure 4: Graphical model describing the variational posterior distribution applied to the model of Fig. 3. In the non-reverberant case, the components of this posterior at time frame n are conditioned only on the data Yin at that frame; in the reverberant case, the components at frame n are conditioned on the data Yim at all frames m. For clarity and space reasons, this distinction is not made in the figure. After learning, the sources are extracted from data by a variational approximation of the minimum mean squared error estimator, ˆ Xjn [k] = E(Xjn [k] | Y ) = dX q(X | Y )Xjn [k] , (7) i.e., the posterior mean, where q(X | Y ) = S q(X, S | Y ). The time domain waveform xjm is then obtained by appropriately patching together the subband signals. ˆ M-step. The update rule for the mixing matrix hij is obtained by solving the linear equation Bi [k]ηij,0 [k] = hij j k Bi [k]λj j,0 [k] . (8) k The update rule for the noise precisions Bi [k] is omitted. The quantities ηij,m [k] and λj j,m [k] are computed from the SS; see [13] for details. E-step. The posterior means of the sources (7) are obtained by solving ˆ Xjn [k] = νjn [k]−1 ˆ i Bi [k]hij Yin [k] − j =j ˆ hij Xj n [k] (9) ˆ for Xjn [k], which is a K ×K linear system for each frequency k and frame n. The equations for the SS are given in [13], which also describes experimental results. 5 Separation of Reverberant Mixtures In this section we extend the algorithm to the case of reverberant mixing. In that case, due to signal propagation in the medium, each sensor signal at time frame n depends on the source signals not just at the same time but also at previous times. To describe this mathematically, the mixing matrix hij must become a matrix of filters hij,m , and yin = hij,m xj,n−m + uin . jm It may seem straightforward to extend the algorithm derived above to the present case. However, this appearance is misleading, because we have a time scale problem. Whereas are speech model p(X, S) is frame based, the filters hij,m are generally longer than the frame length N , typically 10 frames long and sometime longer. It is unclear how one can work with both Xjn and hij,m on the same footing (and, it is easy to see that straightforward windowed FFT cannot solve this problem). This is where the idea of subband filtering becomes very useful. Using (2) we have Yin [k] = Hij,m [k]Xj,n−m [k] + Uin [k], which yields the probabilistic model jm p(Yin | X) N (Yin [k] | = Hij,m [k]Xj,n−m [k], Bi [k]) . (10) jm k Hence, both X and Y are now frame based. Combining this equation with the speech model (3), we now have a complete model p(Y, X, S) for the reverberant mixing problem. The DAG describing this model is shown in Fig. 5. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n y1n−2 y1n−1 y1n y2n−2 y2n−1 y2 n Figure 5: Graphical model for noisy, reverberant 2 × 2 mixing, showing a 3 frame-long sequence. Here we assume 2 frame-long filters, i.e., m = 0, 1 in Eq. (10), where the solid arcs from X to Y correspond to m = 0 (as in Fig. 3) and the dashed arcs to m = 1. While Y1n and Y2n have the same parents, X1n and X2n , the arcs from the parents to Y2n are omitted for clarity. The model parameters θ = {Hij,m [k], Bi [k], Ajs [k], πjs } are estimated from data by a variational EM algorithm, whose derivation generally follows the one outlined in the previous section. Notice that the exact E-step here is even more intractable, due to the history dependence introduced by the filters. M-step. The update rule for Hij,m is obtained by solving the Toeplitz system Hij ,m [k]λj j,m−m [k] = ηij,m [k] (11) j m where the quantities λj j,m [k], ηij,m [k] are computed from the SS (see [12]). The update rule for the Bi [k] is omitted. E-step. The posterior means of the sources (7) are obtained by solving ˆ Xjn [k] = νjn [k]−1 ˆ im Bi [k]Hij,m−n [k] Yim [k] − Hij j m =jm ,m−m ˆ [k]Xj m [k] (12) ˆ for Xjn [k]. Assuming P frames long filters Hij,m , m = 0 : P − 1, this is a KP × KP linear system for each frequency k. The equations for the SS are given in [13], which also describes experimental results. 6 Extensions An alternative technique we have been pursuing for approximating EM in our models is Sequential Rao-Blackwellized Monte Carlo. There, we sample state sequences S from the posterior p(S | Y ) and, for a given sequence, perform exact inference on the source signals X conditioned on that sequence (observe that given S, the posterior p(X | S, Y ) is Gaussian and can be computed exactly). In addition, we are extending our speech model to include features such as pitch [7] in order to improve separation performance, especially in cases with less sensors than sources [7–9]. Yet another extension is applying model selection techniques to infer the number of sources from data in a dynamic manner. Acknowledgments I thank Te-Won Lee for extremely valuable discussions. References [1] A.J. Bell, T.J. Sejnowski (1995). An information maximisation approach to blind separation and blind deconvolution. Neural Computation 7, 1129-1159. [2] B.A. Pearlmutter, L.C. Parra (1997). Maximum likelihood blind source separation: A contextsensitive generalization of ICA. Proc. NIPS-96. [3] A. Cichocki, S.-I. Amari (2002). Adaptive Blind Signal and Image Processing. Wiley. [4] H. Attias (1999). Independent Factor Analysis. Neural Computation 11, 803-851. [5] T.-W. Lee et al. (2001) (Ed.). Proc. ICA 2001. [6] S. Griebel, M. Brandstein (2001). Microphone array speech dereverberation using coarse channel modeling. Proc. ICASSP 2001. [7] J. Hershey, M. Casey (2002). Audiovisual source separation via hidden Markov models. Proc. NIPS 2001. [8] S. Roweis (2001). One Microphone Source Separation. Proc. NIPS-00, 793-799. [9] G.-J. Jang, T.-W. Lee, Y.-H. Oh (2003). A probabilistic approach to single channel blind signal separation. Proc. NIPS 2002. [10] H. Attias, L. Deng, A. Acero, J.C. Platt (2001). A new method for speech denoising using probabilistic models for clean speech and for noise. Proc. Eurospeech 2001. [11] Ephraim, Y. (1992). Statistical model based speech enhancement systems. Proc. IEEE 80(10), 1526-1555. [12] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods in graphical models. Machine Learning 37, 183-233. [13] H. Attias (2003). New EM algorithms for source separation and deconvolution with a microphone array. Proc. ICASSP 2003.
6 0.36067116 29 nips-2002-Analysis of Information in Speech Based on MANOVA
7 0.34027064 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
8 0.30024996 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
9 0.26893887 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
10 0.23951091 67 nips-2002-Discriminative Binaural Sound Localization
11 0.23316973 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
12 0.21838969 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
13 0.20524392 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion
14 0.17546339 206 nips-2002-Visual Development Aids the Acquisition of Motion Velocity Sensitivities
15 0.16998127 103 nips-2002-How Linear are Auditory Cortical Responses?
16 0.16804896 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model
17 0.1671651 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
18 0.16112104 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
19 0.1603919 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
20 0.15938117 167 nips-2002-Rational Kernels
topicId topicWeight
[(11, 0.114), (23, 0.021), (41, 0.026), (42, 0.046), (49, 0.276), (54, 0.103), (55, 0.039), (64, 0.013), (65, 0.024), (68, 0.044), (74, 0.107), (92, 0.014), (98, 0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.82538742 147 nips-2002-Monaural Speech Separation
Author: Guoning Hu, Deliang Wang
Abstract: Monaural speech separation has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with speech in the highfrequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. Motivated by this, we propose a model for monaural separation that deals with low-frequency and highfrequency signals differently. For resolved harmonics, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the model generates segments based on amplitude modulation (AM) in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the separation process is a pitch contour obtained according to psychoacoustic constraints. Our model is systematically evaluated, and it yields substantially better performance than previous systems, especially in the high-frequency range. 1 In t rod u ct i on In a natural environment, speech usually occurs simultaneously with acoustic interference. An effective system for attenuating acoustic interference would greatly facilitate many applications, including automatic speech recognition (ASR) and speaker identification. Blind source separation using independent component analysis [10] or sensor arrays for spatial filtering require multiple sensors. In many situations, such as telecommunication and audio retrieval, a monaural (one microphone) solution is required, in which intrinsic properties of speech or interference must be considered. Various algorithms have been proposed for monaural speech enhancement [14]. These methods assume certain properties of interference and have difficulty in dealing with general acoustic interference. Monaural separation has also been studied using phasebased decomposition [3] and statistical learning [17], but with only limited evaluation. While speech enhancement remains a challenge, the auditory system shows a remarkable capacity for monaural speech separation. According to Bregman [1], the auditory system separates the acoustic signal into streams, corresponding to different sources, based on auditory scene analysis (ASA) principles. Research in ASA has inspired considerable work to build computational auditory scene analysis (CASA) systems for sound separation [19] [4] [7] [18]. Such systems generally approach speech separation in two main stages: segmentation (analysis) and grouping (synthesis). In segmentation, the acoustic input is decomposed into sensory segments, each of which is likely to originate from a single source. In grouping, those segments that likely come from the same source are grouped together, based mostly on periodicity. In a recent CASA model by Wang and Brown [18], segments are formed on the basis of similarity between adjacent filter responses (cross-channel correlation) and temporal continuity, while grouping among segments is performed according to the global pitch extracted within each time frame. In most situations, the model is able to remove intrusions and recover low-frequency (below 1 kHz) energy of target speech. However, this model cannot handle high-frequency (above 1 kHz) signals well, and it loses much of target speech in the high-frequency range. In fact, the inability to deal with speech in the high-frequency range is a common problem for CASA systems. We study monaural speech separation with particular emphasis on the high-frequency problem in CASA. For voiced speech, we note that the auditory system can resolve the first few harmonics in the low-frequency range [16]. It has been suggested that different perceptual mechanisms are used to handle resolved and unresolved harmonics [2]. Consequently, our model employs different methods to segregate resolved and unresolved harmonics of target speech. More specifically, our model generates segments for resolved harmonics based on temporal continuity and cross-channel correlation, and these segments are grouped according to common periodicity. For unresolved harmonics, it is well known that the corresponding filter responses are strongly amplitude-modulated and the response envelopes fluctuate at the fundamental frequency (F0) of target speech [8]. Therefore, our model generates segments for unresolved harmonics based on common AM in addition to temporal continuity. The segments are grouped according to AM repetition rates. We calculate AM repetition rates via sinusoidal modeling, which is guided by target pitch estimated according to characteristics of natural speech. Section 2 describes the overall system. In section 3, systematic results and a comparison with the Wang-Brown system are given. Section 4 concludes the paper. 2 M od el d escri p t i on Our model is a multistage system, as shown in Fig. 1. Description for each stage is given below. 2.1 I n i t i a l p r oc e s s i n g First, an acoustic input is analyzed by a standard cochlear filtering model with a bank of 128 gammatone filters [15] and subsequent hair cell transduction [12]. This peripheral processing is done in time frames of 20 ms long with 10 ms overlap between consecutive frames. As a result, the input signal is decomposed into a group of timefrequency (T-F) units. Each T-F unit contains the response from a certain channel at a certain frame. The envelope of the response is obtained by a lowpass filter with Segregated Speech Mixture Peripheral and Initial Pitch mid-level segregation tracking processing Unit Final Resynthesis labeling segregation Figure 1. Schematic diagram of the proposed multistage system. passband [0, 1 kHz] and a Kaiser window of 18.25 ms. Mid-level processing is performed by computing a correlogram (autocorrelation function) of the individual responses and their envelopes. These autocorrelation functions reveal response periodicities as well as AM repetition rates. The global pitch is obtained from the summary correlogram. For clean speech, the autocorrelations generally have peaks consistent with the pitch and their summation shows a dominant peak corresponding to the pitch period. With acoustic interference, a global pitch may not be an accurate description of the target pitch, but it is reasonably close. Because a harmonic extends for a period of time and its frequency changes smoothly, target speech likely activates contiguous T-F units. This is an instance of the temporal continuity principle. In addition, since the passbands of adjacent channels overlap, a resolved harmonic usually activates adjacent channels, which leads to high crosschannel correlations. Hence, in initial segregation, the model first forms segments by merging T-F units based on temporal continuity and cross-channel correlation. Then the segments are grouped into a foreground stream and a background stream by comparing the periodicities of unit responses with global pitch. A similar process is described in [18]. Fig. 2(a) and Fig. 2(b) illustrate the segments and the foreground stream. The input is a mixture of a voiced utterance and a cocktail party noise (see Sect. 3). Since the intrusion is not strongly structured, most segments correspond to target speech. In addition, most segments are in the low-frequency range. The initial foreground stream successfully groups most of the major segments. 2.2 P i t c h tr a c k i n g In the presence of acoustic interference, the global pitch estimated in mid-level processing is generally not an accurate description of target pitch. To obtain accurate pitch information, target pitch is first estimated from the foreground stream. At each frame, the autocorrelation functions of T-F units in the foreground stream are summated. The pitch period is the lag corresponding to the maximum of the summation in the plausible pitch range: [2 ms, 12.5 ms]. Then we employ the following two constraints to check its reliability. First, an accurate pitch period at a frame should be consistent with the periodicity of the T-F units at this frame in the foreground stream. At frame j, let τ ( j) represent the estimated pitch period, and A(i, j,τ ) the autocorrelation function of uij, the unit in channel i. uij agrees with τ ( j) if A(i , j , τ ( j )) / A(i, j ,τ m ) > θ d (1) (a) (b) Frequency (Hz) 5000 5000 2335 2335 1028 1028 387 387 80 0 0.5 1 Time (Sec) 1.5 80 0 0.5 1 Time (Sec) 1.5 Figure 2. Results of initial segregation for a speech and cocktail-party mixture. (a) Segments formed. Each segment corresponds to a contiguous black region. (b) Foreground stream. Here, θd = 0.95, the same threshold used in [18], and τ m is the lag corresponding to the maximum of A(i, j,τ ) within [2 ms, 12.5 ms]. τ ( j) is considered reliable if more than half of the units in the foreground stream at frame j agree with it. Second, pitch periods in natural speech vary smoothly in time [11]. We stipulate the difference between reliable pitch periods at consecutive frames be smaller than 20% of the pitch period, justified from pitch statistics. Unreliable pitch periods are replaced by new values extrapolated from reliable pitch points using temporal continuity. As an example, suppose at two consecutive frames j and j+1 that τ ( j) is reliable while τ ( j+1) is not. All the channels corresponding to the T-F units agreeing with τ ( j) are selected. τ ( j+1) is then obtained from the summation of the autocorrelations for the units at frame j+1 in those selected channels. Then the re-estimated pitch is further verified with the second constraint. For more details, see [9]. Fig. 3 illustrates the estimated pitch periods from the speech and cocktail-party mixture, which match the pitch periods obtained from clean speech very well. 2.3 U n i t l a be l i n g With estimated pitch periods, (1) provides a criterion to label T-F units according to whether target speech dominates the unit responses or not. This criterion compares an estimated pitch period with the periodicity of the unit response. It is referred as the periodicity criterion. It works well for resolved harmonics, and is used to label the units of the segments generated in initial segregation. However, the periodicity criterion is not suitable for units responding to multiple harmonics because unit responses are amplitude-modulated. As shown in Fig. 4, for a filter response that is strongly amplitude-modulated (Fig. 4(a)), the target pitch corresponds to a local maximum, indicated by the vertical line, in the autocorrelation instead of the global maximum (Fig. 4(b)). Observe that for a filter responding to multiple harmonics of a harmonic source, the response envelope fluctuates at the rate of F0 [8]. Hence, we propose a new criterion for labeling the T-F units corresponding to unresolved harmonics by comparing AM repetition rates with estimated pitch. This criterion is referred as the AM criterion. To obtain an AM repetition rate, the entire response of a gammatone filter is half-wave rectified and then band-pass filtered to remove the DC component and other possible 14 Pitch Period (ms) 12 (a) 10 180 185 190 195 200 Time (ms) 2 4 6 8 Lag (ms) 205 210 8 6 4 0 (b) 0.5 1 Time (Sec) Figure 3. Estimated target pitch for the speech and cocktail-party mixture, marked by “x”. The solid line indicates the pitch contour obtained from clean speech. 0 10 12 Figure 4. AM effects. (a) Response of a filter with center frequency 2.6 kHz. (b) Corresponding autocorrelation. The vertical line marks the position corresponding to the pitch period of target speech. harmonics except for the F0 component. The rectified and filtered signal is then normalized by its envelope to remove the intensity fluctuations of the original signal, where the envelope is obtained via the Hilbert Transform. Because the pitch of natural speech does not change noticeably within a single frame, we model the corresponding normalized signal within a T-F unit by a single sinusoid to obtain the AM repetition rate. Specifically, f ,φ f ij , φ ij = arg min M ˆ [r (i, jT − k ) − sin(2π k f / f S + φ )]2 , for f ∈[80 Hz, 500 Hz], (2) k =1 ˆ where a square error measure is used. r (i , t ) is the normalized filter response, fS is the sampling frequency, M spans a frame, and T= 10 ms is the progressing period from one frame to the next. In the above equation, fij gives the AM repetition rate for unit uij. Note that in the discrete case, a single sinusoid with a sufficiently high frequency can always match these samples perfectly. However, we are interested in finding a frequency within the plausible pitch range. Hence, the solution does not reduce to a degenerate case. With appropriately chosen initial values, this optimization problem can be solved effectively using iterative gradient descent (see [9]). The AM criterion is used to label T-F units that do not belong to any segments generated in initial segregation; such segments, as discussed earlier, tend to miss unresolved harmonics. Specifically, unit uij is labeled as target speech if the final square error is less than half of the total energy of the corresponding signal and the AM repetition rate is close to the estimated target pitch: | f ijτ ( j ) − 1 | < θ f . (3) Psychoacoustic evidence suggests that to separate sounds with overlapping spectra requires 6-12% difference in F0 [6]. Accordingly, we choose θf to be 0.12. 2.4 F i n a l s e gr e g a t i on a n d r e s y n t he s i s For adjacent channels responding to unresolved harmonics, although their responses may be quite different, they exhibit similar AM patterns and their response envelopes are highly correlated. Therefore, for T-F units labeled as target speech, segments are generated based on cross-channel envelope correlation in addition to temporal continuity. The spectra of target speech and intrusion often overlap and, as a result, some segments generated in initial segregation contain both units where target speech dominates and those where intrusion dominates. Given unit labels generated in the last stage, we further divide the segments in the foreground stream, SF, so that all the units in a segment have the same label. Then the streams are adjusted as follows. First, since segments for speech usually are at least 50 ms long, segments with the target label are retained in SF only if they are no shorter than 50 ms. Second, segments with the intrusion label are added to the background stream, SB, if they are no shorter than 50 ms. The remaining segments are removed from SF, becoming undecided. Finally, other units are grouped into the two streams by temporal and spectral continuity. First, SB expands iteratively to include undecided segments in its neighborhood. Then, all the remaining undecided segments are added back to SF. For individual units that do not belong to either stream, they are grouped into SF iteratively if the units are labeled as target speech as well as in the neighborhood of SF. The resulting SF is the final segregated stream of target speech. Fig. 5(a) shows the new segments generated in this process for the speech and cocktailparty mixture. Fig. 5(b) illustrates the segregated stream from the same mixture. Fig. 5(c) shows all the units where target speech is stronger than intrusion. The foreground stream generated by our algorithm contains most of the units where target speech is stronger. In addition, only a small number of units where intrusion is stronger are incorrectly grouped into it. A speech waveform is resynthesized from the final foreground stream. Here, the foreground stream works as a binary mask. It is used to retain the acoustic energy from the mixture that corresponds to 1’s and reject the mixture energy corresponding to 0’s. For more details, see [19]. 3 Evalu at i on an d comp ari son Our model is evaluated with a corpus of 100 mixtures composed of 10 voiced utterances mixed with 10 intrusions collected by Cooke [4]. The intrusions have a considerable variety. Specifically, they are: N0 - 1 kHz pure tone, N1 - white noise, N2 - noise bursts, N3 - “cocktail party” noise, N4 - rock music, N5 - siren, N6 - trill telephone, N7 - female speech, N8 - male speech, and N9 - female speech. Given our decomposition of an input signal into T-F units, we suggest the use of an ideal binary mask as the ground truth for target speech. The ideal binary mask is constructed as follows: a T-F unit is assigned one if the target energy in the corresponding unit is greater than the intrusion energy and zero otherwise. Theoretically speaking, an ideal binary mask gives a performance ceiling for all binary masks. Figure 5(c) illustrates the ideal mask for the speech and cocktail-party mixture. Ideal masks also suit well the situations where more than one target need to be segregated or the target changes dynamically. The use of ideal masks is supported by the auditory masking phenomenon: within a critical band, a weaker signal is masked by a stronger one [13]. In addition, an ideal mask gives excellent resynthesis for a variety of sounds and is similar to a prior mask used in a recent ASR study that yields excellent recognition performance [5]. The speech waveform resynthesized from the final foreground stream is used for evaluation, and it is denoted by S(t). The speech waveform resynthesized from the ideal binary mask is denoted by I(t). Furthermore, let e1(t) denote the signal present in I(t) but missing from S(t), and e2(t) the signal present in S(t) but missing from I(t). Then, the relative energy loss, REL, and the relative noise residue, RNR, are calculated as follows: R EL = e12 (t ) t I 2 (t ) , S 2 (t ) . (4b) ¡ ¡ R NR = (4a) t 2 e 2 (t ) t t (a) (b) (c) Frequency (Hz) 5000 2355 1054 387 80 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) 0 0.5 1 Time (Sec) Figure 5. Results of final segregation for the speech and cocktail-party mixture. (a) New segments formed in the final segregation. (b) Final foreground stream. (c) Units where target speech is stronger than the intrusion. Table 1: REL and RNR Proposed model Wang-Brown model REL (%) RNR (%) N0 2.12 0.02 N1 4.66 3.55 N2 1.38 1.30 N3 3.83 2.72 N4 4.00 2.27 N5 2.83 0.10 N6 1.61 0.30 N7 3.21 2.18 N8 1.82 1.48 N9 8.57 19.33 3.32 Average 3.40 REL (%) RNR (%) 6.99 0 28.96 1.61 5.77 0.71 21.92 1.92 10.22 1.41 7.47 0 5.99 0.48 8.61 4.23 7.27 0.48 15.81 33.03 11.91 4.39 15 SNR (dB) Intrusion 20 10 5 0 −5 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Intrusion Type Figure 6. SNR results for segregated speech. White bars show the results from the proposed model, gray bars those from the Wang-Brown system, and black bars those of the mixtures. The results from our model are shown in Table 1. Each value represents the average of one intrusion with 10 voiced utterances. A further average across all intrusions is also shown in the table. On average, our system retains 96.60% of target speech energy, and the relative residual noise is kept at 3.32%. As a comparison, Table 1 also shows the results from the Wang-Brown model [18], whose performance is representative of current CASA systems. As shown in the table, our model reduces REL significantly. In addition, REL and RNR are balanced in our system. Finally, to compare waveforms directly we measure a form of signal-to-noise ratio (SNR) in decibels using the resynthesized signal from the ideal binary mask as ground truth: ( I (t ) − S (t )) 2 ] . I 2 (t ) SNR = 10 log10 [ t (5) t The SNR for each intrusion averaged across 10 target utterances is shown in Fig. 6, together with the results from the Wang-Brown system and the SNR of the original mixtures. Our model achieves an average SNR gain of around 12 dB and 5 dB improvement over the Wang-Brown model. 4 Di scu ssi on The main feature of our model lies in using different mechanisms to deal with resolved and unresolved harmonics. As a result, our model is able to recover target speech and reduce noise interference in the high-frequency range where harmonics of target speech are unresolved. The proposed system considers the pitch contour of the target source only. However, it is possible to track the pitch contour of the intrusion if it has a harmonic structure. With two pitch contours, one could label a T-F unit more accurately by comparing whether its periodicity is more consistent with one or the other. Such a method is expected to lead to better performance for the two-speaker situation, e.g. N7 through N9. As indicated in Fig. 6, the performance gain of our system for such intrusions is relatively limited. Our model is limited to separation of voiced speech. In our view, unvoiced speech poses the biggest challenge for monaural speech separation. Other grouping cues, such as onset, offset, and timbre, have been demonstrated to be effective for human ASA [1], and may play a role in grouping unvoiced speech. In addition, one should consider the acoustic and phonetic characteristics of individual unvoiced consonants. We plan to investigate these issues in future study. A c k n ow l e d g me n t s We thank G. J. Brown and M. Wu for helpful comments. Preliminary versions of this work were presented in 2001 IEEE WASPAA and 2002 IEEE ICASSP. This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F4962001-1-0027). References [1] A. S. Bregman, Auditory scene analysis, Cambridge MA: MIT Press, 1990. [2] R. P. Carlyon and T. M. Shackleton, “Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms?” J. Acoust. Soc. Am., Vol. 95, pp. 3541-3554, 1994. [3] G. Cauwenberghs, “Monaural separation of independent acoustical components,” In Proc. of IEEE Symp. Circuit & Systems, 1999. [4] M. Cooke, Modeling auditory processing and organization, Cambridge U.K.: Cambridge University Press, 1993. [5] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., Vol. 34, pp. 267-285, 2001. [6] C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, B. C. J. Moore, Ed., San Diego CA: Academic Press, 1995. [7] D. P. W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineering and Computer Science, 1996. [8] H. Helmholtz, On the sensations of tone, Braunschweig: Vieweg & Son, 1863. (A. J. Ellis, English Trans., Dover, 1954.) [9] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” Technical Report TR6, Ohio State University Department of Computer and Information Science, 2002. (available at www.cis.ohio-state.edu/~hu) [10] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, New York: Wiley, 2001. [11] W. J. M. Levelt, Speaking: From intention to articulation, Cambridge MA: MIT Press, 1989. [12] R. Meddis, “Simulation of auditory-neural transduction: further studies,” J. Acoust. Soc. Am., Vol. 83, pp. 1056-1063, 1988. [13] B. C. J. Moore, An Introduction to the psychology of hearing, 4th Ed., San Diego CA: Academic Press, 1997. [14] D. O’Shaughnessy, Speech communications: human and machine, 2nd Ed., New York: IEEE Press, 2000. [15] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” APU Report 2341, MRC, Applied Psychology Unit, Cambridge U.K., 1988. [16] R. Plomp and A. M. Mimpen, “The ear as a frequency analyzer II,” J. Acoust. Soc. Am., Vol. 43, pp. 764-767, 1968. [17] S. Roweis, “One microphone source separation,” In Advances in Neural Information Processing Systems 13 (NIPS’00), 2001. [18] D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, Vol. 10, pp. 684-697, 1999. [19] M. Weintraub, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering, 1985.
2 0.77524281 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
Author: Dmitry Y. Pavlov, David M. Pennock
Abstract: We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, high-dimensional, and dynamic— conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework. We conduct experiments on data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algorithms in offline tests simulating the recommendation of documents to ResearchIndex users.
3 0.58646798 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations
Author: Elzbieta Pekalska, David Tax, Robert Duin
Abstract: Problems in which abnormal or novel situations should be detected can be approached by describing the domain of the class of typical examples. These applications come from the areas of machine diagnostics, fault detection, illness identification or, in principle, refer to any problem where little knowledge is available outside the typical class. In this paper we explain why proximities are natural representations for domain descriptors and we propose a simple one-class classifier for dissimilarity representations. By the use of linear programming an efficient one-class description can be found, based on a small number of prototype objects. This classifier can be made (1) more robust by transforming the dissimilarities and (2) cheaper to compute by using a reduced representation set. Finally, a comparison to a comparable one-class classifier by Campbell and Bennett is given.
4 0.5810082 174 nips-2002-Regularized Greedy Importance Sampling
Author: Finnegan Southey, Dale Schuurmans, Ali Ghodsi
Abstract: Greedy importance sampling is an unbiased estimation technique that reduces the variance of standard importance sampling by explicitly searching for modes in the estimation objective. Previous work has demonstrated the feasibility of implementing this method and proved that the technique is unbiased in both discrete and continuous domains. In this paper we present a reformulation of greedy importance sampling that eliminates the free parameters from the original estimator, and introduces a new regularization strategy that further reduces variance without compromising unbiasedness. The resulting estimator is shown to be effective for difficult estimation problems arising in Markov random field inference. In particular, improvements are achieved over standard MCMC estimators when the distribution has multiple peaked modes.
5 0.57764012 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
7 0.54564488 163 nips-2002-Prediction and Semantic Association
8 0.53528941 41 nips-2002-Bayesian Monte Carlo
9 0.53383029 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
10 0.52737349 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
11 0.52683353 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
12 0.52086991 2 nips-2002-A Bilinear Model for Sparse Coding
13 0.52085447 169 nips-2002-Real-Time Particle Filters
14 0.52018046 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
15 0.51551408 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
16 0.51452559 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
17 0.51435697 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
18 0.51364583 39 nips-2002-Bayesian Image Super-Resolution
19 0.51332748 124 nips-2002-Learning Graphical Models with Mercer Kernels
20 0.51291358 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering