nips nips2000 nips2000-96 knowledge-graph by maker-knowledge-mining

96 nips-2000-One Microphone Source Separation


Source: pdf

Author: Sam T. Roweis

Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. [sent-5, score-0.592]

2 Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. [sent-6, score-0.497]

3 I present a technique called refiltering which recovers sources by a nonstationary reweighting ("masking") of frequency sub-bands from a single recording, and argue for the application of statistical algorithms to learning this masking function . [sent-7, score-1.06]

4 I present results of a simple factorial HMM system which learns on recordings of single speakers and can then separate mixtures using only one observation signal by computing the masking function and then refiltering. [sent-8, score-0.72]

5 1 Learning from data in computational auditory scene analysis Imagine listening to many pianos being played simultaneously. [sent-9, score-0.25]

6 But if each were playing a coherent song, separation would be much easier because of the structure of music. [sent-11, score-0.195]

7 Now imagine teaching a computer to do the separation by showing it many musical scores as "training data". [sent-12, score-0.33]

8 Typical auditory perceptual input contains a mixture of sounds from different sources, altered by the acoustic environment. [sent-13, score-0.373]

9 Any biological or artificial hearing system must extract individual acoustic objects or streams in order to do successful localization, denoising and recognition. [sent-14, score-0.194]

10 Bregman [1] called this process auditory scene analysis in analogy to vision. [sent-15, score-0.25]

11 Source separation, or computational auditory scene analysis (CASA) is the practical realization of this problem via computer analysis of microphone recordings and is very similar to the musical task described above. [sent-16, score-0.476]

12 These "unmixing" algorithms (even those which attempt to recover more sources than signals) cannot operate on single recordings. [sent-23, score-0.24]

13 The goal of this paper is to bring together the robust representations of CAS A and methods which learn from data to solve a restricted version of the source separation problem - isolating acoustic objects from only a single microphone recording. [sent-25, score-0.645]

14 unmixing Unmixing algorithms reweight multiple simultaneous recordings mk (t) (generically called microphones) to form a new source object s(t): s(t) '-v-" = D:lml(t)+D:2m2(t)+ . [sent-27, score-0.389]

15 estimated source mic 1 mic 2 mic K The unmixing coefficients D:i are constant over time and are chosen to optimize some property of the set of recovered sources, which often translates into a kurtosis measure on the joint amplitude histogram of the microphones . [sent-45, score-0.646]

16 The intuition is that unmixing algorithms are finding spikes (or dents for low kurtosis sources) in the marginal amplitude histogram. [sent-46, score-0.2]

17 Humans, on the other hand, cannot hear histogram spikes l and perform well on many monaural separation tasks. [sent-49, score-0.337]

18 There is substantial evidence that the energy across time in different frequency bands can carry relatively independent information. [sent-52, score-0.206]

19 This suggests that the appropriate subparts of an audio signal may be narrow frequency bands over short times. [sent-53, score-0.187]

20 To generate these parts, one can perform multi band analysis - break the original signal y(t) into many subband signals bi(t) each filtered to contain only energy from a small portion of the spectrum. [sent-54, score-0.545]

21 The results of such an analysis are often displayed as a spectrogram which shows energy (using colour or grayscale) as a function of time (ordinate) and frequency (abscissa). [sent-55, score-0.363]

22 ) In the musical analogy, a spectrogram is like a musical score in which the colour or grey level of the each note tells you how hard to hit the piano key. [sent-57, score-0.493]

23 The basic idea of refiltering is to construct new sources by selectively reweighting the multiband signals bi(t). [sent-58, score-0.774]

24 Crucially, however, the mixing coefficients are no longer constant over time; they are now called masking signals. [sent-59, score-0.405]

25 Given a set of masking signals, denoted D:i(t), a source s(t) can be recovered by modulating the corresponding subband signals from the original input and summing: mask 1 mask 2 ,-. [sent-60, score-0.825]

26 D:l(t) b1(t) + D:2(t) b2(t) '-v-" ~ ~ '-v-" estimated source sub-band 1 sub-band 2 sub-band K D:K(t) bK(t) (2) The D:i(t) are gain knobs on each subband that we can twist over time to bring bands in and out of the source as needed. [sent-74, score-0.498]

27 For any specific choice of masking signals D:i(t), refiltering attempts to isolate a single source from the input signal and suppress all other sources and background noises. [sent-80, score-1.343]

28 Different sources can be isolated by choosing different masking signals. [sent-81, score-0.655]

29 This is physically unrealistic, because the energy in each small region of time-frequency never comes entirely from a single source. [sent-83, score-0.145]

30 (Think of ignoring collisions by assuming separate piano players do not often hit the same note at the same time. [sent-85, score-0.163]

31 ) lTry randomJy permuting the time order of samples in a stereo mixture containing several sources and see if you still hear distinct streams when you play it back. [sent-86, score-0.3]

32 Figure 1: The refiltering approach to one microphone source separation. [sent-88, score-0.596]

33 Multiband analysis of the original signal y(t) gives sub-band signals bi(t) which are modulated by masking signals ai(t) (binary or real valued between 0 and 1) and recombined to give the estimated source or object s(t). [sent-89, score-0.963]

34 Refiltering can also be thought of as a highly nonstationary Wiener filter in which both the signal and noise spectra are re-estimated at a rate l/T; the binary assumption is equivalent to assuming that over a timescale T the signal and noise spectra are nonoverlapping. [sent-90, score-0.47]

35 It is a fortunate empirical fact that refiltering, even with binary masking signals, can cleanly separate sources from a single mixed recording. [sent-91, score-0.756]

36 This can be demonstrated by taking several isolated sources or noises and mixing them in a controlled way. [sent-92, score-0.362]

37 Since the original components are known, an "optimal" set of masking signals can be computed. [sent-93, score-0.558]

38 For example, we might set 0i ( t) equal to the ratio of energy from one source in band i around times t ± T to the sum of energies from all sources in the same band at that time (as recommended by the Wiener filter) or to a binary version which thresholds this ratio. [sent-94, score-0.608]

39 3 Multiband grouping as a statistical pattern recognition problem Since one-microphone source separation using refiltering is possible if the masking signals are well chosen, the essential problem becomes: how can the Oi(t) be computed automatically from a single mixed recording? [sent-96, score-1.499]

40 The goal is to group or "tag" together regions of the spectrogram that belong to the same auditory object. [sent-97, score-0.346]

41 Thus, frequencies which exhibit common onsets, offsets, or upward/downward sweeps are more likely to be grouped into the same stream (figure 2). [sent-101, score-0.114]

42 Also, many real world sounds have harmonic spectra; so frequencies which lie exactly on a harmonic "stack" are often perceptually grouped together. [sent-102, score-0.329]

43 (Musically, piano players do not hit keys randomly, but instead use chords and repeated melodies. [sent-103, score-0.204]

44 Figure 2: Examples of three common grouping cues for energy which often comes from a single source. [sent-107, score-0.306]

45 (left) Frequencies which lie exactly on harmonic multiples of a single base frequency. [sent-108, score-0.13]

46 Luckily it is very easy to generate such data by mixing isolated sources in a controlled way, although the subsequent supervised learning can difficult. [sent-113, score-0.452]

47 3 Figure 3: Each point represents the energy from one source versus another in a narrow frequency band over a 32ms window. [sent-114, score-0.391]

48 The plot shows all frequencies over a 2 second period from a speech mixture. [sent-115, score-0.176]

49 Typically when one source has large energy the other does not. [sent-116, score-0.28]

50 4 Results using factorial-max HMMs Here, I will describe one (purely unsupervised) method I have pursued for automatically generating masking signals from a single microphone. [sent-119, score-0.584]

51 The approach first trains speaker dependent hidden Markov models (HMMs) on isolated data from single talkers. [sent-120, score-0.285]

52 These pre-trained models are then combined in a particular way to build a separation system. [sent-121, score-0.234]

53 HMM training was initialized by first training a mixture of Gaussians on each speaker's data (with a single shared covariance matrix) independent of time order. [sent-124, score-0.113]

54 Next, to separate a new single recording which is a mixture of known speakers, these pretrained models are combined into afactorial hidden Markov model (FHMM) architecture [5]. [sent-127, score-0.245]

55 A simple way to model this dependence is to have each chain c independently propose an output yC and then combine them to generate the observation according to some rule Yt = Q(yi, yl, . [sent-130, score-0.209]

56 At each time, one chain proposes an output vector ax, and the other proposes h z ,. [sent-134, score-0.112]

57 The key part of the model is the function Q: observations are generated by taking the elementwise maximum of the proposals and adding noise. [sent-135, score-0.126]

58 This maximum operation reflects the observation that the log magnitude spectrogram of a mixture of sources is very nearly the elementwise maximum of the individual spectrograms. [sent-136, score-0.636]

59 h z,], R) (3) (4) (5) 3Recall that refiltering can only isolate one auditory stream at a time from the scene (we are always separating "a source" from "the background"). [sent-138, score-0.599]

60 This makes learning the masking signals an unusual problem because for any input (spectrogram) there are as many correct answers as objects in the scene. [sent-139, score-0.493]

61 Such a highly multimodal distribution on outputs given inputs means that the mapping from auditory input to masking signals cannot be learned using backprop or other single-valued function approximators which take the average of the possible maskings present in the training data. [sent-140, score-0.683]

62 4The observations are created by concatenating the values of 2 adjacent columns of the log magnitude periodogram into a single vector. [sent-141, score-0.135]

63 Average signal energy was normalized across the most recent 8 frames before computing each OFT. [sent-145, score-0.171]

64 -L and covariance 1; and max[·] is the elementwise maximum operation on two vectors. [sent-148, score-0.12]

65 It ignores two aspects of the spectrogram data: first, Gaussian noise is used although the observations are nonnegative; second, the probability factor requiring the non-maximum output proposal to be less than the maximum proposal is missing. [sent-150, score-0.328]

66 Observations Yt are the elementwise max of the individual emission vectors max[a x " b z ,] plus Gaussian noise. [sent-154, score-0.181]

67 ••• In the experiment presented below, each chain represents a speaker dependent HMM (one male and one female). [sent-155, score-0.271]

68 The emission and transition probabilities from each speaker's pretrained HMM were used as the parameters for the combined FHMM. [sent-156, score-0.132]

69 Since the hidden chains share a single visible output variable, naive inference in the FHMM graphical model yields an intractable amount of work exponential in the size of the state space of each submodel. [sent-163, score-0.182]

70 However, because all of the observations are nonnegative and the max operation is used to combine output proposals, there is an efficient trick for computing the best joint state trajectory. [sent-164, score-0.175]

71 At each time, we can upper bound the log-probability of generating the observation vector if one chain is in state i, no matter what state the other chain is in. [sent-165, score-0.204]

72 With these bounds in hand, each time we evaluate the probability of a specific pair of states we can eliminate from consideration all state settings of either chain whose bounds are worse than the achieved probability. [sent-167, score-0.146]

73 ) The training data for the model consists only of spectrograms of isolated examples of each speaker but inference can be done on test data which is a spectrogram of a single mixture of known speakers. [sent-170, score-0.612]

74 The results of separating a simple two speaker mixture are shown below. [sent-171, score-0.182]

75 The test utterance was formed by linearly mixing two out-of-sample utterances (one male and one female) from the same speakers as the models were trained on. [sent-172, score-0.184]

76 Figure 5 shows the original mixed spectrogram (top left) as well as the sequence of outputs a~, (bottom left) and hz, (bottom right) from each chain. [sent-173, score-0.389]

77 The FHMM system achieves good separation from only a single microphone (see figure 6). [sent-175, score-0.348]

78 > hz, Figure 5: (top left) Original spectrogram of mixed utterance. [sent-177, score-0.324]

79 (bottom) Male and female spectrograms predicted by factorial HMM and used to compute refiltering masks. [sent-178, score-0.537]

80 (top right) Masking signals Oi (t) , computed by comparing the magnitudes of each model's predictions. [sent-179, score-0.144]

81 5 Conclusions In this paper I have argued for the marriage of learning algorithms with the refiltering approach to CASA. [sent-180, score-0.308]

82 I have presented results from a simple factorial HMM system on a speaker dependent separation problem which indicate that automatically learned onemicrophone separation systems may be possible. [sent-181, score-0.646]

83 In the machine learning community, the one-microphone separation problem has received much less attention than unmixing problems, while CASA researchers have not employed automatic learning techniques to full effect. [sent-182, score-0.358]

84 Scene analysis is an interesting and challenging learning problem with exciting and practical applications, and the refiltering setup has many nice properties. [sent-183, score-0.308]

85 First, it can work if the masking signals are chosen properly. [sent-184, score-0.493]

86 Third, a good learning algorithmwhen presented with enough data-should automatically discover the sorts of grouping cues which have been built into existing systems by hand. [sent-186, score-0.202]

87 Furthermore, in the refiltering paradigm there is no need to make a hard decision about the number of sources present in an input. [sent-187, score-0.498]

88 Each proposed masking has an associated score or probability; groupings with high scores can be considered "sources", while ones with low scores might be parts of the background or mixtures other faint sources. [sent-188, score-0.455]

89 CAS A returns a collection of candidate maskings and their associated scores, and then it is up to the user to decide-based on the range of scores-the number of sources in the scene. [sent-189, score-0.237]

90 Many existing approaches to speech and audio processing have the potential to be applied to the monaural source separation problem. [sent-190, score-0.598]

91 Wan and Nelson have developed dual EKF methods [8] and applied them speech denoising but have also informally demonstrated their potential application to monaural source separation. [sent-192, score-0.485]

92 Attias and colleagues [9] developed a fully probabilistic model of speech in noise and used variational Bayesian techniques to perform inference and learning allowing denoising and dereverberation; their approach clearly has the potential to be applied to the separation problem as well. [sent-193, score-0.441]

93 Cauwenberghs [10] has a very promising approach to the problem for purely harmonic signals that takes advantage of powerful phase constraints which are ignored by other algorithms. [sent-194, score-0.224]

94 Learning models of isolated sounds may be useful for developing feature detectors; conjunctions of such feature detectors can then be trained in a supervised fashion using labeled data. [sent-196, score-0.224]

95 Figure 6: Test separation results, using a 2-chain speaker dependent factorial-max HMM, followed by refiltering. [sent-197, score-0.314]

96 Moore (1990) Hidden Markov model decomposition of speech and noise, IEEE Conf. [sent-234, score-0.123]

97 Young (1996) Robust continuous speech recognition using parallel model combination, IEEE Trans. [sent-240, score-0.123]

98 Nelson (1998) Removal of noise from speech using the dual EKF algorithm, IEEE Conf. [sent-246, score-0.164]

99 Acero (2001) Speech denoising and dereverberation using probabilistic models, this volume. [sent-252, score-0.123]

100 Cauwenberghs (1999) Monaural separation of independent acoustical components, IEEE Symp. [sent-254, score-0.195]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('masking', 0.349), ('refiltering', 0.308), ('spectrogram', 0.203), ('separation', 0.195), ('sources', 0.19), ('source', 0.185), ('unmixing', 0.163), ('signals', 0.144), ('hmm', 0.143), ('auditory', 0.143), ('casa', 0.142), ('speech', 0.123), ('mixed', 0.121), ('speaker', 0.119), ('isolated', 0.116), ('acoustic', 0.112), ('scene', 0.107), ('grouping', 0.106), ('microphone', 0.103), ('factorial', 0.096), ('fhmm', 0.095), ('monaural', 0.095), ('energy', 0.095), ('chains', 0.092), ('elementwise', 0.082), ('subband', 0.082), ('musical', 0.082), ('denoising', 0.082), ('zt', 0.08), ('male', 0.08), ('harmonic', 0.08), ('signal', 0.076), ('female', 0.072), ('chain', 0.072), ('mic', 0.071), ('multiband', 0.071), ('piano', 0.071), ('frequency', 0.065), ('original', 0.065), ('mixture', 0.063), ('reweighting', 0.061), ('timescale', 0.061), ('spectrograms', 0.061), ('grouped', 0.061), ('observation', 0.06), ('mixing', 0.056), ('xt', 0.055), ('hit', 0.055), ('yt', 0.055), ('cues', 0.055), ('sounds', 0.055), ('scores', 0.053), ('max', 0.053), ('frequencies', 0.053), ('supervised', 0.053), ('markov', 0.052), ('yw', 0.051), ('single', 0.05), ('brown', 0.049), ('hz', 0.049), ('microphones', 0.048), ('attias', 0.048), ('speakers', 0.048), ('hear', 0.047), ('maskings', 0.047), ('pretrained', 0.047), ('band', 0.046), ('binary', 0.046), ('emission', 0.046), ('spectra', 0.046), ('bands', 0.046), ('recording', 0.046), ('observations', 0.044), ('recordings', 0.041), ('noise', 0.041), ('automatically', 0.041), ('nelson', 0.041), ('isolate', 0.041), ('keys', 0.041), ('cas', 0.041), ('periodogram', 0.041), ('wiener', 0.041), ('dereverberation', 0.041), ('ekf', 0.041), ('oscillatory', 0.041), ('wan', 0.041), ('output', 0.04), ('combined', 0.039), ('oi', 0.038), ('operation', 0.038), ('generate', 0.037), ('bounds', 0.037), ('amplitude', 0.037), ('community', 0.037), ('lca', 0.037), ('sw', 0.037), ('players', 0.037), ('cauwenberghs', 0.037), ('nonstationary', 0.037), ('cli', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 96 nips-2000-One Microphone Source Separation

Author: Sam T. Roweis

Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (

2 0.23587736 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

Abstract: This paper presents a unified probabilistic framework for denoising and dereverberation of speech signals. The framework transforms the denoising and dereverberation problems into Bayes-optimal signal estimation. The key idea is to use a strong speech model that is pre-trained on a large data set of clean speech. Computational efficiency is achieved by using variational EM, working in the frequency domain, and employing conjugate priors. The framework covers both single and multiple microphones. We apply this approach to noisy reverberant speech signals and get results substantially better than standard methods.

3 0.20449844 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

4 0.17210715 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

Author: John W. Fisher III, Trevor Darrell, William T. Freeman, Paul A. Viola

Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.

5 0.16351025 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

Author: Lucas C. Parra, Clay Spence, Paul Sajda

Abstract: We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1

6 0.15302882 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

7 0.15032962 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

8 0.14709416 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

9 0.1230951 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System

10 0.12220111 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

11 0.11905964 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

12 0.077318974 51 nips-2000-Factored Semi-Tied Covariance Matrices

13 0.076064594 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

14 0.073008828 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

15 0.071528144 138 nips-2000-The Use of Classifiers in Sequential Inference

16 0.06861411 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

17 0.068433046 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

18 0.063233621 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

19 0.059319735 137 nips-2000-The Unscented Particle Filter

20 0.059233725 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.257), (1, -0.164), (2, 0.156), (3, 0.166), (4, -0.127), (5, -0.211), (6, -0.286), (7, -0.091), (8, 0.002), (9, 0.033), (10, 0.083), (11, -0.022), (12, -0.044), (13, -0.059), (14, 0.041), (15, -0.063), (16, 0.015), (17, -0.033), (18, -0.05), (19, -0.108), (20, -0.115), (21, 0.017), (22, 0.065), (23, -0.006), (24, 0.122), (25, -0.041), (26, -0.015), (27, -0.069), (28, -0.142), (29, 0.117), (30, 0.015), (31, -0.021), (32, 0.055), (33, 0.073), (34, -0.035), (35, -0.009), (36, -0.034), (37, -0.082), (38, 0.019), (39, -0.103), (40, -0.011), (41, -0.06), (42, -0.018), (43, 0.047), (44, -0.017), (45, 0.033), (46, -0.057), (47, 0.015), (48, -0.052), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96244431 96 nips-2000-One Microphone Source Separation

Author: Sam T. Roweis

Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (

2 0.78045285 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

Abstract: This paper presents a unified probabilistic framework for denoising and dereverberation of speech signals. The framework transforms the denoising and dereverberation problems into Bayes-optimal signal estimation. The key idea is to use a strong speech model that is pre-trained on a large data set of clean speech. Computational efficiency is achieved by using variational EM, working in the frequency domain, and employing conjugate priors. The framework covers both single and multiple microphones. We apply this approach to noisy reverberant speech signals and get results substantially better than standard methods.

3 0.67850775 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

Author: Jürgen Tchorz, Michael Kleinschmidt, Birger Kollmeier

Abstract: A novel noise suppression scheme for speech signals is proposed which is based on a neurophysiologically-motivated estimation of the local signal-to-noise ratio (SNR) in different frequency channels. For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which represent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modulation frequencies in higher stages of the mammalian auditory system. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppression is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speakerindependent digit recognition experiments and compared to noise suppression by Spectral Subtraction. 1

4 0.67631906 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

5 0.63663173 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

Author: Lucas C. Parra, Clay Spence, Paul Sajda

Abstract: We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1

6 0.62251163 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

7 0.56727177 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

8 0.50578052 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

9 0.42827189 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

10 0.41381589 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System

11 0.37484682 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

12 0.36459735 138 nips-2000-The Use of Classifiers in Sequential Inference

13 0.27176714 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

14 0.26602897 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity

15 0.25083661 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

16 0.23103519 109 nips-2000-Redundancy and Dimensionality Reduction in Sparse-Distributed Representations of Natural Objects in Terms of Their Local Features

17 0.23063761 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

18 0.22706446 80 nips-2000-Learning Switching Linear Models of Human Motion

19 0.22685504 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

20 0.22477126 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.318), (4, 0.033), (8, 0.02), (10, 0.028), (17, 0.095), (32, 0.017), (33, 0.055), (42, 0.01), (54, 0.013), (55, 0.037), (62, 0.044), (65, 0.021), (67, 0.036), (75, 0.015), (76, 0.028), (79, 0.023), (81, 0.045), (90, 0.039), (91, 0.034), (97, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84365237 96 nips-2000-One Microphone Source Separation

Author: Sam T. Roweis

Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (

2 0.8007232 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System

Author: Odelia Schwartz, Eero P. Simoncelli

Abstract: We explore the statistical properties of natural sound stimuli preprocessed with a bank of linear filters. The responses of such filters exhibit a striking form of statistical dependency, in which the response variance of each filter grows with the response amplitude of filters tuned for nearby frequencies. These dependencies may be substantially reduced using an operation known as divisive normalization, in which the response of each filter is divided by a weighted sum of the rectified responses of other filters. The weights may be chosen to maximize the independence of the normalized responses for an ensemble of natural sounds. We demonstrate that the resulting model accounts for nonlinearities in the response characteristics of the auditory nerve, by comparing model simulations to electrophysiological recordings. In previous work (NIPS, 1998) we demonstrated that an analogous model derived from the statistics of natural images accounts for non-linear properties of neurons in primary visual cortex. Thus, divisive normalization appears to be a generic mechanism for eliminating a type of statistical dependency that is prevalent in natural signals of different modalities. Signals in the real world are highly structured. For example, natural sounds typically contain both harmonic and rythmic structure. It is reasonable to assume that biological auditory systems are designed to represent these structures in an efficient manner [e.g., 1,2]. Specifically, Barlow hypothesized that a role of early sensory processing is to remove redundancy in the sensory input, resulting in a set of neural responses that are statistically independent. Experimentally, one can test this hypothesis by examining the statistical properties of neural responses under natural stimulation conditions [e.g., 3,4], or the statistical dependency of pairs (or groups) of neural responses. Due to their technical difficulty, such multi-cellular experiments are only recently becoming possible, and the earliest reports in vision appear consistent with the hypothesis [e.g., 5]. An alternative approach, which we follow here, is to develop a neural model from the statistics of natural signals and show that response properties of this model are similar to those of biological sensory neurons. A number of researchers have derived linear filter models using statistical criterion. For visual images, this results in linear filters localized in frequency, orientation and phase [6, 7]. Similar work in audition has yielded filters localized in frequency and phase [8]. Although these linear models provide an important starting point for neural modeling, sensory neurons are highly nonlinear. In addition, the statistical properties of natural signals are too complex to expect a linear transformation to result in an independent set of components. Recent results indicate that nonlinear gain control plays an important role in neural processing. Ruderman and Bialek [9] have shown that division by a local estimate of standard deviation can increase the entropy of responses of center-surround filters to natural images. Such a model is consistent with the properties of neurons in the retina and lateral geniculate nucleus. Heeger and colleagues have shown that the nonlinear behaviors of neurons in primary visual cortex may be described using a form of gain control known as divisive normalization [10], in which the response of a linear kernel is rectified and divided by the sum of other rectified kernel responses and a constant. We have recently shown that the responses of oriented linear filters exhibit nonlinear statistical dependencies that may be substantially reduced using a variant of this model, in which the normalization signal is computed from a weighted sum of other rectified kernel responses [11, 12]. The resulting model, with weighting parameters determined from image statistics, accounts qualitatively for physiological nonlinearities observed in primary visual cortex. In this paper, we demonstrate that the responses of bandpass linear filters to natural sounds exhibit striking statistical dependencies, analogous to those found in visual images. A divisive normalization procedure can substantially remove these dependencies. We show that this model, with parameters optimized for a collection of natural sounds, can account for nonlinear behaviors of neurons at the level of the auditory nerve. Specifically, we show that: 1) the shape offrequency tuning curves varies with sound pressure level, even though the underlying linear filters are fixed; and 2) superposition of a non-optimal tone suppresses the response of a linear filter in a divisive fashion, and the amount of suppression depends on the distance between the frequency of the tone and the preferred frequency of the filter. 1 Empirical observations of natural sound statistics The basic statistical properties of natural sounds, as observed through a linear filter, have been previously documented by Attias [13]. In particular, he showed that, as with visual images, the spectral energy falls roughly according to a power law, and that the histograms of filter responses are more kurtotic than a Gaussian (i.e., they have a sharp peak at zero, and very long tails). Here we examine the joint statistical properties of a pair of linear filters tuned for nearby temporal frequencies. We choose a fixed set of filters that have been widely used in modeling the peripheral auditory system [14]. Figure 1 shows joint histograms of the instantaneous responses of a particular pair of linear filters to five different types of natural sound, and white noise. First note that the responses are approximately decorrelated: the expected value of the y-axis value is roughly zero for all values of the x-axis variable. The responses are not, however, statistically independent: the width of the distribution of responses of one filter increases with the response amplitude of the other filter. If the two responses were statistically independent, then the response of the first filter should not provide any information about the distribution of responses of the other filter. We have found that this type of variance dependency (sometimes accompanied by linear correlation) occurs in a wide range of natural sounds, ranging from animal sounds to music. We emphasize that this dependency is a property of natural sounds, and is not due purely to our choice of linear filters. For example, no such dependency is observed when the input consists of white noise (see Fig. 1). The strength of this dependency varies for different pairs of linear filters . In addition, we see this type of dependency between instantaneous responses of a single filter at two Speech o -1 Drums • Monkey Cat White noise Nocturnal nature I~ ~; ~ • Figure 1: Joint conditional histogram of instantaneous linear responses of two bandpass filters with center frequencies 2000 and 2840 Hz. Pixel intensity corresponds to frequency of occurrence of a given pair of values, except that each column has been independently rescaled to fill the full intensity range. For the natural sounds, responses are not independent: the standard deviation of the ordinate is roughly proportional to the magnitude of the abscissa. Natural sounds were recorded from CDs and converted to sampling frequency of 22050 Hz. nearby time instants. Since the dependency involves the variance of the responses, we can substantially reduce it by dividing. In particular, the response of each filter is divided by a weighted sum of responses of other rectified filters and an additive constant. Specifically: L2 Ri = 2: (1) 12 j WjiLj + 0'2 where Li is the instantaneous linear response of filter i, strength of suppression of filter i by filter j. 0' is a constant and Wji controls the We would like to choose the parameters of the model (the weights Wji, and the constant 0') to optimize the independence of the normalized response to an ensemble of natural sounds. Such an optimization is quite computationally expensive. We instead assume a Gaussian form for the underlying conditional distribution, as described in [15]: P (LiILj,j E Ni ) '

3 0.45790178 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

Author: Hagai Attias, John C. Platt, Alex Acero, Li Deng

Abstract: This paper presents a unified probabilistic framework for denoising and dereverberation of speech signals. The framework transforms the denoising and dereverberation problems into Bayes-optimal signal estimation. The key idea is to use a strong speech model that is pre-trained on a large data set of clean speech. Computational efficiency is achieved by using variational EM, working in the frequency domain, and employing conjugate priors. The framework covers both single and multiple microphones. We apply this approach to noisy reverberant speech signals and get results substantially better than standard methods.

4 0.43918604 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

Author: Jürgen Tchorz, Michael Kleinschmidt, Birger Kollmeier

Abstract: A novel noise suppression scheme for speech signals is proposed which is based on a neurophysiologically-motivated estimation of the local signal-to-noise ratio (SNR) in different frequency channels. For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which represent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modulation frequencies in higher stages of the mammalian auditory system. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppression is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speakerindependent digit recognition experiments and compared to noise suppression by Spectral Subtraction. 1

5 0.43461287 131 nips-2000-The Early Word Catches the Weights

Author: Mark A. Smith, Garrison W. Cottrell, Karen L. Anderson

Abstract: The strong correlation between the frequency of words and their naming latency has been well documented. However, as early as 1973, the Age of Acquisition (AoA) of a word was alleged to be the actual variable of interest, but these studies seem to have been ignored in most of the literature. Recently, there has been a resurgence of interest in AoA. While some studies have shown that frequency has no effect when AoA is controlled for, more recent studies have found independent contributions of frequency and AoA. Connectionist models have repeatedly shown strong effects of frequency, but little attention has been paid to whether they can also show AoA effects. Indeed, several researchers have explicitly claimed that they cannot show AoA effects. In this work, we explore these claims using a simple feed forward neural network. We find a significant contribution of AoA to naming latency, as well as conditions under which frequency provides an independent contribution. 1 Background Naming latency is the time between the presentation of a picture or written word and the beginning of the correct utterance of that word. It is undisputed that there are significant differences in the naming latency of many words, even when controlling word length, syllabic complexity, and other structural variants. The cause of differences in naming latency has been the subject of numerous studies. Earlier studies found that the frequency with which a word appears in spoken English is the best determinant of its naming latency (Oldfield & Wingfield, 1965). More recent psychological studies, however, show that the age at which a word is learned, or its Age of Acquisition (AoA), may be a better predictor of naming latency. Further, in many multiple regression analyses, frequency is not found to be significant when AoA is controlled for (Brown & Watson, 1987; Carroll & White, 1973; Morrison et al. 1992; Morrison & Ellis, 1995). These studies show that frequency and AoA are highly correlated (typically r =-.6) explaining the confound of older studies on frequency. However, still more recent studies question this finding and find that both AoA and frequency are significant and contribute independently to naming latency (Ellis & Morrison, 1998; Gerhand & Barry, 1998,1999). Much like their psychological counterparts, connectionist networks also show very strong frequency effects. However, the ability of a connectionist network to show AoA effects has been doubted (Gerhand & Barry, 1998; Morrison & Ellis, 1995). Most of these claims are based on the well known fact that connectionist networks exhibit

6 0.43048686 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

7 0.41246614 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

8 0.40775687 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

9 0.40326515 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

10 0.40244019 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

11 0.39464521 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

12 0.39157748 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

13 0.39147088 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

14 0.38906649 122 nips-2000-Sparse Representation for Gaussian Process Models

15 0.38856208 74 nips-2000-Kernel Expansions with Unlabeled Examples

16 0.38799295 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

17 0.38626426 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

18 0.38272369 146 nips-2000-What Can a Single Neuron Compute?

19 0.38263404 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script

20 0.38236448 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure