nips nips2001 nips2001-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John R. Hershey, Michael Casey
Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. [sent-4, score-0.553]
2 This suggests the utility of audio-visual information for the task of speech enhancement. [sent-5, score-0.581]
3 We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. [sent-6, score-0.729]
4 We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. [sent-7, score-1.387]
5 We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. [sent-8, score-0.369]
6 We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. [sent-9, score-0.955]
7 In speech perception, vision often plays a crucial role, because we can follow in the lips and face the very mechanisms that modulate the sound, even when the sound is obscured by acoustic noise. [sent-14, score-0.758]
8 It has been demonstrated that the addition of visual cues can enhance speech recog- nition as much as removing 15 dB of noise [1]. [sent-15, score-0.768]
9 Vision provides speech cues that are complementary to audio cues such as components of consonants and vowels that are likely to be obscured by acoustic noise [2]. [sent-16, score-1.059]
10 Visual information is demonstra- bly beneficial to HMM-based automatic speech recognition (ASR) systems, which typically suffer tremendously under moderate acoustical noise [3]. [sent-17, score-0.799]
11 We introduce a method of audio-visual speech enhancement using factorial hidden Markov models (fHMMs). [sent-18, score-0.973]
12 We focus on speech enhancement rather than speech recognition for two reasons: first, speech conveys useful paralinguistic information, such as prosody, emotion, and speaker identity, and second, speech contains useful cues for separation from noise, such as pitch. [sent-19, score-2.728]
13 In automatic speech recognition (ASR) systems, these cues are typically discarded in an effort to reduce irrelevant variance among speakers and utterances within a phonetic class. [sent-20, score-0.822]
14 Whereas the benefit of vision to speech recognition is well known, we may well wonder if visual input offers similar benefits to speech enhancement. [sent-21, score-1.286]
15 In [4] a nonparametric density estimator was used to adapt audio and video transforms to maximize the mutual information between the face of a target speaker and an audio mixture containing both the target voice and a distracter voice. [sent-22, score-0.977]
16 In [5] a multi-layer perceptron is trained to map noisy estimates of formants to clean ones, employing lip parameters (width, height and area of the lip opening) extracted from video as additional input. [sent-24, score-0.566]
17 The re-estimated formant contours were used to filter the speech to enhance the signal. [sent-25, score-0.713]
18 In speech recognition, HMMs are commonly used because of the advantages of modeling signal dynamics. [sent-28, score-0.652]
19 This suggests the following strategy: train an audiovisual HMM on clean speech, infer the likelihoods of its state sequences, and use the inferred state probabilities of the signal and noise to estimate a sequence of filters to clean the data. [sent-29, score-0.478]
20 Ephraim [6] first proposed an approach to factorially combining two HMMs in such an enhancement system. [sent-31, score-0.233]
21 In [7] an efficient variational learning rule for the factorial HMM is formulated, and in [8, 9] fHMM speech enhancement was recently revived using some clever tricks to allow more complex models. [sent-32, score-0.94]
22 The fHMM approach is amenable to audio-visual speech enhancement in many different forms. [sent-33, score-0.716]
23 In the simplest formulation, which we pursue here, the signal observation model includes visual features. [sent-34, score-0.305]
24 These visual inputs constrain the signal HMM and produce more accurate filters. [sent-35, score-0.175]
25 1 Factorial Speech Models One of the challenges of using speech HMMs for enhancement is to model speech in sufficient detail. [sent-38, score-1.269]
26 Typically, speech models, following the practice in ASR, ignore narrow-band, spectral details (corresponding to upper cepstral components) which carry pitch information, because they tend to vary across speakers and utterances for the same word or phoneme. [sent-39, score-0.873]
27 Such wide-band spectral patterns loosely represent formant patterns, a well-known cue for vowel discrimination. [sent-41, score-0.233]
28 In cases where the pitch or other narrow-band properties, of the background signals differ from the foreground speech, and have predictable dynamics, such as with lWe defer a detailed mathematical development to subsequent publications. [sent-42, score-0.195]
29 edu for further information two simultaneous speech signals, these components may be helpful in separating the two signals. [sent-45, score-0.622]
30 "one" " two" Full band: Narrow band: Wide band: Figure 1: full-band, narrow-band, and wide-band log spectrograms of two words. [sent-47, score-0.193]
31 The wide-band log spectrograms (bottom) are derived by low-pass filtering the log spectra (across the frequency domain), and the narrow-band log spectrograms (middle) derived by high pass filtering the log spectra The full log spectrogram (top) is the sum of the two. [sent-48, score-0.77]
32 However, the wide-band and narrow-band variations in speech are only loosely coupled. [sent-49, score-0.599]
33 For instance, a given formant is likely to be uttered with many different pitches and a given pitch may be used to utter any formant. [sent-50, score-0.269]
34 Thus a model of the full spectrum of speech would have to have enough states to represent every combination of pitches and formants. [sent-51, score-0.782]
35 When combined with a similarly complex noise model, the composite model has 64 million states. [sent-54, score-0.248]
36 To parsimoniously model the complexity of speech, we employ a factorial HMM for a single speech signal, in which wide and narrow-band components are represented in sub-models with independent dynamics. [sent-56, score-0.883]
37 for a combination of a wide and a narrow-band state, over the log-spectrum of speech prior to liftering. [sent-65, score-0.648]
38 Because the observation densities of each component are Gaussian, and the log-spectra of the wide and narrow-band components add in the log spectrum, the composite state has a Gaussian observation p. [sent-66, score-0.649]
39 , whose mean and variance is the sum of the component observation means and variances. [sent-69, score-0.165]
40 Although the states of the two sub-models are marginally independent they are typically conditionally dependent given the observation sequence. [sent-70, score-0.219]
41 In other words we assume that the state dependencies between the sub-models for a given speech signal can be explained entirely via the observations. [sent-71, score-0.759]
42 Figure 2(b) depicts the combination of the wide and narrow-band models, where the observation p. [sent-72, score-0.275]
43 When combining the signal and noise models (or two different speech models) in contrast, the signals add in the frequency domain, and hence in the log spectral domain they longer simply add. [sent-76, score-0.988]
44 In the spectral domain the amplitudes of the two signals have log-normal distributions, and the relative phases are unknown. [sent-77, score-0.216]
45 Disregarding phase differences we apply a well-known approximation to the sum of two lognormal random variables, in which we match the mean and variance of a lognormal random variable to the sum of the means and variances of the two component lognormal random variables [10]. [sent-79, score-0.365]
46 2 Figure 3(a) depicts the combination of two factorial speech models, where the observation p. [sent-81, score-0.985]
47 Video Observations ~ Audio Observations (a) dual factorial HMM 6 (f 1 0 (b) speech fHMM with video Figure 3: combining two speech fHMMs (a) and adding video observations to a speech fHMM (b). [sent-88, score-2.385]
48 Using the log-normal observation distribution of the composite model we can estimate the likelihood of the speech and noise states for each frame using the well known forward-backward recursion. [sent-89, score-1.083]
49 Taking 2The uncertainty of the phase differences can be incorporated by modeling the sum as a mixture of lognormals that uniformly samples phase differences. [sent-91, score-0.18]
50 Each mixture element is approximated by taking as its mean the length of the sum of the mean amplitudes when added in the complex plane according a particular phase difference, and as its variance the sum of the two variances. [sent-92, score-0.217]
51 This estimation is facilitated by the assumption of diagonal covariances in the log spectral domain. [sent-93, score-0.203]
52 the expected value of the signal in the numerator and the expected value of the signal plus noise in the denominator yields a Wiener filter which is applied to the original noisy signal enhancing the desired component. [sent-94, score-0.4]
53 When we have two speech signals one person's noise is another's signal and we can separate both by the same method. [sent-95, score-0.791]
54 2 Incorporating vision We incorporate vision after training the audio models in order to test the improvement yielded by visual input while holding the audio model constant. [sent-96, score-0.613]
55 Figure 3(b) depicts the structure ofthe resulting speech model. [sent-100, score-0.603]
56 Such a method in which audio and visual features are integrated early in processing is only one of several approaches. [sent-101, score-0.289]
57 We envision other late integration approaches in which audio and visual dynamics are more loosely coupled. [sent-102, score-0.335]
58 3 Efficient inference In the models described above, in which we factorially combine two speech models , each of which is itself factorial , the complexity of inference in the composite model, using the forward-backward recursion, can easily become unmanageable. [sent-104, score-1.068]
59 If K is the number of states in each subcomponent, then K4 is the number of states in the composite HMM. [sent-105, score-0.333]
60 In our experiments K is on the order of 40 states, so there are 2,560,000 states in the composite model. [sent-106, score-0.244]
61 Naively each composite state must be searched when computing the probabilities of state sequences necessary for inference. [sent-107, score-0.327]
62 Rather than computing the forward-backward procedure on the composite HMM, we compute it sequentially on each sub-HMM to derive the probability of each state in each frame. [sent-110, score-0.22]
63 Of course, in order to evaluate the observation probabilities of the current sub-HMMs for a given frame, we need to consider the state probabilities of the other three sub-HMMs, because their means and variances are combined in the observation model. [sent-111, score-0.478]
64 These state probabilities and their associated observation probabilities comprise a mixture model for a given frame. [sent-112, score-0.336]
65 Thus we only have to consider the K states of the current model, and use the summarized means and variances of the other three HMMs as auxiliary inputs to the observation model. [sent-114, score-0.26]
66 We initialize the state probabilities in each frame with the equilibrium distribution for each sub-HMM. [sent-115, score-0.198]
67 In our experiments, after a handful of iterations, the composite state probabilities tend to converge. [sent-116, score-0.262]
68 This method is closely related to a structured variational approximation for factorial HMMs [7] and can be also be seen as an approximate belief propagation or sum-product algorithm [11]. [sent-117, score-0.224]
69 4 Data We used a small-vocabulary audio-visual speech database developed by Fu Jie Huang at Carnegie Mellon University 3 [12]. [sent-118, score-0.553]
70 These data consist of audio and video recordings of 10 subjects (7 males and 3 females) saying 78 isolated words commonly used for numbers and time, such aS,"one" "Monday", "February", "night", etc. [sent-119, score-0.534]
71 The data set included outer lip parameters extracted from video using an automatic lip tracker, including height of the upper and lower lips relative to the corners the width from corner to corner. [sent-122, score-0.571]
72 We interpolated these lip parameters to match the audio frame rate, and calculate time derivatives. [sent-123, score-0.426]
73 The audio was framed at 60 frames per second, with an overlap of 50% , yielding 264 samples per frame. [sent-126, score-0.263]
74 4 The frames were analyzed into cepstra: the wide-band log spectrum is derived from the lower 20 cepstral components and the wide-band log spectrum from the upper cepstra. [sent-127, score-0.391]
75 A PCA basis was used to reduce the log spectrograms to a more manageable size of 30 dimensions during training. [sent-129, score-0.193]
76 This resulted in some non-zero covariances near the diagonal in the learned observation covariance matrices, which we discarded. [sent-130, score-0.168]
77 The narrow-band model learned states that represented different pitches and had transition probabilities that were non-zero mainly between neighboring pitches. [sent-132, score-0.192]
78 The narrow-band model's video observation probability distributions were largely overlapping, reflecting the fact that video tells us little about pitch. [sent-133, score-0.665]
79 The wide-band model learned states that represented different formant structures. [sent-134, score-0.211]
80 The video observation distributions for several states in the wide-band model were clearly separated, reflecting the information that video provides about the formant structure. [sent-135, score-0.876]
81 Subjectively the enhanced signals sound well separated from each other for the most part. [sent-136, score-0.226]
82 Figure 4(a) (bottom) shows the estimated spectrograms for a mixture of two different words spoken by the same speaker - an extremely difficult task. [sent-137, score-0.452]
83 To quantify these results we evaluate the system using speech recognizer, on the slightly easier task of separating the speech of the two different speakers, whose voices were in different but overlapping pitch ranges. [sent-138, score-1.339]
84 5Estimation of the SNR is necessary in practice; however this subject has been treated The separated test sounds were estimated by the system under two conditions: with and without the use of video information. [sent-148, score-0.351]
85 We evaluated the estimates on the test set using a speech recognition system developed by Bhiksha Raj, using the eMU Sphinx ASR engine. [sent-149, score-0.618]
86 6 Existing speech HMMs trained on 60 hours of broadcast news data were used for recognition. [sent-150, score-0.553]
87 7 The models were adapted in an unsupervised manner to clean speech from each speaker, by learning a single affine transformation of all the state means, using a maximum likelihood linear regression procedure [14]. [sent-151, score-0.722]
88 The recognizer adapted to each speaker was tested with the enhanced speech produced by the speech model for that speaker, as well as with no enhancement. [sent-152, score-1.367]
89 It is somewhat surprising that the gains for video occur mostly in areas of higher SNR, whereas in human speech perception they occur under lower SNR. [sent-155, score-0.833]
90 Little subjective difference was noted with the use of video in the case of two speakers. [sent-156, score-0.251]
91 However in other experiments, when both voices came from the same speaker, the video was crucial in disambiguating which signal came from which voice. [sent-157, score-0.533]
92 We introduced a factorial HMM to track both formant and pitch information, as well as video, in a unified probabilistic model, and demonstrated its effectiveness in speech enhancement. [sent-159, score-0.985]
93 7These models represented every combination of three phones (triphones) using 6000 states tied across trip hone models, with a 16-element Gaussian mixture observation model for each state. [sent-166, score-0.337]
94 The data were processed at 8 kHz in 25ms windows overlapped by 15ms, with a frame rate of 100 frames per second, and analyzed into 31 Mel frequency components from which 13 cepstral coefficients were derived. [sent-167, score-0.378]
95 These coefficients with the mean vector removed, and supplemented with their time differences, comprised the observed features speech enhancement systems in the literature. [sent-168, score-0.716]
96 The results are tentative given the small sample of voices used ; however they suggest that further study with a larger sample of voices is warranted. [sent-169, score-0.234]
97 It would be useful to compare the performance of a factorial speech model to that of each factor in isolation, as well as to a fullspectrum model. [sent-170, score-0.777]
98 Measures of quality and intelligibility by human listeners in terms of speech and emotion recognition , as well as speaker identity, will also be helpful in further demonstrating the utility of these techniques. [sent-171, score-0.963]
99 Special thanks to Bhiksha Raj for devising and producing the evaluation using speech recognition , and to Matt Brand for his entropic HMM toolkit. [sent-174, score-0.667]
100 Maximum likelihood linear regression for speaker adaptation of the parameters of continuous density hidden markov models. [sent-248, score-0.177]
wordName wordTfidf (topN-words)
[('speech', 0.553), ('video', 0.251), ('factorial', 0.224), ('audio', 0.213), ('speaker', 0.177), ('enhancement', 0.163), ('composite', 0.155), ('observation', 0.13), ('spectrograms', 0.13), ('formant', 0.122), ('lip', 0.122), ('voices', 0.117), ('hmms', 0.112), ('hmm', 0.112), ('signal', 0.099), ('fhmm', 0.094), ('frame', 0.091), ('states', 0.089), ('pitch', 0.086), ('acoustical', 0.081), ('asr', 0.081), ('snr', 0.081), ('visual', 0.076), ('cues', 0.074), ('cepstral', 0.074), ('signals', 0.074), ('clean', 0.071), ('band', 0.071), ('factorially', 0.07), ('intelligibility', 0.07), ('lognormal', 0.07), ('wide', 0.067), ('state', 0.065), ('spectral', 0.065), ('noise', 0.065), ('recognition', 0.065), ('log', 0.063), ('spectra', 0.062), ('speakers', 0.062), ('pitches', 0.061), ('mixture', 0.057), ('america', 0.056), ('sound', 0.055), ('separated', 0.054), ('spectrum', 0.051), ('frames', 0.05), ('depicts', 0.05), ('entropic', 0.049), ('bhiksha', 0.047), ('casey', 0.047), ('fhmms', 0.047), ('jhershey', 0.047), ('journ', 0.047), ('overlapped', 0.047), ('spoken', 0.046), ('sounds', 0.046), ('amplitudes', 0.046), ('loosely', 0.046), ('phase', 0.044), ('enhanced', 0.043), ('probabilities', 0.042), ('words', 0.042), ('windows', 0.041), ('recognizer', 0.041), ('lips', 0.041), ('emotion', 0.041), ('huang', 0.041), ('matt', 0.041), ('mitsubishi', 0.041), ('multimedia', 0.041), ('obscured', 0.041), ('raj', 0.041), ('variances', 0.041), ('components', 0.039), ('vision', 0.039), ('filter', 0.038), ('covariances', 0.038), ('separation', 0.037), ('facilitated', 0.037), ('voice', 0.037), ('khz', 0.037), ('frequency', 0.036), ('sum', 0.035), ('automatic', 0.035), ('background', 0.035), ('electric', 0.035), ('schwartz', 0.035), ('utterances', 0.033), ('came', 0.033), ('reflecting', 0.033), ('models', 0.033), ('phases', 0.031), ('separating', 0.03), ('el', 0.03), ('human', 0.029), ('face', 0.029), ('recordings', 0.028), ('utility', 0.028), ('combination', 0.028), ('combined', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models
Author: John R. Hershey, Michael Casey
Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1
2 0.40075579 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
Author: Brendan J. Frey, Trausti T. Kristjansson, Li Deng, Alex Acero
Abstract: A challenging, unsolved problem in the speech recognition community is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recognition is to automatically remove the noise from the cepstrum sequence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noisefree speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for probabilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech. Noise statistics may change significantly during an utterance, so that speechfree frames are not sufficient for estimating the noise model. In this paper, we show how the noise model can be learned even when the data contains speech. In particular, the noise model can be learned from the test utterance and then used to de noise the test utterance. The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance. For both Wall Street J ournal data with added noise samples and the Aurora benchmark, we show that the new noise adaptive technique performs as well as or significantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples. 1
3 0.22320317 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
Author: William M. Campbell
Abstract: A novel approach for comparing sequences of observations using an explicit-expansion kernel is demonstrated. The kernel is derived using the assumption of the independence of the sequence of observations and a mean-squared error training criterion. The use of an explicit expansion kernel reduces classifier model size and computation dramatically, resulting in model sizes and computation one-hundred times smaller in our application. The explicit expansion also preserves the computational advantages of an earlier architecture based on mean-squared error training. Training using standard support vector machine methodology gives accuracy that significantly exceeds the performance of state-of-the-art mean-squared error training for a speaker recognition task.
4 0.19343275 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
Author: S. Parveen, P. Green
Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.
5 0.18833555 172 nips-2001-Speech Recognition using SVMs
Author: N. Smith, Mark Gales
Abstract: An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the definition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood. 1
6 0.18010297 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
7 0.16882305 168 nips-2001-Sequential Noise Compensation by Sequential Monte Carlo Method
8 0.15803149 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
9 0.10066237 183 nips-2001-The Infinite Hidden Markov Model
10 0.10025846 115 nips-2001-Linear-time inference in Hierarchical HMMs
11 0.099811867 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model
12 0.095994644 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
13 0.092619538 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
14 0.087418318 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
15 0.076807939 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
16 0.075954974 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement
17 0.074810065 43 nips-2001-Bayesian time series classification
18 0.073310643 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
19 0.070891641 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
20 0.07086733 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
topicId topicWeight
[(0, -0.223), (1, -0.017), (2, -0.065), (3, -0.092), (4, -0.382), (5, 0.158), (6, 0.227), (7, -0.122), (8, 0.033), (9, -0.158), (10, -0.021), (11, 0.102), (12, -0.043), (13, 0.32), (14, -0.099), (15, 0.037), (16, -0.101), (17, -0.035), (18, -0.279), (19, 0.063), (20, -0.002), (21, -0.025), (22, 0.033), (23, 0.012), (24, 0.024), (25, -0.057), (26, 0.039), (27, -0.046), (28, 0.043), (29, -0.031), (30, 0.022), (31, 0.043), (32, 0.054), (33, 0.023), (34, -0.007), (35, 0.001), (36, -0.043), (37, -0.009), (38, 0.04), (39, 0.001), (40, -0.008), (41, 0.002), (42, 0.011), (43, 0.005), (44, 0.005), (45, 0.019), (46, 0.008), (47, -0.02), (48, 0.005), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.97944784 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models
Author: John R. Hershey, Michael Casey
Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1
2 0.91280878 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
Author: Brendan J. Frey, Trausti T. Kristjansson, Li Deng, Alex Acero
Abstract: A challenging, unsolved problem in the speech recognition community is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recognition is to automatically remove the noise from the cepstrum sequence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noisefree speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for probabilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech. Noise statistics may change significantly during an utterance, so that speechfree frames are not sufficient for estimating the noise model. In this paper, we show how the noise model can be learned even when the data contains speech. In particular, the noise model can be learned from the test utterance and then used to de noise the test utterance. The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance. For both Wall Street J ournal data with added noise samples and the Aurora benchmark, we show that the new noise adaptive technique performs as well as or significantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples. 1
3 0.81653285 168 nips-2001-Sequential Noise Compensation by Sequential Monte Carlo Method
Author: K. Yao, S. Nakamura
Abstract: We present a sequential Monte Carlo method applied to additive noise compensation for robust speech recognition in time-varying noise. The method generates a set of samples according to the prior distribution given by clean speech models and noise prior evolved from previous estimation. An explicit model representing noise effects on speech features is used, so that an extended Kalman filter is constructed for each sample, generating the updated continuous state estimate as the estimation of the noise parameter, and prediction likelihood for weighting each sample. Minimum mean square error (MMSE) inference of the time-varying noise parameter is carried out over these samples by fusion the estimation of samples according to their weights. A residual resampling selection step and a Metropolis-Hastings smoothing step are used to improve calculation efficiency. Experiments were conducted on speech recognition in simulated non-stationary noises, where noise power changed artificially, and highly non-stationary Machinegun noise. In all the experiments carried out, we observed that the method can have significant recognition performance improvement, over that achieved by noise compensation with stationary noise assumption. 1
4 0.76886773 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
Author: S. Parveen, P. Green
Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.
5 0.51915216 172 nips-2001-Speech Recognition using SVMs
Author: N. Smith, Mark Gales
Abstract: An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the definition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood. 1
6 0.50173074 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
7 0.45328209 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
8 0.43245152 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
9 0.33674556 14 nips-2001-A Neural Oscillator Model of Auditory Selective Attention
10 0.31724808 115 nips-2001-Linear-time inference in Hierarchical HMMs
11 0.30859151 3 nips-2001-ACh, Uncertainty, and Cortical Inference
12 0.29410979 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
13 0.29275748 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
14 0.28146401 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
15 0.27482668 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
16 0.26987854 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model
17 0.26608527 183 nips-2001-The Infinite Hidden Markov Model
18 0.26164854 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
19 0.25067446 43 nips-2001-Bayesian time series classification
20 0.24467465 71 nips-2001-Estimating the Reliability of ICA Projections
topicId topicWeight
[(13, 0.016), (14, 0.025), (17, 0.018), (19, 0.029), (27, 0.114), (30, 0.13), (38, 0.015), (55, 0.277), (59, 0.035), (72, 0.053), (79, 0.052), (83, 0.022), (91, 0.118)]
simIndex simValue paperId paperTitle
same-paper 1 0.82122844 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models
Author: John R. Hershey, Michael Casey
Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1
2 0.77444309 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
Author: Stefan Harmeling, Andreas Ziehe, Motoaki Kawanabe, Klaus-Robert Müller
Abstract: In kernel based learning the data is mapped to a kernel feature space of a dimension that corresponds to the number of training data points. In practice, however, the data forms a smaller submanifold in feature space, a fact that has been used e.g. by reduced set techniques for SVMs. We propose a new mathematical construction that permits to adapt to the intrinsic dimension and to find an orthonormal basis of this submanifold. In doing so, computations get much simpler and more important our theoretical framework allows to derive elegant kernelized blind source separation (BSS) algorithms for arbitrary invertible nonlinear mixings. Experiments demonstrate the good performance and high computational efficiency of our kTDSEP algorithm for the problem of nonlinear BSS.
3 0.62270647 149 nips-2001-Probabilistic Abstraction Hierarchies
Author: Eran Segal, Daphne Koller, Dirk Ormoneit
Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.
4 0.61917681 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
Author: Dieter Fox
Abstract: Over the last years, particle filters have been applied with great success to a variety of state estimation problems. We present a statistical approach to increasing the efficiency of particle filters by adapting the size of sample sets on-the-fly. The key idea of the KLD-sampling method is to bound the approximation error introduced by the sample-based representation of the particle filter. The name KLD-sampling is due to the fact that we measure the approximation error by the Kullback-Leibler distance. Our adaptation approach chooses a small number of samples if the density is focused on a small part of the state space, and it chooses a large number of samples if the state uncertainty is high. Both the implementation and computation overhead of this approach are small. Extensive experiments using mobile robot localization as a test application show that our approach yields drastic improvements over particle filters with fixed sample set sizes and over a previously introduced adaptation technique.
5 0.6164434 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
6 0.61258221 71 nips-2001-Estimating the Reliability of ICA Projections
7 0.6085673 46 nips-2001-Categorization by Learning and Combining Object Parts
8 0.60718191 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
9 0.6034317 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
10 0.60228467 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
11 0.60062557 60 nips-2001-Discriminative Direction for Kernel Classifiers
12 0.60025948 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
13 0.59915733 56 nips-2001-Convolution Kernels for Natural Language
14 0.59889281 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
15 0.5981369 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
16 0.59674984 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
17 0.59654862 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
18 0.59579957 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model
19 0.5941658 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
20 0.5923804 163 nips-2001-Risk Sensitive Particle Filters