nips nips2001 nips2001-173 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: S. Parveen, P. Green
Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. [sent-12, score-0.65]
2 In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. [sent-14, score-0.952]
3 In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. [sent-15, score-0.622]
4 We report encouraging results on an isolated digit recognition task. [sent-16, score-0.227]
5 Introduction Automatic Speech Recognition systems perform reasonably well in controlled and matched training and recognition conditions. [sent-18, score-0.15]
6 However, performance deteriorates when there is a mismatch between training and testing conditions, caused for instance by additive noise (Lippmann, 1997). [sent-19, score-0.052]
7 Missing data techniques provide an alternative solution for speech corrupted by additive noise which make minimal assumptions about the nature of the noise. [sent-21, score-0.308]
8 They are based on identifying uncorrupted, reliable regions in the frequency domain and adapting recognition algorithms so that classification is based on these regions. [sent-22, score-0.231]
9 Present missing data techniques developed at Sheffield (Barker et al. [sent-23, score-0.649]
10 Neural Networks, unlike HMMs, are discriminative models which do give direct estimates of posterior probabilities and have been used with success in hybrid ANN/HMM speech recognition systems (Bourlard et al. [sent-30, score-0.425]
11 In this paper, we adapt a recurrent neural network architecture introduced by (Gingras & Bengio, 1998) for robust ASR with missing data. [sent-32, score-0.78]
12 1 Missing data masks Speech recognition with missing data is based on the assumption that some regions in time/frequency remain uncorrupted for speech with added noise. [sent-35, score-1.065]
13 Initial processes, based on local signal-to-noise estimates, on auditory grouping cues, or a combination (Barker et al. [sent-38, score-0.086]
14 , 2001) define a binary ‘missing data mask’: ones in the mask indicate reliable (or ‘present’) features and zeros indicate unreliable (or ‘missing’) features. [sent-39, score-0.221]
15 2 Classification with missing data Techniques for classification with incomplete data can be divided into imputation and marginalisation. [sent-41, score-1.01]
16 Imputation is a technique in which missing features are replaced by estimated values to allow the recognition process proceed in normal way. [sent-42, score-0.785]
17 If the missing values are replaced by either zeros, random values or their means based on training data, the approach is called unconditional imputation. [sent-43, score-0.649]
18 On the other hand in conditional imputation conditional statistics are used to estimate the missing values given the present values. [sent-44, score-0.938]
19 In the marginalisation approach missing values are ignored (by integrating over their possible ranges) and recognition is performed with the reduced data vector which is considered reliable. [sent-45, score-0.798]
20 For the multivariate mixture Gaussian distributions used in CDHMMs, marginalisation and conditional imputation can be formulated analytically (Cooke et al. [sent-46, score-0.505]
21 For missing data ASR further improvements in both techniques follow from using the knowledge that for spectral energy features the unreliable data is bounded between zero and the energy in speech+noise mixture (Vizinho et al. [sent-48, score-0.922]
22 These techniques are referred to as bounded marginalisation and bounded imputation. [sent-51, score-0.092]
23 3 Why recurrent neural nets for missing data robust ASR? [sent-55, score-0.755]
24 Several neural net architectures have been proposed to deal with the missing data problem in general (Ahmed & Tresp, 1993), (Ghahramani & Jordan, 1994). [sent-56, score-0.588]
25 The problem in using neural networks with missing data is to compute the output of a node/unit when some of its input values are unavailable. [sent-57, score-0.652]
26 For marginalisation, this involves finding a way of integrating over the range of the missing values. [sent-58, score-0.525]
27 A robust ASR system to deal with missing data using neural networks has recently been proposed by (Morris et al. [sent-59, score-0.702]
28 This is basically a radial basis function neural network with the hidden units associated with a diagonal covariance gaussian. [sent-61, score-0.12]
29 The marginal over the missing values can be computed in this case and hence the resulting system is equivalent to the HMM based missing data speech recognition system using marginalisation. [sent-62, score-1.477]
30 Reported performance is also comparable to that of the HMM based speech recognition system. [sent-63, score-0.365]
31 In this paper missing data is dealt with by imputation. [sent-64, score-0.558]
32 We use recurrent neural networks to estimate missing values in the input vector. [sent-65, score-0.691]
33 RNNs have the potential to capture long-term contextual effects over time, and hence to use temporal context to compensate for missing data which CDHMM based missing data techniques do not do. [sent-66, score-1.147]
34 RNNs also allow a single net to perform both imputation and classification, with the potential of combining these processes to mutual benefit. [sent-68, score-0.414]
35 (1998) is based on a fully-connected feedforward network with input, hidden and output layers using hyperbolic tangent activation functions. [sent-70, score-0.136]
36 The output layer has one unit for each class and the network is trained with the correct classification as target. [sent-71, score-0.141]
37 Recurrent links are added to the feedforward net with unit delay from output to the hidden units as in Jordan networks (Jordan, 1988). [sent-72, score-0.371]
38 There are also recurrent links with unit delay from hidden units to missing input units to impute missing features. [sent-73, score-1.525]
39 In addition, there are self delayed terms with a fixed weight for each unit which basically serve to stabilise RNN behaviour over time and help in imputation as well. [sent-74, score-0.469]
40 used this RNN both for a pattern classification task with static data (one input vector for each example) and sequential data (a sequence of input values for each example). [sent-76, score-0.167]
41 Our aim is to adapt this architecture for robust ASR with missing data. [sent-77, score-0.667]
42 Some preliminary static classification experiments were performed on vowel spectra (individual spectral slices excised from the TIMIT database). [sent-78, score-0.057]
43 RNN performance on this task with missing data was better than standard MLP and gaussian classifiers. [sent-79, score-0.558]
44 In the next section we show how the net can be adapted for dynamic classification of the spectral sequences constituting words. [sent-80, score-0.087]
45 RNN architecture for robust ASR with missing data Figure 1 illustrates our modified version of the Gingras and Bengio architecture. [sent-82, score-0.677]
46 Instead of taking feedback from the output to the hidden layer we have chosen a fully connected or Elman RNN (Elman, 1990) where there are full recurrent links from the past hidden layer to the present hidden layer (figure 1). [sent-83, score-0.557]
47 We have observed that these links produce faster convergence, in agreement with (Pedersen, 1997). [sent-84, score-0.067]
48 The number of input units depends on the size of feature vector, i. [sent-85, score-0.099]
49 The number of hidden units is determined by experimentation. [sent-88, score-0.12]
50 In our case the classes are taken to be whole words, so in the isolated digit recognition experiments we report, there are eleven output units, for ‘1’ - ‘9’, ‘zero’ and ‘oh’. [sent-90, score-0.268]
51 In training, missing inputs are initialised with their unconditional means. [sent-91, score-0.566]
52 The RNN is then allowed to impute missing values for the next frame through the recurrent links, after a feedforward pass. [sent-92, score-0.771]
53 H X (m,t) = ( 1 – γ )X (m,t – 1) + ∑ v jm f ( hid ( j, t – 1 ) ) j=1 Where X (m,t) is the missing feature at time t, γ is the learning rate, v jm indicates recurrent links from a hidden unit to the missing input and hid ( j, t – 1 ) is the activation of hidden unit j at time t-1. [sent-93, score-1.606]
54 The average of the RNN output over all the frames of an example is taken after these frames have gone through a forward pass. [sent-94, score-0.063]
55 The sum squared error between the correct targets and the RNN output for each frame is back-propagated through time and RNN weights are updated until a stopping criterion is reached. [sent-95, score-0.119]
56 The recognition phase consists of a forward pass to produce RNN output for unseen data and imputation of missing features at each time step. [sent-96, score-1.211]
57 The highest value in the averaged output vector is taken as the correct class. [sent-97, score-0.041]
58 Reliable features one two h i d d e n o u t p u t three nine oh zero -1 -1 -1 Figure 1: RNN architecture for robust ASR with missing data technique. [sent-98, score-0.778]
59 Solid arrows show full forward and recurrent connections between two layers. [sent-99, score-0.161]
60 Shaded blocks in the input layer indicate missing inputs which keep changing at every time step. [sent-100, score-0.592]
61 Missing inputs are fully connected (solid arrows) with the hidden layer with a unit delay in addition to delayed self-connection (thin arrows) with a fixed weight. [sent-101, score-0.206]
62 Isolated word recognition experiments Continuous pattern classification experiments were performed using data from 30 male speakers in the isolated digits section of the TIDIGIT database (Leonard, 1984). [sent-103, score-0.283]
63 220 examples were chosen from a subset of 10 speakers for training. [sent-107, score-0.035]
64 Features were extracted from hamming windowed speech with a window size of 25 msec and 50% overlap. [sent-110, score-0.215]
65 In the initial experiments we report, the missing data masks were formed by deleting spectral energy features at random. [sent-112, score-0.769]
66 This allows comparison with early results with HMMbased missing data recognition (Cooke et al. [sent-113, score-0.768]
67 Recognition performance was evaluated with 0% to 80% missing features with an increment of 10%. [sent-116, score-0.581]
68 1 RNN performance as a classifier An RNN with 20 inputs, 65 hidden and 11 output units was chosen for recognition and imputation with 20 features per time frame. [sent-119, score-0.751]
69 Its performance on various amounts of missing features from 0% to 80%, shown in Figure 2 (the ‘RNN imputation’ curve), is much better than the standard Elman RNN trained on clean speech only for classification task and tested with the mean imputation. [sent-120, score-0.882]
70 Use of the self delayed term in addition to the recurrent links for imputation of missing features contributes positively in case of sequential data. [sent-121, score-1.196]
71 Results resemble those reported for HMMs in (Cooke et al. [sent-122, score-0.06]
72 We also show that results are superior to ‘last reliable imputation’ in which the imputed value of a feature is the last reliable value for that feature. [sent-124, score-0.25]
73 2 RNN performance on pattern completion Imputation, or pattern completion, performance was observed for an RNN trained with 4 features per frame of the speech and is shown in Figure 3. [sent-127, score-0.426]
74 The RNN for this task had 4 input, 45 hidden and 11 output units. [sent-128, score-0.11]
75 In figure 3(a), solid curves show the true values of the feature in each frequency band at every frame for an example of a spoken ‘9’, the horizontal lines are mean feature values, and the circles are the missing values imputed by the RNN. [sent-129, score-0.796]
76 For this network, classification error for recognition was 10. [sent-131, score-0.15]
77 The bottom curve is the average pattern completion error with missing features imputed by the network. [sent-138, score-0.768]
78 This demonstrates the clear advantage of using the RNN for both imputation and classification. [sent-139, score-0.384]
79 5 0 5 10 15 20 25 30 35 frame number (a) 0. [sent-144, score-0.043]
80 2 10 20 30 40 50 60 70 80 % missing (b) Figure 3: (a) Missing values for digit 9 imputed by an RNN (b) Average imputation errors for mean imputation and RNN imputation 6. [sent-146, score-1.864]
81 Our next step will be to extend this recognition system for the connected digits recognition task with missing data, following the Aurora standard for robust ASR (Pearce et al. [sent-148, score-0.969]
82 This will provide a direct comparison with HMM-based missing data recognition (Barker et al. [sent-150, score-0.768]
83 In this case we will need to introduce ‘silence’ as an additional recognition class, and the training targets will be obtained by forced-alignment on clean speech with an existing recogniser. [sent-152, score-0.463]
84 We will use realistic missing data masks, rather than random deletions. [sent-153, score-0.558]
85 This is known to be a more demanding condition (Cooke et al. [sent-154, score-0.06]
86 When we are training using clean speech with added noise, another possibility is to use the true values of the corrupted features as training targets for imputation. [sent-156, score-0.42]
87 Use of actual targets for missing values has been reported by (Seung, 1997) but the RNN architecture in the latter work supports only pattern completion. [sent-157, score-0.648]
88 Some solutions to the missing feature problem in vision. [sent-162, score-0.549]
89 Linking auditory scene analysis and robust ASR by missing data techniques. [sent-177, score-0.668]
90 Soft decisions in missing data techniques for robust automatic speech recognition. [sent-186, score-0.914]
91 Decoding speech in the presence of other sound sources. [sent-195, score-0.215]
92 Hybrid HMM/ANN systems for speech recognition: Overview and new research directions. [sent-199, score-0.215]
93 Robust automatic speech recognition with missing and unreliable acoustic data. [sent-212, score-0.951]
94 Speaker verification in noisy environment with combined spectral subtraction and missing data theory. [sent-225, score-0.641]
95 Supervised learning from incomplete data via an EM approach. [sent-251, score-0.068]
96 State based imputation of missing data for robust speech recognition and speech enhancement. [sent-269, score-1.606]
97 A neural network for classification with incomplete data: application to robust ASR. [sent-299, score-0.119]
98 The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. [sent-305, score-0.395]
99 Reconstruction of damaged spectrographic features for robust speech recognition. [sent-318, score-0.355]
100 Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: An integrated study. [sent-331, score-0.2]
wordName wordTfidf (topN-words)
[('missing', 0.525), ('rnn', 0.424), ('imputation', 0.384), ('speech', 0.215), ('cooke', 0.212), ('asr', 0.182), ('recognition', 0.15), ('barker', 0.14), ('imputed', 0.122), ('recurrent', 0.113), ('josifovski', 0.105), ('gingras', 0.087), ('shef', 0.087), ('robust', 0.084), ('classi', 0.075), ('green', 0.072), ('vizinho', 0.07), ('hidden', 0.069), ('links', 0.067), ('clean', 0.063), ('marginalisation', 0.061), ('et', 0.06), ('spectral', 0.057), ('features', 0.056), ('cation', 0.054), ('reliable', 0.052), ('beijing', 0.052), ('bourlard', 0.052), ('masks', 0.052), ('elman', 0.051), ('units', 0.051), ('energy', 0.046), ('rnns', 0.045), ('oh', 0.045), ('icslp', 0.045), ('layer', 0.043), ('frame', 0.043), ('output', 0.041), ('unconditional', 0.041), ('completion', 0.041), ('isolated', 0.041), ('morris', 0.039), ('hz', 0.036), ('digit', 0.036), ('architecture', 0.035), ('targets', 0.035), ('incomplete', 0.035), ('ahmed', 0.035), ('cdhmms', 0.035), ('furui', 0.035), ('hid', 0.035), ('impute', 0.035), ('leonard', 0.035), ('pedersen', 0.035), ('uncorrupted', 0.035), ('speakers', 0.035), ('unreliable', 0.035), ('unit', 0.034), ('data', 0.033), ('eld', 0.033), ('bengio', 0.032), ('jordan', 0.032), ('techniques', 0.031), ('delay', 0.031), ('pearce', 0.03), ('aurora', 0.03), ('raj', 0.03), ('budapest', 0.03), ('cdhmm', 0.03), ('net', 0.03), ('values', 0.029), ('noise', 0.029), ('delayed', 0.029), ('adapting', 0.029), ('hearing', 0.028), ('automatic', 0.026), ('arrows', 0.026), ('auditory', 0.026), ('jm', 0.026), ('eurospeech', 0.026), ('subtraction', 0.026), ('accepted', 0.026), ('lippmann', 0.026), ('feedforward', 0.026), ('replaced', 0.025), ('pattern', 0.024), ('feature', 0.024), ('input', 0.024), ('trained', 0.023), ('adapt', 0.023), ('communication', 0.023), ('zeros', 0.023), ('mismatch', 0.023), ('forward', 0.022), ('self', 0.022), ('mask', 0.022), ('speaker', 0.022), ('added', 0.022), ('uk', 0.022), ('workshop', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
Author: S. Parveen, P. Green
Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.
2 0.19343275 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models
Author: John R. Hershey, Michael Casey
Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1
3 0.18340524 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
Author: Brendan J. Frey, Trausti T. Kristjansson, Li Deng, Alex Acero
Abstract: A challenging, unsolved problem in the speech recognition community is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recognition is to automatically remove the noise from the cepstrum sequence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noisefree speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for probabilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech. Noise statistics may change significantly during an utterance, so that speechfree frames are not sufficient for estimating the noise model. In this paper, we show how the noise model can be learned even when the data contains speech. In particular, the noise model can be learned from the test utterance and then used to de noise the test utterance. The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance. For both Wall Street J ournal data with added noise samples and the Aurora benchmark, we show that the new noise adaptive technique performs as well as or significantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples. 1
4 0.11924254 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama
Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1
5 0.10456865 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
Author: William M. Campbell
Abstract: A novel approach for comparing sequences of observations using an explicit-expansion kernel is demonstrated. The kernel is derived using the assumption of the independence of the sequence of observations and a mean-squared error training criterion. The use of an explicit expansion kernel reduces classifier model size and computation dramatically, resulting in model sizes and computation one-hundred times smaller in our application. The explicit expansion also preserves the computational advantages of an earlier architecture based on mean-squared error training. Training using standard support vector machine methodology gives accuracy that significantly exceeds the performance of state-of-the-art mean-squared error training for a speaker recognition task.
6 0.092347205 190 nips-2001-Thin Junction Trees
7 0.082024284 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
8 0.081560463 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
9 0.079603337 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
10 0.077553049 168 nips-2001-Sequential Noise Compensation by Sequential Monte Carlo Method
11 0.071326964 172 nips-2001-Speech Recognition using SVMs
12 0.061673176 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
13 0.059239198 46 nips-2001-Categorization by Learning and Combining Object Parts
14 0.057378564 167 nips-2001-Semi-supervised MarginBoost
15 0.057124507 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
16 0.056121543 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
17 0.055389628 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
18 0.052700419 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
19 0.051152129 43 nips-2001-Bayesian time series classification
20 0.048450559 193 nips-2001-Unsupervised Learning of Human Motion Models
topicId topicWeight
[(0, -0.157), (1, 0.01), (2, -0.065), (3, 0.026), (4, -0.229), (5, 0.079), (6, 0.086), (7, -0.108), (8, -0.001), (9, -0.067), (10, -0.013), (11, 0.049), (12, -0.02), (13, 0.214), (14, -0.082), (15, 0.086), (16, -0.05), (17, -0.059), (18, -0.122), (19, -0.001), (20, -0.037), (21, -0.035), (22, 0.05), (23, 0.125), (24, -0.048), (25, 0.054), (26, 0.151), (27, -0.052), (28, 0.085), (29, 0.061), (30, 0.037), (31, 0.083), (32, -0.073), (33, -0.103), (34, -0.059), (35, 0.087), (36, -0.047), (37, -0.027), (38, -0.073), (39, -0.04), (40, 0.002), (41, -0.008), (42, 0.061), (43, 0.018), (44, 0.041), (45, 0.023), (46, -0.015), (47, 0.008), (48, -0.005), (49, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.96589547 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
Author: S. Parveen, P. Green
Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.
2 0.73783052 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
Author: Brendan J. Frey, Trausti T. Kristjansson, Li Deng, Alex Acero
Abstract: A challenging, unsolved problem in the speech recognition community is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recognition is to automatically remove the noise from the cepstrum sequence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noisefree speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for probabilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech. Noise statistics may change significantly during an utterance, so that speechfree frames are not sufficient for estimating the noise model. In this paper, we show how the noise model can be learned even when the data contains speech. In particular, the noise model can be learned from the test utterance and then used to de noise the test utterance. The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance. For both Wall Street J ournal data with added noise samples and the Aurora benchmark, we show that the new noise adaptive technique performs as well as or significantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples. 1
3 0.73686147 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models
Author: John R. Hershey, Michael Casey
Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1
4 0.66201937 168 nips-2001-Sequential Noise Compensation by Sequential Monte Carlo Method
Author: K. Yao, S. Nakamura
Abstract: We present a sequential Monte Carlo method applied to additive noise compensation for robust speech recognition in time-varying noise. The method generates a set of samples according to the prior distribution given by clean speech models and noise prior evolved from previous estimation. An explicit model representing noise effects on speech features is used, so that an extended Kalman filter is constructed for each sample, generating the updated continuous state estimate as the estimation of the noise parameter, and prediction likelihood for weighting each sample. Minimum mean square error (MMSE) inference of the time-varying noise parameter is carried out over these samples by fusion the estimation of samples according to their weights. A residual resampling selection step and a Metropolis-Hastings smoothing step are used to improve calculation efficiency. Experiments were conducted on speech recognition in simulated non-stationary noises, where noise power changed artificially, and highly non-stationary Machinegun noise. In all the experiments carried out, we observed that the method can have significant recognition performance improvement, over that achieved by noise compensation with stationary noise assumption. 1
5 0.52099884 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
Author: William M. Campbell
Abstract: A novel approach for comparing sequences of observations using an explicit-expansion kernel is demonstrated. The kernel is derived using the assumption of the independence of the sequence of observations and a mean-squared error training criterion. The use of an explicit expansion kernel reduces classifier model size and computation dramatically, resulting in model sizes and computation one-hundred times smaller in our application. The explicit expansion also preserves the computational advantages of an earlier architecture based on mean-squared error training. Training using standard support vector machine methodology gives accuracy that significantly exceeds the performance of state-of-the-art mean-squared error training for a speaker recognition task.
6 0.424932 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
7 0.42240632 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
8 0.40039653 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
9 0.38581672 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
10 0.35294977 91 nips-2001-Improvisation and Learning
11 0.32523191 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
12 0.31235191 172 nips-2001-Speech Recognition using SVMs
13 0.30224809 14 nips-2001-A Neural Oscillator Model of Auditory Selective Attention
14 0.29648638 190 nips-2001-Thin Junction Trees
15 0.2905674 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
16 0.27306485 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding
17 0.26759869 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
18 0.25893012 193 nips-2001-Unsupervised Learning of Human Motion Models
19 0.25835216 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
20 0.25426626 76 nips-2001-Fast Parameter Estimation Using Green's Functions
topicId topicWeight
[(13, 0.014), (14, 0.018), (19, 0.015), (20, 0.01), (27, 0.07), (30, 0.577), (38, 0.021), (59, 0.025), (72, 0.043), (79, 0.03), (83, 0.011), (91, 0.066)]
simIndex simValue paperId paperTitle
Author: Takashi Morie, Tomohiro Matsuura, Makoto Nagata, Atsushi Iwata
Abstract: This paper describes a clustering algorithm for vector quantizers using a “stochastic association model”. It offers a new simple and powerful softmax adaptation rule. The adaptation process is the same as the on-line K-means clustering method except for adding random fluctuation in the distortion error evaluation process. Simulation results demonstrate that the new algorithm can achieve efficient adaptation as high as the “neural gas” algorithm, which is reported as one of the most efficient clustering methods. It is a key to add uncorrelated random fluctuation in the similarity evaluation process for each reference vector. For hardware implementation of this process, we propose a nanostructure, whose operation is described by a single-electron circuit. It positively uses fluctuation in quantum mechanical tunneling processes.
same-paper 2 0.96119928 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
Author: S. Parveen, P. Green
Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.
3 0.9446795 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections
Author: Xiaohui Xie, Martin A. Giese
Abstract: Asymmetric lateral connections are one possible mechanism that can account for the direction selectivity of cortical neurons. We present a mathematical analysis for a class of these models. Contrasting with earlier theoretical work that has relied on methods from linear systems theory, we study the network’s nonlinear dynamic properties that arise when the threshold nonlinearity of the neurons is taken into account. We show that such networks have stimulus-locked traveling pulse solutions that are appropriate for modeling the responses of direction selective cortical neurons. In addition, our analysis shows that outside a certain regime of stimulus speeds the stability of this solutions breaks down giving rise to another class of solutions that are characterized by specific spatiotemporal periodicity. This predicts that if direction selectivity in the cortex is mainly achieved by asymmetric lateral connections lurching activity waves might be observable in ensembles of direction selective cortical neurons within appropriate regimes of the stimulus speed.
4 0.93044734 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
Author: B. Zadrozny
Abstract: This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates.
5 0.90397882 163 nips-2001-Risk Sensitive Particle Filters
Author: Sebastian Thrun, John Langford, Vandi Verma
Abstract: We propose a new particle filter that incorporates a model of costs when generating particles. The approach is motivated by the observation that the costs of accidentally not tracking hypotheses might be significant in some areas of state space, and next to irrelevant in others. By incorporating a cost model into particle filtering, states that are more critical to the system performance are more likely to be tracked. Automatic calculation of the cost model is implemented using an MDP value function calculation that estimates the value of tracking a particular state. Experiments in two mobile robot domains illustrate the appropriateness of the approach.
6 0.88888961 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model
7 0.7533744 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
8 0.75072104 149 nips-2001-Probabilistic Abstraction Hierarchies
9 0.73449868 73 nips-2001-Eye movements and the maturation of cortical orientation selectivity
10 0.69916314 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
11 0.65484875 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
12 0.61707878 46 nips-2001-Categorization by Learning and Combining Object Parts
13 0.60987276 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
14 0.60825652 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
15 0.59826183 60 nips-2001-Discriminative Direction for Kernel Classifiers
16 0.59725791 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
17 0.59284693 34 nips-2001-Analog Soft-Pattern-Matching Classifier using Floating-Gate MOS Technology
18 0.59216511 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines
19 0.58757699 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
20 0.58675164 116 nips-2001-Linking Motor Learning to Function Approximation: Learning in an Unlearnable Force Field