nips nips2001 nips2001-39 nips2001-39-reference knowledge-graph by maker-knowledge-mining

39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models


Source: pdf

Author: John R. Hershey, Michael Casey

Abstract: It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This suggests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are factori ally combined, to incorporate visual lip information and employ novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combinatorial explosion in the factorial model by using a simple approximate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information. 1


reference text

[1] W. H. Sumby and I. Pollack. Visual contribution to speech intelligibility in noise. Journal of th e Acoustical Society of America, 26:212- 215, 1954.

[2] Jordi Robert-Ribes, Jean-Luc Schwartz, Tahar Lallouache, and Pierre Escudier. Complementarity and synergy in bimodal speech. Journ el of the Acoustical Society of America, 103(6):3677- 3689, 1998.

[3] Stepmane Dupont and Juergen Luettin. Audio-visual speech modeling for continuous speech recognition. IEEE transactions on Multimedia, 2(3):141- 151, 2000.

[4] John W. Fisher, Trevor Darrell , William T. Freeman, and Paul Viola. Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13. 200l.

[5] Laurent Girin , Jean-Luc Schwartz, and Gang Feng. Audio-visual enhancement of speech in noise. Journ el of the Acoustical Society of America, 109(6):3007- 3019, 200l.

[6] Yariv Ephraim. Statistical-model based speech enhancement systems. Proceedings of th e IEEE, 80(10):1526- 1554, 1992.

[7] Z. Ghahramani and M. Jordan. Factorial hidden markov models. In David S. Touretzky, Michael C. Mozer , and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, 1996.

[8] Sam T. Roweis. One microphone source separation. In Advances in Neural Information Processing Systems 13. 200l.

[9] Hagai Attias, John C. Platt , Alex Acero, and Li Deng. Speech denoising and dereverb eration using probabilistic models. In Advances in Neural Information Processing Systems 13. 200l.

[10] M. J. F . Gales. Mod el-Bas ed Techniques for Noise Robust Speech R ecognition. PhD thesis, Cambridge University, 1996.

[11] F . R. Kschischang, B. Frey, and H .-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory, 47(2):498- 519, 200l.

[12] F. J. Huang and T. Chen. Real-time lip-synch face animation driven by human voice. In IEEE Wo rkshop on Multimedia Signal Processing, Los Angeles, California, Dec 1998.

[13] Matt Brand. Structure learning in conditional probability models via an entropic prior and parameter extinction. Neural Computation, 11(5):1155- 1182, 1999.

[14] C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of the parameters of continuous density hidden markov models. Computer Speech and Language, 9: 171- 185, 1995.