nips nips2012 nips2012-72 nips2012-72-reference knowledge-graph by maker-knowledge-mining

72 nips-2012-Cocktail Party Processing via Structured Prediction


Source: pdf

Author: Yuxuan Wang, Deliang Wang

Abstract: While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.


reference text

[1] R. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise PSD tracking with low complexity,” in ICASSP, 2010.

[2] P. Scalart and J. Filho, “Speech enhancement based on a priori signal to noise estimation,” in ICASSP, 1996.

[3] S. Roweis, “One microphone source separation,” in NIPS, 2001.

[4] D. Wang and G. Brown, Eds., Computational Auditory Scene Analysis: Principles, Algorithms and Applications. Hoboken, NJ: Wiley-IEEE Press, 2006.

[5] A.S. Bregman, Auditory scene analysis: The perceptual organization of sound. The MIT Press, 1994.

[6] D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, Divenyi P., Ed. Kluwer Academic, Norwell MA., 2005, pp. 181–197.

[7] D. Brungart, P. Chang, B. Simpson, and D. Wang, “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Am., vol. 120, pp. 4007–4018, 2006.

[8] M. Anzalone, L. Calandruccio, K. Doherty, and L. Carney, “Determination of the potential benefit of time-frequency gain manipulation,” Ear and hearing, vol. 27, no. 5, pp. 480–492, 2006.

[9] N. Li and P. Loizou, “Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction,” J. Acoust. Soc. Am., vol. 123, no. 3, pp. 1673–1682, 2008.

[10] G. Kim, Y. Lu, Y. Hu, and P. Loizou, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” J. Acoust. Soc. Am., vol. 126, pp. 1486–1494, 2009.

[11] K. Han and D. Wang, “An SVM based classification approach to speech separation,” in ICASSP, 2011.

[12] Y. Wang, K. Han, and D. Wang, “Exploring monaural features for classification-based speech segregation,” IEEE Trans. Audio, Speech, Lang. Process., in press, 2012.

[13] G. Mysore and P. Smaragdis, “A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics,” in ICASSP, 2011.

[14] J. Hershey, T. Kristjansson, S. Rennie, and P. Olsen, “Single channel speech separation using factorial dynamics,” in NIPS, 2007.

[15] J. Lafferty, A. McCallum, and F. Pareira, “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” in ICML, 2001.

[16] J. Nocedal and S. Wright, Numerical optimization. Springer verlag, 1999.

[17] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[18] L. van der Maaten, M. Welling, and L. Saul, “Hidden-unit conditional random fields,” in AISTATS, 2011.

[19] L. Morency, A. Quattoni, and T. Darrell, “Latent-dynamic discriminative models for continuous gesture recognition,” in CVPR, 2007.

[20] J. Peng, L. Bo, and J. Xu, “Conditional neural fields,” in NIPS, 2009.

[21] A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in NIPS workshop on speech recognition and related applications, 2009.

[22] T. Do and T. Artieres, “Neural conditional random fields,” in AISTATS, 2010.

[23] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 2003.

[24] IEEE, “IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroacoust., vol. 17, pp. 225–246, 1969.

[25] G. Hu and D. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1135–1150, 2004.

[26] I. Tsochataridis, T. Hofmann, and T. Joachims, “Support vector machine for interdependent and structured output spaces,” in ICML, 2004.

[27] J. Garofolo, DARPA TIMIT acoustic-phonetic continuous speech corpus, NIST, 1993. 9