nips nips2010 nips2010-28 nips2010-28-reference knowledge-graph by maker-knowledge-mining

28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection


Source: pdf

Author: Javier R. Movellan, Paul L. Ruvolo

Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1


reference text

[1] M. S. Bartlett, G. Littlewort, C. Lainscsek, I. Fasel, and J. Movellan. Recognition of facial actions in spontaneous expressions,. Journal of Multimedia, 2006. 7

[2] M. S. Bartlett, G. C. Littlewort, M. G. Frank, C. Lainscsek, I. R. Fasel, and J. R. Movellan. Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia, 1(6):22, 2006. 3

[3] J. Bridle and M. Brown. An experimental automatic word recognition system. JSRU Report, 1003, 1974. 5

[4] A. Declercq and J. Piater. Online learning of gaussian mixture models-a two-level approach. In Intl. l Conf. Comp. Vis., Imaging and Comp. Graph. Theory and Applications, pages 605–611, 2008. 8

[5] A. DeMaris. A tutorial in logistic regression. Journal of Marriage and the Family, pages 956–968, 1995. 3

[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(Series B):1–38, 1977. 5

[7] P. Ekman, W. Friesen, and J. Hager. Facial Action Coding System (FACS): Manual and Investigator’s Guide. A Human Face, Salt Lake City, UT, 2002. 2

[8] J. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3):406–413, 2004. 1, 7

[9] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: an overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004. 2

[10] J. Hershey and J. Movellan. Audio-vision: Using audio-visual synchrony to locate sounds. Advances in Neural Information Processing Systems, 12:813–819, 2000. 1, 2

[11] D. Hosmer and S. Lemeshow. Applied logistic regression. Wiley-Interscience, 2000. 3

[12] E. Kidron, Y. Schechner, and M. Elad. Pixels that sound. In IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, volume 1, page 88. Citeseer, 2005. 1, 2, 7

[13] G. Littlewort, M. Bartlett, and K. Lee. Faces of pain: automated measurement of spontaneousallfacial expressions of genuine and posed pain. In Proceedings of the 9th international conference on Multimodal interfaces, pages 15–21. ACM, 2007. 3

[14] B. Logan. Mel frequency cepstral coefficients for music modeling. In International Symposium on Music Information Retrieval, volume 28, 2000. 5

[15] J. Movellan and P. Mineiro. Robust sensor fusion: Analysis and application to audio visual speech recognition. Machine Learning, 32(2):85–100, 1998. 6

[16] A. Noulas and B. Krose. On-line multi-modal speaker diarization. In Proceedings of the 9th international conference on Multimodal interfaces, pages 350–357. ACM, 2007. 1, 7

[17] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy. CUAVE: A new audio-visual database for multimodal human-computer interface research. In IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, volume 2. Citeseer, 2002. 7

[18] J. Rehg, K. Murphy, and P. Fieguth. Vision-based speaker detection using bayesian networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 110–116, 1999. 7

[19] D. Reynolds. Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4):639–643, 1994. 5

[20] D. Reynolds, T. Quatieri, and R. Dunn. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3):19–41, 2000. 3

[21] M. Slaney and M. Covell. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. Advances in Neural Information Processing Systems, pages 814–820, 2001. 1, 2, 7

[22] E. Vural, M. Cetin, A. Ercil, G. Littlewort, M. Bartlett, and J. Movellan. Drowsy driver detection through facial movement analysis. Lecture Notes in Computer Science, 4796:6–18, 2007. 3

[23] J. Whitehill, M. Bartlett, and J. Movellan. Automatic facial expression recognition for intelligent tutoring systems. Computer Vision and Pattern Recognition, 2008. 3

[24] J. Whitehill, M. S. Bartlett, and J. R. Movellan. Measuring the difficulty of a lecture using automatic facial expression recognition. In Intelligent Tutoring Systems, 2008. 3 9