cvpr cvpr2013 cvpr2013-159 cvpr2013-159-reference knowledge-graph by maker-knowledge-mining

159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models


Source: pdf

Author: Robert Anderson, Björn Stenger, Vincent Wan, Roberto Cipolla

Abstract: This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a ‘talking head’, given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.


reference text

[1] I. Albrecht, M. Schr o¨der, J. Haber, and H. Seidel. Mixed feelings: expression of non-basic emotions in a musclebased talking head. Virt. Real., 8(4):201–212, 2005. 1, 2

[2] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. SIGGRAPH, pages 187–194, 1999. 4

[3] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer. High resolution passive facial performance capture. ACM TOG, 29(4), 2010. 2

[4] M. Brand. Voice puppetry. In SIGGRAPH, pages 21–28, 1999. 2

[5] Y. Cao, W. Tien, P. Faloutsos, and F. Pighin. Expressive speech-driven facial animation. ACM TOG, 24(4): 1283– 1302, 2005. 1, 2, 7

[6] Y. Chang and T. Ezzat. Transferable videorealistic speech animation. In SIGGRAPH, pages 143–15 1, 2005. 1, 2, 7

[7] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE PAMI, 23(6):681–685, 2001. 1, 2, 3, 5

[8] D. Cosker, S. Paddock, D. Marshall, P. Rosin, and S. Rushton. Towards perceptually realistic talking heads: models, methods and mcgurk. In Symp. Applied perception in graphics and visualization, pages 15 1–157, 2004. 2

[9] F. De la Torre and M. Black. Robust parameterized component analysis: theory and applications to 2d facial appearance models. Computer Vision and Image Understanding, 91(1):53–71, 2003. 2, 4

[10] S. Deena, S. Hou, and A. Galata. Visual speech synthesis by modelling coarticulation dynamics using a non-parametric 333333888866 (a)(b)(c)(d)(e)(f)(g)(h) (a)(b)(c)(d)(e) Figure 4: Example synthesis for (a) neutral, (b) tender, (c) happy, (d) sad, (e) afraid and (f) angry, (g) modifications or hair, (h) close up same angry frame without teeth of teeth, (top) before modification and (bottom) after. Figure 5: Emotion recognition for (a) real video cropped to face, (b) synthetic audio and video, (c) synthetic video only and (d) synthetic audio only. In each case 10 sentences in each emotion were evaluated by 20 different people. (e) gives the recognition rate for each emotion along with the 95% confidence interval.

[11]

[12]

[13]

[14] switching state-space model. In ICMI-MLMI, pages 1–8, 2010. 1, 2, 7 G. Edwards, A. Lanitis, C. Taylor, and T. Cootes. Statistical models of face images - improving specificity. Image and Vision Computing, 16(3):203–21 1, 1998. 2, 3 T. Ezzat and T. Poggio. Miketalk: A talking facial display based on morphing visemes. In In Proceedings of the Computer Animation Conference, pages 96–102, 1998. 2 X. Gao, Y. Su, X. Li, and D. Tao. A review of active appearance models. IEEE Transactions on Systems, Man, and Cybernetics, 40(2): 145–158, 2010. 2 J. Gonzalez-Mora, F. De la Torre, R. Murthi, N. Guil, and E. Zapata. Bilinear active appearance models. ICCV, pages 1–8, 2007. 2

[15] J. Latorre, V. Wan, M. J. F. Gales, L. Chen, K. Chin, K. Knill, and M. Akamine. Speech factorization for HMM-TTS based on cluster adaptive training. In Interspeech, 2012. 5

[16] K. Liu and J. Ostermann. Realistic facial expression synthesis for an image-based talking head. In International Conference on Multimedia & Expo, pages 1–6, 2011. 1, 2, 7

[17] W. Ma, A. Jones, J. Chiang, T. Hawkins, S. Frederiksen, P. Peers, M. Vukovic, M. Ouhyoung, and P. Debevec. Facial performance synthesis using deformation-driven polynomial displacement maps. SIGGRAPH, 27(5), 2008. 2

[18] J. Melench o´n, E. Mart ı´nez, F. De la Torre, and J. Montero. Emphatic visual speech synthesis. Trans. Audio, Speech and Lang. Proc., 17(3):459–468, 2009. 7

[19] I. Pandzic, J. Ostermann, and D. Millen. User evaluation: synthetic talking faces for interactive services. The Visual Computer, 15(7):330–340, 1999. 1

[20] E. Sifakis, A. Selle, A. Robinson-Mosher, and R. Fedkiw. Simulating speech with a physics-based facial muscle model. In SCA ACM/Eurographics, pages 261–270, 2006. 2

[21] S. Taylor, M. Mahler, B. Theobald, and I. Matthews. Dynamic units of visual speech. In Eurographics Symposium on Computer Animation, pages 275–284, 2012. 2

[22] J. Tena, F. De la Torre, and I. Matthews. Interactive regionbased linear 3d face models. ACM TOG, 30(4):76, 2011. 4

[23] B. Theobald, J. Bangham, I. Matthews, and G. Cawley. Nearvideorealistic synthetic talking faces: implementation and evaluation. Speech Comm. , 44(14): 127–140, 2004. 1, 2

[24] K. Wampler, D. Sasaki, L. Zhang, and Z. Popovi´ c. Dynamic, expressive speech animation from a single mesh. In SCA ACM/Eurographics, pages 53–62, 2007. 2

[25] L. Wang, W. Han, X. Qian, and F. Soong. Photo-real lips

[26]

[27]

[28]

[29] synthesis with trajectory-guided sample selection. In Speech Synth. Workshop, Int. Speech Comm. Assoc. , 2010. 1, 2 L. Wang, W. Han, F. Soong, and Q. Huo. Text driven 3D photo-realistic talking head. In Interspeech, pages 3307– 3308, 2011. 2, 7 K. Waters and T. Levergood. DECface: A system for synthetic face aplications. Multimedia Tools and Applications, 1(4):349–366, 1995. 1 H. Zen, N. Braunschweiler, S. Buchholz, M. Gales, K. Knill, S. Krstulovi c´, and J. Latorre. Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization. IEEE Trans. Audio Speech Lang. Process., 20(5), 2012. 5 H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Communication, 5 1(11): 1039– 1154, November 2009. 1, 5 333333888977