iccv iccv2013 iccv2013-437 iccv2013-437-reference knowledge-graph by maker-knowledge-mining

437 iccv-2013-Unsupervised Random Forest Manifold Alignment for Lipreading

Source: pdf

Author: Yuru Pei, Tae-Kyun Kim, Hongbin Zha

Abstract: Lipreading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lipreading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts.

reference text

[1] M. Aharon and R. Kimmel. Representation analysis and synthesis of lip images using dimensionality reduction. IJCV, 67(3):297–312, 2006.

[2] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[3] J. Chen, H.-r. Fang, and Y. Saad. Fast approximate k nn

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12] graph construction for high dimensional data via recursive lanczos bisection. The Journal of Machine Learning Research, 10: 1989–2012, 2009. T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Trans. on PAMI, 23(6):681–685, 2001. S. Cox, R. Harvey, Y. Lan, J. Newman, and B. Theobald. The challenge of multispeaker lip-reading. In International Conference on Auditory-Visual Speech Processing, pages 179–184, 2008. A. Criminisi. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends?R in Computer Graphics and Vision, 7(2-3):81–227, 2011. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893, 2005. W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586. ACM, 2011. J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. IEEE Trans. on PAMI, 33(1 1):2188–2202, 2011. K. Gray, P. Aljabar, R. Heckemann, A. Hammers, and D. Rueckert. Random forest-based manifold learning for classification of imaging data in dementia. Machine Learning in Medical Imaging, pages 159–166, 2011. J. Ham, D. Lee, and L. Saul. Semisupervised Alignment of Manifolds. In Proc. of the Tenth Int’l Workshop on Artificial Intelligence and Statistics, pages 120–127, 2005. S. Lafon, Y. Keller, and R. Coifman. Data Fusion and Mul-

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22] ticue Data Matching by Diffusion Maps. IEEE Trans. on PAMI, 28(1 1): 1784–1797, 2006. I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey. Extraction of visual features for lipreading. IEEE Trans. on PAMI, 24(2):198–213, 2002. F. Moosmann, B. Triggs, F. Jurie, et al. Fast discriminative visual codebooks using randomized clustering forests. In NIPS’06, pages 985–992, 2006. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011. T. Ojala, M. Pietik¨ ainen, and D. Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern recognition, 29(1):5 1–59, 1996. Y. Pei, F. Huang, F. Shi, and H. Zha. Unsupervised image matching based on manifold alignment. IEEE Trans. on PAMI, 34(8): 1658–1664, 2012. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior. Recent advances in the automatic recognition of audiovisual speech. Proc. the IEEE, 91(9): 1306–1326, 2003. K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell. Visual speech recognition with loosely synchronized feature streams. In ICCV, pages 1424–143 1, 2005. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In IEEE Conf. on CVPR, volume 2, page 7, 2011. R. Sim and N. Roy. Global a-optimal robot exploration in slam. In IEEE Conf. on ICRA, pages 661–666, 2005. C. Wang and S. Mahadevan. Manifold Alignment without Correspondence. In Proc. Int’l Joint Conf. Artificial Intelligence (IJCAI), pages 1273–1278, 2009.

[23] G. Yu, J. Yuan, and Z. Liu. Unsupervised random forest indexing for fast action search. In IEEE Conf. on CVPR, pages 865–872, 2011.

[24] G. Zhao, M. Barnard, and M. Pietikainen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7): 1254–1265, 2009.

[25] Z. Zhou, G. Zhao, and M. Pietikainen. Lipreading: a graph embedding approach. In ICPR, pages 523–526, 2010.

[26] Z. Zhou, G. Zhao, and M. Pietikainen. Towards a practical lipreading system. In IEEE Conf. on CVPR, pages 137–144, 2011. 136