acl acl2013 acl2013-175 acl2013-175-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Haonan Yu ; Jeffrey Mark Siskind
Abstract: We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.
A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, N. Siddharth, D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang. 2012a. Video in sentences out. In Proceedings of the Twenty-Eighth Conference on Un- certainty in Artificial Intelligence, pages 102–1 12. A. Barbu, N. Siddharth, A. Michaux, and J. M. Siskind. 2012b. Simultaneous object detection, tracking, and event recognition. Advances in Cognitive Systems, 2:203–220, December. L. E. Baum and T. Petrie. 1966. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37: 1554–1563. L. E. Baum, T. Petrie, G. Soules, and N. Weiss. 1970. A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1): 164– 171. L. E. Baum. 1972. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3: 1–8. J. Bilmes. 1997. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report TR-97-021, ICSI. M. Brand, N. Oliver, and A. Pentland. 1997. Coupled hidden Markov models for complex action recog- nition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 994–999. D. L. Chen and R. J. Mooney. 2008. Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine Learning. W. Dinkelbach. 1967. On nonlinear fractional programming. Management Science, 13(7):492–498. P. F. Dominey and J.-D. Boucher. 2005. Learning to talk about events from narrated video in a construction grammar framework. Artificial Intelligence, 167(12):31–61. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. 2010a. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9): 1627–1645. P. F. Felzenszwalb, R. B. Girshick, and D. A. McAllester. 2010b. Cascade object detection with deformable part models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2241–2248. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1601–1608. P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, pages 359–368. T. Kwiatkowski, S. Goldwater, L. Zettlemoyer, and M. Steedman. 2012. A probabilistic model of syntactic and semantic acquisition from child-directed utterances and their meanings. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 234– 244. V. Ordonez, G. Kulkarni, and T. L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Proceedings of Neural Information Processing Systems. D. Roy. 2002. Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language, 16:353–385. S. Sadanand and J. J. Corso. 2012. Action bank: A high-level representation of activity in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1234–1241 . J. M. Siskind and Q. Morris. 1996. A maximumlikelihood approach to visual event classification. In Proceedings of the Fourth European Conference on Computer Vision, pages 347–360. J. M. Siskind. 1996. A computational study of crosssituational techniques for learning word-to-meaning mappings. Cognition, 61:39–91. T. Starner, J. Weaver, and A. Pentland. 1998. Realtime American Sign Language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12): 1371–1375. A. J. Viterbi. 1967. Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13:260–267. A. Viterbi. 1971 . Convolutional codes and their performance in communication systems. IEEE Transactions on Communication Technology, 19(5):75 1– 772. J. Yamoto, J. Ohya, and K. Ishii. 1992. Recognizing human action in time-sequential images using hidden Markov model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 379–385. 62 B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. 2010. I2T: Image parsing to text description. Proceedings of the IEEE, 98(8): 1485–1508, August. C. Yu and D. H. Ballard. 2004. On the integration of grounding language and learning objects. In Proceedings of the 19th National Conference on Artifical intelligence, pages 488–493. S. Zhong and J. Ghosh. 2001. A new formulation of coupled hidden Markov models. Technical report, Department of Electrical and Computer Engineering, The University of Texas at Austin. 63