acl acl2012 acl2012-51 acl2012-51-reference knowledge-graph by maker-knowledge-mining

51 acl-2012-Collective Generation of Natural Image Descriptions

Source: pdf

Author: Polina Kuznetsova ; Vicente Ordonez ; Alexander Berg ; Tamara Berg ; Yejin Choi

Abstract: We present a holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web. More specifically, given a query image, we retrieve existing human-composed phrases used to describe visually similar images, then selectively combine those phrases to generate a novel description for the query image. We cast the generation process as constraint optimization problems, collectively incorporating multiple interconnected aspects of language composition for content planning, surface realization and discourse structure. Evaluation by human annotators indicates that our final system generates more semantically correct and linguistically appealing descriptions than two nontrivial baselines.

reference text

Ahmet Aker and Robert Gaizauskas. 2010. Generating image descriptions using dependency relational patterns. In A CL. Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of nlg systems. In EA CL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, April 3-7, 2006, Trento, Italy. The Association for Computer Linguistics. Thorsten Brants and Alex Franz. 2006. Web 1t 5gram version 1. In Linguistic Data Consortium. James Clarke and Mirella Lapata. 2006. Constraintbased sentence compression: An integer programming approach. In Proceedings of the COLING/A CL 2006 Main Conference Poster Sessions, pages 144–151, Sydney, Australia, July. Association for Computational Linguistics. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR ’05) - Volume 1 - Volume 01, CVPR ’05, pages 886–893, Washington, DC, USA. IEEE Computer Society. Haris Dindo and Daniele Zambuto. 2010. A probabilistic approach to learning a visually grounded language model through human-robot interaction. In IROS, pages 790–796. IEEE. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: generating sentences for images. In ECCV. Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trained part based models. tPAMI, Sept. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In A CL. Fateh Muhammad Hafiz and Ian Tudor. 1989. Extensive reading and the development of language skills. ELT Journal, 43(1) :4–13. Girish Kulkarni, Visruth Premraj , Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2011. Babytalk: Understanding and generating simple image descriptions. In CVPR. Thomas K. Leung and Jitendra Malik. 1999. Recognizing surfaces using three-dimensional textons. In ICCV. 367 Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale ngrams. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 220–228, Portland, Oregon, USA, June. Association for Computational Linguistics. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60:91–110, November. Andre Martins and Noah A. Smith. 2009. Summarization with a joint model for sentence extraction and compression. In Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing, pages 1–9, Boulder, Colorado, June. Association for Computational Linguistics. Derek D. Monner and James A. Reggia. 2011. Systematically grounding language through vision in a deep, recurrent neural network. In Proceedings of the 4th international conference on Artificial general intelligence, AGI’11, pages 112–121, Berlin, Heidelberg. Springer-Verlag. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS). Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In A CL. Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In HLT-NAA CL. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In COLING/ACL. Deb K. Roy. 2002. Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language, In review. Wai-King Tsang. 1996. Comparing the effects of reading and writing on writing performance. Applied Linguistics, 17(2) :210–233. Kristian Woodsend and Mirella Lapata. 2010. Automatic generation of story highlights. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 565– 574, Uppsala, Sweden, July. Association for Computational Linguistics. Kristian Woodsend, Yansong Feng, and Mirella Lapata. 2010. Title generation with quasi- synchronous grammar. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 513–523, Stroudsburg, PA, USA. Association for Computational Linguistics. Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR. Yezhou Yang, Ching Teo, Hal Daume III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 444–454, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2t: Image parsing to text description. Proc. IEEE, 98(8) . 368