emnlp emnlp2013 emnlp2013-98 emnlp2013-98-reference knowledge-graph by maker-knowledge-mining

98 emnlp-2013-Image Description using Visual Dependency Representations

Source: pdf

Author: Desmond Elliott ; Frank Keller

Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

reference text

Yezhou Yang, Ching Lik Teo, Hal Daum e´ III, and Yiannis Aloimonos. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2011. The PASCAL Visual Object Classes Challenge 2011 (VOC201 1) Results. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, and David Forsyth. generating sentences 2010. Julia Hockenmaier, Every picture tells a story: from images. In ECCV ’10, 2011. Corpus-Guided tion of Natural Images. Sentence Genera- In EMNLP ’11, pages 444– 454, Edinburgh, Scotland, UK. pages 15–29, Heraklion, Crete, Greece. P F Felzenszwalb, R B Girshick, D McAllester, and D Ramanan. 2010. Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9): 1627–1645. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In CVPR ’11, pages 1601–1608, Colorado Springs, Colorado, U.S.A. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective Generation of Natural Image Descriptions. In ACL ’12, pages 359–368, Jeju Island, South Korea. Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In CoNLL ’11, pages 220–228, Portland, Oregon, U.S.A. Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL ’04, pages 605–612, Barcelona, Spain. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In ACL ’05, pages 91–98, University of Michigan, U.S.A. Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Alyssa Mensch, Alex Berg, Tamara Berg, and Hal Daum. 2012. Midge : Generating Image Descriptions From Computer Vision Detections. In EACL ’12, pages 747–756, Avignon, France. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated Gigaword. In AKBCWEKEX Workshop at NAACL-HLT ’12, Montreal, Canada. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In NIPS 24, Granada, Spain. Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A Database and Web-Based Tool for Image Annotation. IJCV, 77(1-3): 157–173. 1302