iccv iccv2013 iccv2013-246 iccv2013-246-reference knowledge-graph by maker-knowledge-mining

246 iccv-2013-Learning the Visual Interpretation of Sentences

Source: pdf

Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende

Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.

reference text

[1] A. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 1250–1258. Association for Computational Linguistics, 2010.

[2] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. The Journal of Machine Learning Research, 3: 1107–1 135, 2003.

[3] K. Barnard and Q. Fan. Reducing correspondence ambiguity in loosely labeled training data. In CVPR, 2007.

[4] T. Berg, A. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web data. ECCV, 2010.

[5] I. Biederman, R. Mezzanotte, and J. Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2), 1982.

[6] B. Coyne and R. Sproat. Wordseye: an automatic text-to-scene conversion system. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 487–496. ACM, 2001.

[7] M. Douze, A. Ramisa, and C. Schmid. Combining attributes and fisher vectors

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24] for efficient image retrieval. In CVPR, 2011. A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for crosscategory generalization. In CVPR, 2010. A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences for images. In ECCV, 2010. B. Fasel and J. Luettin. Automatic facial expression analysis: a survey. Pattern Recognition, 36(1), 2003. P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9), 2010. Y. Feng and M. Lapata. How many words is a picture worth? automatic caption generation for news images. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1239–1249. Association for Computational Linguistics, 2010. S. Gobron, J. Ahn, G. Paltoglou, M. Thelwall, and D. Thalmann. From sentence to emotion: a real-time three-dimensional graphics metaphor of emotions extracted from text. The Visual Computer, 26(6-8):505–5 19, 2010. M. Grubinger, P. Clough, H. M ¨uller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In Int. Workshop OntoImage, 2006. A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. ECCV, 2008. D. Joshi, J. Z. Wang, and J. Li. The story picturing engine—a system for automatic text illustration. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 2(1):68–89, 2006. A. Kovashka, D. Parikh, and K. Grauman. Whittlesearch: Image search with relative attribute feedback. In CVPR, 2012. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011. N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: A search engine for large collections of images with faces. In ECCV, 2010. J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001 . C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In NIPS, 2003. M. Naphade, J. Smith, J. Tesic, S. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale concept ontology for multimedia. IEEE Multimedia, 2006.

[25] V. Ordonez, G. Kulkarni, and T. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.

[26] D. Parikh and K. Grauman. Relative attributes. In ICCV, 201 1.

[27] K. Perlin and A. Goldberg. Improv: A system for scripting interactive actors in virtual worlds. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 205–216. ACM, 1996.

[28] C. Quirk, P. Choudhury, J. Gao, H. Suzuki, K. Toutanova, M. Gamon, W.-t. Yih, L. Vanderwende, and C. Cherry. Msr splat, a language analysis toolkit. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstration Session, pages 21–24. Association for Computational Linguistics, 2012.

[29] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon ’s MT, 2010.

[30] N. Rasiwasia, P. Moreno, and N. Vasconcelos. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 2007. [3 1] M. Sadeghi and A. Farhadi. Recognition using visual phrases. In CVPR, 2011.

[32] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2–23, 2009.

[33] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010.

[34] J. Smith, M. Naphade, and A. Natsev. Multimedia semantic indexing using model vectors. In ICME, 2003.

[35] X. Wang, K. Liu, and X. Tang. Query-specific visual semantic spaces for web image re-ranking. In CVPR, 2011.

[36] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR. IEEE, 2010.

[37] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixturesof-parts. In CVPR, 2011.

[38] Y. Yang, C. Teo, H. Daum e´ III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011.

[39] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction activities. In CVPR, 2010.

[40] E. Zavesky and S.-F. Chang. Cuzero: Embracing the frontier of interactive visual search for informed users. In Proceedings of ACM Multimedia Information Retrieval, 2008.

[41] C. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In CVPR, 2013. 11668888