cvpr cvpr2013 cvpr2013-73 cvpr2013-73-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: C. Lawrence Zitnick, Devi Parikh
Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.
[1] A. Berg, T. Berg, H. Daume, J. Dodge, A. Goyal, X. Han, A. Mensch, M. Mitchell, A. Sood, K. Stratos, et al. Understanding and predicting importance in images. In CVPR, 2012.
[2] T. Berg, A. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web data. ECCV, 2010.
[3] I. Biederman, R. Mezzanotte, and J. Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2), 1982.
[4] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout. In ICCV, 2009.
[5] L. Elazary and L. Itti. Interesting objects are visually salient. J. of Vision, 8(3), 2008.
[6] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.
[7] A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. ECCV, 2010.
[8] B. Fasel and J. Luettin. Automatic facial expression analysis: a survey. Pattern Recognition, 36(1), 2003.
[9] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9), 2010.
[10] C. Galleguillos, A. Rabinovich, and S. Belongie. Object categorization using co-occurrence, location and appearance. In CVPR, 2008.
[11] M. Grubinger, P. Clough, H. M ¨uller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In Int. Workshop OntoImage, 2006.
[12] A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. ECCV, 2008.
[13] A. Gupta, A. Efros, and M. Hebert. Blocks world revisited: Image understanding using qualitative geometry and mechanics. ECCV, 2010.
[14] F. Heider and M. Simmel. An experimental study of apparent behavior. The American Journal of Psychology, 1944.
[15] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. In CVPR, 2006.
[16] S. Hwang and K. Grauman. Learning the relative importance of objects from tagged images for retrieval and cross-modal search. IJCV, 2011.
[17] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. PAMI, 20(1 1), 1998.
[18] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.
[19] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
[20] W.-H. Lin and A. Hauptmann. Which thousand words are worth a picture? experiments on video retrieval using a thousand concepts. In ICME, 2006.
[21] K. Oatley and N. Yuill. Perception of personal and interpersonal action in a cartoon film. British J. of Social Psychology, 24(2), 2011.
[22] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30] [3 1]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39] representation of the spatial envelope. IJCV, 42(3), 2001 . A. Oliva, A. Torralba, et al. The role of context in object recognition. Trends in cognitive sciences, 11(12), 2007. V. Ordonez, G. Kulkarni, and T. Berg. Im2text: Describing images using 1million captioned photographs. In NIPS, 2011. D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. C. Privitera and L. Stark. Algorithms for defining visual regions-ofinterest: Comparison with eye fixations. PAMI, 22(9), 2000. A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon ’s MT, 2010. B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: a database and web-based tool for image annotation. IJCV, 2008. M. Sadeghi and A. Farhadi. Recognition using visual phrases. In CVPR, 2011. M. Spain and P. Perona. Measuring and predicting object importance. IJCV, 91(1), 2011. A. Torralba, K. Murphy, and W. Freeman. Contextual models for object detection using boosted random fields. In NIPS, 2004. P. Tseng, R. Carmi, I. Cameron, D. Munoz, and L. Itti. Quantifying center bias of observers in free viewing of dynamic natural scenes. Journal of Vision, 9(7), 2009. S. Ullman, M. Vidal-Naquet, E. Sali, et al. Visual features ofintermediate complexity and their use in classification. Nature neuroscience, 5(7), 2002. K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2006. J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR. IEEE, 2010. Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, 2011. Y. Yang, C. Teo, H. Daum e´ III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011. B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction activities. In CVPR, 2010. 333000111644