cvpr cvpr2013 cvpr2013-416 cvpr2013-416-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg
Abstract: Weposit that user behavior during natural viewing ofimages contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. In this paper, we conduct experiments to better understand the relationship between images, the eye movements people make while viewing images, and how people construct natural language to describe images. We explore these relationships in the context of two commonly used computer vision datasets. We then further relate human cues with outputs of current visual recognition systems and demonstrate prototype applications for gaze-enabled detection and annotation.
[1] P. F. Baldi and L. Itti. Of bits and wows: A bayesian theory of surprise with applications to attention. Neural Networks, 23(5):649–666, 2010. 2
[2] A. C. Berg, T. L. Berg, H. Daume, J. Dodge, A. Goyal, X. Han, A. Mensch, M. Mitchell, A. Sood, K. Stratos, et al. Understanding and predicting importance in images. In CVPR, pages 3562–3569. IEEE, 2012. 5
[3] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky. Exploiting hierarchical context on a large database of object categories. In CVPR, 2010. 2, 3
[4] T. De Campos, G. Csurka, and F. Perronnin. Images as sets of locally weighted features. Computer Vision and Image Understanding, 116(1):68–85, 2012. 2
[5] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and F.-F. Li. Large scale visual recognition challenge. In http://www.image-net.org/challenges/LSVRC/2012/index, 2012. 1
[6] J. Deng, A. C. Berg, K. Li, and F.-F. Li. What does classifying more than 10,000 image categories tell us? In ECCV, 2010. 1
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009. 3
[8] K. A. Ehinger, B. Hidalgo-Sotelo, A. Torralba, and A. Oliva. Modelling search for people in 900 scenes: A combined source model of eye guidance. Visual Cognition, 17(6):945– 978, 2009. 2
[9] W. Einh a¨user, M. Spain, and P. Perona. Objects predict fixations better than early saliency. Journal of Vision, 8(14), 2008. 2
[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. 2, 3
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. 1
[12] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, pages 1627–1645, 2009. 1, 3, 6
[13] J. M. Henderson. Human gaze control during real-world scene perception. Trends in cognitive sciences, 7(11):498– 504, 2003. 6
[14] L. Itti. Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes. Vision Cognition, 12:1093–1 123, 2005. 2
[15] L. Itti and P. F. Baldi. Bayesian surprise attracts human attention. Vision Research, 49(10): 1295–1306, 2009. 2 ? G BOabjszemlcti-onsemt radoebrtsilkcerdb ietk,p dte borcsytejhnducomtsba:jencts 7 7 7 4 4 4 53 353 ? O B Gba jsze c-linets ecad ebatlsec,taerdbtac liebdtel bedtl ebocytejhducomtsba:jenc ts: ? GOBa bjzse c-ltiesn pedahoedbrhslec oterndscid,bet hepd oetr sborcyset ojhednucomtsba:jenc ts: ? OB G ab sjze c-lintesn coad ecwbseolctwerdpcibedt redstoebncybtejhducomtsba:jnects: Figure8:Rsultofantiopredcti?o OG bBnaicbysjzec ol-tinesmnda etb shlmikcetrdbecik dte PpdteA robcsytoSejhndCucomtspbla:AjnectLs: dataset. Left: baseline detection, Right: Gaze-enabled detection. Gaze-enabled detection improves over the baseline for objects that people often fixate on (e.g. cat and person in top three rows). Gaze also sometimes help remove false positives (e.g. tv in Figure 1), but sometimes hurts performance by enhancing detector confusion (e.g. cow versus person in the 4th row, and bicycle and motorbike in the 5th row). Moreover, sometimes gaze adds additional false positives (e.g. plant in the 5th row)
[16] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40:1489–1506, 2000. 2
[17] L. Itti and C. Koch. Computational modeling of visual attention. Nature Reviews Neuroscience, 2(3): 194–203, 2001 . 2
[18] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1
[19] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image classification: Fast feature extraction and svm training. In CVPR, 2011. 1
[20] N. Mackworth and A. Morandi. The gaze selects informative details within pictures. Perception and Psychophysics, 2:547–552, 1967. 2
[21] M. B. Neider and G. J. Zelinsky. Scene context guides eye movements during search. Vision Research, pages 614–621, 2006. 2
[22] M. B. Neider and G. J. Zelinsky. Searching for camouflaged targets: Effects of target-background similarity on visual search. Vision Research, 46:2217–2235, 2006. 2
[23] D. J. Parkhurst, K. Law, and E. Niebur. Modeling the role of salience in the allocation of overt visual selective attention. Vision Research, 42: 107–123, 2002. 2
[24] F. Perronnin, Z. Akata, Z. Harchaoui, and C. Schmid. Towards good practice in large-scale learning for image classification. In CVPR, 2012. 1
[25] X.-H. Phan. CRFTagger: CRF English POS Tagger. http://crftagger.sourceforge.net/, 2006. 5
[26] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, 2010. 3
[27] L. W. Renninger, P. Vergheese, and J. Coughlan. Where to look next? eye movements reduce local uncertainty. Journal of Vision, 3: 1–17, 2007. 2
[28] B. W. Tatler. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of Vision, 7(14): 1–17, 2007. 2
[29] B. W. Tatler, R. J. Baddeley, and B. T. Vincent. The long and the short of it: Spatial statistics at fixation vary with saccade amplitude and task. Vision Research, 46: 1857–1862, 2006. 2
[30] J. Theeuwes. Stimulus-driven capture and attentional set: selective search for color and visual abrupt onsets. Journal of Experimental Psychology: Human Perception and Performance, 20:799–806, 1994. 2 [3 1] J. Theeuwes, A. Kramer, S. Hahn, D. Irwin, and G. Zelinsky. Influence of attentional capture on oculomotor control. Journal ofExperimental Psychology: Human Perception and Performance, 25: 1595–1608, 1999. 2
[32] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In ACL, pages 133–138, Stroudsburg, PA, USA, 1994. 5
[33] A. Yarbus. Eye movements and vision. Plenum Press, 1967. 2
[34] G. J. Zelinsky. A theory of eye movements during target acquisition. Psychological Review, 115(4):787–835, 2008. 2
[35] G. J. Zelinsky and J. Schmidt. An effect of referential scene constraint on search implies scene segmentation. Visual Cognition, 17(6): 1004–1028, 2009. 2 7 7 7 4 4 4 64 464