iccv iccv2013 iccv2013-180 iccv2013-180-reference knowledge-graph by maker-knowledge-mining

180 iccv-2013-From Where and How to What We See


Source: pdf

Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath

Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.


reference text

[1] B. Alexe et al. Searching for objects driven by context. Advances in Neural Information Processing Systems 2012.

[2] Y. Boykov et al. Fast approximate energy minimization via graph cuts. Pattern Analysis and Machine Intelligence 2001.

[3] A. Bulling et al. Eye movement analysis for activity recognition using electrooculography. IEEE Transactions on Pattern Analysis and Machine Intelligence 2011.

[4] A. Bulling et al. Wearable EOG goggles: eye-based interaction in everyday environments. ACM, 2009.

[5] A. Bulling and H. Gellersen. Toward mobile eye-based humancomputer interaction. Pervasive Computing, IEEE, 2010.

[6] M. Cerf et al. Decoding what people see from where they look: Predicting visual stimuli from scanpaths. Attention in Cognitive Systems 2009.

[7] M. Cerf et al. Faces and text attract gaze independent of the task: Experimental data and computer model. Journal of vision 2009.

[8] H. Chen et al. Robust text detection in natural images with edgeenhanced maximally stable extremal regions. IEEE International Conference on Image Processing (ICIP) 2011.

[9] X. Chen and A. L. Yuille. Detecting and reading text in natural scenes. CVPR 2004.

[10] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002.

[11] C. Desai et al. Discriminative models for multi-class object layout. Computer Vision–ICCV 2009.

[12] S. K. Divvala et al. An empirical study of context in object detection. CVPR, 2009.

[13] B. Epshtein et al. Detecting text in natural scenes with stroke width transform. CVPR 2010.

[14] P. Felzenszwalb et al. A discriminatively trained, multiscale, deformable part model. IEEE CVPR, 2008.

[15] S. Goferman et al. Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 2012.

[16] V. Hedau et al. Thinking inside the box: Using appearance models and context based on room geometry. Computer Vision–ECCV 2010.

[17] L. Itti et al. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 1998.

[18] T. Judd et al. Learning to predict where humans look. IEEE International Conference on Computer Vision 2009.

[19] S. Karthikeyan et al. Learning bottom-up text attention maps for text detection using stroke width transform. ICIP, IEEE, 2013.

[20] S. Karthikeyan et al. Learning top-down scene context for visual attention modeling in natural images. ICIP, IEEE, 2013.

[21] S. M. Lucas et al. Icdar 2003 robust reading competitions. ICDAR.

[22] A. Mishra et al. Active segmentation with fixation. IEEE International Conference on Computer Vision, 2009.

[23] O. M. Parkhi et al. Cats and dogs. Computer Vision and Pattern Recognition (CVPR) 2012.

[24] O. M. Parkhi et al. The truth about cats and dogs. IEEE International Conference on Computer Vision (ICCV), 2011.

[25] C. Rother et al. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG) 2004.

[26] P. Shivakumara et al. A laplacian approach to multi-oriented text detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence 2011.

[27] R. Subramanian et al. Can computers learn from humans to see better?: inferring scene semantics from viewers’ eye movements. ACM International Conference on Multimedia 2011.

[28] A. Torralba. Contextual priming for object detection. International Journal of Computer Vision 2003.

[29] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

[30] K. Wang and S. Belongie. Word spotting in the wild. Computer Vision–ECCV 2010. [3 1] Y. Zhong et al. Automatic caption localization in compressed video. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000. 632