nips nips2010 nips2010-256 nips2010-256-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nebojsa Jojic, Alessandro Perina, Vittorio Murino
Abstract: In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g., Photosynth) or appearance-based clustering models (e.g., the epitome), is impractical due to either the large dataset size or the dramatic variations in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, defines a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. 1
[1] B. Frey and N. Jojic, “Transformation-invariant clustering using the EM algorithm ”, TPAMI 2003, vol. 25, no. 1, pp. 1-17.
[2] N. Jojic, B. Frey, A. Kannan, “Epitomic analysis of appearance and shape”, ICCV 2003.
[3] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” IJCV, 2004, vol. 60, no. 2, pp. 91-110.
[4] L. Fei-Fei, P. Perona, “A Bayesian Hierarchical Model for Learning Natural Scene Categories,” IEEE CVPR 2005, pp. 524-531.
[5] S. Lazebnik, C. Schmid, J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” IEEE CVPR, 2006, pp. 2169-2178.
[6] N. Jojic and C. Caspi, “Capturing image structure with probabilistic index maps,” IEEE CVPR 2004, pp. 212-219.
[7] J. Winn and N. Jojic, “LOCUS: Learning Object Classes with Unsupervised Segmentation” ICCV 2005.
[8] N. Jojic, A.Perina, M.Cristani, V.Murino and B. Frey, “Stel component analysis: modeling spatial correlation in image class structure,” IEEE CVPR 2009.
[9] K. Ni, A. Kannan, A. Criminisi and J. Winn, “Epitomic Location Recognition,” IEEE CVPR 2008.
[10] A. Perina, M. Cristani, U. Castellani, V. Murino and N. Jojic, “Free energy score-space,” NIPS 2009.
[11] A. Torralba, K.P. Murphy, W.T. Freeman and M.A. Rubin, “Context-based vision system for place and object recognition,” ICCV 2003, pp. 273-280.
[12] C. Stauffer, E. Miller, and K. Tieu, “Transform invariant image decomposition with similarity templates,” NIPS 2003.
[13] V. Ferrari , A. Zisserman, “Learning Visual Attributes,” NIPS 2007.
[14] B. Russell, A. Efros, J. Sivic, B. Freeman, A. Zisserman “Segmenting Scenes by Matching Image Composites,” NIPS 2009.
[15] G. Bell and J. Gemmell, Total Recall. Dutton Adult 2009. 9