iccv iccv2013 iccv2013-210 iccv2013-210-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9] http://textspotter.org. http://www.eng.tau.ac.il/∼talib/RBNR.html. http://algoval.essex.ac.uk/icdar/. http://vision.ucsd.edu/∼kai/svt/. D. Chen, J.-M. Odobez, and H. Bourlard. Text detection, recognition in images and video frames. Pattern Recognition, 2004. X. Chen and A. L. Yuille. Detecting and reading text in natural scenes. In CVPR, 2004. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In VISAPP, 2009. B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. In CVPR, 2010.
[10] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 2010.
[11] C. D. Manning, P. Raghavan, and H. Sch u¨tze. Intro- duction to Information Retrieval. Cambridge University Press, 2008.
[12] A. Mishra, K. Alahari, and C. V. Jawahar. Scene text recognition using higher order langauge priors. In BMVC, 2012. TSDapbVlotesra 5er.itsQuP24aCn0@6t.hi21a 0rti.vPeS32p@94a.o2nta0.6lysP5i38@9.o1f5R0rSeOP5t3r7@8ie.2v3a10lrP5e49@su.2l81tsR0SoP5n49@3.v402id eo datasets. We choose 10 and 20 query words for Sports-10K and TV series-1M respectively. We use top-n retrieval to compute precision at n (denoted by P@n).
[13] A. Mishra, K. Alahari, and C. V. Jawahar. Top-down and bottom-up cues for scene text recognition. In CVPR, 2012.
[14] M. Mozer, M. I. Jordan, and T. Petsche. Improving the accuracy and speed of support vector machines. In NIPS, 1997.
[15] L. Neumann and J. Matas. Real-time scene text localization and recognition. In CVPR, 2012.
[16] J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers. MIT Press, 1999.
[17] A. Shahab, F. Shafait, and A. Dengel. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In ICDAR, 2011.
[18] C. Shi. Scene text recognition using part-based treestructured character detections. In CVPR, 2013.
[19] P. Shivakumara, T. Q. Phan, and C. L. Tan. A laplacian approach to multi-oriented text detection in video. IEEE TPAMI, 2011.
[20] P. Simard, B. Victorri, Y. LeCun, and J. S. Denker. Tangent prop - a formalism for specifying selected invariances in an adaptive network. In NIPS, 1991 .
[21] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
[22] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. TPAMI, 2012.
[23] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR, 2001 .
[24] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011.
[25] K. Wang and S. Belongie. Word spotting in the wild. In ECCV, 2010. 333000444666 (b) Text query: motel (c) Text query: department Figure 4. Text query example: Top-10 retrievals of our method on SVT and IIIT STR are shown. (a) Text query: “restaurant”. There are in all 8 occurrences of this query in the SVT dataset. The proposed scheme retrieves them all. The ninth and the tenth results contain many characters from the query like R, E, S, T, A, N. (b) Text query: “motel”. There are in all 39 occurrences of query in the IIIT STR dataset, with large variations in fonts, e.g. the first and the tenth retrievals. A failure case of our approach is when a highly similar word (hotel in this case) is well-ranked. (c) Text query: “department”. The top retrievals for this query are significant. The fourth, sixth and seventh results are images of the same building with the query word appearing in different views. These results support our claim of instance as well as category retrieval. 333000444777