iccv iccv2013 iccv2013-210 knowledge-graph by maker-knowledge-mining

210 iccv-2013-Image Retrieval Using Textual Cues


Source: pdf

Author: Anand Mishra, Karteek Alahari, C.V. Jawahar

Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. [sent-4, score-1.242]

2 We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. [sent-6, score-0.736]

3 The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce. [sent-7, score-1.099]

4 One approach to retrieval uses text as a query, with applications such as Google image search, which relies on cues from meta tags or text available in the context of the image. [sent-11, score-1.008]

5 On the other hand, the text “restaurant” appearing on the banner/awning is an indispensable cue for retrieval. [sent-21, score-0.451]

6 In this work, we aim to fill this gap in image retrieval with text as a query, and develop an image-search based on the textual content present in it. [sent-24, score-0.676]

7 The problem of recognizing text in images or videos has gained a huge attention in the computer vision community in recent years [5,9, 13, 18, 19,24,25]. [sent-25, score-0.39]

8 Although exact localization and recognition of text in the wild is far from being a solved problem, there have been notable successes. [sent-26, score-0.49]

9 We take this problem one step further and ask the question: Can we search for query text in a large collection of images and videos, and retrieve all occurrences of the query text? [sent-27, score-1.147]

10 Note that, unlike approaches such as Video Google [21], which retrieve only similar instances of the queried content, our goal is to retrieve instances (text appearing in different places or view points), as well as categories (text in different font styles). [sent-28, score-0.408]

11 One approach for addressing the text-to-image retrieval problem is based on text localization, followed by text recognition. [sent-30, score-0.975]

12 Once the text is recognized, the retrieval task becomes equivalent to that of text retrieval. [sent-31, score-0.975]

13 Many methods have been proposed to solve the text localization and recognition problems [6, 9, 12, 13, 15]. [sent-32, score-0.466]

14 We transformed the visual text 333000444000 content in the image into text, either with [15] directly, or by localizing with [9], and then recognizing with [13]. [sent-34, score-0.434]

15 In summary, we recognize the text contained in all images in the database, search for the query text, and then rank the images based on minimum edit distance between the query and the recognized text. [sent-35, score-0.924]

16 Table 1shows the results of these two approaches on the street view text (SVT) dataset. [sent-36, score-0.416]

17 In contrast, we aim to spot query words in millions of images, and efficiently retrieve all occurrences of query. [sent-44, score-0.645]

18 (ii) Using their character detection, and then applying our indexing and re-ranking schemes, we obtain an mAP of 52. [sent-50, score-0.534]

19 It is not clear how well text queries can be used in combination with such methods to retrieve scene text appearing in a variety of styles. [sent-56, score-1.061]

20 We take an alternate approach, and do not rely on an accurate text localization and recognition pipeline. [sent-58, score-0.466]

21 Rather, we do a query-driven search on images and spot the characters of the words of a vocabulary1 in the image database (Section 2. [sent-59, score-0.441]

22 We then compute a score characterizing the presence of characters of a vocabulary word in every image. [sent-61, score-0.644]

23 We demonstrate the performance of our approach on publicly available scene text datasets. [sent-66, score-0.438]

24 23aP52lonthesr t view text dataset [24] are shown as mean average precision (mAP) scores. [sent-72, score-0.437]

25 The first two methods, based on the state-of-the-art text localization and recognition schemes, perform poorly. [sent-74, score-0.466]

26 dataset with diversity, but also a dataset containing multiple occurrences of text in different fonts, view points and illumination conditions. [sent-79, score-0.556]

27 To this end, we introduce two video datasets, namely Sports-10K and TV series-1M, with more than 1million frames, and an image dataset, IIIT scene text retrieval (STR). [sent-80, score-0.659]

28 Scene Text Indexing and Retrieval Our retrieval scheme works as follows: we begin by detecting characters in all the images in the database. [sent-83, score-0.537]

29 We then spot characters of the vocabulary words in the images and compute a score based on the presence of these characters. [sent-86, score-0.629]

30 To achieve this, we create an inverted index file containing image id and a score indicating the presence of characters of the vocabulary words in the image. [sent-88, score-0.735]

31 We then re-rank the topn initial retrievals by imposing constraints on the order and the location of characters from the query text. [sent-90, score-0.663]

32 We do not expect ideal character detection from this stage, but instead obtain many potential character windows, which are likely to include false positives. [sent-95, score-1.005]

33 We then use a sliding window based detection to obtain character locations and their likelihoods. [sent-97, score-0.555]

34 The character localization process is illustrated in Figure 3. [sent-98, score-0.564]

35 We then compute a score indicating the presence of characters from the vocabulary words (vocabulary presence score), and create an inverted index file with this score and image id. [sent-103, score-0.802]

36 (b) After character detection, an image Im is represented as a graph Gm, where nodes correspond to potential character detections and edges model the spatial relation between two detections. [sent-106, score-1.097]

37 The nodes are characterized by their character likelihood vector U, and the edges by their character pair priors V . [sent-107, score-1.037]

38 For a robust localization of characters using sliding windows, we need a strong character classifier. [sent-110, score-0.878]

39 For example, the corner of a door can be detected as the character ‘L’ . [sent-116, score-0.488]

40 To deal with these issues, we add more examples to the training set by applying small affine transformations to the original character images. [sent-117, score-0.54]

41 With this strategy, we achieve a significant boost in character classification. [sent-121, score-0.488]

42 This results in a 63-dimensional vector for every window, which indicates the presence of a character or background in that window. [sent-129, score-0.523]

43 Indexing Once the characters are detected, we index the database for a set of vocabulary words. [sent-135, score-0.413]

44 To do so, we construct a graph, where each character detection is represented as a node. [sent-139, score-0.488]

45 This step essentially removes many false character windows scattered in the image. [sent-145, score-0.533]

46 Further, assuming these likelihoods are independent, we compute the joint probabilities of character pairs for every edge. [sent-147, score-0.488]

47 In other words, we associate a 36 36 dimensional matrix Vij containing joint probabilities of character pairs to the edge connecting nodes iand j (see Figure 2(b)). [sent-148, score-0.541]

48 Now, consider a word from the vocabulary ωk = ωk1ωk2 · · · ωkp, represented by its characters ωkl, 1 ≤ l ≤ p, where p is the length of the word. [sent-149, score-0.53]

49 A linear SVM trained on affine transformed (AT) training samples is used to obtain potential character windows. [sent-155, score-0.547]

50 score denoting the presence of characters from the query in these horizontal strips. [sent-157, score-0.69]

51 =1mjaxP(ωkl|hogj), (1) where j varies over all the bounding boxes representing potential characters whose top-left coordinate falls in the horizontal strip and h varies over all the horizontal strips in the image. [sent-161, score-0.504]

52 Once these scores are computed for all the words in the vocabulary and all the images in the database, we create an inverted index file [11] containing image id, the vocabulary word and its score. [sent-164, score-0.601]

53 We also store the image and its corresponding graph (representing character detections) in the indexed database. [sent-165, score-0.488]

54 This ensures that images containing characters from the query text have a high likelihood in a relatively small area (the horizontal strip of height H) get a higher rank. [sent-170, score-1.121]

55 Character spotting does not ensure that characters are spotted in the same order as in the query word. [sent-174, score-0.669]

56 } be the set of all the bi-grams present in the query word ωk, where ? [sent-178, score-0.415]

57 The score Sso(Im , ωk) = 1, when all the characters in the query word are present in the image, and have the same spatial order as the query word. [sent-182, score-1.059]

58 The re-ranking scheme based on spatial ordering does not account for spotted characters being in the correct spatial position. [sent-185, score-0.471]

59 To address this, we use the graphs representing the character detections in the images, the associated U vectors, and the matrix V to compute a new score. [sent-187, score-0.518]

60 We define a new score characterizing the spatial positioning of characters of the query word in the image as Ssp(Im , ωk) = p p−1 ? [sent-188, score-0.905]

61 (3) This new score is high when all the characters and bi-grams are present in the graph in the same order as in the query word and with a high likelihood. [sent-191, score-0.758]

62 The character classifiers are trained on the train sets of ICDAR 2003 character [3] and Chars74K [8] datasets. [sent-200, score-0.976]

63 333000444333 We then apply affine transformations to all the character images, resize them to 48 48, and compute HOG features. [sent-202, score-0.54]

64 We divide the images into horizontal strips of height 30 pixels and spot characters from a set of character bounding boxes, as described in Section 2. [sent-207, score-0.984]

65 The idea here is to find images where the characters of the vocabulary word have a high likelihood in a relatively small area. [sent-209, score-0.563]

66 Datasets We evaluate our approach on three scene text (SVT, ICDAR 2011 and IIIT scene text retrieval) and two video (Sports-10K and TV series-1M) datasets. [sent-213, score-0.902]

67 These two datasets were originally introduced for scene text localization and recognition. [sent-216, score-0.542]

68 The SVT and ICDAR 2011datasets, in addition to being relatively small, contain many scene text words occurring only once. [sent-220, score-0.536]

69 To analyze our text-to-image retrieval method in a more challenging setting, we introduce IIIT scene text retrieval (STR) dataset. [sent-221, score-0.828]

70 Each image is then annotated manually to say if it contains a query text or not. [sent-226, score-0.657]

71 It is intended for category retrieval (text appearing in different fonts or styles), instance retrieval (text imaged from a different view point), and retrieval in the presence of distractors (images without any text). [sent-228, score-0.816]

72 To analyze the scalability of our retrieval approach, we need a large dataset, where query words ap- hundred images. [sent-230, score-0.585]

73 We introduce an image (IIIT scene text retrieval) and two video (Sports-10K and TV series-1M) datasets to test the scalability of our proposed approach. [sent-231, score-0.517]

74 All the image frames extracted from this dataset are manually annotated with the query text they may contain. [sent-239, score-0.657]

75 We use 10 and 20 query words to demonstrate the retrieval performance on the Sports-10K and the TV series-1M datasets respectively. [sent-241, score-0.588]

76 Experimental Analysis Given a text query our goal is to retrieve all images where it appears. [sent-244, score-0.765]

77 text appearing in different view points, as well as category retrieval, i. [sent-247, score-0.477]

78 Character classification results We analyze the performance of one of the basic modules of our system on scene character datasets. [sent-253, score-0.536]

79 Table 3 compares our character classification performance with recent works [8, 13, 24]. [sent-254, score-0.488]

80 We observe that selecting training data and features appropriately improves the character classification accuracies significantly on the ICDAR-char [3], Chars74K [8], SVT-char [13] datasets. [sent-255, score-0.488]

81 333000444444 fier is key to better character classification. [sent-259, score-0.488]

82 H-13 + AT + linear takes less than 50% time compared to other methods, with a minor reduction in accuracy, and hence used this for our character detection module. [sent-264, score-0.488]

83 We observe that the performance of our initial naive character spotting method is comparable to the baselines in Table 1. [sent-271, score-0.557]

84 Another baseline we compare with uses character detections from [24] in combination with our spatial positioning re-ranking scheme, which achieves 52. [sent-293, score-0.643]

85 sliding window based character detection step and computation of index file are performed offline. [sent-295, score-0.64]

86 Qualitative results of the proposed method are shown in Figure 4 for the query words restaurant on SVT, motel and department on IIIT STR. [sent-297, score-0.654]

87 We retrieve all the occurrences of the query restaurant from SVT. [sent-298, score-0.598]

88 Our top retrievals for this query are quite significant, for instance, the tenth retrieval, where the query word appears in a very different font. [sent-300, score-0.835]

89 The query word department has 20 occurrences in the dataset. [sent-301, score-0.569]

90 Figure 5(a) shows precision-recall curves for two text queries: department and motel on IIIT STR. [sent-304, score-0.571]

91 The method tends to fail in cases where almost all the characters in the word are not detected correctly or when the query text appears vertically. [sent-309, score-1.091]

92 (ii) It is query-driven, thus even in cases where the second or the third best predictions for a character bounding box are correct, it can retrieve the correct result. [sent-316, score-0.596]

93 (a) Precision-Recall curves for two queries on the IIIT scene text retrieval dataset. [sent-326, score-0.697]

94 (b) A few failure cases are shown as cropped images, where our approach fails to retrieve these images for the text queries: Galaxy, India and Dairy. [sent-328, score-0.498]

95 Detecting text in natural scenes with stroke width transform. [sent-379, score-0.43]

96 ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. [sent-446, score-0.472]

97 A laplacian approach to multi-oriented text detection in video. [sent-458, score-0.39]

98 Video google: A text retrieval approach to object matching in videos. [sent-471, score-0.585]

99 The ninth and the tenth results contain many characters from the query like R, E, S, T, A, N. [sent-499, score-0.596]

100 The fourth, sixth and seventh results are images of the same building with the query word appearing in different views. [sent-507, score-0.476]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('character', 0.488), ('text', 0.39), ('characters', 0.286), ('query', 0.267), ('iiit', 0.249), ('retrieval', 0.195), ('svt', 0.172), ('word', 0.148), ('motel', 0.142), ('occurrences', 0.115), ('retrievals', 0.11), ('retrieve', 0.108), ('restaurant', 0.108), ('icdar', 0.101), ('words', 0.098), ('vocabulary', 0.096), ('positioning', 0.091), ('tv', 0.084), ('fonts', 0.08), ('localization', 0.076), ('str', 0.071), ('spotting', 0.069), ('strips', 0.066), ('queries', 0.064), ('font', 0.063), ('mhaxl', 0.061), ('restaurants', 0.061), ('kl', 0.061), ('appearing', 0.061), ('score', 0.057), ('spot', 0.057), ('file', 0.054), ('india', 0.053), ('inverted', 0.053), ('neighbouring', 0.049), ('im', 0.049), ('scene', 0.048), ('textual', 0.047), ('spotted', 0.047), ('indexing', 0.046), ('windows', 0.045), ('mishra', 0.045), ('horizontal', 0.045), ('content', 0.044), ('tenth', 0.043), ('height', 0.042), ('hogi', 0.041), ('hogj', 0.041), ('width', 0.04), ('lexicon', 0.04), ('department', 0.039), ('window', 0.039), ('ordering', 0.037), ('presence', 0.035), ('hog', 0.035), ('alahari', 0.035), ('reading', 0.034), ('spatial', 0.034), ('scheme', 0.033), ('meta', 0.033), ('strip', 0.033), ('likelihood', 0.033), ('advertisement', 0.031), ('karteek', 0.031), ('google', 0.031), ('index', 0.031), ('detections', 0.03), ('affine', 0.03), ('potential', 0.029), ('anand', 0.029), ('galaxy', 0.029), ('distractors', 0.029), ('nodes', 0.028), ('sliding', 0.028), ('retrieving', 0.028), ('schemes', 0.028), ('datasets', 0.028), ('style', 0.028), ('enrich', 0.028), ('view', 0.026), ('ranked', 0.026), ('kp', 0.026), ('video', 0.026), ('truncation', 0.025), ('sso', 0.025), ('texts', 0.025), ('scalability', 0.025), ('containing', 0.025), ('photos', 0.025), ('english', 0.025), ('notable', 0.024), ('explicit', 0.023), ('detecting', 0.023), ('transformations', 0.022), ('map', 0.022), ('dist', 0.022), ('styles', 0.022), ('characterizing', 0.022), ('instances', 0.021), ('precision', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 210 iccv-2013-Image Retrieval Using Textual Cues

Author: Anand Mishra, Karteek Alahari, C.V. Jawahar

Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.

2 0.48748538 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes

Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan

Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.

3 0.46186852 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions

Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven

Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.

4 0.45496801 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection

Author: Lukáš Neumann, Jiri Matas

Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.

5 0.25715134 266 iccv-2013-Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation

Author: Basura Fernando, Tinne Tuytelaars

Abstract: In this paper we present a new method for object retrieval starting from multiple query images. The use of multiple queries allows for a more expressive formulation of the query object including, e.g., different viewpoints and/or viewing conditions. This, in turn, leads to more diverse and more accurate retrieval results. When no query images are available to the user, they can easily be retrieved from the internet using a standard image search engine. In particular, we propose a new method based on pattern mining. Using the minimal description length principle, we derive the most suitable set of patterns to describe the query object, with patterns corresponding to local feature configurations. This results in apowerful object-specific mid-level image representation. The archive can then be searched efficiently for similar images based on this representation, using a combination of two inverted file systems. Since the patterns already encode local spatial information, good results on several standard image retrieval datasets are obtained even without costly re-ranking based on geometric verification.

6 0.2528488 180 iccv-2013-From Where and How to What We See

7 0.24440426 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes

8 0.20090429 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors

9 0.17935194 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval

10 0.17367655 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint

11 0.16341215 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing

12 0.1589454 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search

13 0.15533218 334 iccv-2013-Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval

14 0.14372055 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching

15 0.11565512 166 iccv-2013-Finding Actors and Actions in Movies

16 0.11476678 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval

17 0.11158045 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection

18 0.10911629 444 iccv-2013-Viewing Real-World Faces in 3D

19 0.10397247 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies

20 0.10151362 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.17), (1, 0.092), (2, -0.037), (3, -0.132), (4, 0.095), (5, 0.188), (6, 0.053), (7, -0.067), (8, -0.155), (9, 0.036), (10, 0.537), (11, -0.152), (12, 0.234), (13, 0.175), (14, 0.02), (15, 0.096), (16, -0.017), (17, 0.051), (18, -0.16), (19, 0.056), (20, 0.103), (21, 0.065), (22, 0.031), (23, -0.002), (24, 0.031), (25, -0.035), (26, -0.025), (27, 0.02), (28, 0.061), (29, -0.008), (30, -0.008), (31, 0.042), (32, -0.012), (33, -0.006), (34, 0.042), (35, 0.018), (36, 0.021), (37, 0.042), (38, 0.027), (39, 0.011), (40, 0.024), (41, -0.045), (42, 0.003), (43, 0.007), (44, 0.019), (45, 0.041), (46, -0.01), (47, 0.015), (48, -0.0), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97169906 210 iccv-2013-Image Retrieval Using Textual Cues

Author: Anand Mishra, Karteek Alahari, C.V. Jawahar

Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.

2 0.89290828 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes

Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan

Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.

3 0.87311488 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection

Author: Lukáš Neumann, Jiri Matas

Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.

4 0.8515209 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions

Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven

Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.

5 0.81900567 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors

Author: Weilin Huang, Zhe Lin, Jianchao Yang, Jue Wang

Abstract: In this paper, we present a new approach for text localization in natural images, by discriminating text and non-text regions at three levels: pixel, component and textline levels. Firstly, a powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends the widely-used Stroke Width Transform (SWT) by incorporating color cues of text pixels, leading to significantly enhanced performance on inter-component separation and intra-component connection. Secondly, based on the output of SFT, we apply two classifiers, a text component classifier and a text-line classifier, sequentially to extract text regions, eliminating the heuristic procedures that are commonly used in previous approaches. The two classifiers are built upon two novel Text Covariance Descriptors (TCDs) that encode both the heuristic properties and the statistical characteristics of text stokes. Finally, text regions are located by simply thresholding the text-line confident map. Our method was evaluated on two benchmark datasets: ICDAR 2005 and ICDAR 2011, and the corresponding F- , measure values are 0. 72 and 0. 73, respectively, surpassing previous methods in accuracy by a large margin.

6 0.64900875 180 iccv-2013-From Where and How to What We See

7 0.64386457 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes

8 0.52884024 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching

9 0.49254769 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint

10 0.4874672 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences

11 0.4714486 266 iccv-2013-Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation

12 0.46259803 334 iccv-2013-Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval

13 0.4286716 446 iccv-2013-Visual Semantic Complex Network for Web Images

14 0.42298892 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval

15 0.41638067 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval

16 0.41634116 419 iccv-2013-To Aggregate or Not to aggregate: Selective Match Kernels for Image Search

17 0.39825726 221 iccv-2013-Joint Inverted Indexing

18 0.39315709 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search

19 0.3754769 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection

20 0.37471139 368 iccv-2013-SYM-FISH: A Symmetry-Aware Flip Invariant Sketch Histogram Shape Descriptor


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.11), (7, 0.043), (12, 0.018), (26, 0.057), (27, 0.017), (31, 0.164), (35, 0.011), (42, 0.073), (47, 0.138), (64, 0.051), (73, 0.017), (89, 0.197), (98, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88771617 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes

Author: Dahua Lin, Jianxiong Xiao

Abstract: In this paper, we develop a generative model to describe the layouts of outdoor scenes the spatial configuration of regions. Specifically, the layout of an image is represented as a composite of regions, each associated with a semantic topic. At the heart of this model is a novel stochastic process called Spatial Topic Process, which generates a spatial map of topics from a set of coupled Gaussian processes, thus allowing the distributions of topics to vary continuously across the image plane. A key aspect that distinguishes this model from previous ones consists in its capability of capturing dependencies across both locations and topics while allowing substantial variations in the layouts. We demonstrate the practical utility of the proposed model by testing it on scene classification, semantic segmentation, and layout hallucination. –

2 0.88494611 38 iccv-2013-Action Recognition with Actons

Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu

Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.

same-paper 3 0.88281184 210 iccv-2013-Image Retrieval Using Textual Cues

Author: Anand Mishra, Karteek Alahari, C.V. Jawahar

Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.

4 0.87192488 275 iccv-2013-Motion-Aware KNN Laplacian for Video Matting

Author: Dingzeyu Li, Qifeng Chen, Chi-Keung Tang

Abstract: This paper demonstrates how the nonlocal principle benefits video matting via the KNN Laplacian, which comes with a straightforward implementation using motionaware K nearest neighbors. In hindsight, the fundamental problem to solve in video matting is to produce spatiotemporally coherent clusters of moving foreground pixels. When used as described, the motion-aware KNN Laplacian is effective in addressing this fundamental problem, as demonstrated by sparse user markups typically on only one frame in a variety of challenging examples featuring ambiguous foreground and background colors, changing topologies with disocclusion, significant illumination changes, fast motion, and motion blur. When working with existing Laplacian-based systems, our Laplacian is expected to benefit them immediately with improved clustering of moving foreground pixels.

5 0.86631757 357 iccv-2013-Robust Matrix Factorization with Unknown Noise

Author: Deyu Meng, Fernando De_La_Torre

Abstract: Many problems in computer vision can be posed as recovering a low-dimensional subspace from highdimensional visual data. Factorization approaches to lowrank subspace estimation minimize a loss function between an observed measurement matrix and a bilinear factorization. Most popular loss functions include the L2 and L1 losses. L2 is optimal for Gaussian noise, while L1 is for Laplacian distributed noise. However, real data is often corrupted by an unknown noise distribution, which is unlikely to be purely Gaussian or Laplacian. To address this problem, this paper proposes a low-rank matrix factorization problem with a Mixture of Gaussians (MoG) noise model. The MoG model is a universal approximator for any continuous distribution, and hence is able to model a wider range of noise distributions. The parameters of the MoG model can be estimated with a maximum likelihood method, while the subspace is computed with standard approaches. We illustrate the benefits of our approach in extensive syn- thetic and real-world experiments including structure from motion, face modeling and background subtraction.

6 0.86597574 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes

7 0.85616958 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures

8 0.85559839 408 iccv-2013-Super-resolution via Transform-Invariant Group-Sparse Regularization

9 0.85430121 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

10 0.84869635 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection

11 0.84584999 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction

12 0.83997208 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification

13 0.83920622 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions

14 0.83037603 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors

15 0.82665741 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

16 0.82015336 180 iccv-2013-From Where and How to What We See

17 0.81432432 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes

18 0.81008649 426 iccv-2013-Training Deformable Part Models with Decorrelated Features

19 0.80949759 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation

20 0.80738455 173 iccv-2013-Fluttering Pattern Generation Using Modified Legendre Sequence for Coded Exposure Imaging