iccv iccv2013 iccv2013-315 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We describe PhotoOCR, a system for text extraction from images. [sent-2, score-0.509]
2 Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. [sent-3, score-1.011]
3 Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. [sent-5, score-0.565]
4 Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. [sent-7, score-0.381]
5 We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. [sent-9, score-0.588]
6 Introduction Extraction of text from uncontrolled images is a challenging problem with many practical applications. [sent-12, score-0.381]
7 Reliable text recognition would provide a useful input modality for smartphones, particularly in applications such as translation where the text may be difficult for a user to input by other means. [sent-13, score-0.851]
8 They typically rely on brittle techniques such as binarization, where the first stage of processing is a simple thresholding operation used to divide an image into text and non-text pixels [19]. [sent-16, score-0.381]
9 include both scene text (such as Figure 1) and also more document-like text that suffers from blur, low resolution or other degradations that are common in smartphone imagery (see Figure 2). [sent-19, score-0.915]
10 In particular, our deep neural network character classifier is trained on up to 2 million manually labelled examples, and our language model is learned on a corpus of more than a trillion tokens. [sent-24, score-1.117]
11 Many publications address sub-tasks 778855 such as text detection and isolated character classification. [sent-29, score-0.864]
12 One of the best performing text detection methods is the stroke width transform of [8]. [sent-30, score-0.494]
13 Isolated character classification is widely used as a machine learning benchmark [17]. [sent-31, score-0.403]
14 However the mid-level problem of fusing character classifier and language model signals for complete text extraction is less commonly addressed. [sent-32, score-1.307]
15 Application papers often perform text detection and preprocessing before applying a commercial OCR system designed for printed documents, as for example in [4]. [sent-33, score-0.49]
16 Among fully complete systems for the scene text extraction problem, language modelling is often less developed than the image processing components. [sent-34, score-0.795]
17 For example the method of [18] uses a bigram language model together with a set of hand-designed image features and a support vector machine classifier for text detection and recognition. [sent-35, score-0.872]
18 In the method of [27], text detection is assumed and recognition is performed by fusing appearance, self-similarity, lexicon and bigram language signals in a sparse belief propagation framework. [sent-36, score-0.97]
19 The system of [20] describes a large-lexicon design, using weighted finite state transducers to perform joint inference over appearance, self-consistency and language signals. [sent-39, score-0.435]
20 One such method is [25], where detection and character classification are performed in a single step using randomized ferns. [sent-41, score-0.433]
21 System Design In general outline our system takes a conventional multistage approach to text extraction, similar to designs such as [2]. [sent-46, score-0.49]
22 We begin by performing text detection on the input image. [sent-47, score-0.411]
23 Candidate text lines from the detection stage are processed for text recognition. [sent-50, score-0.836]
24 Recognition begins with a 1D oversegmentation of the text line to identify candidate character regions. [sent-51, score-0.819]
25 We then search through the space of segmentations to maximize a score which combines the character classifier and language model likelihoods. [sent-52, score-0.913]
26 The reason for this staged approach is computational: it would be prohibitively expensive to apply full inference at all locations and scales in the input image, or to apply our character classifier at all locations in each candidate text region. [sent-54, score-0.885]
27 Our primary intended application is text extraction as an input modality for smartphone users, which limits total acceptable processing time to at most one or two seconds per image. [sent-55, score-0.57]
28 Text Detection A detailed description of the text detection portion of our system is outside the scope of this paper. [sent-60, score-0.49]
29 Briefly, we combine the output of three different text detection approaches. [sent-61, score-0.411]
30 This portion of the system also deals with splitting text regions into individual lines and correcting orientation to horizontal, both of which are relatively trivial. [sent-65, score-0.504]
31 For the remainder of the paper we will focus on extracting text from the horizontal line region candidates. [sent-66, score-0.416]
32 Over-Segmentation The over-segmentation step divides the text line into segments which should contain no more than one character (but characters may be split into multiple segments). [sent-69, score-0.962]
33 The segmentation stage outputs a vector B containing the positions of the detected segmentation points, including the start and end points of the text detection box. [sent-83, score-0.509]
34 , c1) is the language model probability for the ith class assignment given the previous class assignments in the line. [sent-96, score-0.378]
35 α controls the relative strength of the character classifier and language model. [sent-97, score-0.812]
36 The total score is thus the average per-character log-likelihood of the text line under the classifier and language model. [sent-98, score-0.867]
37 We perform this maximization using beam search [22], which is the typical approach for similar problems in speech recognition. [sent-101, score-0.364]
38 Beam search is a best-first search through this graph which relies on the fact that each node in the graph is a partial result (corresponding to the recognition of part of the text line) which can be scored by our scoring function. [sent-103, score-0.589]
39 At each step of the beam search, all successors of the current search nodes are scored, but only a fixed number of top scoring nodes (the beam width) are retained for the next search step. [sent-104, score-0.653]
40 We initialize the search simply at the left edge of the text detection box. [sent-105, score-0.47]
41 The practical bottleneck on recognition performance appears to come from classifier and language model quality, rather than failure to find the solution which maximizes the score function. [sent-112, score-0.508]
42 Character Classifier We use a deep neural network for character classification. [sent-115, score-0.539]
43 The output layer is a softmax over 99 character classes plus a noise class. [sent-125, score-0.403]
44 This provides a more tightly cropped character to the classifier. [sent-130, score-0.403]
45 We compute two HOG features on this character patch. [sent-131, score-0.403]
46 The character patch is normalized to 65x65 pixels for computing this second feature. [sent-134, score-0.403]
47 The three geometry features used in addition to HOG encode the original aspect ratio and the position of the top and bottom edge of the pixels relative to the height of the overall text detection. [sent-135, score-0.381]
48 Language Model In structured classification tasks such as OCR and speech recognition, a strong language prior makes a major contribution to final performance. [sent-140, score-0.361]
49 We use a standard ngram approach for language modelling. [sent-142, score-0.453]
50 Because our system is designed for use in datacenter environments, we adopt a two-level language model design. [sent-143, score-0.411]
51 This model provides the beam search language score, Φ (ci , ci−1 , . [sent-145, score-0.642]
52 Our second level language model is a much larger distributed word-ngram model using the design of [1]. [sent-150, score-0.407]
53 Consequently we allocate only 60 MB of RAM per language for the character × ngram model. [sent-157, score-0.856]
54 In addition to the character ngram model, we also maintain a simple dictionary of the top 100k words per language. [sent-163, score-0.524]
55 We use this as an additional signal in our language score; it provides a small performance increase over the character ngram model alone. [sent-164, score-0.856]
56 Reranking The beam search terminates with ranked list of recognition hypotheses, of beam width size. [sent-176, score-0.648]
57 Punctuation is recognized in the same way as any other character class in the initial beam search, but recall is comparatively low since it can be difficult to distinguish small punctuation characters from background clutter. [sent-178, score-1.004]
58 5, we use a distributed word-level language model which cannot be accessed during the beam search for latency reasons. [sent-181, score-0.83]
59 As in Equation 1, we use mean per-character log-likelihood for the language model score, in order to make it comparable between lines of different lengths. [sent-184, score-0.376]
60 The shape model computes the expected relative size and position of character bounding boxes for the recognized text in 20 common fonts, and scores the line based on the deviation from the best matching font. [sent-187, score-0.862]
61 Training We train our neural network character classifier using stochastic gradient descent with Adagrad [7] and dropout [10], using the distributed training design described in [6]. [sent-193, score-0.756]
62 Text lines in the imagery are manually annotated with the class of each character and the segmentation points between characters. [sent-204, score-0.616]
63 The data is used to train both our segmenter and character classifier. [sent-206, score-0.478]
64 To train the character classifier, we do not use the manually segmented characters directly since they may be dissimilar to what our automatic segmentation produces. [sent-208, score-0.593]
65 Instead we run our segmenter on the annotated lines and choose segments visited by the beam search which have high overlap with a manually labelled character as positive training examples. [sent-209, score-1.017]
66 We also select training examples for a “noise class” consisting of segments visited by the beam search which overlap partial/multiple ground truth characters or background clutter. [sent-210, score-0.52]
67 The final training set consists of 45% noise examples and 55% character class examples, for 3. [sent-211, score-0.468]
68 For the Latin alphabet version of our system, we learn 99 character classes plus the noise class. [sent-213, score-0.403]
69 The character classes cover upper and lowercase letters, punctuation and some additional common characters such as currency symbols. [sent-214, score-0.687]
70 For training we consider these variants together as a single class, and rely on the language model to select the correct variant at recognition time. [sent-216, score-0.405]
71 Our final output is thus from a larger space of several hundred possible character classes. [sent-217, score-0.403]
72 Finally we set free parameters of the system, such as the language model weights α and β, by optimizing end-toend system performance on a validation set using Powell’s method. [sent-218, score-0.411]
73 of the text in these real-world images also exists verbatim somewhere on the web. [sent-226, score-0.381]
74 If we achieve partially correct text extraction from an image, we can often locate the source text by issuing multiple web queries. [sent-227, score-0.811]
75 The extracted character bounding boxes come from our OCR system, but any errors in the character labels are corrected by alignment against the source text. [sent-231, score-0.832]
76 The majority of matches come from images of dense text such as newspaper articles. [sent-233, score-0.407]
77 The most suitable pub778899 Figure 3: Some examples of text correctly read by our system. [sent-239, score-0.381]
78 4071 Table 1: Results on the ICDAR 2013 Robust Reading Competition scene text test set [12], showing closest competitors to PhotoOCR on recognition rate and edit distance metrics. [sent-244, score-0.496]
79 lic benchmark for unconstrained OCR is the ICDAR 2013 Robust Reading Competition scene text test set [12]. [sent-246, score-0.381]
80 We have not designed our system for this task, but we can make simple use of the lexicon by performing unaided OCR and then selecting the lexicon word with smallest edit distance as the final result. [sent-262, score-0.432]
81 uk” (e) “THUNDERBU” = = “VENUE” (d) “saske” “Forever” = “Service” “THUNDERBALL” Figure 4: Some examples ofimages where the extracted text is incorrect. [sent-273, score-0.381]
82 The captions show the recognized text versus the ground truth. [sent-274, score-0.45]
83 A further salient difference in the tasks is that ICDAR tests isolated word recognition, which limits the performance gains available from language modelling. [sent-277, score-0.469]
84 For comparability to ICDAR results, we measure performance given ground truth text detection. [sent-279, score-0.381]
85 We also analysed the impact of the language models on the word error rate of the overall system. [sent-289, score-0.484]
86 A baseline system without any language model (character classifier alone) recognizes only 39. [sent-290, score-0.488]
87 Adding the character-level language model reduces the word error rate by 39. [sent-292, score-0.461]
88 This seems small relative to the impact of the character language model, but represents resolving many harder examples that are difficult to disambiguate in any other way. [sent-305, score-0.783]
89 Finally, we estimated the performance impact of the beam search by computing results with a beam width 1000x larger than used in our baseline configuration. [sent-306, score-0.64]
90 This includes both text detection and recognition for a full image containing multiple text lines; note however that we parallelize the computation by sending each line of text to a different OCR worker. [sent-311, score-1.239]
91 In this configuration no text detection is required, and recognition alone took 1. [sent-313, score-0.473]
92 Conclusion We have presented a complete system for text extraction from challenging images. [sent-316, score-0.542]
93 Our system is built on recent machine learning approaches for improved classifier performance, combined with large training sets and distributed language modelling. [sent-317, score-0.575]
94 The system achieves record performance on all major text recognition benchmarks, and high quality text extraction from typical smartphone imagery with sub-second latency. [sent-318, score-1.099]
95 Detecting text in natural scenes with stroke width transform. [sent-481, score-0.464]
96 Whole is greater than sum of parts: Recognizing scene text words. [sent-488, score-0.381]
97 Improvement of handwritten Japanese character recognition using weighted direction code histogram. [sent-524, score-0.434]
98 A method for text localization and recognition in real-world images. [sent-556, score-0.412]
99 Enforcing similarity constraints with integer programming for better scene text recognition. [sent-584, score-0.381]
100 Scene text recognition using similarity and a lexicon with sparse belief propagation. [sent-608, score-0.524]
wordName wordTfidf (topN-words)
[('character', 0.403), ('text', 0.381), ('ocr', 0.376), ('language', 0.332), ('beam', 0.251), ('punctuation', 0.169), ('icdar', 0.158), ('ngram', 0.121), ('characters', 0.115), ('latency', 0.112), ('lexicon', 0.112), ('word', 0.087), ('smartphone', 0.082), ('system', 0.079), ('classifier', 0.077), ('segmenter', 0.075), ('imagery', 0.071), ('reading', 0.071), ('languages', 0.069), ('network', 0.066), ('translate', 0.066), ('networks', 0.065), ('google', 0.065), ('labelled', 0.064), ('photoocr', 0.063), ('search', 0.059), ('modality', 0.058), ('width', 0.056), ('dropout', 0.056), ('million', 0.055), ('latin', 0.052), ('commercially', 0.052), ('bigram', 0.052), ('isolated', 0.05), ('segmentation', 0.049), ('extraction', 0.049), ('competition', 0.047), ('binarized', 0.047), ('distributed', 0.045), ('lines', 0.044), ('document', 0.044), ('recognized', 0.043), ('algorithmrrecawotgeon', 0.042), ('logs', 0.042), ('netzer', 0.042), ('ngrams', 0.042), ('wdch', 0.042), ('rate', 0.042), ('score', 0.042), ('edit', 0.042), ('training', 0.042), ('reranking', 0.041), ('ci', 0.038), ('halving', 0.037), ('forever', 0.037), ('neural', 0.037), ('hog', 0.037), ('bi', 0.037), ('line', 0.035), ('bissacco', 0.035), ('junk', 0.035), ('complete', 0.033), ('deep', 0.033), ('scoring', 0.033), ('svt', 0.033), ('prentice', 0.033), ('coates', 0.033), ('signals', 0.032), ('convolutional', 0.031), ('recognition', 0.031), ('accessed', 0.031), ('secs', 0.031), ('mishra', 0.031), ('ram', 0.031), ('configuration', 0.031), ('detection', 0.03), ('design', 0.03), ('regime', 0.03), ('designs', 0.03), ('speech', 0.029), ('blur', 0.029), ('segments', 0.028), ('fonts', 0.028), ('binarization', 0.027), ('stroke', 0.027), ('manually', 0.026), ('versus', 0.026), ('come', 0.026), ('scored', 0.026), ('typical', 0.025), ('alahari', 0.025), ('permits', 0.025), ('visited', 0.025), ('resolving', 0.025), ('trained', 0.024), ('inference', 0.024), ('impact', 0.023), ('hypotheses', 0.023), ('modern', 0.023), ('class', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
2 0.46186852 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
3 0.4262712 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
Author: Lukáš Neumann, Jiri Matas
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.
4 0.39557177 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
5 0.24236608 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
6 0.21411064 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
7 0.15329808 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
8 0.151737 428 iccv-2013-Translating Video Content to Natural Language Descriptions
9 0.135208 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
10 0.12184771 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
11 0.11337462 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences
12 0.09332376 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
13 0.088982105 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition
14 0.075577997 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection
15 0.075343184 44 iccv-2013-Adapting Classification Cascades to New Domains
16 0.074314214 166 iccv-2013-Finding Actors and Actions in Movies
17 0.068134576 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
18 0.066934556 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
19 0.063883595 379 iccv-2013-Semantic Segmentation without Annotating Segments
20 0.063646339 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
topicId topicWeight
[(0, 0.18), (1, 0.065), (2, -0.012), (3, -0.076), (4, 0.082), (5, 0.055), (6, 0.025), (7, -0.036), (8, -0.076), (9, -0.042), (10, 0.428), (11, -0.162), (12, 0.188), (13, 0.107), (14, 0.035), (15, 0.109), (16, -0.118), (17, 0.142), (18, -0.278), (19, 0.139), (20, 0.117), (21, 0.094), (22, 0.039), (23, 0.013), (24, 0.006), (25, -0.051), (26, -0.03), (27, 0.06), (28, -0.006), (29, -0.003), (30, -0.01), (31, 0.005), (32, 0.015), (33, -0.008), (34, -0.013), (35, 0.031), (36, -0.003), (37, 0.025), (38, 0.006), (39, -0.001), (40, -0.025), (41, -0.015), (42, 0.009), (43, -0.012), (44, 0.012), (45, 0.011), (46, 0.029), (47, 0.002), (48, -0.031), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.9547959 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
2 0.95468348 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
Author: Lukáš Neumann, Jiri Matas
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.
3 0.9399268 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
4 0.91211146 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
Author: Weilin Huang, Zhe Lin, Jianchao Yang, Jue Wang
Abstract: In this paper, we present a new approach for text localization in natural images, by discriminating text and non-text regions at three levels: pixel, component and textline levels. Firstly, a powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends the widely-used Stroke Width Transform (SWT) by incorporating color cues of text pixels, leading to significantly enhanced performance on inter-component separation and intra-component connection. Secondly, based on the output of SFT, we apply two classifiers, a text component classifier and a text-line classifier, sequentially to extract text regions, eliminating the heuristic procedures that are commonly used in previous approaches. The two classifiers are built upon two novel Text Covariance Descriptors (TCDs) that encode both the heuristic properties and the statistical characteristics of text stokes. Finally, text regions are located by simply thresholding the text-line confident map. Our method was evaluated on two benchmark datasets: ICDAR 2005 and ICDAR 2011, and the corresponding F- , measure values are 0. 72 and 0. 73, respectively, surpassing previous methods in accuracy by a large margin.
5 0.83844733 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
6 0.70977277 180 iccv-2013-From Where and How to What We See
7 0.57063532 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
9 0.47344413 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
10 0.43453425 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
11 0.32518104 428 iccv-2013-Translating Video Content to Natural Language Descriptions
13 0.31888232 166 iccv-2013-Finding Actors and Actions in Movies
14 0.30462301 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
15 0.29548857 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
16 0.2898837 246 iccv-2013-Learning the Visual Interpretation of Sentences
17 0.28211987 44 iccv-2013-Adapting Classification Cascades to New Domains
18 0.27324581 416 iccv-2013-The Interestingness of Images
19 0.26252872 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
20 0.26203209 112 iccv-2013-Detecting Irregular Curvilinear Structures in Gray Scale and Color Imagery Using Multi-directional Oriented Flux
topicId topicWeight
[(2, 0.078), (7, 0.048), (12, 0.024), (15, 0.145), (26, 0.073), (31, 0.127), (40, 0.013), (42, 0.082), (48, 0.016), (64, 0.046), (73, 0.029), (78, 0.015), (89, 0.14), (95, 0.01), (97, 0.028), (98, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.85048771 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
2 0.84240431 312 iccv-2013-Perceptual Fidelity Aware Mean Squared Error
Author: Wufeng Xue, Xuanqin Mou, Lei Zhang, Xiangchu Feng
Abstract: How to measure the perceptual quality of natural images is an important problem in low level vision. It is known that the Mean Squared Error (MSE) is not an effective index to describe the perceptual fidelity of images. Numerous perceptual fidelity indices have been developed, while the representatives include the Structural SIMilarity (SSIM) index and its variants. However, most of those perceptual measures are nonlinear, and they cannot be easily adopted as an objective function to minimize in various low level vision tasks. Can MSE be perceptual fidelity aware after some minor adaptation ? In this paper we propose a simple framework to enhance the perceptual fidelity awareness of MSE by introducing an l2-norm structural error term to it. Such a Structural MSE (SMSE) can lead to very competitive image quality assessment (IQA) results. More surprisingly, we show that by using certain structure extractors, SMSE can befurther turned into a Gaussian smoothed MSE (i.e., the Euclidean distance between the original and distorted images after Gaussian , smooth filtering), which is much simpler to calculate but achieves rather better IQA performance than SSIM. The socalled Perceptual-fidelity Aware MSE (PAMSE) can have great potentials in applications such as perceptual image coding and perceptual image restoration.
3 0.8238005 38 iccv-2013-Action Recognition with Actons
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
4 0.81158578 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
Author: Dahua Lin, Jianxiong Xiao
Abstract: In this paper, we develop a generative model to describe the layouts of outdoor scenes the spatial configuration of regions. Specifically, the layout of an image is represented as a composite of regions, each associated with a semantic topic. At the heart of this model is a novel stochastic process called Spatial Topic Process, which generates a spatial map of topics from a set of coupled Gaussian processes, thus allowing the distributions of topics to vary continuously across the image plane. A key aspect that distinguishes this model from previous ones consists in its capability of capturing dependencies across both locations and topics while allowing substantial variations in the layouts. We demonstrate the practical utility of the proposed model by testing it on scene classification, semantic segmentation, and layout hallucination. –
5 0.80983371 357 iccv-2013-Robust Matrix Factorization with Unknown Noise
Author: Deyu Meng, Fernando De_La_Torre
Abstract: Many problems in computer vision can be posed as recovering a low-dimensional subspace from highdimensional visual data. Factorization approaches to lowrank subspace estimation minimize a loss function between an observed measurement matrix and a bilinear factorization. Most popular loss functions include the L2 and L1 losses. L2 is optimal for Gaussian noise, while L1 is for Laplacian distributed noise. However, real data is often corrupted by an unknown noise distribution, which is unlikely to be purely Gaussian or Laplacian. To address this problem, this paper proposes a low-rank matrix factorization problem with a Mixture of Gaussians (MoG) noise model. The MoG model is a universal approximator for any continuous distribution, and hence is able to model a wider range of noise distributions. The parameters of the MoG model can be estimated with a maximum likelihood method, while the subspace is computed with standard approaches. We illustrate the benefits of our approach in extensive syn- thetic and real-world experiments including structure from motion, face modeling and background subtraction.
6 0.80333549 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
7 0.79720926 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
8 0.79718131 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
9 0.79613817 408 iccv-2013-Super-resolution via Transform-Invariant Group-Sparse Regularization
10 0.79565412 275 iccv-2013-Motion-Aware KNN Laplacian for Video Matting
11 0.79556507 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
12 0.79162771 180 iccv-2013-From Where and How to What We See
13 0.78631532 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
14 0.78326124 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
15 0.77193987 210 iccv-2013-Image Retrieval Using Textual Cues
16 0.76309645 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
17 0.76309371 173 iccv-2013-Fluttering Pattern Generation Using Modified Legendre Sequence for Coded Exposure Imaging
18 0.76093137 349 iccv-2013-Regionlets for Generic Object Detection
19 0.75855863 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
20 0.75631732 287 iccv-2013-Neighbor-to-Neighbor Search for Fast Coding of Feature Vectors