iccv iccv2013 iccv2013-345 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
Reference: text
sentIndex sentText sentNum sentScore
1 shiva@ um edu my Abstract This paper presents an approach to text recognition in natural scene images. [sent-6, score-0.239]
2 Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. [sent-7, score-1.009]
3 For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. [sent-8, score-0.58]
4 We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. [sent-10, score-0.836]
5 Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. [sent-11, score-0.4]
6 Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations. [sent-12, score-0.518]
7 Introduction Reading text in natural scene images refers to the problem of recognizing words that appear on, e. [sent-14, score-0.278]
8 Partly due to this reason, scene text recognition has received increased interests from the community, e. [sent-18, score-0.239]
9 In this paper, we focus on text recognition in street images, which facilitates the application of business name search on online maps [1]. [sent-21, score-0.355]
10 The large-scale nature of street image data provides an exciting opportunity to benefit millions of users. [sent-27, score-0.097]
11 However, scene text recognition is very challenging due to three main problems. [sent-28, score-0.239]
12 First, the appearances of scene characters are almost unconstrained, i. [sent-29, score-0.32]
13 Second, scene characters often suffer from various deformations such as uneven illumination, blurring and perspective distortion. [sent-33, score-0.602]
14 Third, in complex scenes such as street images, text may not be the main object. [sent-34, score-0.266]
15 As an illustration of the complexity of street images, the recognition accuracy of Optical Character Recognition (OCR) engines on words cropped from these images is as low as 35% [1]. [sent-36, score-0.275]
16 Although there are existing works to recognize text in natural scene images, e. [sent-38, score-0.249]
17 , [1–4], their scopes are limited to horizontal texts which are frontal parallel to the image plane. [sent-40, score-0.445]
18 However, in practice, scene texts can appear in any orientation, and with perspective distortion. [sent-41, score-0.513]
19 Thus, the important issue of handling perspective texts has been neglected by previous works. [sent-42, score-0.513]
20 In this paper, we attempt to address the recognition of perspective texts of arbitrary orientations in complex scenes (such as street images). [sent-43, score-0.679]
21 Using a traditional visual feature such as Histogram of Oriented Gradients (HOG) (as employed in [1, 2]) would lead to a low accuracy on perspective texts. [sent-44, score-0.176]
22 The reason is that the feature is not able to handle the different character poses. [sent-45, score-0.517]
23 However, the major drawback of this approach is that it is labor-intensive and time-consuming to collect enough training samples for a large number of character classes (62 classes for English characters and digits), each with, say, 10 discrete poses. [sent-47, score-0.803]
24 In addition, when collecting character samples from natural scenes, it is difficult to control the character poses accurately. [sent-48, score-1.093]
25 Because SIFT is robust to both rotation and viewpoint change, our system is trained on only frontal characters (from commonly used datasets in the literature such as ICDAR 2003 [5]). [sent-50, score-0.411]
26 Our extensive experiments show that this approach achieves good accuracies, while avoiding the high cost of collecting samples of perspective characters. [sent-51, score-0.235]
27 Following recent works [1, 2], the scope of this paper is limited to cropped word recognition with a lexicon, i. [sent-52, score-0.365]
28 The lexicon serves as a form of context information, and is especially relevant for the application of business name search. [sent-55, score-0.223]
29 Given a street image and its address, the lexicon can be built by collecting the shop names around the address via a search engine [ 1] . [sent-56, score-0.368]
30 There are also other 569 SALKexEic,oYnO:G A,RBAGRE,… deTt cxtorSALKexECi,crYeocpnOop:Gg endAi tw,Ro BnAr dGRE,… Figure 1: The problem of cropped word recognition. [sent-57, score-0.329]
31 A “cropped word” refers to the region cropped from the original image based on the word bounding box returned by a text detector, e. [sent-58, score-0.559]
32 Given a cropped word image, the task is to recognize the word using the provided lexicon. [sent-61, score-0.611]
33 [6] used a list of soccer players’ names for text recognition in sports videos. [sent-64, score-0.284]
34 (1) We present an approach to recognize perspective scene texts of arbitrary orientations. [sent-70, score-0.598]
35 (2) Our system is trained on only frontal characters, which drastically reduces the cost of collecting training data. [sent-72, score-0.154]
36 (3) For performance evaluation, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a variety of viewpoints. [sent-73, score-0.4]
37 Related work A comprehensive review of text extraction methods is provided in [8]. [sent-76, score-0.169]
38 In general, there are two main steps: text detection and text recognition. [sent-77, score-0.338]
39 The first step aims to locate the text positions in an image, usually by drawing a bounding box around each word. [sent-78, score-0.204]
40 In the second step, the detected words are recognized into text strings. [sent-80, score-0.276]
41 A building block for word recognition is individual character recognition. [sent-83, score-0.789]
42 However, since these features are not robust to rotation and viewpoint change, they may not work well for perspective characters of arbitrary orientations. [sent-86, score-0.531]
43 [3] proposed a novel similarity constraint to force characters which were visually similar to take the same label. [sent-90, score-0.286]
44 However, these methods were only tested on simple sign images where most of the words appeared on plain backgrounds. [sent-91, score-0.106]
45 [1, 16] adopted an object recognition framework for word recognition. [sent-93, score-0.272]
46 These methods require all characters of a word to be correctly recognized and Figure 2: The flowchart of the proposed method. [sent-94, score-0.58]
47 thus, they cannot handle cases where one or more characters are occluded. [sent-95, score-0.286]
48 Recent works formulate word recognition as an optimization problem by using Conditional Random Field (CRF) [2, 17], Viterbi alignment [15] and weighted finitestate transducers [4]. [sent-96, score-0.343]
49 One approach is to rectify perspective texts prior to recognition, e. [sent-99, score-0.479]
50 However, these methods rely heavily on the quality of the binarized character shapes. [sent-102, score-0.517]
51 Thus, although they work for texts on plain backgrounds, it is unclear whether they can handle texts with cluttered backgrounds (as in street images). [sent-103, score-0.761]
52 [20] rectified perspective texts in image sequences by utilizing the motion information. [sent-105, score-0.479]
53 However, this work only focused on character recognition, and did not address word recognition. [sent-109, score-0.753]
54 Therefore, despite its importance, the recognition of perspective texts has not been adequately addressed. [sent-111, score-0.515]
55 Character detection and recognition An overview of our approach to perspective text recognition is shown in Figure 2. [sent-113, score-0.417]
56 We describe the detection and recognition of characters below. [sent-114, score-0.322]
57 The optimized alignment of the recognized characters with the lexicon will be discussed in the next section. [sent-115, score-0.585]
58 Detection of character candidates In the first step, we use MSERs [22] to detect the potential character locations in a cropped word image (hereafter referred to as character candidates). [sent-118, score-1.987]
59 It has been shown that scene characters can be extracted as MSERs [11, 12]. [sent-120, score-0.346]
60 However, not all the extracted MSERs from a cropped word correspond to characters. [sent-123, score-0.355]
61 Thus, we classify them into text MSERs and non-text MSERs using four features: the relative height, the aspect ratio, the number of holes and the number of horizontal crossings [11, 12]. [sent-124, score-0.216]
62 The text MSERs are retained while the non-text MSERs are discarded. [sent-125, score-0.169]
63 In [11, 12], the text MSERs were directly used for text detection. [sent-126, score-0.338]
64 Figure 3 shows an 570 (a) Cropped word image (b) MSERs (c) Character candidates based on MSER bounding boxes Figure 3: Character detection based on MSERs. [sent-129, score-0.378]
65 Therefore, using the MSER bounding boxes as character candidates helps to recover some of the missing parts (if any) of the characters. [sent-133, score-0.659]
66 Estimation of character probabilities aFraorc eera chla eclh a? [sent-136, score-0.517]
67 This requires the features extracted from the character candidates to be robust to rotation and viewpoint change. [sent-221, score-0.68]
68 SIFT has been explored for text recognition in [24, 25] and for word spotting in [26, 27]. [sent-223, score-0.441]
69 a |b, h…e a…b i, × at sparse interest points, which is not sufficient for perspective characters (to be explained later). [sent-226, score-0.495]
70 The last two works were only tested on frontal scanned document images. [sent-227, score-0.125]
71 In contrast, we adopt dense SIFT (which was used for scene classification in [28]) for perspective character recognition. [sent-228, score-0.763]
72 More specifically, the patch inside a character candidate is normalized to a fixed size of 48 48. [sent-229, score-0.554]
73 (In the literature, the term “dense SIFT” sometimes refers to an extraction scheme where the orientations of the dense interest points are fixed. [sent-234, score-0.123]
74 ) The rationale for using dense SIFT is that it provides more information to discriminate among a large number of classes (62 character classes). [sent-236, score-0.553]
75 With the original SIFT, the descriptors are only extracted at sparse interest points. [sent-237, score-0.096]
76 , blurring and uneven illumination, which reduce the number of detected interest points. [sent-240, score-0.113]
77 In contrast, dense SIFT provides more information for character recognition. [sent-245, score-0.553]
78 Kmeans clustering is used to build a vocabulary of 3,000 visual words from a random subset of (dense) SIFT descriptors extracted from training samples. [sent-252, score-0.141]
79 ) With this vocabulary, the descriptors of a character candidate are assigned to the nearest clusters. [sent-255, score-0.591]
80 The rectification process is error-prone due to the challenges of scene characters, including blurring and cluttered backgrounds. [sent-283, score-0.135]
81 Non-maximal suppression Since multiple MSERs may be detected for the same character [12], we perform non-maximal suppression on the set of character candidates. [sent-286, score-1.086]
82 A character candidate is suppressed if it has a significant overlap with another character candidate and the latter has a higher confidence. [sent-287, score-1.108]
83 The confidence of a character candidate is defined as the maximum? [sent-289, score-0.554]
84 character candidates are fed into the next step for word recognition. [sent-338, score-0.86]
85 MSER and SIFT have been used separately for character detection and character recognition in previous works. [sent-339, score-1.07]
86 Figure5:Asamplealignmentbetwe nasetof6char cter candidates (shown in yellow) and the word “PIONEER”. [sent-342, score-0.343]
87 However, to the best of our knowledge, this paper is the first attempt to combine them in a coherent way to recognize perspective characters while using only frontal training data. [sent-344, score-0.603]
88 Word recognition The recognition of perspective texts of arbitrary orientations is much more difficult than that of frontal, horizontal texts due to additional challenges. [sent-346, score-0.968]
89 With arbitrary orientation, it is difficult to distinguish characters such as ‘6’ and ‘9’, and ‘u’ and ‘n’, unless there is context information. [sent-347, score-0.325]
90 Furthermore, some characters may be hard to read (due to severe distortions) or even occluded. [sent-348, score-0.286]
91 To deal with these problems, we use a lexicon as the context information. [sent-349, score-0.17]
92 d (3) mean that for each word icnh trhaec eler xcicaonnd,i watee c aonmdp wuoter it? [sent-713, score-0.261]
93 Then, among all the lexicon words, the one with the highest maximum alignment score is returned as the optimal word. [sent-715, score-0.241]
94 Ordering of character candidates Our alignment algorithm requires the character candidates to be ordered into a sequence. [sent-718, score-1.353]
95 For simplicity, we assume that text is written from left to right or from top to bottom. [sent-719, score-0.169]
96 If a word is nearer to the horizontal orientation, the character candidates are ordered by the x-coordinates. [sent-720, score-1.027]
97 A word is classified as either nearer to the horizontal orientation or nearer to the vertical orientation based on the angle of the major axis of its bounding quadrilateral. [sent-722, score-0.55]
98 (For perspective words, we use quadrilaterals to mark the word locations (Section 5). [sent-723, score-0.437]
99 The purpose of the penalty score is to discourage character candidates with high confidence from taking the empty label. [sent-816, score-0.65]
100 The alignment score of the whole word is the sum of the indi? [sent-817, score-0.307]
wordName wordTfidf (topN-words)
[('character', 0.517), ('msers', 0.352), ('texts', 0.303), ('characters', 0.286), ('word', 0.236), ('perspective', 0.176), ('lexicon', 0.17), ('text', 0.169), ('mser', 0.155), ('candidates', 0.107), ('sift', 0.101), ('street', 0.097), ('frontal', 0.095), ('cropped', 0.093), ('nearer', 0.086), ('alignment', 0.071), ('collecting', 0.059), ('recognized', 0.058), ('bexe', 0.057), ('descriptormatchingusing', 0.057), ('naelidg', 0.057), ('pthlaei', 0.057), ('roat', 0.057), ('slactoerre', 0.057), ('business', 0.053), ('iag', 0.051), ('words', 0.049), ('horizontal', 0.047), ('recognize', 0.046), ('blurring', 0.042), ('eet', 0.042), ('ec', 0.042), ('names', 0.042), ('eh', 0.041), ('arbitrary', 0.039), ('ae', 0.039), ('tc', 0.039), ('ht', 0.038), ('uneven', 0.038), ('ocr', 0.038), ('candidate', 0.037), ('descriptors', 0.037), ('list', 0.037), ('lea', 0.037), ('dense', 0.036), ('recognition', 0.036), ('dc', 0.035), ('english', 0.035), ('tto', 0.035), ('bounding', 0.035), ('ha', 0.034), ('neglected', 0.034), ('ordered', 0.034), ('scene', 0.034), ('interest', 0.033), ('al', 0.033), ('rectification', 0.032), ('xb', 0.032), ('plain', 0.031), ('hl', 0.031), ('orientation', 0.03), ('viewpoint', 0.03), ('wo', 0.03), ('lf', 0.03), ('scanned', 0.03), ('eu', 0.029), ('vocabulary', 0.029), ('orientations', 0.028), ('cluttered', 0.027), ('od', 0.027), ('grid', 0.027), ('ee', 0.027), ('empty', 0.026), ('sign', 0.026), ('ate', 0.026), ('suppression', 0.026), ('el', 0.026), ('suffer', 0.026), ('refers', 0.026), ('extracted', 0.026), ('ttoa', 0.025), ('aotfe', 0.025), ('eler', 0.025), ('tnod', 0.025), ('adf', 0.025), ('quadrilaterals', 0.025), ('acu', 0.025), ('indi', 0.025), ('rch', 0.025), ('ioofn', 0.025), ('ility', 0.025), ('tohneta', 0.025), ('tna', 0.025), ('eei', 0.025), ('tfhoartm', 0.025), ('earn', 0.025), ('hdee', 0.025), ('tehceo', 0.025), ('eexr', 0.025), ('lpf', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
2 0.48748538 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
3 0.39557177 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
4 0.38833797 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
Author: Lukáš Neumann, Jiri Matas
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.
5 0.22991341 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
Author: Jon Almazán, Albert Gordo, Alicia Fornés, Ernest Valveny
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform queryby-example, where the query is an image, and query-bystring, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.
6 0.12761192 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
7 0.1262922 180 iccv-2013-From Where and How to What We See
9 0.10237628 166 iccv-2013-Finding Actors and Actions in Movies
10 0.074064828 44 iccv-2013-Adapting Classification Cascades to New Domains
11 0.061167944 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
12 0.057953686 450 iccv-2013-What is the Most EfficientWay to Select Nearest Neighbor Candidates for Fast Approximate Nearest Neighbor Search?
13 0.057687372 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
14 0.054181747 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
15 0.052339025 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation
16 0.051429734 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
17 0.050817885 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
18 0.049879517 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
19 0.048718225 314 iccv-2013-Perspective Motion Segmentation via Collaborative Clustering
20 0.04841071 368 iccv-2013-SYM-FISH: A Symmetry-Aware Flip Invariant Sketch Histogram Shape Descriptor
topicId topicWeight
[(0, 0.141), (1, 0.033), (2, -0.018), (3, -0.06), (4, 0.053), (5, 0.066), (6, 0.037), (7, -0.041), (8, -0.076), (9, -0.02), (10, 0.422), (11, -0.135), (12, 0.168), (13, 0.111), (14, 0.031), (15, 0.087), (16, -0.078), (17, 0.166), (18, -0.232), (19, 0.095), (20, 0.113), (21, 0.087), (22, 0.042), (23, 0.017), (24, 0.057), (25, -0.043), (26, -0.048), (27, 0.089), (28, 0.037), (29, 0.009), (30, -0.01), (31, 0.02), (32, -0.019), (33, 0.012), (34, 0.005), (35, 0.024), (36, 0.015), (37, -0.021), (38, 0.07), (39, 0.009), (40, 0.014), (41, -0.018), (42, 0.007), (43, 0.003), (44, 0.011), (45, -0.011), (46, -0.005), (47, -0.002), (48, -0.004), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.96827084 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
2 0.94908488 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
Author: Lukáš Neumann, Jiri Matas
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.
3 0.90660411 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
4 0.89271802 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
Author: Weilin Huang, Zhe Lin, Jianchao Yang, Jue Wang
Abstract: In this paper, we present a new approach for text localization in natural images, by discriminating text and non-text regions at three levels: pixel, component and textline levels. Firstly, a powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends the widely-used Stroke Width Transform (SWT) by incorporating color cues of text pixels, leading to significantly enhanced performance on inter-component separation and intra-component connection. Secondly, based on the output of SFT, we apply two classifiers, a text component classifier and a text-line classifier, sequentially to extract text regions, eliminating the heuristic procedures that are commonly used in previous approaches. The two classifiers are built upon two novel Text Covariance Descriptors (TCDs) that encode both the heuristic properties and the statistical characteristics of text stokes. Finally, text regions are located by simply thresholding the text-line confident map. Our method was evaluated on two benchmark datasets: ICDAR 2005 and ICDAR 2011, and the corresponding F- , measure values are 0. 72 and 0. 73, respectively, surpassing previous methods in accuracy by a large margin.
5 0.84699422 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
6 0.69414288 180 iccv-2013-From Where and How to What We See
7 0.58837187 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
9 0.42453527 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
10 0.37438452 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
11 0.27042466 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
12 0.26470131 166 iccv-2013-Finding Actors and Actions in Movies
13 0.2628963 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
14 0.24571636 346 iccv-2013-Rectangling Stereographic Projection for Wide-Angle Image Visualization
15 0.2325474 365 iccv-2013-SIFTpack: A Compact Representation for Efficient SIFT Matching
16 0.23190038 388 iccv-2013-Shape Index Descriptors Applied to Texture-Based Galaxy Analysis
17 0.23009816 44 iccv-2013-Adapting Classification Cascades to New Domains
18 0.22747299 416 iccv-2013-The Interestingness of Images
19 0.22691853 368 iccv-2013-SYM-FISH: A Symmetry-Aware Flip Invariant Sketch Histogram Shape Descriptor
topicId topicWeight
[(2, 0.046), (7, 0.022), (26, 0.045), (31, 0.527), (42, 0.066), (48, 0.02), (64, 0.036), (73, 0.025), (89, 0.126)]
simIndex simValue paperId paperTitle
same-paper 1 0.87223005 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
2 0.82865965 408 iccv-2013-Super-resolution via Transform-Invariant Group-Sparse Regularization
Author: Carlos Fernandez-Granda, Emmanuel J. Candès
Abstract: We present a framework to super-resolve planar regions found in urban scenes and other man-made environments by taking into account their 3D geometry. Such regions have highly structured straight edges, but this prior is challenging to exploit due to deformations induced by the projection onto the imaging plane. Our method factors out such deformations by using recently developed tools based on convex optimization to learn a transform that maps the image to a domain where its gradient has a simple group-sparse structure. This allows to obtain a novel convex regularizer that enforces global consistency constraints between the edges of the image. Computational experiments with real images show that this data-driven approach to the design of regularizers promoting transform-invariant group sparsity is very effective at high super-resolution factors. We view our approach as complementary to most recent superresolution methods, which tend to focus on hallucinating high-frequency textures.
3 0.80007845 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
Author: Dahua Lin, Jianxiong Xiao
Abstract: In this paper, we develop a generative model to describe the layouts of outdoor scenes the spatial configuration of regions. Specifically, the layout of an image is represented as a composite of regions, each associated with a semantic topic. At the heart of this model is a novel stochastic process called Spatial Topic Process, which generates a spatial map of topics from a set of coupled Gaussian processes, thus allowing the distributions of topics to vary continuously across the image plane. A key aspect that distinguishes this model from previous ones consists in its capability of capturing dependencies across both locations and topics while allowing substantial variations in the layouts. We demonstrate the practical utility of the proposed model by testing it on scene classification, semantic segmentation, and layout hallucination. –
4 0.77504647 357 iccv-2013-Robust Matrix Factorization with Unknown Noise
Author: Deyu Meng, Fernando De_La_Torre
Abstract: Many problems in computer vision can be posed as recovering a low-dimensional subspace from highdimensional visual data. Factorization approaches to lowrank subspace estimation minimize a loss function between an observed measurement matrix and a bilinear factorization. Most popular loss functions include the L2 and L1 losses. L2 is optimal for Gaussian noise, while L1 is for Laplacian distributed noise. However, real data is often corrupted by an unknown noise distribution, which is unlikely to be purely Gaussian or Laplacian. To address this problem, this paper proposes a low-rank matrix factorization problem with a Mixture of Gaussians (MoG) noise model. The MoG model is a universal approximator for any continuous distribution, and hence is able to model a wider range of noise distributions. The parameters of the MoG model can be estimated with a maximum likelihood method, while the subspace is computed with standard approaches. We illustrate the benefits of our approach in extensive syn- thetic and real-world experiments including structure from motion, face modeling and background subtraction.
5 0.77406991 38 iccv-2013-Action Recognition with Actons
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
6 0.67212147 275 iccv-2013-Motion-Aware KNN Laplacian for Video Matting
7 0.62674737 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
8 0.57270825 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
9 0.56253886 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
10 0.53467047 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
11 0.53383809 210 iccv-2013-Image Retrieval Using Textual Cues
12 0.52746075 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
13 0.50706679 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
14 0.50188005 180 iccv-2013-From Where and How to What We See
15 0.49533901 173 iccv-2013-Fluttering Pattern Generation Using Modified Legendre Sequence for Coded Exposure Imaging
16 0.48452514 19 iccv-2013-A Learning-Based Approach to Reduce JPEG Artifacts in Image Matting
17 0.48373348 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
18 0.4767648 287 iccv-2013-Neighbor-to-Neighbor Search for Fast Coding of Feature Vectors
19 0.46927485 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
20 0.46787351 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions