iccv iccv2013 iccv2013-192 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jon Almazán, Albert Gordo, Alicia Fornés, Ernest Valveny
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform queryby-example, where the query is an image, and query-bystring, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.
Reference: text
sentIndex sentText sentNum sentScore
1 We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. [sent-2, score-0.474]
2 This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform queryby-example, where the query is an image, and query-bystring, where the query is a string. [sent-3, score-0.62]
3 We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. [sent-4, score-0.383]
4 Introduction This paper addresses the problem of multi-writer word spotting. [sent-7, score-0.437]
5 The objective of word spotting is to find all instances of a given word in a potentially large dataset of doc- ument images. [sent-8, score-1.12]
6 This is typically done in a query-by-example (QBE) scenario, where the query is an image of a handwritten word and it is assumed that the transcription of the dataset and the query word is not available. [sent-9, score-1.252]
7 In a multi-writer setting, the writers of the dataset documents may have completely different writing styles than the writer of the query. [sent-10, score-0.367]
8 a very large intra-class variability different writers may have completely different writing styles, making the same word look completely different (cf. [sent-18, score-0.634]
9 This huge variability in styles makes this a much more difficult problem than typeset or single-writer handwritten word spotting. [sent-21, score-0.636]
10 Because of this complexity, most popular techniques are based on describing word images as sequences of features of variable length and using techniques such as Dynamic Time Warping (DTW) or Hidden Markov Models (HMM) to classify them. [sent-22, score-0.437]
11 Variable-length features are more flexible than feature vectors and have been known to lead to superior – results in difficult word-spotting tasks since they can adapt better to the different variations of style and word length [7, 9, 22, 24, 25]. [sent-23, score-0.476]
12 Indeed, with the steady increase of datasets size there has been a renewed interest in compact, fast-to-compare word representations. [sent-31, score-0.437]
13 [28], where word images are represented with SIFT descriptors aggregated using the bag of visual words framework [4], or the work of Almaz ´an et al. [sent-33, score-0.619]
14 In particular, we adopt the Fisher vector (FV) [29] representation computed over SIFT descriptors extracted densely from the word image. [sent-38, score-0.504]
15 In this paper we propose to use labeled training data to learn how to embed our fixed-length descriptor in a more discriminative, low-dimensional space, where similarities between words are preserved independently of the writing style. [sent-42, score-0.268]
16 Indeed, we believe that learning robust models at the word level is an extremely difficult task due to the intrinsic variation of writing styles, and its adaptation to new, unseen words at test time usually yields poor results. [sent-44, score-0.656]
17 The use of attributes is, arguably, the most popular approach to achieve these goals. [sent-46, score-0.182]
18 As our first contribution, we propose an embedding approach that encodes word strings as a pyramidal histogram of characters which we dubbed PHOC inspired by the bag of characters string kernels used for example in the machine learning and biocomputing communities [16, 17]. [sent-47, score-0.934]
19 In a nutshell, this binary histogram encodes whether a particular character appears in the represented word or not. [sent-48, score-0.603]
20 , this character appears on the first half of the word, or this character appears in the last quarter of the word (see Fig. [sent-51, score-0.713]
21 During the learning of these attributes we use a wide variety of writers and characters, and so the adaptation to new, unseen words is almost seamless. [sent-56, score-0.374]
22 A naive implementation of this attributes representation greatly outperforms the direct use of FVs. [sent-57, score-0.219]
23 We found that accurately calibrating the attribute scores can have a large impact in the accuracy of the method. [sent-63, score-0.261]
24 We believe that calibrating the scores jointly can lead to large improvements, since the information of different attributes is shared. [sent-65, score-0.353]
25 This is particularly true in the case of pyramidal histograms, where the same character may be simultaneously represented by various attributes depending on its position inside the word. [sent-66, score-0.358]
26 This motivates our second contribution, a scheme to calibrate all the attribute scores jointly by means of Canonical Correlation Analysis (CCA) and its kernelized version (KCCA), where the main idea is to correlate the predicted attribute scores with their ground truth values. [sent-67, score-0.442]
27 This calibration method can noticeably outperform the standard Platts scaling while, at the same time, perform a dimensionality reduction of the attribute space. [sent-68, score-0.232]
28 We believe that the uses of this calibration scheme are not limited to word image representation and can also be used in other attributebased tasks. [sent-69, score-0.616]
29 Like in [23], we learn a joint representation for word images and text. [sent-71, score-0.505]
30 In Section 2 we review the literature on fixed-length representations – – 110011 88 for word images and describe our baseline FV representation as well as the proposed attributes-based representation. [sent-74, score-0.528]
31 Word Representation In this section we describe how we obtain the representation of a word image. [sent-79, score-0.474]
32 First we review fixed-length word image representations and introduce the FV as our reference representation. [sent-80, score-0.491]
33 In [18], a distance between binary word images is defined based on the result of XORing the images. [sent-87, score-0.437]
34 This representation has a fixed length and can be used for efficient spotting tasks, although the paper focuses on only 10 different keywords. [sent-91, score-0.283]
35 These fast-to-compare representations allow them to perform word spotting using a sliding window over the whole document without segmenting it into individual words. [sent-97, score-0.773]
36 Here we adopt a similar approach and represent word images using the FV framework. [sent-98, score-0.437]
37 To (weakly) capture the structure of the word image, we use a spatial pyramid of 2 6 leading to a final descriptor of approximately 25, 0o0f 02 d ×im 6e lnesaiodninsg. [sent-101, score-0.493]
38 Supervised Word Representation with PHOC Attributes One of the most popular approaches to perform supervised learning for word spotting is to learn models for particular keywords. [sent-104, score-0.714]
39 At test time, it is possible to compute the probability of a given word being generated by that keyword model, and that can be used as a score. [sent-106, score-0.489]
40 One disadvantage of these approaches that learn at the word level is that information is not shared between similar words. [sent-113, score-0.468]
41 We believe that sharing information between words is extremely important to learn good discriminative representations, and that the use of attributes is one way to achieve this goal. [sent-115, score-0.338]
42 The selection of these attributes is commonly a task-dependent process, so for their application to word spotting we should define them as worddiscriminative and writer-independent properties. [sent-118, score-0.899]
43 We define attributes such as “word contains an a” or “word contains a k”, leading to a histogram of 26 dimensions when using the English alphabet1 . [sent-120, score-0.238]
44 Then, at train- ing time, we learn models for each of the attributes using the image representation of the words (FVs in our case) as data, and set their labels as positive or negative according to whether those images contain that particular character or not (see Figure 2). [sent-121, score-0.477]
45 Then, at testing time, given the FV of a word, we can compute its attribute representation simply by concatenating the scores that those models yield on that particular sample. [sent-123, score-0.258]
46 , Platts scaling), these attribute representations can be compared using measures such as the Euclidean distance or the cosine similarity. [sent-126, score-0.18]
47 11001199 Labeled word images the PHOC representation as label. [sent-131, score-0.474]
48 At level 2, we define attributes such as “word contains character x on the first half of the word” and “word contains character x on the second half of the word”. [sent-135, score-0.458]
49 Level 3 splits the word in 3 parts, level 4 in 4, etc. [sent-136, score-0.437]
50 Finally, we also add t(h2e + +75 3 m +o s4)t common English bigrams Fati nleavlelyl, 2 w, leading dtod 150 extra attributes for a total of 384 attributes. [sent-138, score-0.279]
51 In the context of an attributes-based representations, the spatially-aware attributes allow one to ask more precise questions about the location of the characters, while the spatial pyramid on the image representation allows one to answer those questions. [sent-140, score-0.247]
52 Given a transcription of a word we need to determine the regions of the pyramid where we assign each character. [sent-141, score-0.564]
53 For that, we first define the normalized occupancy of the k-th character of a word of length n as the interval Occ(k, n) = [nk , k+n1], where the position k is zero-based. [sent-142, score-0.605]
54 Note that this information is extracted from the word transcription, not from the word image. [sent-143, score-0.874]
55 Indeed, assuming perfect attribute models and calibration, both representations would be identical. [sent-157, score-0.18]
56 This leads to a very clean model to perform query-by-string (QBS, sometimes referred to as query-by-text or QBT), where, instead of having a word image as a query, we have its transcription. [sent-158, score-0.437]
57 Since attribute scores and PHOCs lie in the same space, we can simply compute the PHOC representation of the text and directly compare it against the dataset word images represented with attribute scores. [sent-159, score-0.851]
58 To the best of our knowledge, we are the first to provide an unified framework where we can perform OOV QBE and QBS, as well as to be able to query text datasets using word images without an OCR transcription of the query word. [sent-161, score-0.712]
59 Calibration of scores Through the previous section we presented an attributesbased representation of the word images. [sent-163, score-0.569]
60 Although this representation is writer-independent, special care has to be put when comparing different words, since the scores of one attribute may dominate over the scores of other attributes. [sent-164, score-0.353]
61 Therefore, some calibration of the attribute scores is necessary. [sent-165, score-0.327]
62 This is particularly true when performing QBS, since otherwise attribute scores are not comparable to the binary PHOC representations. [sent-166, score-0.221]
63 Although the similarity measure involves all the attributes, the calibration of each attribute is done individually. [sent-170, score-0.232]
64 Here we propose to perform the calibration of the scores 11002200 Attribute scores PHOC [ · ] a1 a2 a3 a4 a5 ad [ a1 a2 · a3 a4 a5 ] ad WaProjected sub[ s· ·p ·a ]ceWb p1 p2 p3 p4 p5 pk Figure 3. [sent-171, score-0.296]
65 Projection of predicted attribute scores and attributes ground truth into a more correlated subspace with CCA. [sent-172, score-0.403]
66 To achieve this goal, we make use of Canonical Correlation Analysis to embed the attribute scores and the binary attributes in a common subspace where they are maximally correlated (Fig. [sent-174, score-0.457]
67 , wbk} projection vectors that project the attributes B into the k-dimensional common subspace. [sent-208, score-0.182]
68 This CCA embedding can be seen as a way to exploit the correlation between different attributes to correct the scores predicted by the model. [sent-214, score-0.338]
69 Furthermore, after CCA the attribute scores and binary attributes lie in a more correlated space, which makes the comparison between the scores and the PHOCs for our QBS problem more principled. [sent-215, score-0.498]
70 One may also note that the relation between the attribute scores and the binary attributes may not be linear, and that a kernelized CCA could yield larger improvements. [sent-218, score-0.403]
71 The document images are annotated at word level and contain the transcriptions of more than 115, 000 words. [sent-224, score-0.473]
72 Images are also annotated at word level and contain approximately 5, 000 words. [sent-227, score-0.437]
73 When computing the attribute representation, we use levels 2, 3, and 4, as well as 75 common bigrams at level 2, leading to 384 dimensions. [sent-245, score-0.223]
74 When learning and projecting with CCA and KCCA, the representations (both score attributes and PHOCs) are first L2-normalized and mean centered. [sent-246, score-0.262]
75 We used the first partition to learn the attributes representation, the second partition to learn the calibration as well as for validation purposes, and the third partition for testing purposes. [sent-253, score-0.428]
76 We use the “calibration” partition to validate the parameters of the attribute classifiers, and a small subset of it to validate the calibration (the regularization ρ for CCA, plus the bandwidth γ and the number of random projections for KCCA). [sent-254, score-0.307]
77 To train the attributes we use a one-versus-rest linear SVM with a SGD solver inspired in the implementation of L. [sent-256, score-0.182]
78 At testing time, we use each word of the test dataset as a query and use it to rank the rest of the dataset using the cosine similarity between representations. [sent-258, score-0.51]
79 Since there is no clear writer separation, we split it at word level. [sent-265, score-0.498]
80 First, due to the simplicity of GW, the attribute scores are already very good (notice the 70% QBE map with no calibration at all compared to the 34% on IAM), and so they may not require a complex calibration. [sent-282, score-0.327]
81 It is also interesting to check how the learning performed on the IAM dataset (where all the writers had a “modern” writing style) adapts to a dataset with a very different (250 years old) calligraphic style. [sent-284, score-0.197]
82 We learn the attributes and the Platts weights and CCA and KCCA projections on the IAM dataset as before, and apply it directly to the FV extracted from the GW dataset. [sent-285, score-0.262]
83 This is not surprising due to the large differences in style, but also because the attributes learned on the GW are specialized to that particular writing style and so perform better when only that style is present at test time. [sent-288, score-0.354]
84 The results on QBS show that indeed we are learning attributes correctly and not simply projecting on a different space completely uncorrelated with the transcription of the words. [sent-290, score-0.307]
85 The character HMM seems to perform well on the IAM dataset, precisely because it exploits the relations between characters of different words during training. [sent-319, score-0.305]
86 By contrast, comparing the same queries using our attributes embedded with CCA took less than 3 seconds on the same machine. [sent-324, score-0.261]
87 Note however that comparison should be exercised with caution: although we use similar set partitions for both GW and IAM datasets, [9] does not perform word spotting but line spotting, i. [sent-327, score-0.729]
88 On IAM, Frinken [9] reports a 79% map using as queries the most common nonstop words appearing in the training set, while we obtain a 71% using all the non-stop words appearing in the test set, whether they appear on the training set or not. [sent-330, score-0.232]
89 Finally, on Figure 4 we show some qualitative results on the IAM dataset, where we observe how the retrieved words can have very different styles from the query and still be retrieved successfully. [sent-333, score-0.228]
90 Conclusions This paper proposes a method for multi-writer word spotting in handwritten documents. [sent-335, score-0.816]
91 We show how an attributes-based approach based on a pyramidal histogram of characters can be used to learn how to embed the word images in a more discriminative space, where the similarity between words is independent of the writing style. [sent-336, score-0.849]
92 This attributes representation leads to an unified representation of word images and strings, resulting in a method that allows one to perform either query-by-example or query-by-string searches. [sent-337, score-0.693]
93 We show how to jointly calibrate all the attributes scores by means of CCA and KCCA, outperforming standard calibration methods. [sent-338, score-0.383]
94 We compare our method in two public datasets, outperforming state-of-the-art approaches and showing that the proposed attribute-based representation is well-suited for word searches, whether they are images or strings, in handwritten documents. [sent-339, score-0.607]
95 In the future we plan to explore the use of these approaches in a segmentation-free context, where the word images are not segmented. [sent-341, score-0.437]
96 HMMbased word spotting in handwritten documents using subword models. [sent-390, score-0.859]
97 A novel word spotting method based on recurrent neural networks. [sent-404, score-0.683]
98 Local gradient histogram features for word spotting in unconstrained handwritten documents. [sent-493, score-0.844]
99 Browsing heterogeneous document collections by a segmentation-free word spotting method. [sent-515, score-0.719]
100 Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features. [sent-551, score-0.471]
wordName wordTfidf (topN-words)
[('word', 0.437), ('iam', 0.396), ('spotting', 0.246), ('cca', 0.225), ('gw', 0.19), ('attributes', 0.182), ('qbs', 0.172), ('fv', 0.161), ('kcca', 0.153), ('character', 0.138), ('phoc', 0.138), ('platts', 0.138), ('handwritten', 0.133), ('attribute', 0.126), ('qbe', 0.12), ('calibration', 0.106), ('frinken', 0.103), ('writers', 0.103), ('transcription', 0.099), ('scores', 0.095), ('writing', 0.094), ('words', 0.089), ('hmm', 0.089), ('string', 0.088), ('almaz', 0.086), ('phocs', 0.086), ('characters', 0.078), ('query', 0.073), ('bigrams', 0.069), ('rusi', 0.069), ('rodr', 0.067), ('styles', 0.066), ('writer', 0.061), ('queries', 0.054), ('representations', 0.054), ('embed', 0.054), ('strings', 0.053), ('manmatha', 0.053), ('keyword', 0.052), ('forn', 0.052), ('keywords', 0.05), ('projections', 0.049), ('dtw', 0.047), ('partitions', 0.046), ('fisher', 0.046), ('oov', 0.046), ('documents', 0.043), ('fischer', 0.042), ('english', 0.042), ('bag', 0.04), ('calibrating', 0.04), ('style', 0.039), ('pyramidal', 0.038), ('representation', 0.037), ('gmm', 0.036), ('document', 0.036), ('believe', 0.036), ('wb', 0.035), ('historical', 0.035), ('biocomputing', 0.034), ('cursive', 0.034), ('gatos', 0.034), ('ijdar', 0.034), ('keaton', 0.034), ('listen', 0.034), ('rath', 0.034), ('silent', 0.034), ('stopwords', 0.034), ('wak', 0.034), ('worddiscriminative', 0.034), ('retrieval', 0.033), ('ol', 0.033), ('embedding', 0.033), ('icdar', 0.032), ('learn', 0.031), ('rff', 0.031), ('vinciarelli', 0.031), ('descriptors', 0.03), ('occupancy', 0.03), ('text', 0.03), ('pyramid', 0.028), ('histogram', 0.028), ('xyz', 0.028), ('slant', 0.028), ('occ', 0.028), ('correlation', 0.028), ('leading', 0.028), ('dubbed', 0.027), ('gordo', 0.027), ('fvs', 0.027), ('george', 0.027), ('partition', 0.026), ('projecting', 0.026), ('keller', 0.025), ('embedded', 0.025), ('barcelona', 0.024), ('canonical', 0.024), ('pca', 0.024), ('wa', 0.024), ('aggregated', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
Author: Jon Almazán, Albert Gordo, Alicia Fornés, Ernest Valveny
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform queryby-example, where the query is an image, and query-bystring, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.
2 0.24440426 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
3 0.22991341 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
4 0.17677322 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
Author: Xiaoyang Wang, Qiang Ji
Abstract: This paper proposes a unified probabilistic model to model the relationships between attributes and objects for attribute prediction and object recognition. As a list of semantically meaningful properties of objects, attributes generally relate to each other statistically. In this paper, we propose a unified probabilistic model to automatically discover and capture both the object-dependent and objectindependent attribute relationships. The model utilizes the captured relationships to benefit both attribute prediction and object recognition. Experiments on four benchmark attribute datasets demonstrate the effectiveness of the proposed unified model for improving attribute prediction as well as object recognition in both standard and zero-shot learning cases.
5 0.16526024 52 iccv-2013-Attribute Adaptation for Personalized Image Search
Author: Adriana Kovashka, Kristen Grauman
Abstract: Current methods learn monolithic attribute predictors, with the assumption that a single model is sufficient to reflect human understanding of a visual attribute. However, in reality, humans vary in how they perceive the association between a named property and image content. For example, two people may have slightly different internal models for what makes a shoe look “formal”, or they may disagree on which of two scenes looks “more cluttered”. Rather than discount these differences as noise, we propose to learn user-specific attribute models. We adapt a generic model trained with annotations from multiple users, tailoring it to satisfy user-specific labels. Furthermore, we propose novel techniques to infer user-specific labels based on transitivity and contradictions in the user’s search history. We demonstrate that adapted attributes improve accuracy over both existing monolithic models as well as models that learn from scratch with user-specific data alone. In addition, we show how adapted attributes are useful to personalize image search, whether with binary or relative attributes.
6 0.15329808 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
7 0.12537447 53 iccv-2013-Attribute Dominance: What Pops Out?
8 0.12511985 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
10 0.11736801 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
11 0.11525264 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
12 0.11428449 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
13 0.10753414 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
14 0.10561946 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
15 0.097600006 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search
16 0.095019951 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
17 0.090601876 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
18 0.090390407 238 iccv-2013-Learning Graphs to Match
19 0.089151762 266 iccv-2013-Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation
20 0.088085338 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
topicId topicWeight
[(0, 0.153), (1, 0.13), (2, -0.048), (3, -0.127), (4, 0.088), (5, 0.073), (6, -0.031), (7, -0.117), (8, 0.062), (9, 0.073), (10, 0.205), (11, -0.064), (12, 0.101), (13, 0.072), (14, -0.019), (15, 0.03), (16, 0.003), (17, 0.065), (18, -0.066), (19, -0.001), (20, 0.055), (21, 0.052), (22, 0.012), (23, 0.001), (24, -0.013), (25, 0.001), (26, -0.009), (27, 0.029), (28, 0.029), (29, 0.078), (30, 0.017), (31, 0.015), (32, -0.026), (33, 0.015), (34, -0.03), (35, -0.019), (36, -0.003), (37, -0.081), (38, 0.02), (39, 0.015), (40, 0.024), (41, -0.011), (42, -0.044), (43, 0.063), (44, -0.028), (45, -0.01), (46, -0.002), (47, -0.001), (48, 0.046), (49, -0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.94741172 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
Author: Jon Almazán, Albert Gordo, Alicia Fornés, Ernest Valveny
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform queryby-example, where the query is an image, and query-bystring, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.
2 0.74842674 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
3 0.74272293 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
4 0.67689681 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
5 0.66514772 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
Author: Lukáš Neumann, Jiri Matas
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.
6 0.62417305 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
8 0.61756194 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
9 0.60511053 53 iccv-2013-Attribute Dominance: What Pops Out?
10 0.5967958 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
11 0.58361095 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
12 0.57739091 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
13 0.57330704 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
14 0.55287981 52 iccv-2013-Attribute Adaptation for Personalized Image Search
15 0.55173624 449 iccv-2013-What Do You Do? Occupation Recognition in a Photo via Social Context
16 0.5286026 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
17 0.51438981 180 iccv-2013-From Where and How to What We See
18 0.49803722 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
19 0.48550707 419 iccv-2013-To Aggregate or Not to aggregate: Selective Match Kernels for Image Search
20 0.48360041 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis
topicId topicWeight
[(2, 0.078), (7, 0.028), (12, 0.017), (13, 0.025), (26, 0.057), (31, 0.09), (34, 0.029), (35, 0.012), (40, 0.016), (42, 0.071), (64, 0.041), (72, 0.282), (73, 0.017), (89, 0.136), (98, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.70942354 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
Author: Jon Almazán, Albert Gordo, Alicia Fornés, Ernest Valveny
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform queryby-example, where the query is an image, and query-bystring, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.
2 0.62253106 18 iccv-2013-A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution
Author: Martin Kiechle, Simon Hawe, Martin Kleinsteuber
Abstract: High-resolution depth maps can be inferred from lowresolution depth measurements and an additional highresolution intensity image of the same scene. To that end, we introduce a bimodal co-sparse analysis model, which is able to capture the interdependency of registered intensity . go l e i um . de . .t ities together with the knowledge of the relative positions between all views. Despite very active research in this area and significant improvements over the past years, stereo methods still struggle with noise, texture-less regions, repetitive texture, and occluded areas. For an overview of stereo methods, the reader is referred to [25]. and depth information. This model is based on the assumption that the co-supports of corresponding bimodal image structures are aligned when computed by a suitable pair of analysis operators. No analytic form of such operators ex- ist and we propose a method for learning them from a set of registered training signals. This learning process is done offline and returns a bimodal analysis operator that is universally applicable to natural scenes. We use this to exploit the bimodal co-sparse analysis model as a prior for solving inverse problems, which leads to an efficient algorithm for depth map super-resolution.
3 0.57704163 38 iccv-2013-Action Recognition with Actons
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
4 0.57573283 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
Author: Mandar Dixit, Nikhil Rasiwasia, Nuno Vasconcelos
Abstract: An extension of the latent Dirichlet allocation (LDA), denoted class-specific-simplex LDA (css-LDA), is proposed for image classification. An analysis of the supervised LDA models currently used for this task shows that the impact of class information on the topics discovered by these models is very weak in general. This implies that the discovered topics are driven by general image regularities, rather than the semantic regularities of interest for classification. To address this, we introduce a model that induces supervision in topic discovery, while retaining the original flexibility of LDA to account for unanticipated structures of interest. The proposed css-LDA is an LDA model with class supervision at the level of image features. In css-LDA topics are discovered per class, i.e. a single set of topics shared across classes is replaced by multiple class-specific topic sets. This model can be used for generative classification using the Bayes decision rule or even extended to discriminative classification with support vector machines (SVMs). A css-LDA model can endow an image with a vector of class and topic specific count statistics that are similar to the Bag-of-words (BoW) histogram. SVM-based discriminants can be learned for classes in the space of these histograms. The effectiveness of css-LDA model in both generative and discriminative classification frameworks is demonstrated through an extensive experimental evaluation, involving multiple benchmark datasets, where it is shown to outperform all existing LDA based image classification approaches.
5 0.57527721 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
Author: Ming-Ming Cheng, Jonathan Warrell, Wen-Yan Lin, Shuai Zheng, Vibhav Vineet, Nigel Crook
Abstract: Detecting visually salient regions in images is one of the fundamental problems in computer vision. We propose a novel method to decompose an image into large scale perceptually homogeneous elements for efficient salient region detection, using a soft image abstraction representation. By considering both appearance similarity and spatial distribution of image pixels, the proposed representation abstracts out unnecessary image details, allowing the assignment of comparable saliency values across similar regions, and producing perceptually accurate salient region detection. We evaluate our salient region detection approach on the largest publicly available dataset with pixel accurate annotations. The experimental results show that the proposed method outperforms 18 alternate methods, reducing the mean absolute error by 25.2% compared to the previous best result, while being computationally more efficient.
6 0.57296622 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
7 0.5713197 180 iccv-2013-From Where and How to What We See
8 0.56980437 275 iccv-2013-Motion-Aware KNN Laplacian for Video Matting
9 0.56864172 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
10 0.56790459 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
11 0.56658757 210 iccv-2013-Image Retrieval Using Textual Cues
12 0.56557441 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
13 0.56277359 357 iccv-2013-Robust Matrix Factorization with Unknown Noise
14 0.55878174 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
15 0.55773163 349 iccv-2013-Regionlets for Generic Object Detection
16 0.55713183 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
17 0.55660045 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences
18 0.55556405 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
19 0.55554152 287 iccv-2013-Neighbor-to-Neighbor Search for Fast Coding of Feature Vectors
20 0.55539334 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition