cvpr cvpr2013 cvpr2013-382 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song Gao, Zhong Zhang
Abstract: Scene text recognition has inspired great interests from the computer vision community in recent years. In this paper, we propose a novel scene text recognition method using part-based tree-structured character detection. Different from conventional multi-scale sliding window character detection strategy, which does not make use of the character-specific structure information, we use part-based tree-structure to model each type of character so as to detect and recognize the characters at the same time. While for word recognition, we build a Conditional Random Field model on the potential character locations to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework. The final word recognition result is obtained by minimizing the cost function defined on the random field. Experimental results on a range of challenging public datasets (ICDAR 2003, ICDAR 2011, SVT) demonstrate that the proposed method outperforms stateof-the-art methods significantly bothfor character detection and word recognition.
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we propose a novel scene text recognition method using part-based tree-structured character detection. [sent-10, score-0.94]
2 Different from conventional multi-scale sliding window character detection strategy, which does not make use of the character-specific structure information, we use part-based tree-structure to model each type of character so as to detect and recognize the characters at the same time. [sent-11, score-2.069]
3 While for word recognition, we build a Conditional Random Field model on the potential character locations to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework. [sent-12, score-1.177]
4 The final word recognition result is obtained by minimizing the cost function defined on the random field. [sent-13, score-0.302]
5 Experimental results on a range of challenging public datasets (ICDAR 2003, ICDAR 2011, SVT) demonstrate that the proposed method outperforms stateof-the-art methods significantly bothfor character detection and word recognition. [sent-14, score-0.97]
6 [7], given an image containing text and other objects, viewers tend to fixate on text, suggesting the importance of text to human. [sent-19, score-0.288]
7 In fact, text recognition is indispensable for a lot of applications such as automatic sign reading, language translation, navigation and so on. [sent-20, score-0.337]
8 Most of the previous work on scene text recognition could be roughly classified into two categories: tradition- Figure 1. [sent-22, score-0.265]
9 Given a text image, we first use tree-structured models to get the character detection results, based on which we get the potential character locations. [sent-24, score-1.703]
10 Character detection scores are used to define the unary cost and language model is used to define the pairwise cost. [sent-26, score-0.393]
11 We finally infer each label of the node and the word by minimizing the cost function. [sent-27, score-0.291]
12 However, since text in natural images differs from text in traditional scanned document in terms of resolution, illumination condition, size and font style, the binarization result is usually unsatisfactory. [sent-30, score-0.43]
13 Moreover, the loss of information during the binarization process is almost unrecoverable, which means if the binarization result is poor, the chance of correctly recognizing the text is quite small. [sent-31, score-0.387]
14 On the other hand, object recognition based methods assume that scene character recognition is quite simi222999556199 Figure 2. [sent-33, score-0.872]
15 For scene character recognition, these methods [4, 13, 19, 18] directly extract features from original image and use various classifiers to recognize the character. [sent-37, score-0.839]
16 While for scene text recognition, since there are no binarization and segmentation stages, most existing methods [19, 18, 11, 10] adopt multi-scale sliding window strategy to get the candidate character detection results. [sent-38, score-1.255]
17 Thus, these methods heavily rely on the postprocessing methods such as pictorial structures [19, 18] or CRF [11, 10] to choose the final word from piles of candidate detections. [sent-40, score-0.348]
18 When humans try to recognize scene characters with distortions and complex background, the detection of the character from complex background and the recognition of the character are somehow interdependent. [sent-41, score-1.925]
19 On one hand, the unique structure of each character helps us to detect the characters from complex background and on the other hand, detecting the character-specific structure from complex background also helps us to recognize the character. [sent-42, score-1.198]
20 In other words, humans naturally combine detection and recognition together when recognizing characters from scene images. [sent-43, score-0.423]
21 Thus, in this paper, we try to imitate human perceptual ability and propose to recognize characters by detecting character-specific part-based structures, which seamlessly combine detection and recognition together. [sent-44, score-0.523]
22 To recognize the scene text, we build the CRF model on the potential character locations. [sent-46, score-0.926]
23 Character detection scores, spatial constraints and linguistic knowledge are used to define the unary and pairwise cost function. [sent-47, score-0.32]
24 The final word recognition result is acquired by minimizing the cost function. [sent-48, score-0.32]
25 Section 2 details the proposed method, including the model for character detection and word recognition. [sent-52, score-0.988]
26 First, we use part-based tree-structured models to detect characters, based on which we get the potential character locations. [sent-58, score-0.8]
27 We use character detection scores, spatial constraints and language model to define the unary and pairwise cost function. [sent-60, score-1.074]
28 Finally we get the word recognition result by minimizing the cost function. [sent-61, score-0.32]
29 Next, we will detail the character detection method and word recognition model. [sent-62, score-1.027]
30 Structure information is even more important to characters, since characters are designed by human and each type of character has unique structure representing itself. [sent-68, score-1.039]
31 To utilize the unique structure information of characters, we model each character as a tree whose nodes correspond to parts of the character. [sent-69, score-0.84]
32 Each rectangle corresponds to a part-based filter of the character and the red lines illustrate the topological relations of the parts. [sent-76, score-0.759]
33 Model for each character: We represent each character by a tree Tk = (Vk , Ek), where k is the index of the model for different structures, Vk represents the nodes and Ek specifies the topological relations of nodes [20]. [sent-77, score-0.864]
34 By incorporating the elastic structure information, the model could detect characters − − with contamination or deformation as shown in Figure 4(a). [sent-99, score-0.397]
35 The character in the green rectangle labels the type of the sturcture. [sent-103, score-0.758]
36 (a) Detection results of characters with contamination and deformation. [sent-104, score-0.282]
37 Re-scoring: Apart from unique structure, different parts of a character tend to have similar intensity, which we could utilize to further improve the performance. [sent-112, score-0.746]
38 To learn the model, we assume a fully-supervised paradigm, where we are provided positive images with characters as well as part labels, and negative images with- 222999666311 out characters. [sent-125, score-0.253]
39 We design the tree-structure for each type of character by our experience and the experimental results show that they perform quite well. [sent-128, score-0.75]
40 The Word Recognition Model Although the character detection step provides us with a set of windows containing characters with high confidence as shown in Figure 4(b), inevitably it also produces some false positives and ambiguities between similar characters. [sent-143, score-1.095]
41 We make use of character detection scores, spatial constraints, and linguistic knowledge to define the cost function. [sent-146, score-0.909]
42 Finally, the word recognition result is acquired by minimizing the cost function. [sent-147, score-0.32]
43 For a given scene text image, there are several potential character locations. [sent-148, score-0.929]
44 Each position, which might have several character detection results, is represented by a random variable Xi. [sent-150, score-0.776]
45 1 Graph Construction After applying Non-Maximum Suppression (NMS) [20] on the original character detection results, the left detection windows constitute the potential locations. [sent-176, score-0.924]
46 Then, for each location, we choose those detection windows which are close to this location as the candidate characters for this location. [sent-179, score-0.409]
47 2 Cost Function The unary cost E(xi) represents the penalty of assigning label cj to node xi. [sent-186, score-0.235]
48 In this case, if the detection score for a certain type of character model cj is very high, the cost of labeling the node cj should be small and vise versa. [sent-187, score-1.11]
49 (1,0) where P(ci, cj) r⎩⎪efers to the bi-gram language model learnt from the lexicon, Dij is the relative distance of the two nodes, Si and Sj represent the maximum character detection scores at the corresponding locations, Si,j is the larger one of Si and Sj, and μ is set to 1. [sent-205, score-0.967]
50 We use the SRI Language Modeling Toolkit [17] to learn the probability of joint occurrences of characters in a large English dictionary with around 0. [sent-207, score-0.253]
51 The pairwise cost function means that if the probability of joint occurrence of a character pair (ci , cj) is large, the cost of nodes (xi, xj) taking labels (ci, cj) should be small. [sent-209, score-0.887]
52 Moreover, if the relative distance of the two nodes is small, and the maximum score of the node is low, the cost of the node taking a null label should be small. [sent-210, score-0.22]
53 3 Inference After computing the unary and pairwise cost, we use the sequential tree-reweighted message passing (TRW-S) algorithm [8] to minimize the cost function in (8), due to its efficiency and accuracy on our recognition problem. [sent-213, score-0.224]
54 Experimental Results In this section, we give detailed evaluation of the proposed character detection and word recognition method. [sent-219, score-1.027]
55 We compare the detection based character recognition method with conventional HOG+NN. [sent-220, score-0.858]
56 We also compare the proposed character detection method with conventional sliding window strategy, SYNTH+FERNS proposed by Wang et al. [sent-221, score-0.896]
57 For word recognition task, we compare our results with state-of-the-art methods [19, 18, 10, 11] as well as commercial OCR engines ABBYY FineReader 9. [sent-223, score-0.251]
58 To evaluate the performance of the proposed detection based character recog- nition method, we test the recognition rate on two public datasets: Chars74k [4] and ICDAR 2003 robust character recognition dataset (ICDAR03-CH) [9]. [sent-228, score-1.614]
59 However, since we focus on detecting and recognizing characters with certain structures, characters with similar structures such as, ’0’, ’O’ and ’o’, ’P’ and ’p’, ’K’ and ’k’, ’X’ and ’x’, should belong to the same class. [sent-229, score-0.572]
60 We use the challenging public datasets Street View Text (SVT) [19], ICDAR 2003 robust word recognition [9] and ICDAR 2011word recognition datasets [16] to evaluate the performance of the overall word recognition method. [sent-238, score-0.559]
61 Since we focus on the word recognition task, we use the SVT-WORD dataset following × the experimental protocol of [19, 18]. [sent-240, score-0.251]
62 For ICDAR 2003 and ICDAR 2011datasets, similar to [18], we ignore words with less than two characters or with non-alphanumeric characters. [sent-241, score-0.298]
63 Detection Based Character Recognition To recognize characters using the detection model, we apply each character-specific tree-structured model (TSM) on the image and choose the structure with the highest score as the recognition result. [sent-244, score-0.587]
64 Since we focus on detecting characters with unique structures, we only train 49 types of character model whose structures are different from each other. [sent-252, score-1.057]
65 The great improvement suggests (1) the effectiveness of the tree-structured models, as they tend to detect and recognize characters with certain structures, and thus (2) the high possibility of achieving better recognition result if we postprocess the result to deal with similar structures. [sent-255, score-0.445]
66 Character Detection To evaluate the superiority of the proposed character detection method over conventional multi-scale sliding window detection strategy for word recognition, we test the word recognition result using the word spotting strategy PLEX from [18]. [sent-263, score-1.762]
67 In this case, based on the character detection results of the proposed TSM and the SYNTH+FERNS proposed by Wang et al. [sent-264, score-0.776]
68 In the SVT-WD case, a lexicon of about 50 words is provided with each image as part of the dataset. [sent-267, score-0.236]
69 The word recognition results are shown in Table 1. [sent-268, score-0.251]
70 Since we use the same word spotting strategy PLEX, the only difference between the two methods lies in the character detection method. [sent-270, score-1.085]
71 For FERNS+PLEX, multi-scale sliding window strategy is used to detect characters and FERNS classifier is used to recognize the characters. [sent-275, score-0.52]
72 While for TSM, tree-structured models are used to detect and recognize the characters at the same time. [sent-276, score-0.388]
73 strategy to detect and recognize characters, which does not make use of the character-specific global structure informa- tion. [sent-278, score-0.208]
74 Thus, there are many false positives, which would disturb the word spotting stage. [sent-279, score-0.293]
75 While for the proposed character detection method, since we make use of both global structure information and local appearance information, the detection results are more reliable and representative. [sent-280, score-0.907]
76 Word Recognition To recognize the word, we build the CRF model on the character detection results as discussed in Section 2. [sent-283, score-0.917]
77 We use ICDAR 2003, ICDAR 2011 and SVT datasets to evaluate the proposed word recognition method. [sent-286, score-0.251]
78 Same bigram language model learnt from the lexicon with 0. [sent-287, score-0.344]
79 Similar to the evaluation scheme in [18] and [11], we use the inferred result to retrieve the word with the smallest edit distance in the lexicon. [sent-289, score-0.218]
80 For ICDAR datasets, we measure performance using a lexicon created from all the words in the test set (ICDAR03(FULL), ICDAR1 1(FULL)), and with lexicon consisting of the ground truth words plus 50 random words from the test set (ICDAR03(50), ICDAR1 1(50)). [sent-290, score-0.517]
81 he results on ICDAR03(50), ICDAR1 1(50), SVT are acquired by retrieving the ones with the smallest edit distance in the lexicon of 50 words whereas for ICDAR03(FULL) and ICDAR1 1(FULL), the lexicon contains all the ground truth words in the test set. [sent-302, score-0.514]
82 The proposed method outperforms TSM+PLEX by 6%-9%, showing the effectiveness of the CRF model which incorporates detection scores, linguistic knowledge and spatial constraints, since both methods adopt the same character detection method. [sent-305, score-0.951]
83 also used the CRF model to encode character detection results and language model. [sent-309, score-0.911]
84 However, they used the multi-scale sliding window strategy to get the candidate character locations and SVM to classify these characters. [sent-310, score-0.906]
85 The detection method is not as good as the proposed tree-structured character detection method which makes use of the intrinsic global structure information. [sent-311, score-0.887]
86 Furthermore, they built the CRF model on all the detection windows as long as their spatial distance and overlap ratio satisfy a certain condition, which makes the CRF model more complex than ours since we only use the potential character locations to define the nodes. [sent-312, score-0.905]
87 [11] computed the node-specific lexicon prior for each text image from their corresponding lexicon, which means (1) the lexicon priors heavily rely on the lexicon for that image and (2) the computation cost is increased since the lexicon prior should be recomputed for each image. [sent-314, score-0.959]
88 Compared to [11], the recognition rates on SVT do not improve a lot, mainly because some of the scene text images in SVT are difficult to recognize even for human as shown in Figure 7. [sent-320, score-0.339]
89 As we can see, our method could recognize scene text with low resolution, different fonts and distortions. [sent-323, score-0.343]
90 Both the character detection and word recognition are implemented in Matlab. [sent-325, score-1.027]
91 The average processing time to recognize a scene text image is about 3 seconds on an In222999666755 tel(R) Core(TM) i7-2600 CPU 3. [sent-326, score-0.282]
92 Since the character detectors are independent from each other, the implementation could be much faster using parallel processing. [sent-328, score-0.727]
93 Conclusion In this paper, we propose an effective scene text recognition method using the CRF model to incorporate treestructure based character detection and linguistic knowl- edge into one framework. [sent-330, score-1.115]
94 Different from the conventional multi-scale sliding window character detection strategy, which does not make use of the intrinsic global structure information, we propose to learn a part-based tree-structured model for each type of character to detect and recognize the characters simultaneously. [sent-331, score-2.069]
95 Based on these detection results, we build a CRF model on the potential character locations to integrate detection scores, spatial constraints and language model. [sent-332, score-1.093]
96 The experimental results show that our method could recognize text in unconstrained scene images with a high accuracy. [sent-335, score-0.308]
97 Multiscale histogram of oriented gradient descriptors for robust character recognition. [sent-420, score-0.701]
98 Icdar 2011 robust reading competition challenge 2: Reading text in scene images. [sent-435, score-0.25]
99 In Proceedings of the international conference on spoken language processing, volume 2, pages 901–904, 2002. [sent-441, score-0.211]
100 Segmentation and recognition of characters in scene images using selective binarization in color space and gat correlation. [sent-467, score-0.46]
wordName wordTfidf (topN-words)
[('character', 0.701), ('icdar', 0.273), ('characters', 0.253), ('word', 0.194), ('lexicon', 0.191), ('svt', 0.176), ('text', 0.144), ('tsm', 0.123), ('language', 0.117), ('binarization', 0.112), ('plex', 0.11), ('mishra', 0.106), ('recognize', 0.1), ('crf', 0.091), ('linguistic', 0.082), ('cj', 0.08), ('spotting', 0.078), ('ferns', 0.078), ('detection', 0.075), ('reading', 0.068), ('sliding', 0.06), ('unary', 0.058), ('recognition', 0.057), ('cost', 0.051), ('ocr', 0.05), ('vk', 0.049), ('structures', 0.048), ('nodes', 0.048), ('wik', 0.047), ('node', 0.046), ('potential', 0.046), ('conference', 0.045), ('words', 0.045), ('scores', 0.038), ('scene', 0.038), ('ek', 0.038), ('strategy', 0.037), ('structure', 0.036), ('pairwise', 0.036), ('xi', 0.036), ('candidate', 0.035), ('finereader', 0.035), ('fonts', 0.035), ('wikj', 0.035), ('window', 0.035), ('detect', 0.035), ('chunheng', 0.031), ('abbyy', 0.031), ('topological', 0.031), ('document', 0.03), ('type', 0.03), ('score', 0.029), ('postprocessing', 0.029), ('synth', 0.029), ('contamination', 0.029), ('sstr', 0.029), ('international', 0.028), ('ci', 0.027), ('disappointing', 0.027), ('windows', 0.027), ('rectangle', 0.027), ('knn', 0.026), ('could', 0.026), ('hog', 0.025), ('nms', 0.025), ('conventional', 0.025), ('proceedings', 0.024), ('edit', 0.024), ('xj', 0.023), ('pictorial', 0.023), ('build', 0.023), ('nition', 0.023), ('judd', 0.023), ('message', 0.022), ('million', 0.022), ('zhong', 0.022), ('full', 0.022), ('eij', 0.022), ('pages', 0.021), ('false', 0.021), ('root', 0.02), ('appearance', 0.02), ('locations', 0.02), ('seamlessly', 0.02), ('alahari', 0.019), ('navigation', 0.019), ('choose', 0.019), ('sapp', 0.019), ('unique', 0.019), ('quite', 0.019), ('lj', 0.019), ('tree', 0.018), ('model', 0.018), ('positives', 0.018), ('acquired', 0.018), ('learnt', 0.018), ('constraints', 0.018), ('detecting', 0.018), ('get', 0.018), ('ieee', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
Author: Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song Gao, Zhong Zhang
Abstract: Scene text recognition has inspired great interests from the computer vision community in recent years. In this paper, we propose a novel scene text recognition method using part-based tree-structured character detection. Different from conventional multi-scale sliding window character detection strategy, which does not make use of the character-specific structure information, we use part-based tree-structure to model each type of character so as to detect and recognize the characters at the same time. While for word recognition, we build a Conditional Random Field model on the potential character locations to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework. The final word recognition result is obtained by minimizing the cost function defined on the random field. Experimental results on a range of challenging public datasets (ICDAR 2003, ICDAR 2011, SVT) demonstrate that the proposed method outperforms stateof-the-art methods significantly bothfor character detection and word recognition.
2 0.12540483 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data
Author: Martin Bäuml, Makarand Tapaswi, Rainer Stiefelhagen
Abstract: We address the problem of person identification in TV series. We propose a unified learning framework for multiclass classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance.
3 0.089607857 8 cvpr-2013-A Fast Approximate AIB Algorithm for Distributional Word Clustering
Author: Lei Wang, Jianjia Zhang, Luping Zhou, Wanqing Li
Abstract: Distributional word clustering merges the words having similar probability distributions to attain reliable parameter estimation, compact classification models and even better classification performance. Agglomerative Information Bottleneck (AIB) is one of the typical word clustering algorithms and has been applied to both traditional text classification and recent image recognition. Although enjoying theoretical elegance, AIB has one main issue on its computational efficiency, especially when clustering a large number of words. Different from existing solutions to this issue, we analyze the characteristics of its objective function the loss of mutual information, and show that by merely using the ratio of word-class joint probabilities of each word, good candidate word pairs for merging can be easily identified. Based on this finding, we propose a fast approximate AIB algorithm and show that it can significantly improve the computational efficiency of AIB while well maintaining or even slightly increasing its classification performance. Experimental study on both text and image classification benchmark data sets shows that our algorithm can achieve more than 100 times speedup on large real data sets over the state-of-the-art method.
4 0.089252383 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential
Author: Neill D.F. Campbell, Kartic Subr, Jan Kautz
Abstract: Conditional Random Fields (CRFs) are used for diverse tasks, ranging from image denoising to object recognition. For images, they are commonly defined as a graph with nodes corresponding to individual pixels and pairwise links that connect nodes to their immediate neighbors. Recent work has shown that fully-connected CRFs, where each node is connected to every other node, can be solved efficiently under the restriction that the pairwise term is a Gaussian kernel over a Euclidean feature space. In this paper, we generalize the pairwise terms to a non-linear dissimilarity measure that is not required to be a distance metric. To this end, we propose a density estimation technique to derive conditional pairwise potentials in a nonparametric manner. We then use an efficient embedding technique to estimate an approximate Euclidean feature space for these potentials, in which the pairwise term can still be expressed as a Gaussian kernel. We demonstrate that the use of non-parametric models for the pairwise interactions, conditioned on the input data, greatly increases expressive power whilst maintaining efficient inference.
5 0.08549244 456 cvpr-2013-Visual Place Recognition with Repetitive Structures
Author: Akihiko Torii, Josef Sivic, Tomáš Pajdla, Masatoshi Okutomi
Abstract: Repeated structures such as building facades, fences or road markings often represent a significant challenge for place recognition. Repeated structures are notoriously hard for establishing correspondences using multi-view geometry. Even more importantly, they violate thefeature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance. In this work we show that repeated structures are not a nuisance but, when appropriately represented, theyform an importantdistinguishing feature for many places. We describe a representation of repeated structures suitable for scalable retrieval. It is based on robust detection of repeated image structures and a simple modification of weights in the bag-of-visual-word model. Place recognition results are shown on datasets of street-level imagery from Pittsburgh and San Francisco demonstrating significant gains in recognition performance compared to the standard bag-of-visual-words baseline and more recently proposed burstiness weighting.
7 0.081065319 22 cvpr-2013-A Non-parametric Framework for Document Bleed-through Removal
8 0.078564711 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels
9 0.07728228 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
10 0.070351049 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
11 0.069039896 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters
12 0.068545736 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
13 0.066713706 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
14 0.062226012 346 cvpr-2013-Real-Time No-Reference Image Quality Assessment Based on Filter Learning
15 0.06219019 340 cvpr-2013-Probabilistic Label Trees for Efficient Large Scale Image Classification
16 0.059377037 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
17 0.056776125 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings
18 0.056226902 146 cvpr-2013-Enriching Texture Analysis with Semantic Data
19 0.055964209 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration
20 0.053311408 53 cvpr-2013-BFO Meets HOG: Feature Extraction Based on Histograms of Oriented p.d.f. Gradients for Image Classification
topicId topicWeight
[(0, 0.14), (1, -0.04), (2, -0.002), (3, -0.008), (4, 0.06), (5, 0.017), (6, 0.017), (7, 0.033), (8, -0.009), (9, -0.022), (10, 0.011), (11, 0.015), (12, -0.001), (13, -0.009), (14, -0.022), (15, 0.001), (16, 0.02), (17, 0.034), (18, 0.053), (19, -0.077), (20, -0.007), (21, 0.01), (22, 0.005), (23, 0.036), (24, 0.024), (25, -0.017), (26, 0.053), (27, 0.076), (28, -0.062), (29, -0.022), (30, 0.005), (31, 0.034), (32, 0.019), (33, 0.053), (34, 0.048), (35, 0.038), (36, -0.077), (37, 0.058), (38, -0.088), (39, -0.017), (40, -0.008), (41, -0.033), (42, -0.082), (43, 0.031), (44, -0.001), (45, 0.013), (46, -0.057), (47, -0.059), (48, 0.077), (49, -0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.89145672 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
Author: Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song Gao, Zhong Zhang
Abstract: Scene text recognition has inspired great interests from the computer vision community in recent years. In this paper, we propose a novel scene text recognition method using part-based tree-structured character detection. Different from conventional multi-scale sliding window character detection strategy, which does not make use of the character-specific structure information, we use part-based tree-structure to model each type of character so as to detect and recognize the characters at the same time. While for word recognition, we build a Conditional Random Field model on the potential character locations to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework. The final word recognition result is obtained by minimizing the cost function defined on the random field. Experimental results on a range of challenging public datasets (ICDAR 2003, ICDAR 2011, SVT) demonstrate that the proposed method outperforms stateof-the-art methods significantly bothfor character detection and word recognition.
2 0.67688346 8 cvpr-2013-A Fast Approximate AIB Algorithm for Distributional Word Clustering
Author: Lei Wang, Jianjia Zhang, Luping Zhou, Wanqing Li
Abstract: Distributional word clustering merges the words having similar probability distributions to attain reliable parameter estimation, compact classification models and even better classification performance. Agglomerative Information Bottleneck (AIB) is one of the typical word clustering algorithms and has been applied to both traditional text classification and recent image recognition. Although enjoying theoretical elegance, AIB has one main issue on its computational efficiency, especially when clustering a large number of words. Different from existing solutions to this issue, we analyze the characteristics of its objective function the loss of mutual information, and show that by merely using the ratio of word-class joint probabilities of each word, good candidate word pairs for merging can be easily identified. Based on this finding, we propose a fast approximate AIB algorithm and show that it can significantly improve the computational efficiency of AIB while well maintaining or even slightly increasing its classification performance. Experimental study on both text and image classification benchmark data sets shows that our algorithm can achieve more than 100 times speedup on large real data sets over the state-of-the-art method.
3 0.67606926 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels
Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun
Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].
Author: Pradipto Das, Chenliang Xu, Richard F. Doell, Jason J. Corso
Abstract: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.
5 0.59475106 183 cvpr-2013-GRASP Recurring Patterns from a Single View
Author: Jingchen Liu, Yanxi Liu
Abstract: We propose a novel unsupervised method for discovering recurring patterns from a single view. A key contribution of our approach is the formulation and validation of a joint assignment optimization problem where multiple visual words and object instances of a potential recurring pattern are considered simultaneously. The optimization is achieved by a greedy randomized adaptive search procedure (GRASP) with moves specifically designed for fast convergence. We have quantified systematically the performance of our approach under stressed conditions of the input (missing features, geometric distortions). We demonstrate that our proposed algorithm outperforms state of the art methods for recurring pattern discovery on a diverse set of 400+ real world and synthesized test images.
6 0.57722545 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
7 0.57318681 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
8 0.56975794 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
9 0.5566541 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
10 0.55461246 157 cvpr-2013-Exploring Implicit Image Statistics for Visual Representativeness Modeling
11 0.53435951 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters
13 0.52997696 134 cvpr-2013-Discriminative Sub-categorization
14 0.52761126 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
15 0.52483296 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
16 0.51619619 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
17 0.50800323 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential
18 0.50185829 24 cvpr-2013-A Principled Deep Random Field Model for Image Segmentation
19 0.50085819 22 cvpr-2013-A Non-parametric Framework for Document Bleed-through Removal
20 0.49909478 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
topicId topicWeight
[(10, 0.11), (16, 0.03), (26, 0.028), (33, 0.294), (64, 0.225), (67, 0.108), (69, 0.035), (87, 0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.87169063 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
Author: Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song Gao, Zhong Zhang
Abstract: Scene text recognition has inspired great interests from the computer vision community in recent years. In this paper, we propose a novel scene text recognition method using part-based tree-structured character detection. Different from conventional multi-scale sliding window character detection strategy, which does not make use of the character-specific structure information, we use part-based tree-structure to model each type of character so as to detect and recognize the characters at the same time. While for word recognition, we build a Conditional Random Field model on the potential character locations to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework. The final word recognition result is obtained by minimizing the cost function defined on the random field. Experimental results on a range of challenging public datasets (ICDAR 2003, ICDAR 2011, SVT) demonstrate that the proposed method outperforms stateof-the-art methods significantly bothfor character detection and word recognition.
2 0.85553616 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
Author: Simon Hadfield, Richard Bowden
Abstract: Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
3 0.83956307 262 cvpr-2013-Learning for Structured Prediction Using Approximate Subgradient Descent with Working Sets
Author: Aurélien Lucchi, Yunpeng Li, Pascal Fua
Abstract: We propose a working set based approximate subgradient descent algorithm to minimize the margin-sensitive hinge loss arising from the soft constraints in max-margin learning frameworks, such as the structured SVM. We focus on the setting of general graphical models, such as loopy MRFs and CRFs commonly used in image segmentation, where exact inference is intractable and the most violated constraints can only be approximated, voiding the optimality guarantees of the structured SVM’s cutting plane algorithm as well as reducing the robustness of existing subgradient based methods. We show that the proposed method obtains better approximate subgradients through the use of working sets, leading to improved convergence properties and increased reliability. Furthermore, our method allows new constraints to be randomly sampled instead of computed using the more expensive approximate inference techniques such as belief propagation and graph cuts, which can be used to reduce learning time at only a small cost of performance. We demonstrate the strength of our method empirically on the segmentation of a new publicly available electron microscopy dataset as well as the popular MSRC data set and show state-of-the-art results.
4 0.83037037 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
Author: Luming Zhang, Mingli Song, Zicheng Liu, Xiao Liu, Jiajun Bu, Chun Chen
Abstract: Weakly supervised image segmentation is a challenging problem in computer vision field. In this paper, we present a new weakly supervised image segmentation algorithm by learning the distribution of spatially structured superpixel sets from image-level labels. Specifically, we first extract graphlets from each image where a graphlet is a smallsized graph consisting of superpixels as its nodes and it encapsulates the spatial structure of those superpixels. Then, a manifold embedding algorithm is proposed to transform graphlets of different sizes into equal-length feature vectors. Thereafter, we use GMM to learn the distribution of the post-embedding graphlets. Finally, we propose a novel image segmentation algorithm, called graphlet cut, that leverages the learned graphlet distribution in measuring the homogeneity of a set of spatially structured superpixels. Experimental results show that the proposed approach outperforms state-of-the-art weakly supervised image segmentation methods, and its performance is comparable to those of the fully supervised segmentation models.
5 0.82941622 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
Author: Guang Chen, Yuanyuan Ding, Jing Xiao, Tony X. Han
Abstract: Context has been playing an increasingly important role to improve the object detection performance. In this paper we propose an effective representation, Multi-Order Contextual co-Occurrence (MOCO), to implicitly model the high level context using solely detection responses from a baseline object detector. The so-called (1st-order) context feature is computed as a set of randomized binary comparisons on the response map of the baseline object detector. The statistics of the 1st-order binary context features are further calculated to construct a high order co-occurrence descriptor. Combining the MOCO feature with the original image feature, we can evolve the baseline object detector to a stronger context aware detector. With the updated detector, we can continue the evolution till the contextual improvements saturate. Using the successful deformable-partmodel detector [13] as the baseline detector, we test the proposed MOCO evolution framework on the PASCAL VOC 2007 dataset [8] and Caltech pedestrian dataset [7]: The proposed MOCO detector outperforms all known state-ofthe-art approaches, contextually boosting deformable part models (ver.5) [13] by 3.3% in mean average precision on the PASCAL 2007 dataset. For the Caltech pedestrian dataset, our method further reduces the log-average miss rate from 48% to 46% and the miss rate at 1 FPPI from 25% to 23%, compared with the best prior art [6].
6 0.82931262 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
7 0.82745367 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
8 0.82694393 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
9 0.82561791 438 cvpr-2013-Towards Pose Robust Face Recognition
10 0.82515317 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
11 0.82505083 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors
12 0.82444704 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
13 0.82399976 204 cvpr-2013-Histograms of Sparse Codes for Object Detection
14 0.82346648 89 cvpr-2013-Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation
15 0.82340735 202 cvpr-2013-Hierarchical Saliency Detection
16 0.82294422 387 cvpr-2013-Semi-supervised Domain Adaptation with Instance Constraints
17 0.82270521 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
18 0.82248914 249 cvpr-2013-Learning Compact Binary Codes for Visual Tracking
19 0.82237816 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
20 0.82230949 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels