iccv iccv2013 iccv2013-180 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. [sent-8, score-0.885]
2 In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. [sent-9, score-1.342]
3 The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. [sent-10, score-0.656]
4 The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). [sent-11, score-1.397]
5 Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. [sent-12, score-1.203]
6 The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. [sent-14, score-0.638]
7 Towards this, we propose a technique to obtain image-level scene semantic priors from eye tracking data, which will reduce the search space for multimedia annotation tasks. [sent-20, score-0.618]
8 The eye tracking regions identified by the proposed algorithm as faces (blue) and text (green) 4. [sent-25, score-1.307]
9 The final detection outputs of face and text detector focusing on the priors provided by eye tracking. [sent-26, score-1.283]
10 top-down task is biased towards faces and text [7]. [sent-28, score-0.65]
11 The first step towards obtaining scene semantic prior from eye tracking information alone is to build models that predict face and text regions in images, which is the primary focus of the paper. [sent-29, score-1.342]
12 We note that the performance of state-of-the-art cat and dog detectors [24] in turn depends on head (face) detection algorithm which can be enhanced using eye movement information. [sent-31, score-0.896]
13 [27] extract high-level information from images and verbal cues, (faces, face parts and person) and model their interrelationships using eye movement fixations and saccades across these detections. [sent-38, score-1.159]
14 We propose an algorithm to localize face and text regions in images using eye tracking data alone. [sent-51, score-1.342]
15 The algorithm basically clusters the eye tracking data into mean- ingful regions using mean-shift clustering. [sent-52, score-0.742]
16 The final cluster labels are inferred using a fully connected MRF, by learning the unary and interaction potentials for faces and text from these statistics. [sent-54, score-0.99]
17 We demonstrate the ability ofthese face and text priors to improve the speed and precision of state-of-the-art text [13] and cat and dog detection [24] algorithms. [sent-56, score-1.606]
18 We also present a new eye tracking dataset, collected on images from various text, dogs and cats datasets. [sent-58, score-0.86]
19 Faces and Text Eye Tracking Database We collected an eye tracking dataset, with primary focus on faces (humans, dogs and cats) and text using Eyelink 1000 eye tracking device. [sent-63, score-1.933]
20 The flickr images provide sufficient representation for images without text or faces (including dogs and cats) in both indoor and outdoor scenes. [sent-66, score-0.777]
21 The overall image dataset consists of 61 dogs, 61 cats, 35 human faces, 246 text lines and 63 images without any text or faces. [sent-67, score-1.046]
22 Also, eye tracking calibration was performed every 50 images and the entire data was collected in two sessions (150 images each). [sent-73, score-0.634]
23 edu / ~karthikeyan / face s TextEyet rackingDat aset / faces, dogs, cats and other background objects Humans eye movement scanpaths typically consists of alternating fixations and saccades. [sent-77, score-1.052]
24 The eye tracking host computer samples the gaze information at 1000 Hz and automatically detects fixations and saccades in the data. [sent-79, score-1.15]
25 In our analysis we only use the fixation samples and henceforth refer to these fixation samples as the eye tracking samples. [sent-82, score-0.979]
26 The eye tracking device also clusters the fixation samples and identifies fixation and saccade points. [sent-83, score-1.098]
27 In our experiments, the first fixation and saccade was removed to avoid the initial eye position bias due to the transition gray slide in the experimental setup. [sent-86, score-0.654]
28 In text regions consisting of a single word, the subjects typically fixate around the center of the word and the different fixations take a nearly elliptical shape. [sent-98, score-0.973]
29 Faces and Text Localization from Eye Tracking Data The aim is to identify face and text regions in images by analyzing eye tracking information from multiple subjects, without utilizing any image features. [sent-103, score-1.39]
30 The text and face region detection problem is mapped to a cluster labeling problem. [sent-109, score-0.991]
31 The 2D eye tracking samples a(fnix iamtiaogne samples) twedith biyn tChe cluster are represented by Ei. [sent-112, score-0.804]
32 Finally, the fixations provided by every individual person k in cluster iis augmented giving Fik and the corresponding times (0-4 seconds) representing the beginning of the fixations in cluster iis given by Tik. [sent-115, score-0.742]
33 Number of fixations and eye tracking samples: |Fi |, |Ei | b. [sent-120, score-0.772]
34 Standard deviation of each dimension of the eye tracking samples Ei c. [sent-121, score-0.628]
35 The ratio ofthe eye tracking sample density in the cluster compared to its background. [sent-126, score-0.753]
36 These features essentially aim to capture the eye movement attributes typical of face and text regions described in Section 2. [sent-163, score-1.265]
37 The features a,b,c,d and e are important basic features where text and face regions exhibit characteristic responses. [sent-164, score-0.765]
38 Features f and g are more characteristic of text regions with multiple words as nearly horizontal inter-word saccades are prevalent. [sent-165, score-0.94]
39 Inter-cluster features In addition to intra-cluster features, pairwise inter-cluster features also provide useful information to identify face and text regions. [sent-170, score-0.685]
40 Moreover, in text images with multiple words, inter-word saccadic activity is quite common. [sent-172, score-0.618]
41 627 Figure 3: Shows example of faces and text in two scenarios each. [sent-174, score-0.65]
42 The number of saccades, horizontal saccades and percentage of horizontal saccades from the left cluster to the right cluster. [sent-188, score-0.85]
43 Also, feature 4 is targeted to capture text re- gions as subjects typically read text from left to right. [sent-193, score-1.156]
44 (Center) Clustered eye tracking fixation samples from multiple subjects overlaid on the image 3. [sent-198, score-0.888]
45 (Right) Visualizing the unary and interaction potentials of the clusters for the text MRF. [sent-199, score-0.732]
46 The unary is color coded as red, the bright values indicating high unary potentials of a cluster belonging to text class. [sent-200, score-0.84]
47 2, we propose a probabilistic model based technique bel the clusters provided by mean-shift the eye tracking samples. [sent-207, score-0.662]
48 IP dt,oeiaNxltns,C}Ti\aelxbs{tT]ir}aFe=;nxiItc]d,=ePxa)i;rwse Algorithm 1: Proposed method to detect face and text regions in images by analyzing eye tracking samples. [sent-218, score-1.366]
49 In order to allow overlapping text and face regions (in watermarked images), cope with limited availability of data with face-text co-occurrence, and speed up inference, we resort to separately tackle the face,non-face and text,non-text problems using two distinct MRFs. [sent-232, score-0.765]
50 Performance of Face and Text Localization from the Eye Tracking Samples In this section we analyze the performance of the clusterlevel classification of faces and text regions in images. [sent-235, score-0.73]
51 The cluster labels are defined as the the label of the class (face, text and background) which has the most representation among the cluster samples. [sent-237, score-0.875]
52 For this experiment we fix the bandwidth of both the face and text MRFs to 50. [sent-240, score-0.685]
53 Iianl sa)d,d witihoenr,e c Clusters which have less than 1% of the total number of eye tracking samples are automatically marked as background to avoid trivial cases. [sent-242, score-0.658]
54 Figure 5 : Left: Input image with the ground truth for face (blue) and text (green). [sent-244, score-0.685]
55 Right: Face (blue) and text (green) cluster labels propagated from ground truth. [sent-246, score-0.699]
56 The performance ofthe cluster detection problem is evaluated using a precision-recall approach for face and text detection. [sent-248, score-0.942]
57 In order to utilize these cluster labels to enhance text and cat and dog detection algorithms, we require high recall under reasonable precision. [sent-251, score-1.031]
58 This ensures most of the regions containing faces and text are presented to the detector, which will enhance the overall performance. [sent-252, score-0.73]
59 Red fixation points correspond to text and blue corresponds to background. [sent-259, score-0.7]
60 Figure8:Examplescenariowher theprop sedap roachfailstodet c face (left) and a text word (right). [sent-263, score-0.763]
61 The eye tracking samples detected as face in (a) and text in (b) are shown in red and the samples detected as background (both (a) and (b)) are indicated in blue. [sent-264, score-1.364]
62 The performance of the face and text detector MRFs are shown in Table 1. [sent-266, score-0.742]
63 We notice that the recall is high for both face and text detection sections. [sent-270, score-0.825]
64 In the text region as well, we observe that the precision is fairly high, indicating the 629 excellent localization ability of our algorithm. [sent-272, score-0.623]
65 8 also highlights a few failure cases where both the face and text localization fails. [sent-278, score-0.742]
66 In addition the text cluster detection fails as the allocated time (4 seconds) was insufficient to scan the entire text content. [sent-280, score-1.334]
67 735 Table 1: Indicates performance of cluster and image level face and text detection from the eye tracking samples. [sent-290, score-1.519]
68 We notice that the recall (marked in bold) is high suggesting that the proposed approach seldom misses face and text detections in images. [sent-291, score-0.777]
69 This is achieved at a sufficiently good precision ensuring that this method can be valuable to localize ROI to reduce the search space for computationally expensive face and text detectors. [sent-292, score-0.739]
70 The proposed faces and text eye tracking priors can be an extremely useful alternate source of context to improve detection. [sent-295, score-1.268]
71 Therefore, we investigate the utility of these priors for text detection in natural scenes as well as cat and dog detection in images. [sent-296, score-0.948]
72 The proposed eye tracking based face detection prior can significantly reduce the search space for cat and dog faces/heads in images. [sent-306, score-1.042]
73 As human fixations are typically focused towards the eyes and nose of the animals, we construct a bounding box around the face clusters to localize the cat head. [sent-307, score-0.619]
74 When the cluster is approximated by a rectangular bounding box R with width w and length lcontaining all the eye tracking samples, an outer bounding box B centered around R of size 2. [sent-308, score-0.784]
75 3% of the entire dataset (image area) using the proposed eye tracking based face detection model. [sent-312, score-0.851]
76 Text detection in natural scenes is challenging as text is present in a wide variety of styles, fonts and shapes coupled with geometric distortions, var630 Figure 10: Plotting Average Precision (AP) of Cat head (a) Dog head (b), Cat Body (c) and Dog Body (d). [sent-335, score-0.756]
77 Texture based ap- proaches typically learn the properties of text and background texture,[9, 31] and classify image regions into text and non-text using sliding windows. [sent-342, score-1.126]
78 Stroke width transform (SWT) [13] is an elegant connected component based approach which groups pixels based on the properties of the potential text stroke it belongs to. [sent-346, score-0.641]
79 We utilize SWT as the baseline text detection algorithm as it obtained state-of-the-art results in the text detection datasets [21, 30] from which we obtained the images. [sent-347, score-1.253]
80 The first step of SWT is edge detection and the quality of edges primarily determine the final text detection performance [19]. [sent-348, score-0.685]
81 The presence of several false edges especially in highly textured objects leads to false detections and therefore we propose an edge subset selection procedure from text priors obtained by labeling the eye tracking samples. [sent-349, score-1.284]
82 This is implemented by convolving the eye tracking samples using a gaussian filter of variance 150 pixels (conservative selection) and obtaining a binary text attention map in the image plane by selecting regions which are above a threshold (0. [sent-351, score-1.305]
83 In the following step, connected components of the edges which have an average text attention > 0. [sent-353, score-0.637]
84 56e3a20s5ureMea16n97 E243d5ges Table 2: Comparison of the performance of the proposed text detector with eye tracking prior and baseline SWT. [sent-358, score-1.202]
85 The performance of the text detection is validated using standard precision-recall metrics popular in text detection Figure12:Examplesofimageswher theprop sedtextdet ctionapproach performs reliably. [sent-361, score-1.245]
86 We note that the time of the proposed algorithm which we use for comparison includes the text cluster labeling overhead as well. [sent-368, score-0.725]
87 13 compares some results of the proposed approach to baseline SWT and indicates the utility of the text attention map to limit the ROI for text detection. [sent-372, score-1.165]
88 Consequently, we generate semantic priors by analyzing eye tracking samples without image information. [sent-376, score-0.693]
89 We focused on two semantic categories, faces and text, and collected a new eye tracking dataset. [sent-377, score-0.73]
90 The dataset consists of 300 images with 15 subjects with specific focus on humans, dogs, cats and text in natural scenes. [sent-378, score-0.787]
91 The eye tracking fixation samples are clustered using mean-shift. [sent-379, score-0.803]
92 The proposed approach obtains promising results in classifying face and text regions from background by only analyzing eye tracking samples. [sent-381, score-1.366]
93 This information provides a very useful prior for challenging problems which require robust face and text detection. [sent-382, score-0.685]
94 The attention regions ((c) and (f)) shows the eye tracking samples classified as text in red and the ROI used by the text detector in blue. [sent-384, score-1.885]
95 Therefore, as the false positive portion in SWT (red boxes in (a) and (d)) is removed by the generated text attention region, we obtain better detector precision in these images. [sent-385, score-0.734]
96 tic prior in conjunction with state-of-the-art detectors obtains faster detections and higher precision results for dog, cat and text detection problems compared to baseline. [sent-386, score-0.846]
97 Furthermore, if the image has a large number of text lines, the subjects do not have sufficient viewing time to gather all the information presented. [sent-389, score-0.689]
98 In addition, we will explore better localization of face and text regions for the detectors from the eye tracking information. [sent-392, score-1.394]
99 Additionally, an edge learning technique from the cluster labels for the text class could improve the proposed text detection algorithm. [sent-394, score-1.303]
100 Learning bottom-up text attention maps for text detection using stroke width transform. [sent-479, score-1.279]
wordName wordTfidf (topN-words)
[('text', 0.523), ('eye', 0.419), ('saccades', 0.302), ('fixations', 0.195), ('cluster', 0.176), ('face', 0.162), ('tracking', 0.158), ('cats', 0.154), ('fixation', 0.15), ('faces', 0.127), ('cat', 0.126), ('subjects', 0.11), ('dogs', 0.103), ('swt', 0.098), ('dog', 0.096), ('saccade', 0.085), ('clusters', 0.085), ('detection', 0.081), ('movement', 0.081), ('regions', 0.08), ('attention', 0.074), ('bulling', 0.074), ('karthikeyan', 0.074), ('saccadic', 0.071), ('head', 0.064), ('detector', 0.057), ('semantics', 0.055), ('precision', 0.054), ('viewed', 0.051), ('samples', 0.051), ('unary', 0.049), ('stroke', 0.047), ('baseline', 0.045), ('cerf', 0.045), ('potentials', 0.043), ('word', 0.041), ('priors', 0.041), ('scanpaths', 0.041), ('connected', 0.04), ('roi', 0.038), ('mediank', 0.037), ('renuka', 0.037), ('theprop', 0.037), ('horizontal', 0.035), ('mrf', 0.035), ('parkhi', 0.035), ('humans', 0.035), ('icdar', 0.034), ('viewing', 0.034), ('highlights', 0.034), ('attract', 0.034), ('detections', 0.033), ('tpt', 0.033), ('fik', 0.033), ('pet', 0.033), ('presence', 0.032), ('interaction', 0.032), ('entire', 0.031), ('intelligence', 0.031), ('reading', 0.031), ('width', 0.031), ('notice', 0.03), ('marked', 0.03), ('recall', 0.029), ('body', 0.029), ('detectors', 0.029), ('subramanian', 0.028), ('nose', 0.028), ('saliency', 0.028), ('scanpath', 0.027), ('mishra', 0.027), ('blue', 0.027), ('office', 0.027), ('transactions', 0.027), ('collected', 0.026), ('labeling', 0.026), ('false', 0.026), ('detects', 0.025), ('clustered', 0.025), ('vij', 0.025), ('detecting', 0.025), ('international', 0.025), ('icip', 0.025), ('activity', 0.024), ('analyzing', 0.024), ('elliptical', 0.024), ('fonts', 0.024), ('et', 0.024), ('ei', 0.024), ('utilizing', 0.024), ('flickr', 0.024), ('wearable', 0.024), ('eyes', 0.023), ('localization', 0.023), ('gathering', 0.023), ('outgoing', 0.023), ('region', 0.023), ('ieee', 0.023), ('gather', 0.022), ('conservative', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
2 0.2528488 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
3 0.25023872 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
Author: Weilin Huang, Zhe Lin, Jianchao Yang, Jue Wang
Abstract: In this paper, we present a new approach for text localization in natural images, by discriminating text and non-text regions at three levels: pixel, component and textline levels. Firstly, a powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends the widely-used Stroke Width Transform (SWT) by incorporating color cues of text pixels, leading to significantly enhanced performance on inter-component separation and intra-component connection. Secondly, based on the output of SFT, we apply two classifiers, a text component classifier and a text-line classifier, sequentially to extract text regions, eliminating the heuristic procedures that are commonly used in previous approaches. The two classifiers are built upon two novel Text Covariance Descriptors (TCDs) that encode both the heuristic properties and the statistical characteristics of text stokes. Finally, text regions are located by simply thresholding the text-line confident map. Our method was evaluated on two benchmark datasets: ICDAR 2005 and ICDAR 2011, and the corresponding F- , measure values are 0. 72 and 0. 73, respectively, surpassing previous methods in accuracy by a large margin.
4 0.24236608 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
5 0.23628114 50 iccv-2013-Analysis of Scores, Datasets, and Models in Visual Saliency Prediction
Author: Ali Borji, Hamed R. Tavakoli, Dicky N. Sihite, Laurent Itti
Abstract: Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scanpath sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.
6 0.23551716 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
7 0.21108083 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
8 0.19022463 373 iccv-2013-Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics
9 0.14751144 157 iccv-2013-Fast Face Detector Training Using Tailored Views
10 0.13240454 67 iccv-2013-Calibration-Free Gaze Estimation Using Human Gaze Patterns
11 0.1262922 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
12 0.12434669 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
13 0.12065308 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
14 0.11916434 369 iccv-2013-Saliency Detection: A Boolean Map Approach
15 0.1152136 335 iccv-2013-Random Faces Guided Sparse Many-to-One Encoder for Pose-Invariant Face Recognition
16 0.11265825 97 iccv-2013-Coupling Alignments with Recognition for Still-to-Video Face Recognition
17 0.10227272 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
18 0.1012753 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
19 0.10012322 444 iccv-2013-Viewing Real-World Faces in 3D
20 0.098309338 318 iccv-2013-PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects
topicId topicWeight
[(0, 0.218), (1, 0.019), (2, 0.1), (3, -0.139), (4, 0.034), (5, -0.068), (6, 0.12), (7, 0.072), (8, -0.066), (9, 0.06), (10, 0.251), (11, -0.119), (12, 0.077), (13, 0.088), (14, -0.027), (15, 0.133), (16, -0.16), (17, 0.088), (18, -0.15), (19, 0.021), (20, 0.018), (21, 0.056), (22, 0.055), (23, -0.032), (24, 0.075), (25, 0.028), (26, -0.029), (27, -0.005), (28, 0.029), (29, -0.0), (30, 0.011), (31, -0.054), (32, -0.001), (33, -0.039), (34, 0.037), (35, -0.015), (36, 0.074), (37, -0.025), (38, -0.034), (39, 0.002), (40, 0.036), (41, -0.007), (42, 0.044), (43, -0.032), (44, -0.0), (45, -0.015), (46, -0.059), (47, 0.022), (48, -0.026), (49, -0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.96313697 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
2 0.79057294 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
Author: Lukáš Neumann, Jiri Matas
Abstract: An unconstrained end-to-end text localization and recognition method is presented. The method introduces a novel approach for character detection and recognition which combines the advantages of sliding-window and connected component methods. Characters are detected and recognized as image regions which contain strokes of specific orientations in a specific relative position, where the strokes are efficiently detected by convolving the image gradient field with a set of oriented bar filters. Additionally, a novel character representation efficiently calculated from the values obtained in the stroke detection phase is introduced. The representation is robust to shift at the stroke level, which makes it less sensitive to intra-class variations and the noise induced by normalizing character size and positioning. The effectiveness of the representation is demonstrated by the results achieved in the classification of real-world characters using an euclidian nearestneighbor classifier trained on synthetic data in a plain form. The method was evaluated on a standard dataset, where it achieves state-of-the-art results in both text localization and recognition.
3 0.78683072 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
Author: Weilin Huang, Zhe Lin, Jianchao Yang, Jue Wang
Abstract: In this paper, we present a new approach for text localization in natural images, by discriminating text and non-text regions at three levels: pixel, component and textline levels. Firstly, a powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends the widely-used Stroke Width Transform (SWT) by incorporating color cues of text pixels, leading to significantly enhanced performance on inter-component separation and intra-component connection. Secondly, based on the output of SFT, we apply two classifiers, a text component classifier and a text-line classifier, sequentially to extract text regions, eliminating the heuristic procedures that are commonly used in previous approaches. The two classifiers are built upon two novel Text Covariance Descriptors (TCDs) that encode both the heuristic properties and the statistical characteristics of text stokes. Finally, text regions are located by simply thresholding the text-line confident map. Our method was evaluated on two benchmark datasets: ICDAR 2005 and ICDAR 2011, and the corresponding F- , measure values are 0. 72 and 0. 73, respectively, surpassing previous methods in accuracy by a large margin.
4 0.76414764 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
5 0.76401764 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.
6 0.68067944 210 iccv-2013-Image Retrieval Using Textual Cues
8 0.48742029 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
9 0.47939384 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
10 0.47467148 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
11 0.45858422 50 iccv-2013-Analysis of Scores, Datasets, and Models in Visual Saliency Prediction
12 0.45536211 157 iccv-2013-Fast Face Detector Training Using Tailored Views
13 0.44404501 195 iccv-2013-Hidden Factor Analysis for Age Invariant Face Recognition
14 0.44214693 325 iccv-2013-Predicting Primary Gaze Behavior Using Social Saliency Fields
15 0.43170843 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
16 0.4267672 272 iccv-2013-Modifying the Memorability of Face Photographs
17 0.42673799 335 iccv-2013-Random Faces Guided Sparse Many-to-One Encoder for Pose-Invariant Face Recognition
18 0.4205814 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
19 0.41446462 373 iccv-2013-Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics
20 0.41171578 393 iccv-2013-Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos
topicId topicWeight
[(2, 0.103), (7, 0.045), (12, 0.034), (26, 0.083), (31, 0.092), (34, 0.019), (40, 0.011), (42, 0.119), (48, 0.016), (64, 0.06), (73, 0.029), (77, 0.143), (84, 0.013), (89, 0.119), (97, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.87687618 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
2 0.85658324 350 iccv-2013-Relative Attributes for Large-Scale Abandoned Object Detection
Author: Quanfu Fan, Prasad Gabbur, Sharath Pankanti
Abstract: Effective reduction of false alarms in large-scale video surveillance is rather challenging, especially for applications where abnormal events of interest rarely occur, such as abandoned object detection. We develop an approach to prioritize alerts by ranking them, and demonstrate its great effectiveness in reducing false positives while keeping good detection accuracy. Our approach benefits from a novel representation of abandoned object alerts by relative attributes, namely staticness, foregroundness and abandonment. The relative strengths of these attributes are quantified using a ranking function[19] learnt on suitably designed low-level spatial and temporal features.These attributes of varying strengths are not only powerful in distinguishing abandoned objects from false alarms such as people and light artifacts, but also computationally efficient for large-scale deployment. With these features, we apply a linear ranking algorithm to sort alerts according to their relevance to the end-user. We test the effectiveness of our approach on both public data sets and large ones collected from the real world.
3 0.84367734 83 iccv-2013-Complementary Projection Hashing
Author: Zhongming Jin, Yao Hu, Yue Lin, Debing Zhang, Shiding Lin, Deng Cai, Xuelong Li
Abstract: Recently, hashing techniques have been widely applied to solve the approximate nearest neighbors search problem in many vision applications. Generally, these hashing approaches generate 2c buckets, where c is the length of the hash code. A good hashing method should satisfy the following two requirements: 1) mapping the nearby data points into the same bucket or nearby (measured by xue long l i opt . ac . cn @ a(a)b(b) the Hamming distance) buckets. 2) all the data points are evenly distributed among all the buckets. In this paper, we propose a novel algorithm named Complementary Projection Hashing (CPH) to find the optimal hashing functions which explicitly considers the above two requirements. Specifically, CPHaims at sequentiallyfinding a series ofhyperplanes (hashing functions) which cross the sparse region of the data. At the same time, the data points are evenly distributed in the hypercubes generated by these hyperplanes. The experiments comparing with the state-of-the-art hashing methods demonstrate the effectiveness of the proposed method.
4 0.82096082 142 iccv-2013-Ensemble Projection for Semi-supervised Image Classification
Author: Dengxin Dai, Luc Van_Gool
Abstract: This paper investigates the problem of semi-supervised classification. Unlike previous methods to regularize classifying boundaries with unlabeled data, our method learns a new image representation from all available data (labeled and unlabeled) andperformsplain supervised learning with the new feature. In particular, an ensemble of image prototype sets are sampled automatically from the available data, to represent a rich set of visual categories/attributes. Discriminative functions are then learned on these prototype sets, and image are represented by the concatenation of their projected values onto the prototypes (similarities to them) for further classification. Experiments on four standard datasets show three interesting phenomena: (1) our method consistently outperforms previous methods for semi-supervised image classification; (2) our method lets itself combine well with these methods; and (3) our method works well for self-taught image classification where unlabeled data are not coming from the same distribution as la- beled ones, but rather from a random collection of images.
5 0.81126285 181 iccv-2013-Frustratingly Easy NBNN Domain Adaptation
Author: Tatiana Tommasi, Barbara Caputo
Abstract: Over the last years, several authors have signaled that state of the art categorization methods fail to perform well when trained and tested on data from different databases. The general consensus in the literature is that this issue, known as domain adaptation and/or dataset bias, is due to a distribution mismatch between data collections. Methods addressing it go from max-margin classifiers to learning how to modify the features and obtain a more robust representation. The large majority of these works use BOW feature descriptors, and learning methods based on imageto-image distance functions. Following the seminal work of [6], in this paper we challenge these two assumptions. We experimentally show that using the NBNN classifier over existing domain adaptation databases achieves always very strong performances. We build on this result, and present an NBNN-based domain adaptation algorithm that learns iteratively a class metric while inducing, for each sample, a large margin separation among classes. To the best of our knowledge, this is the first work casting the domain adaptation problem within the NBNN framework. Experiments show that our method achieves the state of the art, both in the unsupervised and semi-supervised settings.
6 0.79967874 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
7 0.79250801 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
8 0.79158938 38 iccv-2013-Action Recognition with Actons
9 0.78775686 227 iccv-2013-Large-Scale Image Annotation by Efficient and Robust Kernel Metric Learning
10 0.78744864 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
11 0.78741825 20 iccv-2013-A Max-Margin Perspective on Sparse Representation-Based Classification
12 0.78723961 80 iccv-2013-Collaborative Active Learning of a Kernel Machine Ensemble for Recognition
13 0.78327346 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
14 0.78323305 277 iccv-2013-Multi-channel Correlation Filters
15 0.78260303 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
16 0.78106391 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
17 0.78070021 59 iccv-2013-Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation
18 0.77756166 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences
19 0.77674419 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation
20 0.77623606 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition