cvpr cvpr2013 cvpr2013-416 knowledge-graph by maker-knowledge-mining

416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

Source: pdf

Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg

Abstract: Weposit that user behavior during natural viewing ofimages contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. In this paper, we conduct experiments to better understand the relationship between images, the eye movements people make while viewing images, and how people construct natural language to describe images. We explore these relationships in the context of two commonly used computer vision datasets. We then further relate human cues with outputs of current visual recognition systems and demonstrate prototype applications for gaze-enabled detection and annotation.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu ins Abstract Weposit that user behavior during natural viewing ofimages contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. [sent-7, score-0.29]

2 In this paper, we conduct experiments to better understand the relationship between images, the eye movements people make while viewing images, and how people construct natural language to describe images. [sent-8, score-0.692]

3 This creates the unprecedented opportunity to harness these devices and use information about eye, head, and body movements to inform intelligent systems about the content that we find interesting and the tasks that we are trying to perform. [sent-15, score-0.286]

4 This is particularly true in the case of gaze behavior, which provides direct insight into a person’s interests and intent. [sent-16, score-0.583]

5 We envision a day when reliable eye tracking can be performed using standard front facing cameras, making it possible for visual imagery to be tagged with individualized interpretations of content, each a unique “story” simply through the act of a person viewing their favorite images and videos. [sent-17, score-0.333]

6 Bottom: objects described by people and detected objects from each method (green - correct, blue - incorrect). [sent-20, score-0.189]

7 symbiotic relationship might be exploited to better analyze and index content that people find important. [sent-21, score-0.294]

8 Recent advances have started to look at problems of recognition at a human scale, classifying or localizing thousands of object categories with reasonable accuracy [19, 24, 5, 6, 18]. [sent-26, score-0.263]

9 Information from Gaze It has long been known that eye movements are not directly determined by an image, but are also influenced by task [33]. [sent-34, score-0.297]

10 The clearest examples of this come from the extensive literature on eye movements during visual search [8, 21, 34, 35]; specifying different targets yields different patterns of eye movements even for the same image. [sent-35, score-0.654]

11 However, clear relationships also exist between the properties of an image and the eye movements that people make during free viewing. [sent-36, score-0.469]

12 For example, when presented with a complex scene, people overwhelmingly choose to direct their initial fixations toward the center of the image [27], probably in an attempt to maximize extraction of information from the scene [27]. [sent-37, score-0.464]

13 Figure/ground relationships play a role as well; people prefer to look at objects even when the background is made more relevant to the task [22]. [sent-38, score-0.31]

14 All things being equal, eye movements also tend to be directed to corners and regions of high feature density [20, 29], sudden onsets [30, 3 1], object motion [14, 15], and regions of brightness, texture, and color contrast [16, 17, 23]. [sent-39, score-0.417]

15 The focus of our experiments is on less well explored semantic factors how categories of objects or events might influence gaze [9] and how we can use gaze to predict semantic categories. [sent-41, score-1.322]

16 Second, the patterns of saccades and fixations made during image viewing might be used as a direct indication of content information. [sent-44, score-0.593]

17 To the extent that gaze is drawn to oddities and inconsistencies in a scene [28], fixations might also serve to predict unusual events [1]. [sent-45, score-0.961]

18 There are many recognition tasks that could benefit from gaze information. [sent-49, score-0.583]

19 Rather than applying object detectors at every location in an image arbitrarily, they could be more intelligently applied only at important locations as indicated by gaze fixations. [sent-52, score-0.665]

20 Humans can provide: • Passive indications of content through gaze patterns. [sent-56, score-0.731]

21 In this paper we describe several combined behavioralcomputational experiments aimed at exploring the relationships between the pixels in an image, the eye movements that people make while viewing that image, and the words that they produce when asked to describe it. [sent-66, score-0.589]

22 For these experiments we have collected gaze fixations and some descriptions for images from two commonly used computer vision datasets. [sent-68, score-0.999]

23 Dataset & Experimental Settings We investigate the relationships between eye movements, description, image content, and computational recognition algorithms using images from two standard computer vision datasets, the Pascal VOC dataset [10] and the SUN 2009 dataset [3]. [sent-74, score-0.25]

24 These descriptions generally describe the main image content (objects), relationships, and sometimes the overall scene. [sent-80, score-0.226]

25 We train 22 deformable part model object detectors [12] using images with associated bounding boxes from ImageNet [7]. [sent-83, score-0.23]

26 These categories were selected to cover, as much as possible, the main object content of our selected scene images. [sent-84, score-0.283]

27 Eye movements were recorded during this time using a remote eye tracker (EL1000) sampling at 1000 Hz. [sent-88, score-0.32]

28 Image descriptions were not collected from observers during the experiment, as we wanted to examine the general relationships between gaze and description that hold across different people. [sent-89, score-0.811]

29 Figure 2 shows an example gaze pattern and description. [sent-96, score-0.583]

30 2), and 3) What is the relationship between what people look at and what they describe? [sent-103, score-0.259]

31 categories, and 22 classes from SUN09) represent the interesting content of these images, we first need to validate to what degree people actually look at these objects. [sent-119, score-0.321]

32 Hence, we first compute how many fixations fall into the image regions corresponding to selected object categories. [sent-122, score-0.439]

33 57% of fixations fall into selected object category bounding boxes for the PASCAL and Sun09 datasets respectively. [sent-125, score-0.638]

34 Therefore, while these objects do reasonably cover human fixation locations they do not represent all of the fixated image content. [sent-126, score-0.711]

35 Gaze vs Object Type: Here we explore which objects tend to attract the most human attention by computing the rate of fixation for each category. [sent-127, score-0.402]

36 NF(I, b) denotes the normalized percentage of fixations of bounding box b in image I. [sent-129, score-0.54]

37 In the SUN dataset, people are more likely to look at content elements like televisions (if they are on), people, and ovens than objects like rugs or cabinets. [sent-134, score-0.359]

38 We also study the overall fixation rate for each category (results are shown in Figure 4). [sent-135, score-0.312]

39 We evaluate this in two ways, 1) by computing the average percentage of fixated instances for each category (blue bars), and 2) by computing the percentage of images where at least one instance of a category was fixated when present (red bars). [sent-136, score-0.974]

40 While viewers will probably not take the time to look at every single sheep in the image, if sheep are important then they are likely to look at at least one sheep in the image. [sent-140, score-0.371]

41 We find that while only 45% of all sheep in images are fixated, at least one sheep is fixated in 97% of images containing sheep. [sent-141, score-0.488]

42 We also find that object categories like person, cat, or dog are nearly always fixated on while more common scene elements like curtains or potted plants are fixated on much less frequently. [sent-142, score-0.964]

43 Gaze vs Location on Objects: Here we explore the gaze patterns people produce for different object categories, examining how the patterns vary across categories, and whether bounding boxes are a reasonable representation for object localization (as indicated by gaze patterns on objects). [sent-143, score-1.647]

44 To analyze location information from fixations, we first transform fixations into a density map. [sent-144, score-0.401]

45 For a given image, a two-dimensional Gaussian distribution that models the human visual system with appropriately chosen parameters is centered at each fixation point. [sent-145, score-0.326]

46 Then, a fixation density map is calculated by summing the Gaussians over the entire image. [sent-148, score-0.311]

47 For each category, we average the fixation density maps (a) PASCAL (b) SUN09) Figure 4: Blue bars show the average percentage of fixated instance per category. [sent-149, score-0.823]

48 Red bars show the percentage of images where a category was fixated when present (at least one fixated instance in an image). [sent-150, score-0.937]

49 (a) person(b) horse(c) tvmonitor (d) bicycle(e) chair(f) diningtable Figure 5: Examples of average fixation density maps. [sent-151, score-0.311]

50 over the ground truth bounding boxes to create an “average” fixation density map for that category. [sent-153, score-0.459]

51 Figure 5 shows how gaze patterns differ for example object categories. [sent-154, score-0.65]

52 We find that when people look at an animal such as a person or horse (5a, 5b), they tend to look near the animal’s head. [sent-155, score-0.463]

53 For some categories such as bicycle or chair (5d, 5e), which tend to have people sitting on them, we find that fixations are pulled toward the top/middle of the bounding box. [sent-156, score-0.74]

54 For other categories like tv monitor (5c), people tend to look at the center of the object. [sent-158, score-0.34]

55 This observation suggest that designing or training different gaze models for different categories could potentially be useful for recognizing what someone is looking at. [sent-159, score-0.726]

56 We compute the percentage of fixations that fall into the true 7 7 7 4 4 4 20 020 AllPersonChairPainting % of area68. [sent-161, score-0.442]

57 We measure what percentage of the bounding box is part of the segmented object, and what percentage of the human fixations in that bounding box fall in the segmented object. [sent-170, score-0.796]

58 We compare the extracted nouns to our selected object categories using WordNet distance [32] and keep nouns with small WordNet distance. [sent-178, score-0.204]

59 Previous work has shown that object categories are described preferentially [2]. [sent-184, score-0.184]

60 We examine the relationship between gaze and description by studying: 1) whether subjects look at the objects they describe, and 2) whether subjects describe the objects they look at. [sent-189, score-1.093]

61 We find that there is a strong relationship between gaze and description in both datasets. [sent-194, score-0.686]

62 In the PASCAL dataset, for categories aeroplane, bus, cat, cow, horse, motorbike, person, sofa, people tends to look much more in the detection boxes with high scores. [sent-202, score-0.427]

63 For other categories, people tend to fixate evenly at detection boxes. [sent-203, score-0.23]

64 Gaze-Enabled Computer Vision In this section, we discuss the implications of human gaze as a potential signal for two computer vision tasks object detection and image annotation. [sent-206, score-0.729]

65 Analysis of human gaze with object detectors We first examine correlations between the confidence of visual detection systems and fixation. [sent-209, score-0.78]

66 Positive or negative correlations give us insight into whether fixations have the potential to improve detection performance. [sent-210, score-0.449]

67 In this experiment, we compute detection score versus fixation rate (Equation 3). [sent-211, score-0.311]

68 in general, we find that observers look at bounding boxes with high confidence scores more often, but that detections with lower confidence scores are also sometimes fixated. [sent-213, score-0.365]

69 As indicated by our previous studies, in general some categories are fixated more often than others, suggesting that we might focus on integrating gaze and computer vision predictions in a category specific manner. [sent-214, score-1.126]

70 Given these observations, we also measure for what percentage of cases fixations could provide useful or detrimental evidence for object detection. [sent-215, score-0.447]

71 In this experiment, we select the bounding boxes output by the detectors at their selected default thresholds. [sent-216, score-0.252]

72 For these cases, gaze cannot possibly help to improve the result, 2) There are both true positive (TP) and false pos7 7 7 4 4 4 31 131 ? [sent-218, score-0.611]

73 Figure 7: Analysis of where gaze could potentially decrease (yellow), increase (pink), or not affect (green & blue) performance of detection. [sent-315, score-0.583]

74 In some of these cases there will be more fixations falling into a FP box than into a TP. [sent-317, score-0.403]

75 In these cases it is likely that adding gaze information could hurt object detection performance (yellow bars). [sent-318, score-0.704]

76 3) In other cases, where we have more fixations in a TP box than in any other FP box, gaze has the potential to improve object detection (pink bars). [sent-319, score-1.094]

77 4) Green bars show detections where the object detector already provides the correct answer and no FP boxes overlap with the ground truth (therefore adding gaze will neither hurt nor help these cases). [sent-320, score-0.843]

78 We first consider the simplest possible algorithm filter out all detected bounding boxes that do not contain any fixations (or conversely run object detectors only on parts of the image containing fixations). [sent-324, score-0.581]

79 At the same time, it also removes a lot of true positive boxes for objects that are less likely fixated such as bottle and plant, resulting in improvements for some categories, but overall decreased detection performance (Table 3 shows detection performance on the 20 PASCAL categories). [sent-326, score-0.585]

80 For gaze features, we first create a fixation density map for each image (as described in Section 3. [sent-329, score-0.894]

81 To remove outliers, fixation density maps are weighted by fixation duration [ 13]. [sent-331, score-0.572]

82 Then, we compute the average fixation density map per image across – viewers. [sent-332, score-0.311]

83 To compute gaze features of each detection box, we calculate the average and the maximum of the fixation density map inside of the detection box. [sent-333, score-0.994]

84 Then, the final gaze feature of each box is a three dimensional feature vector (eg. [sent-334, score-0.635]

85 detection score, and the average and maximum of the fixation density map). [sent-335, score-0.361]

86 However, for training, we also consider bounding boxes with detection scores somewhat lower than the default threshold for training our gaze classifier and consider a more generous criterion (ie. [sent-339, score-0.812]

87 We generally find gaze helps improve object detection on categories that are usually fixated while it can hurt those that are not fixated (e. [sent-350, score-1.543]

88 Since people often look at planes, gaze-enabled classifiers could increase this confusion. [sent-355, score-0.213]

89 Annotation Prediction We evaluate applicability of gaze to another end-user application, image annotation – outputing a set of object tags for an image. [sent-360, score-0.654]

90 Here, we consider a successful annotation to be one that matches the set of objects a person describes when viewing the image. [sent-361, score-0.203]

91 2), we find gaze to be a useful cue for annotation. [sent-364, score-0.583]

92 Overall, both simple filtering and classification improve average annotation performance (Table 4), and are especially helpful for those categories that tend to draw fixations and description, e. [sent-365, score-0.515]

93 Conclusion and Future work In this paper through a series of behavioral studies and experimental evaluations, we explored the information con- 7 7 7 4 4 4 42 242 tained in eye movements and description and analyzed their relationship with image content. [sent-372, score-0.4]

94 We also examined the complex relationships between human gaze and outputs of current visual detection methods. [sent-373, score-0.757]

95 Modelling search for people in 900 scenes: A combined source model of eye guidance. [sent-455, score-0.258]

96 Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes. [sent-495, score-0.335]

97 Gaze-enabled detection improves over the baseline for objects that people often fixate on (e. [sent-510, score-0.232]

98 The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. [sent-605, score-0.335]

99 The long and the short of it: Spatial statistics at fixation vary with saccade amplitude and task. [sent-614, score-0.261]

100 A theory of eye movements during target acquisition. [sent-640, score-0.297]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gaze', 0.583), ('fixated', 0.374), ('fixations', 0.351), ('fixation', 0.261), ('movements', 0.152), ('eye', 0.145), ('pascal', 0.119), ('people', 0.113), ('content', 0.108), ('look', 0.1), ('categories', 0.091), ('bars', 0.076), ('bounding', 0.075), ('viewing', 0.074), ('boxes', 0.073), ('descriptions', 0.065), ('percentage', 0.062), ('relationships', 0.059), ('preferentially', 0.059), ('sheep', 0.057), ('description', 0.057), ('rashtchian', 0.056), ('person', 0.054), ('potted', 0.052), ('someone', 0.052), ('cat', 0.052), ('box', 0.052), ('category', 0.051), ('density', 0.05), ('detection', 0.05), ('detectors', 0.048), ('observers', 0.047), ('relationship', 0.046), ('chair', 0.043), ('wordnet', 0.04), ('indications', 0.04), ('sec', 0.04), ('plant', 0.04), ('detections', 0.04), ('gazeenabled', 0.039), ('inanimate', 0.039), ('iox', 0.039), ('jenc', 0.039), ('neider', 0.039), ('timeframe', 0.039), ('cognition', 0.039), ('dog', 0.039), ('objects', 0.038), ('human', 0.038), ('annotation', 0.037), ('hurt', 0.037), ('tend', 0.036), ('humans', 0.036), ('object', 0.034), ('fp', 0.034), ('itti', 0.034), ('horse', 0.033), ('patterns', 0.033), ('imagery', 0.033), ('brook', 0.032), ('onybrook', 0.032), ('stony', 0.032), ('psychology', 0.031), ('bicycle', 0.031), ('default', 0.031), ('fixate', 0.031), ('overt', 0.031), ('sometimes', 0.03), ('subjects', 0.03), ('cow', 0.03), ('nf', 0.029), ('deng', 0.029), ('fall', 0.029), ('imagenet', 0.029), ('explore', 0.029), ('amazon', 0.029), ('false', 0.028), ('positives', 0.027), ('collaborative', 0.027), ('might', 0.027), ('visual', 0.027), ('animal', 0.027), ('pink', 0.027), ('dining', 0.027), ('nouns', 0.027), ('dte', 0.027), ('attentional', 0.027), ('berg', 0.027), ('voc', 0.026), ('language', 0.026), ('inform', 0.026), ('selected', 0.025), ('motorbike', 0.025), ('potential', 0.024), ('whether', 0.024), ('amt', 0.023), ('remote', 0.023), ('surprise', 0.023), ('describe', 0.023), ('dataset', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg

2 0.43014577 258 cvpr-2013-Learning Video Saliency from Human Gaze Using Candidate Selection

Author: Dmitry Rudoy, Dan B. Goldman, Eli Shechtman, Lihi Zelnik-Manor

Abstract: During recent years remarkable progress has been made in visual saliency modeling. Our interest is in video saliency. Since videos are fundamentally different from still images, they are viewed differently by human observers. For example, the time each video frame is observed is a fraction of a second, while a still image can be viewed leisurely. Therefore, video saliency estimation methods should differ substantially from image saliency methods. In this paper we propose a novel methodfor video saliency estimation, which is inspired by the way people watch videos. We explicitly model the continuity of the video by predicting the saliency map of a given frame, conditioned on the map from the previousframe. Furthermore, accuracy and computation speed are improved by restricting the salient locations to a carefully selected candidate set. We validate our method using two gaze-tracked video datasets and show we outperform the state-of-the-art.

3 0.18038489 214 cvpr-2013-Image Understanding from Experts' Eyes by Modeling Perceptual Skill of Diagnostic Reasoning Processes

Author: Rui Li, Pengcheng Shi, Anne R. Haake

Abstract: Eliciting and representing experts ’ remarkable perceptual capability of locating, identifying and categorizing objects in images specific to their domains of expertise will benefit image understanding in terms of transferring human domain knowledge and perceptual expertise into image-based computational procedures. In this paper, we present a hierarchical probabilistic framework to summarize the stereotypical and idiosyncratic eye movement patterns shared within 11 board-certified dermatologists while they are examining and diagnosing medical images. Each inferred eye movement pattern characterizes the similar temporal and spatial properties of its corresponding seg. edu anne .haake @ rit . edu , ments of the experts ’ eye movement sequences. We further discover a subset of distinctive eye movement patterns which are commonly exhibited across multiple images. Based on the combinations of the exhibitions of these eye movement patterns, we are able to categorize the images from the perspective of experts’ viewing strategies. In each category, images share similar lesion distributions and configurations. The performance of our approach shows that modeling physicians ’ diagnostic viewing behaviors informs about medical images’ understanding to correct diagnosis.

4 0.11444117 364 cvpr-2013-Robust Object Co-detection

Author: Xin Guo, Dong Liu, Brendan Jou, Mojun Zhu, Anni Cai, Shih-Fu Chang

Abstract: Object co-detection aims at simultaneous detection of objects of the same category from a pool of related images by exploiting consistent visual patterns present in candidate objects in the images. The related image set may contain a mixture of annotated objects and candidate objects generated by automatic detectors. Co-detection differs from the conventional object detection paradigm in which detection over each test image is determined one-by-one independently without taking advantage of common patterns in the data pool. In this paper, we propose a novel, robust approach to dramatically enhance co-detection by extracting a shared low-rank representation of the object instances in multiple feature spaces. The idea is analogous to that of the well-known Robust PCA [28], but has not been explored in object co-detection so far. The representation is based on a linear reconstruction over the entire data set and the low-rank approach enables effective removal of noisy and outlier samples. The extracted low-rank representation can be used to detect the target objects by spectral clustering. Extensive experiments over diverse benchmark datasets demonstrate consistent and significant performance gains of the proposed method over the state-of-the-art object codetection method and the generic object detection methods without co-detection formulations.

5 0.11144113 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

6 0.1097272 103 cvpr-2013-Decoding Children's Social Behavior

7 0.10881326 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction

8 0.10237738 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

9 0.094276249 273 cvpr-2013-Looking Beyond the Image: Unsupervised Learning for Object Saliency and Detection

10 0.085029103 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

11 0.082420766 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

12 0.081347018 325 cvpr-2013-Part Discovery from Partial Correspondence

13 0.078516252 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)

14 0.078412697 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings

15 0.076524615 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

16 0.072908834 163 cvpr-2013-Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

17 0.072489619 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

18 0.072377615 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels

19 0.072033308 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes

20 0.07183437 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.161), (1, -0.084), (2, 0.097), (3, -0.021), (4, 0.028), (5, 0.018), (6, 0.012), (7, 0.044), (8, 0.048), (9, 0.041), (10, -0.043), (11, -0.049), (12, 0.037), (13, -0.059), (14, -0.013), (15, 0.011), (16, 0.052), (17, 0.067), (18, -0.049), (19, -0.014), (20, -0.059), (21, -0.001), (22, 0.076), (23, -0.001), (24, 0.023), (25, -0.028), (26, 0.071), (27, 0.012), (28, 0.018), (29, -0.093), (30, -0.057), (31, -0.026), (32, -0.035), (33, 0.014), (34, 0.027), (35, 0.034), (36, -0.004), (37, 0.16), (38, -0.123), (39, 0.04), (40, -0.108), (41, -0.002), (42, -0.079), (43, 0.138), (44, 0.019), (45, 0.022), (46, 0.049), (47, -0.158), (48, -0.019), (49, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89974123 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg

2 0.75251764 214 cvpr-2013-Image Understanding from Experts' Eyes by Modeling Perceptual Skill of Diagnostic Reasoning Processes

Author: Rui Li, Pengcheng Shi, Anne R. Haake

3 0.68545842 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction

Author: C. Lawrence Zitnick, Devi Parikh

Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

4 0.59945744 103 cvpr-2013-Decoding Children's Social Behavior

Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye

Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.

5 0.58137935 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

Author: Yun Jiang, Hema Koppula, Ashutosh Saxena

Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.

6 0.56866044 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

7 0.5646385 258 cvpr-2013-Learning Video Saliency from Human Gaze Using Candidate Selection

8 0.51959854 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

9 0.51414675 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

10 0.51246601 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection

11 0.50186372 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

12 0.4995465 157 cvpr-2013-Exploring Implicit Image Statistics for Visual Representativeness Modeling

13 0.48976383 417 cvpr-2013-Subcategory-Aware Object Classification

14 0.48958939 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings

15 0.47788081 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

16 0.47284564 163 cvpr-2013-Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

17 0.4713819 440 cvpr-2013-Tracking People and Their Objects

18 0.46429068 72 cvpr-2013-Boundary Detection Benchmarking: Beyond F-Measures

19 0.46082091 364 cvpr-2013-Robust Object Co-detection

20 0.45993444 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.097), (16, 0.033), (26, 0.053), (28, 0.018), (33, 0.221), (60, 0.211), (67, 0.089), (69, 0.066), (72, 0.015), (77, 0.013), (87, 0.073), (99, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87930459 279 cvpr-2013-Manhattan Scene Understanding via XSlit Imaging

Author: Jinwei Ye, Yu Ji, Jingyi Yu

Abstract: A Manhattan World (MW) [3] is composed of planar surfaces and parallel lines aligned with three mutually orthogonal principal axes. Traditional MW understanding algorithms rely on geometry priors such as the vanishing points and reference (ground) planes for grouping coplanar structures. In this paper, we present a novel single-image MW reconstruction algorithm from the perspective of nonpinhole cameras. We show that by acquiring the MW using an XSlit camera, we can instantly resolve coplanarity ambiguities. Specifically, we prove that parallel 3D lines map to 2D curves in an XSlit image and they converge at an XSlit Vanishing Point (XVP). In addition, if the lines are coplanar, their curved images will intersect at a second common pixel that we call Coplanar Common Point (CCP). CCP is a unique image feature in XSlit cameras that does not exist in pinholes. We present a comprehensive theory to analyze XVPs and CCPs in a MW scene and study how to recover 3D geometry in a complex MW scene from XVPs and CCPs. Finally, we build a prototype XSlit camera by using two layers of cylindrical lenses. Experimental results × on both synthetic and real data show that our new XSlitcamera-based solution provides an effective and reliable solution for MW understanding.

same-paper 2 0.82723373 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg

3 0.77660346 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

4 0.77268809 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

Author: Xiaohui Shen, Zhe Lin, Jonathan Brandt, Ying Wu

Abstract: Detecting faces in uncontrolled environments continues to be a challenge to traditional face detection methods[24] due to the large variation in facial appearances, as well as occlusion and clutter. In order to overcome these challenges, we present a novel and robust exemplarbased face detector that integrates image retrieval and discriminative learning. A large database of faces with bounding rectangles and facial landmark locations is collected, and simple discriminative classifiers are learned from each of them. A voting-based method is then proposed to let these classifiers cast votes on the test image through an efficient image retrieval technique. As a result, faces can be very efficiently detected by selecting the modes from the voting maps, without resorting to exhaustive sliding window-style scanning. Moreover, due to the exemplar-based framework, our approach can detect faces under challenging conditions without explicitly modeling their variations. Evaluation on two public benchmark datasets shows that our new face detection approach is accurate and efficient, and achieves the state-of-the-art performance. We further propose to use image retrieval for face validation (in order to remove false positives) and for face alignment/landmark localization. The same methodology can also be easily generalized to other facerelated tasks, such as attribute recognition, as well as general object detection.

5 0.7694025 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

Author: Jianguo Li, Yimin Zhang

Abstract: This paper presents a novel learning framework for training boosting cascade based object detector from large scale dataset. The framework is derived from the wellknown Viola-Jones (VJ) framework but distinguished by three key differences. First, the proposed framework adopts multi-dimensional SURF features instead of single dimensional Haar features to describe local patches. In this way, the number of used local patches can be reduced from hundreds of thousands to several hundreds. Second, it adopts logistic regression as weak classifier for each local patch instead of decision trees in the VJ framework. Third, we adopt AUC as a single criterion for the convergence test during cascade training rather than the two trade-off criteria (false-positive-rate and hit-rate) in the VJ framework. The benefit is that the false-positive-rate can be adaptive among different cascade stages, and thus yields much faster convergence speed of SURF cascade. Combining these points together, the proposed approach has three good properties. First, the boosting cascade can be trained very efficiently. Experiments show that the proposed approach can train object detectors from billions of negative samples within one hour even on personal computers. Second, the built detector is comparable to the stateof-the-art algorithm not only on the accuracy but also on the processing speed. Third, the built detector is small in model-size due to short cascade stages.

6 0.7693187 325 cvpr-2013-Part Discovery from Partial Correspondence

7 0.76914191 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

8 0.76809752 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

9 0.76776636 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

10 0.76637673 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

11 0.76631337 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

12 0.76602232 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

13 0.76579028 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

14 0.7657215 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

15 0.76551914 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors

16 0.76545602 414 cvpr-2013-Structure Preserving Object Tracking

17 0.76512277 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

18 0.76495254 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

19 0.76388872 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection

20 0.76372367 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence