cvpr cvpr2013 cvpr2013-73 knowledge-graph by maker-knowledge-mining

73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction


Source: pdf

Author: C. Lawrence Zitnick, Devi Parikh

Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. [sent-3, score-0.313]

2 The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. [sent-4, score-0.384]

3 In this paper, we propose studying semantic information in abstract images created from collections of clip art. [sent-6, score-0.651]

4 Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. [sent-9, score-0.343]

5 Finding analogous sets of semantically similar real images would be nearly impossible. [sent-10, score-0.389]

6 We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. [sent-11, score-0.646]

7 We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity. [sent-12, score-0.756]

8 Introduction A fundamental goal of computer vision is to discover the semantically meaningful information contained within an image. [sent-14, score-0.348]

9 Similarly humans may deem two images as semantically similar, even though the arrangement or even the presence of objects may vary dramatically. [sent-17, score-0.401]

10 Discovering the subset of image specific information that is semantically meaningful remains a challenging area of research. [sent-18, score-0.348]

11 Numerous works have explored related areas, including predicting the salient locations in an image [17, 26], ranking the relative importance of visible objects [1, 5, 16, 31] and semantically interpreting images [7, 18, 24, 38]. [sent-19, score-0.5]

12 Unlike traditional approaches that use real images, we hypothesize that the same information can be learned from abstract images rendered from a collection of clip art, as shown in Figure 1. [sent-30, score-0.473]

13 Even with a limited set of clip art, the variety and complexity of semantic information that can be conveyed 333000000977 Figure 2. [sent-31, score-0.564]

14 An illustration of the clip art used to create the children (left) and the other available objects (right. [sent-32, score-0.735]

15 For instance, clip art can correspond to different attributes of an object, such as a person’s pose, facial expression or clothing. [sent-34, score-0.715]

16 Using abstract images, even complex relation information can be easily computed given the relative placement of the clip art, such as “Is the person holding an object? [sent-40, score-0.563]

17 We accomplish this by first asking human subjects to generate novel scenes and corresponding written descriptions. [sent-43, score-0.4]

18 Next, multiple human subjects are asked to generate scenes depicting the same written description without any knowledge of the original scene’s appearance. [sent-44, score-0.578]

19 The result is a set of different scenes with similar semantic meaning, as shown in Figure 1. [sent-45, score-0.348]

20 Collecting analogous sets of semantically similar real images would be prohibitively difficult. [sent-46, score-0.389]

21 y Wfoer estnuvdiysiionng this to be useful for studying a wide variety of tasks, such as generating semantic descriptions of images, or textbased image search. [sent-48, score-0.345]

22 • We measure the mutual information between visual featWuree sm aenasdu trhee t hseem maunttuica lc ilnafsosrems attoi odnis bcoetvwere wnh vicishu avli sfueaa-l features are most semantically meaningful. [sent-50, score-0.534]

23 Our semantic classes are defined using sets of semantically similar scenes depicting the same written description. [sent-51, score-0.859]

24 We show the relative importance of various features, such as the high importance of a person’s facial expression or the occurrence of a dog, and the relatively low importance of some spatial relations. [sent-52, score-0.516]

25 t Wpese compute semantically similar nearest neighbors using a metric learning approach [35]. [sent-58, score-0.307]

26 Through our various experiments, we study what aspects of the scenes are semantically important. [sent-59, score-0.534]

27 We hypothesize that by analyzing the set of semantically important features in abstract images, we may better understand what information needs to be gathered for semantic understanding in all types of visual data, including real images. [sent-60, score-0.784]

28 For instance, methods generating novel sentences rely on the automatic detection of objects [9] and attributes [2, 6, 25], and use language statistics [38] or spatial relationships [18] for verb prediction. [sent-65, score-0.33]

29 Works in learning semantic attributes [2, 6, 25] are becoming popular for enabling humans and machines to communicate using natural language. [sent-67, score-0.328]

30 The use of semantic concepts such as scenes and objects has also been shown to be effective for video retrieval [20]. [sent-68, score-0.409]

31 However, our dataset has the unique property of having sets of semantically similar images, i. [sent-70, score-0.343]

32 The works on attributes described above includes the use of adjectives as well as nouns relating to parts of objects. [sent-78, score-0.357]

33 With this in mind, we choose to create abstract scenes of children playing outside. [sent-92, score-0.385]

34 Our goal is to create a set of scenes that are semantically similar. [sent-96, score-0.539]

35 First, we ask subjects on Amazon’s Mechanical Turk (AMT) to create scenes from a collection ofclip art. [sent-98, score-0.368]

36 Next, a new set of subjects are asked to describe the scenes using a one or two sentence description. [sent-99, score-0.469]

37 Finally, semantically similar scenes are generated by asking multiple subjects to create scenes depicting the same written description. [sent-100, score-1.036]

38 Initial scene creation: Our scenes are created from a collection of 80 pieces of clip art created by an artist, as shown in Figure 2. [sent-102, score-0.852]

39 Clip art depicting a boy and girl are created from seven different poses and five different facial expressions, resulting in 35 possible combinations for each, Figure 2(left). [sent-103, score-0.952]

40 56 pieces of clip art represent the other objects in the scene, including trees, toys, hats, animals, etc. [sent-104, score-0.604]

41 The subjects were given five pieces of clip art for both the boy and girl assembled randomly from the different facial expressions and poses. [sent-105, score-1.374]

42 The subjects were instructed to “create an illustration for a children’s story book by creating a realistic scene from the clip art below”. [sent-111, score-0.713]

43 At least six pieces of clip art were required to be used, and each clip art could only be used once. [sent-112, score-1.034]

44 At most one boy and one girl could be added to the scene. [sent-113, score-0.579]

45 Each piece of clip art could be scaled using three fixed sizes and flipped horizontally. [sent-114, score-0.491]

46 The depth ordering was automatically computed using the type of clip art, e. [sent-115, score-0.394]

47 a hat should appear on top of the girl, and using the clip art scale. [sent-117, score-0.491]

48 A simple interface was created that showed a single scene, and the subjects were asked to describe the scene using one or two sentences. [sent-122, score-0.347]

49 For those subjects who wished to use proper names in their descriptions, we provided the names “Mike” and “Jenny” for the boy and girl. [sent-123, score-0.461]

50 Generating semantically similar scenes: Finally, we generated sets of semantically similar scenes. [sent-126, score-0.65]

51 For this task, we asked subjects to generate scenes depicting the written descriptions. [sent-127, score-0.536]

52 By having multiple subjects generate scenes for each description, we can create sets of semantically similar scenes. [sent-128, score-0.711]

53 First, the subjects were given a written description of a scene and asked to create a scene depicting it. [sent-131, score-0.6]

54 Second, the clip art was randomly chosen as above, except we enforced any clip art that was used in the original scene was also included. [sent-132, score-1.037]

55 As a result, on average about 25% of the clip art was from the original scene used to create the written description. [sent-133, score-0.689]

56 It is important to note that it is critical to ensure that objects that are in the written description are available to the subjects generating the new scenes. [sent-134, score-0.369]

57 However this does introduce a bias, since subjects will always have the option of choosing the clip art present in the original scene even if it is not described in the scene description. [sent-135, score-0.737]

58 Thus it is critical that a significant portion of the clip art remains randomly chosen. [sent-136, score-0.491]

59 That is, we have 1,002 sets of 10 scenes that are known to be semantically similar. [sent-140, score-0.503]

60 Semantic importance of visual features In this section, we examine the relative semantic importance of various scene properties or features. [sent-144, score-0.536]

61 For instance, the study of abstract scenes may help research in semantic scene understanding in real images by suggesting to researchers which properties are important to reliably detect. [sent-146, score-0.565]

62 To study the semantic importance of features, we need a quantitative measure of semantic importance. [sent-147, score-0.502]

63 In this paper, we use the mutual information shared between a specified feature and a set of classes representing semantically similar scenes. [sent-148, score-0.432]

64 In our dataset, we have 1002 sets of semantically similar scenes, resulting in 1002 classes. [sent-149, score-0.343]

65 For instance, if the MI between a feature and the classes is small, it indicates that the feature provides minimal information for determining whether scenes are semantically similar. [sent-151, score-0.508]

66 Next, we describe various sets of features and analyze their semantic importance using Equations (1) and (2). [sent-166, score-0.334]

67 Occurrence: We begin by analyzing the simple features corresponding to the occurrence of the various objects that may exist in the scene. [sent-167, score-0.3]

68 In our dataset, there exist 58 object instances, since we group all of the variations of the boy together in one instance, and similarly for girl. [sent-170, score-0.364]

69 For instance, objects such as the bear, dog, girl or boy are more semantically meaningful than background objects such as trees or hats. [sent-175, score-1.008]

70 3%) occur frequently but are less semantically important, whereas bears (11. [sent-180, score-0.34]

71 Interestingly, the individual occurrence of boy and girl have higher scores than the category people. [sent-183, score-0.777]

72 Person attributes: Since the occurrence of the boy and girl are semantically meaningful, it is likely their attributes are also semantically relevant. [sent-186, score-1.449]

73 The boy and girl clip art have five different facial expressions and seven different poses. [sent-187, score-1.186]

74 We compute the CMI of the person attributes conditioned upon the boy or girl being present. [sent-189, score-0.84]

75 Interestingly, features that include combinations of the boy, girl and animals provide significant additional information. [sent-196, score-0.353]

76 Other features such as girl and balloons actually have high MI but low CMI, since balloons almost always occur with the girl in our dataset. [sent-197, score-0.69]

77 The mutual information measuring the dependence between classes of semantically similar scenes and the (left) occurrence of obejcts, (top) co-occurrence, relative depth and position, (middle) person attributes and (bottom) the position relative to the head and hand, and absolute position. [sent-202, score-1.292]

78 ) The pie chart shows the sum of the mutual information or conditional mutual information scores for all features. [sent-204, score-0.333]

79 The probability of occurrence of each piece of clip art occurring is shown to the left. [sent-205, score-0.64]

80 Intuitively, the position of the boy and girl provide the most additional information, whereas the location of toys and hats matters less. [sent-208, score-0.716]

81 For instance, a boy holding a hamburger implies eating, where a hamburger sitting on a table does not. [sent-212, score-0.459]

82 As shown in Figure 4, the relative positions of the boy and girl provide the most information. [sent-215, score-0.652]

83 CMI scores were conditioned on both the object and the boy or girl. [sent-224, score-0.46]

84 The average results for the boy and girl are shown in Figure 4. [sent-225, score-0.579]

85 The depth ordering of the objects also provides important semantic information. [sent-229, score-0.308]

86 The absolute depth features are conditioned on the object appearing while the relative depth features are conditioned on the corresponding pair co-occurring. [sent-234, score-0.523]

87 Measuring the semantic similarity of images The semantic similarity of images is dependent on the various characteristics of an image, such as the object present, their attributes and relations. [sent-248, score-0.483]

88 In this section, we explore the use of visual features for measuring semantic similarity. [sent-249, score-0.33]

89 For ground truth, we assume a set of 10 scenes generated using the same sentence are members of the same semantically similar class, Section 3. [sent-250, score-0.568]

90 This is not surprising since semantically im- portant information is commonly quite subtle, and scenes with very different object arrangements might be semantically similar. [sent-262, score-0.815]

91 For instance the combination of occurrence and person attributes provides a very effective set of features. [sent-264, score-0.366]

92 In fact, occurrence with person attributes has nearly identical results to using the top 200 features overall. [sent-265, score-0.375]

93 For in- stance, occurrence features are informative of nouns, while relative position features are predictive of more verbs, adverbs and prepositions. [sent-278, score-0.467]

94 Notice how the relative positions and orientations of the clip art can dramatically alter the words with highest score. [sent-280, score-0.627]

95 Discussion The potential of using abstract images to study the highlevel semantic understanding of visual data is especially promising. [sent-282, score-0.355]

96 Abstract images allow for the creation of huge datasets of semantically similar scenes that would be impossible with real images. [sent-283, score-0.513]

97 High-level semantic visual features can be learned or designed that better predict not only nouns, but other more complex phenomena represented by verbs, adverbs and prepositions. [sent-286, score-0.403]

98 Finally, we hypothesize that the study of high-level semantic information using abstract scenes will provide insights into methods for semantically understanding real im- ages. [sent-288, score-0.909]

99 To simulate detections in real images, artificial noise may be added to the visual features to study the effect of noise on inferring semantic information. [sent-291, score-0.403]

100 Finally by removing the dependence on varying sets of noisy automatic detectors, abstract scenes allow for more direct comparison between competing methods for extraction of semantic information from visual information. [sent-292, score-0.514]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clip', 0.335), ('boy', 0.325), ('cmi', 0.307), ('semantically', 0.307), ('girl', 0.254), ('semantic', 0.188), ('scenes', 0.16), ('art', 0.156), ('occurrence', 0.149), ('subjects', 0.136), ('children', 0.111), ('attributes', 0.107), ('nouns', 0.101), ('sentence', 0.101), ('depicting', 0.097), ('prepositions', 0.088), ('conditioned', 0.086), ('mutual', 0.084), ('adjectives', 0.077), ('verbs', 0.077), ('adverbs', 0.074), ('relative', 0.073), ('facial', 0.073), ('relating', 0.072), ('asked', 0.072), ('create', 0.072), ('written', 0.071), ('mi', 0.071), ('sentences', 0.07), ('person', 0.068), ('study', 0.067), ('words', 0.063), ('objects', 0.061), ('toys', 0.061), ('depth', 0.059), ('importance', 0.059), ('generating', 0.059), ('descriptions', 0.058), ('absolute', 0.058), ('relations', 0.056), ('scene', 0.055), ('pieces', 0.052), ('visual', 0.051), ('features', 0.051), ('hypothesize', 0.051), ('balloons', 0.049), ('heider', 0.049), ('scores', 0.049), ('understanding', 0.049), ('animals', 0.048), ('gaze', 0.048), ('created', 0.047), ('holding', 0.046), ('real', 0.046), ('biederman', 0.044), ('hamburger', 0.044), ('jenny', 0.044), ('expression', 0.044), ('expressions', 0.043), ('playing', 0.042), ('instance', 0.042), ('berg', 0.042), ('description', 0.042), ('information', 0.041), ('convey', 0.041), ('spm', 0.041), ('hats', 0.041), ('measuring', 0.04), ('studying', 0.04), ('thousand', 0.039), ('psychology', 0.039), ('phenomena', 0.039), ('exist', 0.039), ('dependence', 0.038), ('interface', 0.037), ('food', 0.037), ('sets', 0.036), ('amazon', 0.036), ('rashtchian', 0.035), ('meanings', 0.035), ('tagged', 0.035), ('worn', 0.035), ('position', 0.035), ('gist', 0.034), ('galleguillos', 0.034), ('rabinovich', 0.034), ('kulkarni', 0.034), ('pie', 0.034), ('informative', 0.034), ('numerous', 0.034), ('meaning', 0.033), ('occur', 0.033), ('oliva', 0.033), ('language', 0.033), ('asking', 0.033), ('humans', 0.033), ('parikh', 0.032), ('sadeghi', 0.031), ('story', 0.031), ('xp', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction

Author: C. Lawrence Zitnick, Devi Parikh

Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

2 0.15090367 146 cvpr-2013-Enriching Texture Analysis with Semantic Data

Author: Tim Matthews, Mark S. Nixon, Mahesan Niranjan

Abstract: We argue for the importance of explicit semantic modelling in human-centred texture analysis tasks such as retrieval, annotation, synthesis, and zero-shot learning. To this end, low-level attributes are selected and used to define a semantic space for texture. 319 texture classes varying in illumination and rotation are positioned within this semantic space using a pairwise relative comparison procedure. Low-level visual features used by existing texture descriptors are then assessed in terms of their correspondence to the semantic space. Textures with strong presence ofattributes connoting randomness and complexity are shown to be poorly modelled by existing descriptors. In a retrieval experiment semantic descriptors are shown to outperform visual descriptors. Semantic modelling of texture is thus shown to provide considerable value in both feature selection and in analysis tasks.

3 0.14929472 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun

Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].

4 0.1488438 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation

Author: Heesoo Myeong, Kyoung Mu Lee

Abstract: We propose a novel nonparametric approach for semantic segmentation using high-order semantic relations. Conventional context models mainly focus on learning pairwise relationships between objects. Pairwise relations, however, are not enough to represent high-level contextual knowledge within images. In this paper, we propose semantic relation transfer, a method to transfer high-order semantic relations of objects from annotated images to unlabeled images analogous to label transfer techniques where label information are transferred. Wefirst define semantic tensors representing high-order relations of objects. Semantic relation transfer problem is then formulated as semi-supervised learning using a quadratic objective function of the semantic tensors. By exploiting low-rank property of the semantic tensors and employing Kronecker sum similarity, an efficient approximation algorithm is developed. Based on the predicted high-order semantic relations, we reason semantic segmentation and evaluate the performance on several challenging datasets.

5 0.14308287 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

6 0.14229769 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

7 0.13704009 116 cvpr-2013-Designing Category-Level Attributes for Discriminative Visual Recognition

8 0.1356526 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes

9 0.12431314 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context

10 0.11039451 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes

11 0.10881326 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

12 0.10237218 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes

13 0.10140507 325 cvpr-2013-Part Discovery from Partial Correspondence

14 0.097703658 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images

15 0.096122593 77 cvpr-2013-Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition

16 0.089733556 161 cvpr-2013-Facial Feature Tracking Under Varying Facial Expressions and Face Poses Based on Restricted Boltzmann Machines

17 0.088910215 293 cvpr-2013-Multi-attribute Queries: To Merge or Not to Merge?

18 0.083683826 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

19 0.081957348 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes

20 0.0819498 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, -0.06), (2, 0.021), (3, -0.047), (4, 0.053), (5, 0.04), (6, -0.127), (7, 0.086), (8, 0.046), (9, 0.048), (10, 0.005), (11, 0.012), (12, 0.003), (13, 0.059), (14, 0.03), (15, 0.004), (16, 0.019), (17, 0.084), (18, -0.037), (19, -0.024), (20, 0.034), (21, -0.027), (22, 0.046), (23, 0.024), (24, -0.048), (25, -0.017), (26, 0.034), (27, 0.001), (28, -0.006), (29, -0.083), (30, -0.088), (31, -0.066), (32, -0.064), (33, -0.0), (34, -0.053), (35, 0.022), (36, -0.038), (37, 0.153), (38, -0.109), (39, 0.014), (40, -0.067), (41, -0.04), (42, -0.122), (43, 0.129), (44, 0.028), (45, 0.006), (46, -0.029), (47, -0.071), (48, 0.063), (49, -0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93248492 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction

Author: C. Lawrence Zitnick, Devi Parikh

Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

2 0.68193007 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun

Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].

3 0.68064141 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg

Abstract: Weposit that user behavior during natural viewing ofimages contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. In this paper, we conduct experiments to better understand the relationship between images, the eye movements people make while viewing images, and how people construct natural language to describe images. We explore these relationships in the context of two commonly used computer vision datasets. We then further relate human cues with outputs of current visual recognition systems and demonstrate prototype applications for gaze-enabled detection and annotation.

4 0.6776371 146 cvpr-2013-Enriching Texture Analysis with Semantic Data

Author: Tim Matthews, Mark S. Nixon, Mahesan Niranjan

Abstract: We argue for the importance of explicit semantic modelling in human-centred texture analysis tasks such as retrieval, annotation, synthesis, and zero-shot learning. To this end, low-level attributes are selected and used to define a semantic space for texture. 319 texture classes varying in illumination and rotation are positioned within this semantic space using a pairwise relative comparison procedure. Low-level visual features used by existing texture descriptors are then assessed in terms of their correspondence to the semantic space. Textures with strong presence ofattributes connoting randomness and complexity are shown to be poorly modelled by existing descriptors. In a retrieval experiment semantic descriptors are shown to outperform visual descriptors. Semantic modelling of texture is thus shown to provide considerable value in both feature selection and in analysis tasks.

5 0.63874954 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

Author: Yun Jiang, Hema Koppula, Ashutosh Saxena

Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.

6 0.63479269 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

7 0.63049066 157 cvpr-2013-Exploring Implicit Image Statistics for Visual Representativeness Modeling

8 0.60641849 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation

9 0.60615575 214 cvpr-2013-Image Understanding from Experts' Eyes by Modeling Perceptual Skill of Diagnostic Reasoning Processes

10 0.58413935 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

11 0.56421685 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning

12 0.56274921 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection

13 0.52967995 463 cvpr-2013-What's in a Name? First Names as Facial Attributes

14 0.52870011 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images

15 0.52084744 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes

16 0.50815582 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes

17 0.50701755 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

18 0.49812928 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

19 0.49458539 78 cvpr-2013-Capturing Layers in Image Collections with Componential Models: From the Layered Epitome to the Componential Counting Grid

20 0.48638114 293 cvpr-2013-Multi-attribute Queries: To Merge or Not to Merge?


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.104), (16, 0.018), (26, 0.064), (33, 0.248), (59, 0.011), (67, 0.088), (69, 0.051), (72, 0.019), (77, 0.014), (80, 0.015), (87, 0.075), (99, 0.229)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85609889 181 cvpr-2013-Fusing Depth from Defocus and Stereo with Coded Apertures

Author: Yuichi Takeda, Shinsaku Hiura, Kosuke Sato

Abstract: In this paper we propose a novel depth measurement method by fusing depth from defocus (DFD) and stereo. One of the problems of passive stereo method is the difficulty of finding correct correspondence between images when an object has a repetitive pattern or edges parallel to the epipolar line. On the other hand, the accuracy of DFD method is inherently limited by the effective diameter of the lens. Therefore, we propose the fusion of stereo method and DFD by giving different focus distances for left and right cameras of a stereo camera with coded apertures. Two types of depth cues, defocus and disparity, are naturally integrated by the magnification and phase shift of a single point spread function (PSF) per camera. In this paper we give the proof of the proportional relationship between the diameter of defocus and disparity which makes the calibration easy. We also show the outstanding performance of our method which has both advantages of two depth cues through simulation and actual experiments.

2 0.84103131 297 cvpr-2013-Multi-resolution Shape Analysis via Non-Euclidean Wavelets: Applications to Mesh Segmentation and Surface Alignment Problems

Author: Won Hwa Kim, Moo K. Chung, Vikas Singh

Abstract: The analysis of 3-D shape meshes is a fundamental problem in computer vision, graphics, and medical imaging. Frequently, the needs of the application require that our analysis take a multi-resolution view of the shape ’s local and global topology, and that the solution is consistent across multiple scales. Unfortunately, the preferred mathematical construct which offers this behavior in classical image/signal processing, Wavelets, is no longer applicable in this general setting (data with non-uniform topology). In particular, the traditional definition does not allow writing out an expansion for graphs that do not correspond to the uniformly sampled lattice (e.g., images). In this paper, we adapt recent results in harmonic analysis, to derive NonEuclidean Wavelets based algorithms for a range of shape analysis problems in vision and medical imaging. We show how descriptors derived from the dual domain representation offer native multi-resolution behavior for characterizing local/global topology around vertices. With only minor modifications, the framework yields a method for extracting interest/key points from shapes, a surprisingly simple algorithm for 3-D shape segmentation (competitive with state of the art), and a method for surface alignment (without landmarks). We give an extensive set of comparison results on a large shape segmentation benchmark and derive a uniqueness theorem for the surface alignment problem.

same-paper 3 0.82970357 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction

Author: C. Lawrence Zitnick, Devi Parikh

Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

4 0.81570357 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

Author: Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, Cunzhao Shi

Abstract: In this paper, we propose a novel method for cross-view action recognition via a continuous virtual path which connects the source view and the target view. Each point on this virtual path is a virtual view which is obtained by a linear transformation of the action descriptor. All the virtual views are concatenated into an infinite-dimensional feature to characterize continuous changes from the source to the target view. However, these infinite-dimensional features cannot be used directly. Thus, we propose a virtual view kernel to compute the value of similarity between two infinite-dimensional features, which can be readily used to construct any kernelized classifiers. In addition, there are a lot of unlabeled samples from the target view, which can be utilized to improve the performance of classifiers. Thus, we present a constraint strategy to explore the information contained in the unlabeled samples. The rationality behind the constraint is that any action video belongs to only one class. Our method is verified on the IXMAS dataset, and the experimental results demonstrate that our method achieves better performance than the state-of-the-art methods.

5 0.80509478 130 cvpr-2013-Discriminative Color Descriptors

Author: Rahat Khan, Joost van_de_Weijer, Fahad Shahbaz Khan, Damien Muselet, Christophe Ducottet, Cecile Barat

Abstract: Color description is a challenging task because of large variations in RGB values which occur due to scene accidental events, such as shadows, shading, specularities, illuminant color changes, and changes in viewing geometry. Traditionally, this challenge has been addressed by capturing the variations in physics-basedmodels, and deriving invariants for the undesired variations. The drawback of this approach is that sets of distinguishable colors in the original color space are mapped to the same value in the photometric invariant space. This results in a drop of discriminative power of the color description. In this paper we take an information theoretic approach to color description. We cluster color values together based on their discriminative power in a classification problem. The clustering has the explicit objective to minimize the drop of mutual information of the final representation. We show that such a color description automatically learns a certain degree of photometric invariance. We also show that a universal color representation, which is based on other data sets than the one at hand, can obtain competing performance. Experiments show that the proposed descriptor outperforms existing photometric invariants. Furthermore, we show that combined with shape description these color descriptors obtain excellent results on four challenging datasets, namely, PASCAL VOC 2007, Flowers-102, Stanford dogs-120 and Birds-200.

6 0.79730254 126 cvpr-2013-Diffusion Processes for Retrieval Revisited

7 0.78796184 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

8 0.78636968 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

9 0.78213632 325 cvpr-2013-Part Discovery from Partial Correspondence

10 0.78193402 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

11 0.7805171 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

12 0.77954382 311 cvpr-2013-Occlusion Patterns for Object Class Detection

13 0.77854747 414 cvpr-2013-Structure Preserving Object Tracking

14 0.77844858 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

15 0.77832711 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

16 0.77793872 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

17 0.77731878 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

18 0.77616531 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

19 0.77610701 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors

20 0.77545118 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection