iccv iccv2013 iccv2013-246 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende
Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. [sent-6, score-0.29]
2 Specifically, we extract predicate tuples that contain two nouns and a relation. [sent-8, score-0.79]
3 We determine the potentials of the CRF using the tuples extracted from the sentences. [sent-11, score-0.525]
4 We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. [sent-12, score-0.281]
5 Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. [sent-14, score-0.461]
6 That is, an image may be given as input and a sentence produced [8, 1, 13, 25, 38, 19], or a scene or animation may be generated from a sentence description [27, 6, 17, 14]. [sent-24, score-0.67]
7 For the later, the quality of the generated scenes may be studied to determine whether the meaning of the sentence was correctly interpreted. [sent-25, score-0.544]
8 Noticehowsubtlechanges in the wording of the sentences leads to different visual interpretations. [sent-27, score-0.481]
9 [3 1] proposed a novel approach to discovering the visual meaning of semantic phrases using training datasets created from text-based image search. [sent-31, score-0.363]
10 We demonstrate the effectiveness of our approach by both generating novel scenes from sentence descriptions, and by enabling sentence-based search of abstract scenes [41]. [sent-33, score-0.572]
11 However, unlike previous papers [6, 19] we automatically discover the relation between semantic and visual information. [sent-34, score-0.284]
12 That is, we not only learn the mapping of nouns to the occurrence of various objects, but also the visual meaning of many verbs, prepositions and adjectives. [sent-35, score-0.579]
13 We explore this problem within the methodology of Zitnick and Parikh [41], which proposed the use of abstract scenes generated from clip art to study semantic scene understanding, Figure 1. [sent-36, score-0.467]
14 First, by construction we know the visual arrangement and attributes of the objects in the scene, but not their semantic meaning. [sent-38, score-0.313]
15 This allows us to focus on the core problem of semantic scene understanding, while avoiding problems that arise with the use of noisy automatic object and attribute detectors in real images. [sent-39, score-0.297]
16 Second, while real image datasets may be quite large, they contain a very diverse set of scenes resulting in a sparse sampling of many semantic concepts [15, 29, 8, 36]. [sent-40, score-0.317]
17 Using abstract scenes we may densely sam11668811 ple to learn subtle nuances in semantic meaning. [sent-41, score-0.317]
18 While the sentences “Jenny is next to Mike”, “Jenny ran after Mike” and “Jenny ran to Mike” are similar, each has distinctly different visual interpretations. [sent-43, score-0.629]
19 For each scene, we gathered two sets of three sentences describing different aspects of the scene. [sent-50, score-0.448]
20 The result is a dataset containing 60,000 sentences that is publicly available on the authors’ website. [sent-51, score-0.448]
21 Unary potentials model the position, occurrence and attributes of an object. [sent-55, score-0.308]
22 For each sentence, we extract a set of predicate tuples containing a primary object, a relation, and secondary object. [sent-57, score-0.814]
23 We use the predicate tuples and visual features (trivially) extracted from the training scenes to determine the CRF’s unary and pairwise potentials. [sent-58, score-0.841]
24 Given a novel sentence, a scene depicting the meaning of the sentence may be generated by sampling from the CRF. [sent-59, score-0.444]
25 Results show that our approach can generate (and retrieve) scenes that convey the desired semantic meaning, even when scenes are described by multiple sentences. [sent-61, score-0.461]
26 Related work Images to sentences: Several works have looked at the task of annotating images with tags, be it nouns [23, 3] or adjectives (attributes) [9, 26]. [sent-64, score-0.283]
27 More recently, efforts are being made to predict entire sentences from image features [8, 25, 38, 19]. [sent-65, score-0.448]
28 Some methods generate novel sentences by leveraging existing object detectors [12], attributes predictors [9, 4, 26], language statistics [38] or spatial relationships [19]. [sent-66, score-0.689]
29 Of course at the core, our approach learns the visual meaning of semantically rich text that may also help produce more grounded textual descriptions of images. [sent-73, score-0.313]
30 Some approaches build intermediate representations of images that capture mid-level semantic concepts [34, 30, 24, 40, 7, 35] and help bridge the well known semantic gap. [sent-75, score-0.27]
31 These semantic concepts or attributes can also be used to pose queries for image search [20, 33]. [sent-76, score-0.267]
32 Their “meaning” representation only involved tuples of the (object, action, scene) form. [sent-83, score-0.402]
33 As will be evident later, we allow for a significantly larger variety of textual phrases and learn their visual meaning from data. [sent-84, score-0.288]
34 We demonstrate this by generating novel scenes in addition to retrieving scenes as in [10]. [sent-85, score-0.335]
35 Text to images (generation): In computer graphics the use of sentence descriptions has been used as an intuitive method for generating static scenes [6] and animations [27]. [sent-86, score-0.498]
36 [16] explore the use of both prepositions and adjectives to build better object models, while [3 1] and [39] study relations that convey information related to active verbs, such as “riding” or “playing”. [sent-96, score-0.339]
37 We extend this work to learn the meaning of semantic phrases extracted from a sentence, which convey complex information related to numerous visual features. [sent-98, score-0.424]
38 Our approach to sentence parsing and how the parsed sentences are used to determine the CRF potentials is described in following sections. [sent-101, score-0.808]
39 We use scenes that are created from 80 pieces of clip art representing 58 different objects [41], such as people, animals, toys, trees, etc. [sent-102, score-0.347]
40 Turkers were allowed 11668822 Figure 2: Example tuples extracted from sentences. [sent-104, score-0.402]
41 Correct tuples are shown in blue, and incorrect or incomplete tuples are shown in red (darker). [sent-105, score-0.804]
42 eWnchoidlee w thee c puerrresonntl’ys only model attributes for people, the model may handle attributes for other objects as well. [sent-115, score-0.306]
43 We define the conditional probability of the scene {c, Φ, Ψ} given a set of sentences pSr oasb log P(c, Φ, Ψ|S, θ) = ? [sent-116, score-0.588]
44 Occurrence: We compute the unary occurrence potential using ψi (ci, S; θc) = log θψ (ci , i, S) (2) where the parameters θψ (ci, i, S) encode the likelihood of observing or not observing object igiven the sentences S. [sent-140, score-0.687]
45 Attributes: The attribute potential encodes the likelihood of observing the attributes given the sentences if the object is a person and is 0 otherwise πi(Ψi,S;θπ) = ? [sent-151, score-0.701]
46 We include the direction di object i is facing when− −co ymputing Δx so that we may determine whether object iis facing object j. [sent-182, score-0.28]
47 The parameters θφ,z (i, j, , S) used by the relative depth potential encode the probability of the depth ordering of two objects given the sentences P(Δz |i, j,S). [sent-196, score-0.542]
48 A majority of the model’s parameters for computing the unary and pairwise potentials are dependent on the set of given sentences S. [sent-202, score-0.614]
49 In this section we describe how we parse a set of sentences into a set of predicate tuples. [sent-203, score-0.612]
50 A set of predicate tuples is a common method for encoding the information contained in a sentence. [sent-205, score-0.566]
51 Several papers have also explored the use of various forms of tuples for use in semantic scene understanding [8, 3 1, 19]. [sent-206, score-0.624]
52 In our representation a tuple contains a primary object, a relation, and an optional secondary object. [sent-207, score-0.398]
53 The primary and secondary object are both represented as nouns, where the relation may take on several forms. [sent-208, score-0.42]
54 Examples of sentences and tuples are shown in Figure 2. [sent-210, score-0.85]
55 Note that each sentence can Figure 4: Figure showing the probability of expression (red) and pose (blue) for the primary object for several predicate relations, larger circle implies greater probability. [sent-211, score-0.588]
56 The tuples are found using a technique called semantic roles analysis [28] that allows for the unpacking of a sentence’s semantic roles into a set of tuples. [sent-213, score-0.762]
57 Note that the words in the sentences are represented using their lemma or “dictionary look-up form” so that different forms of the same word are mapped together. [sent-214, score-0.448]
58 Finally, while we model numerous relationships within a sentence, there are many we do not model and semantic roles analysis often misses tuples and may contain errors. [sent-217, score-0.639]
59 In our experiments we use a set of 10,000 clip art scenes provided by [41]. [sent-222, score-0.277]
60 For each of these scenes we gathered two sets of descriptions, each containing three sentences using AMT. [sent-223, score-0.587]
61 The turkers were instructed to “Please write three simple sentences describing different parts of the scene. [sent-224, score-0.514]
62 These sentences should range from 4 to 8 words in length using basic words that would appear in a book for young children ages 4-6. [sent-225, score-0.478]
63 ” 9,000 scenes and their 54,000 sentences were used for training, and 1000 descriptions (3000 sentences) for the remaining 1000 scenes were used for testing. [sent-226, score-0.791]
64 319 nouns (objects) and 445 relations were found at least 8 times in the sentences. [sent-227, score-0.309]
65 W Qe, now describe how the tuples are used to compute the CRF’s parameters for scene generation, followed by how we generate scenes using the CRF. [sent-235, score-0.61]
66 Each tuple contains one or two nouns and a relation that 11668844 aFiregu shreo w5:n Il ilnurs etrda. [sent-236, score-0.472]
67 We assume the nouns only provide information related to the occurrence of objects. [sent-243, score-0.321]
68 We make the simplifying assumption that each noun used by the primary or secondary object only refers to a single object in the scene. [sent-244, score-0.402]
69 We perform this mapping by finding the clip art object that has the highest mutual information for each noun over all of the training scenes M(i) = maxj I(i; j), where I(i; j) is the mutual inforMmat(iio)n = be mtwaeexn noun i and clip art object j. [sent-246, score-0.699]
70 The nouns with highest mutual information are shown for several objects in Figure 3. [sent-247, score-0.298]
71 The main failures were for ambiguous nouns such as “it” and nouns for which no distinct clip art exists, such as “ground”, “sky”, or “hand”. [sent-249, score-0.586]
72 Intuitively, we compute the potentials such that all objects in the tuples are contained in the scene and otherwise their occurrence is based on their prior probability. [sent-252, score-0.702]
73 Foof rv eaalucehs celoarrtieosnpo ln ∈din Rg, to the likelihood of the primary object’s attributes Pp(k|l) and secondary object’s attributes Ps (k|l) given the rela(ktio|ln) l, where k is an index over the set of (aktt|rli)b gutievse. [sent-259, score-0.512]
74 If an object is a member of multiple tuples, the average values of Pp(k|ri) or Ps (k|ri) across all tuples are used. [sent-265, score-0.447]
75 The tree mentioned in the description is missing in our scene because the tuples missed the tree. [sent-281, score-0.54]
76 Random scenes from the dataset (Random) are also shown to demonstrate the variety of scenes in the dataset. [sent-284, score-0.278]
77 In- tuitively, this makes sense since each relation implies the primary object is looking at the secondary object, but the secondary object may be facing either direction. [sent-294, score-0.662]
78 If a tuple does not contain a secondary object, the tuple is not used to update the parameters for the pairwise potentials. [sent-296, score-0.487]
79 Generating scenes using the CRF After the potentials of the CRF have been computed given the tuples extracted from the sentences, we generate a scene using a combination of sampling and iterated conditional modes. [sent-299, score-0.745]
80 If the object is something that may be worn such as a hat or glasses, its position is determined by the person whose attributes indicate it is most likely to be worn by them. [sent-307, score-0.317]
81 The subjects find our scenes (Full-CRF) better represent the input sentences than all baseline approaches. [sent-320, score-0.654]
82 (middle) Subjects were asked to score how well a scene depicted a set of three sentences from 1 (very poor) to 5 (very well). [sent-322, score-0.517]
83 GT: The ground truth uses the original scenes that the mechanical turkers’ viewed while writing their sentence descriptions. [sent-327, score-0.413]
84 BoW: We build a bag-of-words representation for the input description that captures whether a word (primary object, secondary object or relation) is present in the description or not. [sent-329, score-0.337]
85 Noun-CRF: This baseline generates a scene using the CRF, but only based on the primary and secondary object nouns present in the predicate tuples. [sent-334, score-0.784]
86 Subjects were shown the input description and asked which one of the two scenes matched the description better, or if both equally matched. [sent-339, score-0.277]
87 The fact that our approach significantly outperforms the bag-of-words nearest neighbor base- 1Entire tuples occur too rarely to be used as “words”. [sent-349, score-0.402]
88 03) shows that it is essential to learn the semantic meaning of complex language structures that encode the relationships among objects in the scene. [sent-352, score-0.377]
89 11), but it demonstrates that the dataset is challenging in that random scenes rarely convey the same semantic meaning. [sent-354, score-0.322]
90 We compare results for varying values of K to the BoW baseline that only matches tuple objects and relations extracted from a separate set of 3,000 training sentences on the 1000 test scenes. [sent-363, score-0.758]
91 This helps demonstrate that obtaining a deeper visual interpretation of a sentence significantly improves the quality of descriptive text-based queries. [sent-369, score-0.299]
92 ” Furthermore, pre11668877 vious works learn relations using only occurrence [2, 16] and relative position [16]. [sent-374, score-0.3]
93 As we demonstrate, complex semantic phrases are dependent on a large variety of visual features including object occurrence, relative position, facial expression, pose, gaze, etc. [sent-375, score-0.323]
94 One critical aspect of our approach is the extraction of predicate tuples from sentences. [sent-377, score-0.566]
95 For instance, sentences with ambiguous phrase attachment, such as ”Jenny ran after the bear with a bat. [sent-382, score-0.522]
96 While we study the problem of scene generation and retrieval in this paper, the features we learn may also be useful for generating rich and descriptive sentences from scenes. [sent-386, score-0.681]
97 In conclusion we demonstrate a method for automatically inferring the visual meaning of predicate tuples extracted from sentences. [sent-388, score-0.708]
98 The tuples relate one or two nouns using a combination of verbs, prepositions and adjectives. [sent-389, score-0.714]
99 Every picture tells a story: Generating sentences for images. [sent-458, score-0.448]
100 From sentence to emotion: a real-time three-dimensional graphics metaphor of emotions extracted from text. [sent-485, score-0.266]
wordName wordTfidf (topN-words)
[('sentences', 0.448), ('tuples', 0.402), ('sentence', 0.237), ('nouns', 0.224), ('predicate', 0.164), ('crf', 0.163), ('secondary', 0.154), ('tuple', 0.15), ('jenny', 0.14), ('scenes', 0.139), ('semantic', 0.121), ('attributes', 0.118), ('meaning', 0.109), ('mike', 0.1), ('relation', 0.098), ('occurrence', 0.097), ('primary', 0.094), ('potentials', 0.093), ('clip', 0.091), ('prepositions', 0.088), ('relations', 0.085), ('linguistics', 0.082), ('ran', 0.074), ('phrases', 0.071), ('description', 0.069), ('scene', 0.069), ('turkers', 0.066), ('descriptions', 0.065), ('noun', 0.064), ('convey', 0.062), ('attribute', 0.062), ('happy', 0.062), ('verbs', 0.062), ('ri', 0.06), ('roles', 0.059), ('adjectives', 0.059), ('generating', 0.057), ('adjective', 0.053), ('relative', 0.053), ('language', 0.05), ('generation', 0.05), ('expression', 0.048), ('textual', 0.047), ('art', 0.047), ('association', 0.046), ('object', 0.045), ('bow', 0.044), ('worn', 0.044), ('facing', 0.043), ('conditional', 0.042), ('annual', 0.041), ('objects', 0.041), ('unary', 0.04), ('attributive', 0.04), ('gobron', 0.04), ('laugh', 0.04), ('lucy', 0.04), ('scare', 0.04), ('vanderwende', 0.04), ('parikh', 0.038), ('ball', 0.038), ('amazon', 0.038), ('mechanical', 0.037), ('position', 0.037), ('story', 0.036), ('multimedia', 0.036), ('farhadi', 0.035), ('rela', 0.035), ('nlp', 0.035), ('preposition', 0.035), ('sadeghi', 0.035), ('pi', 0.034), ('baseline', 0.034), ('pairwise', 0.033), ('mutual', 0.033), ('visual', 0.033), ('pp', 0.033), ('subjects', 0.033), ('proceedings', 0.033), ('papers', 0.032), ('zitnick', 0.032), ('gmm', 0.031), ('verb', 0.031), ('absolute', 0.031), ('determine', 0.03), ('ci', 0.03), ('young', 0.03), ('text', 0.03), ('kicking', 0.029), ('naphade', 0.029), ('boy', 0.029), ('emotions', 0.029), ('may', 0.029), ('interpretation', 0.029), ('created', 0.029), ('log', 0.029), ('relationships', 0.028), ('learn', 0.028), ('likelihood', 0.028), ('concepts', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 246 iccv-2013-Learning the Visual Interpretation of Sentences
Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende
Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.
2 0.30013892 428 iccv-2013-Translating Video Content to Natural Language Descriptions
Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele
Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.
Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
4 0.13543576 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
Author: Xiaoyang Wang, Qiang Ji
Abstract: This paper proposes a unified probabilistic model to model the relationships between attributes and objects for attribute prediction and object recognition. As a list of semantically meaningful properties of objects, attributes generally relate to each other statistically. In this paper, we propose a unified probabilistic model to automatically discover and capture both the object-dependent and objectindependent attribute relationships. The model utilizes the captured relationships to benefit both attribute prediction and object recognition. Experiments on four benchmark attribute datasets demonstrate the effectiveness of the proposed unified model for improving attribute prediction as well as object recognition in both standard and zero-shot learning cases.
5 0.13111952 53 iccv-2013-Attribute Dominance: What Pops Out?
Author: Naman Turakhia, Devi Parikh
Abstract: When we look at an image, some properties or attributes of the image stand out more than others. When describing an image, people are likely to describe these dominant attributes first. Attribute dominance is a result of a complex interplay between the various properties present or absent in the image. Which attributes in an image are more dominant than others reveals rich information about the content of the image. In this paper we tap into this information by modeling attribute dominance. We show that this helps improve the performance of vision systems on a variety of human-centric applications such as zero-shot learning, image search and generating textual descriptions of images.
6 0.12225998 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
7 0.12172323 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
8 0.12054215 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation
9 0.119187 52 iccv-2013-Attribute Adaptation for Personalized Image Search
10 0.11694828 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
11 0.11500888 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
12 0.10790375 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
13 0.10587525 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
14 0.096827514 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
15 0.095140509 46 iccv-2013-Allocentric Pose Estimation
16 0.089385763 234 iccv-2013-Learning CRFs for Image Parsing with Adaptive Subgradient Descent
17 0.087432384 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
18 0.085572928 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
19 0.084023319 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
20 0.082990631 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
topicId topicWeight
[(0, 0.186), (1, 0.088), (2, -0.015), (3, -0.069), (4, 0.128), (5, -0.005), (6, -0.032), (7, -0.109), (8, 0.07), (9, 0.031), (10, 0.041), (11, -0.03), (12, -0.014), (13, 0.058), (14, -0.029), (15, 0.019), (16, -0.065), (17, -0.006), (18, -0.065), (19, -0.061), (20, -0.071), (21, 0.019), (22, 0.04), (23, 0.013), (24, 0.01), (25, -0.086), (26, 0.125), (27, -0.074), (28, -0.106), (29, 0.006), (30, 0.056), (31, -0.17), (32, 0.065), (33, -0.053), (34, -0.038), (35, -0.048), (36, -0.115), (37, 0.074), (38, -0.204), (39, 0.027), (40, -0.076), (41, 0.111), (42, -0.041), (43, -0.013), (44, 0.013), (45, -0.001), (46, 0.093), (47, -0.062), (48, 0.02), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.90787786 246 iccv-2013-Learning the Visual Interpretation of Sentences
Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende
Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.
2 0.86026353 428 iccv-2013-Translating Video Content to Natural Language Descriptions
Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele
Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.
Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
4 0.61459106 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
Author: Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg
Abstract: Entry level categories the labels people will use to name an object were originally defined and studied by psychologists in the 1980s. In this paper we study entrylevel categories at a large scale and learn the first models for predicting entry-level categories for images. Our models combine visual recognition predictions with proxies for word “naturalness ” mined from the enormous amounts of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people. We also learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. – –
5 0.60754162 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal
Abstract: The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. We propose an approach for zero-shot learning of object categories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry, without the need to explicitly defined attributes. We propose and investigate two baseline formulations, based on regression and domain adaptation. Then, we propose a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. We applied the proposed approach on two fine-grained categorization datasets, and the results indicate successful classifier prediction.
6 0.52260578 53 iccv-2013-Attribute Dominance: What Pops Out?
7 0.51562399 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
8 0.5130589 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
9 0.49955219 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
10 0.49256197 446 iccv-2013-Visual Semantic Complex Network for Web Images
11 0.4771753 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
12 0.47386387 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
13 0.47245252 416 iccv-2013-The Interestingness of Images
14 0.46917138 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
15 0.4633196 350 iccv-2013-Relative Attributes for Large-Scale Abandoned Object Detection
16 0.445519 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
17 0.44254699 449 iccv-2013-What Do You Do? Occupation Recognition in a Photo via Social Context
18 0.44194096 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
19 0.43113846 145 iccv-2013-Estimating the Material Properties of Fabric from Video
20 0.42896268 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
topicId topicWeight
[(2, 0.08), (7, 0.014), (8, 0.28), (12, 0.045), (26, 0.062), (31, 0.056), (34, 0.016), (42, 0.1), (48, 0.011), (64, 0.032), (73, 0.028), (89, 0.161)]
simIndex simValue paperId paperTitle
same-paper 1 0.76360655 246 iccv-2013-Learning the Visual Interpretation of Sentences
Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende
Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.
2 0.73132563 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval
Author: Yen-Liang Lin, Cheng-Yu Huang, Hao-Jeng Wang, Winston Hsu
Abstract: We propose a 3D sub-query expansion approach for boosting sketch-based multi-view image retrieval. The core idea of our method is to automatically convert two (guided) 2D sketches into an approximated 3D sketch model, and then generate multi-view sketches as expanded sub-queries to improve the retrieval performance. To learn the weights among synthesized views (sub-queries), we present a new multi-query feature to model the similarity between subqueries and dataset images, and formulate it into a convex optimization problem. Our approach shows superior performance compared with the state-of-the-art approach on a public multi-view image dataset. Moreover, we also conduct sensitivity tests to analyze the parameters of our approach based on the gathered user sketches.
3 0.7240513 272 iccv-2013-Modifying the Memorability of Face Photographs
Author: Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba, Aude Oliva
Abstract: Contemporary life bombards us with many new images of faces every day, which poses non-trivial constraints on human memory. The vast majority of face photographs are intended to be remembered, either because of personal relevance, commercial interests or because the pictures were deliberately designed to be memorable. Can we make aportrait more memorable or more forgettable automatically? Here, we provide a method to modify the memorability of individual face photographs, while keeping the identity and other facial traits (e.g. age, attractiveness, and emotional magnitude) of the individual fixed. We show that face photographs manipulated to be more memorable (or more forgettable) are indeed more often remembered (or forgotten) in a crowd-sourcing experiment with an accuracy of 74%. Quantifying and modifying the ‘memorability ’ of a face lends itself to many useful applications in computer vision and graphics, such as mnemonic aids for learning, photo editing applications for social networks and tools for designing memorable advertisements.
4 0.6947968 186 iccv-2013-GrabCut in One Cut
Author: Meng Tang, Lena Gorelick, Olga Veksler, Yuri Boykov
Abstract: Among image segmentation algorithms there are two major groups: (a) methods assuming known appearance models and (b) methods estimating appearance models jointly with segmentation. Typically, the first group optimizes appearance log-likelihoods in combination with some spacial regularization. This problem is relatively simple and many methods guarantee globally optimal results. The second group treats model parameters as additional variables transforming simple segmentation energies into highorder NP-hard functionals (Zhu-Yuille, Chan-Vese, GrabCut, etc). It is known that such methods indirectly minimize the appearance overlap between the segments. We propose a new energy term explicitly measuring L1 distance between the object and background appearance models that can be globally maximized in one graph cut. We show that in many applications our simple term makes NP-hard segmentation functionals unnecessary. Our one cut algorithm effectively replaces approximate iterative optimization techniques based on block coordinate descent.
5 0.66134048 62 iccv-2013-Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency
Author: Jiongxin Liu, Peter N. Belhumeur
Abstract: In this paper, we propose a novel approach for bird part localization, targeting fine-grained categories with wide variations in appearance due to different poses (including aspect and orientation) and subcategories. As it is challenging to represent such variations across a large set of diverse samples with tractable parametric models, we turn to individual exemplars. Specifically, we extend the exemplarbased models in [4] by enforcing pose and subcategory consistency at the parts. During training, we build posespecific detectors scoring part poses across subcategories, and subcategory-specific detectors scoring part appearance across poses. At the testing stage, likely exemplars are matched to the image, suggesting part locations whose pose and subcategory consistency are well-supported by the image cues. From these hypotheses, part configuration can be predicted with very high accuracy. Experimental results demonstrate significantperformance gainsfrom our method on an extensive dataset: CUB-200-2011 [30], for both localization and classification tasks.
6 0.65622282 428 iccv-2013-Translating Video Content to Natural Language Descriptions
7 0.63943326 238 iccv-2013-Learning Graphs to Match
9 0.62338984 180 iccv-2013-From Where and How to What We See
10 0.62121272 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
11 0.61968434 349 iccv-2013-Regionlets for Generic Object Detection
12 0.6188128 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
13 0.61757517 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
14 0.61742651 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
15 0.61697942 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
16 0.61685508 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
17 0.6161145 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
18 0.61606193 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
19 0.61594236 327 iccv-2013-Predicting an Object Location Using a Global Image Representation
20 0.61567456 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification