cvpr cvpr2013 cvpr2013-25 knowledge-graph by maker-knowledge-mining

25 cvpr-2013-A Sentence Is Worth a Thousand Pixels


Source: pdf

Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun

Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. [sent-3, score-0.559]

2 We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. [sent-4, score-0.588]

3 We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. [sent-5, score-0.678]

4 We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12. [sent-6, score-0.382]

5 However, very few approaches [ 18, 26] try to use text to improve semantic understanding of images beyond simple image classification [2 1], or tag generation [6, 2]. [sent-18, score-0.308]

6 If we were able to retrieve the objects and stuff present in the scene, their relations and the actions they perform from textual descriptions, we should be able to do a much better job at automatically parsing those images. [sent-24, score-0.292]

7 Here we are interested in exploiting textual information for semantic scene understanding. [sent-25, score-0.26]

8 In particular, our goal is to reason jointly about the scene type, objects, their location and spatial extent in an image, while exploiting textual information in the form of complex sentential image descriptions generated by humans. [sent-26, score-0.414]

9 Being able to extract semantic information from text does not entirely solve the image parsing problem, as we cannot expect sentences to describe everything that is happening in the image, and in too great detail. [sent-27, score-0.73]

10 Furthermore, not all information in the descriptions may be visually relevant, thus textual information also contains considerable noise for the task of interest, which needs to be properly handled. [sent-29, score-0.26]

11 In this paper we propose a holistic model for semantic parsing which employs text as well as image information. [sent-31, score-0.542]

12 111999999533 Our model is a conditional random field (CRF) which employs complex sentential image descriptions to jointly reason about multiple scene recognition tasks. [sent-32, score-0.292]

13 We automatically parse the sentences and extract objects and their relationships, and incorporate those into the model, both via potentials as well as by re-ranking the candidate bounding boxes. [sent-33, score-0.678]

14 The generated sentences are semantically richer than a list of words, as the former describe objects, their actions and inter-object relations. [sent-41, score-0.335]

15 [7] generated sentences by learning a meaning space which is shared between text and image domains. [sent-42, score-0.564]

16 In [1], sentences were generated by parsing semantically videos into objects and their relations defined by actions. [sent-52, score-0.464]

17 Topic models have been widely employed to model text and images, particularly for retrieval tasks. [sent-53, score-0.267]

18 Corr-LDA [4] was proposed to capture the relations between images and text annotations, but assumes a one to one correspondence between text and image topics. [sent-54, score-0.504]

19 They used sentences generated by humans describing the images in order to analyze what humans think is important. [sent-64, score-0.335]

20 Despite the large body of work in generating descriptions or word annotations from images, there is little work into trying to use text to improve recognition. [sent-72, score-0.326]

21 Towards this goal, we utilize semantic labelings and rich textual descriptions of images to learn powerful holistic models. [sent-86, score-0.49]

22 Automatic Text Extraction In this section we show how to extract meaningful information from complex sentences for our image parsing task. [sent-91, score-0.455]

23 We extract part of speech tags (POS) of all sentences using the Stanford POS Tagger for English language [28]. [sent-92, score-0.452]

24 We parse syntactically the sentences and obtain a parse tree using the Stanford Parser with factored model [13]. [sent-93, score-0.517]

25 Given the POS, parse trees and type dependencies, we would like to extract whether an object class was mentioned as well as its cardinality (i. [sent-105, score-0.529]

26 Object Cardinality: The object cardinality can appear in the sentence in two different forms. [sent-116, score-0.417]

27 For this purpose, we parse the entire sentence from left to right and with each mention of a class noun we increase the count by 1 or 2, depending whether the word used is singular or plural. [sent-122, score-0.348]

28 Object Relations: We extract object relations by extracting prepositions and the objects they modify. [sent-125, score-0.315]

29 Towards this goal, we first locate the prepositions of interest in the sentence (i. [sent-126, score-0.382]

30 For each preposition we utilize the parse tree in order to locate the objects modified by the preposition. [sent-129, score-0.358]

31 Note that for a given preposition there could be more than one tuple of this form as well as more than one preposition in each sentence. [sent-131, score-0.452]

32 For example in the sentence ”two planes are parked near a car and a person”, we can extract (plane, near, car) and (plane, near, person). [sent-132, score-0.29]

33 Holistic Scene Understanding In this section we describe our approach to holistic scene understanding that uses text as well as image information in order to reason about multiple recognition tasks, i. [sent-140, score-0.456]

34 We employ potentials which utilize the image I, text T, as well as statistics of the variables of interest. [sent-164, score-0.49]

35 When a class is mentioned, we use the average cardinality (across all sentences) for each class φcmleansts(zi|T) =? [sent-276, score-0.385]

36 nd class i mentioned When a class is not mentioned we simply use a bias φnclotamssent(zi|T) =? [sent-278, score-0.27]

37 nd class i not mentioned We learn different weights for these two features for each class in order to learn the statistics of how often each class is “on” or “off” depending on whether it is mentioned. [sent-280, score-0.285]

38 In this case, we add as many boxes as dictated by the extracted object cardinality C¯ard(cls). [sent-302, score-0.322]

39 We utilize both text and images to compute the score for each detection. [sent-303, score-0.27]

40 In particular, for each box we compute a feature vector composed of the original detector’s score, the average cardinality for that class extracted from text, as well as object size relative to the image size. [sent-304, score-0.351]

41 In this potential we would like to encode that if the text refers to two cars being in the image, then our model should expect at least two cars to be on. [sent-420, score-0.297]

42 Thus our potential should penalize all car box configurations that have cardinality smaller than the estimated cardi? [sent-421, score-0.383]

43 In order to exploit this fact, we extract prepositions from text and use them to score pairs of boxes. [sent-433, score-0.466]

44 , near, in, on, in front of) from text using the algorithm proposed in the previous section. [sent-436, score-0.266]

45 We train a classifier for each preposition that uses features defined over pairs of bounding boxes of the referred object classes, e. [sent-438, score-0.349]

46 We train for each preposition an SVM classifier with a polynomial kernel using these features. [sent-443, score-0.294]

47 We compute the new score for each box using the preposition classifier on the pairwise features as follows rˆi =j m,parexpscore(ri,j,prep) Prepositions (relation only): 39. [sent-445, score-0.301]

48 where ri,j,prep is the output of the classifier for preposition prep between boxes iand j. [sent-452, score-0.397]

49 While order of classes (left or right to the preposition) might matters for certain prepositions (such as “on top of”) in our experiments ignoring the position of the box yielded the best results. [sent-454, score-0.276]

50 Following [29], we exploit this by defining potentials φschls (xi, bl |I) which utilize a shape prior for each component. [sent-463, score-0.264]

51 Scene Potentials Text Scene Potential: We first extract a vocabulary of words appearing in the full text corpus, resulting in 1793 words. [sent-468, score-0.266]

52 then define the unary potential for the scene node as We φScene(s = u|T) = σ(tu) where tu denotes the classifier score for scene class u and is again the logistic function. [sent-477, score-0.346]

53 σ Scene-class compatibility: Following [29], we define a pairwise compatibility potential between the scene and the class labels as φSC(s,zk) =⎨⎪ ⎧0−fsτ,zk ioftz hk er= wi 1s e∧. [sent-478, score-0.286]

54 We handannotated GT cardinality and prepositions for each sentence, which we call textGT. [sent-498, score-0.435]

55 , for man is walking with child, the textGT cardinality for person is two even though there may be more people in the image. [sent-501, score-0.326]

56 However, despite the rather low prediction accuracy, our experiments show that cardinality is a very strong textual cue. [sent-509, score-0.398]

57 Holistic parsing using text: We next evaluate our holistic model for semantic segmentation. [sent-517, score-0.272]

58 Our baselines consists of TextonBoost [25] (which is employed by our model to form segmentation unary potentials) as well as the holistic model of [29], which only employs visual information. [sent-519, score-0.336]

59 Note that GT cardinality only affects the potential on z and potential on the number of detection boxes. [sent-539, score-0.371]

60 “GT noneg” denotes the experiment, where we encourage at least as many boxes as dictated by the cardinality to be on in the image. [sent-541, score-0.322]

61 Rescoring the DPM detections via text gives significant improvement in AP, as shown in Table 4. [sent-549, score-0.264]

62 Adding the text-based cardinality and scene potentials, boosts the performance by another 5. [sent-553, score-0.318]

63 Amount of Text: We also experimented with the number of sentences per image. [sent-558, score-0.335]

64 We tried two settings, (a) by using all available sentences per image in training, but different number in test, and (b) by varying also the number of training sentences. [sent-559, score-0.335]

65 Since sentences can also not be directly related to the visual content of the image, having more of them increases performance. [sent-562, score-0.335]

66 Results: Detector’s AP (%) using text-based re-scoring for different number of sentences used per image in train and test. [sent-581, score-0.369]

67 Conclusions In this paper, we tackled the problem of holistic scene understanding in scenarios where visual imagery is accompanied by text in the form of complex sentential descriptions or short image captions. [sent-583, score-0.656]

68 We proposed a CRF model that employs visual and textual data to reason jointly about objects, segmentation and scene classification. [sent-584, score-0.302]

69 Every picture tells a story: Generating sentences for images. [sent-662, score-0.335]

70 3, 6 112990990199 sent 1: “A dog herding two sheep. [sent-689, score-0.549]

71 ” sent 2: “A sheep dog and two sheep walking in a field. [sent-690, score-0.604]

72 ” sent 3: “Black dog herding sheep in grassy field. [sent-691, score-0.597]

73 ” sent 1: “Passengers at a station waiting to board a train pulled by a green locomotive engine. [sent-692, score-0.726]

74 ” sent 2: “Passengers loading onto a train with a green and black steam engine. [sent-693, score-0.58]

75 ” sent 3: “Several people waiting to board the train. [sent-694, score-0.618]

76 ” sent 3: “Five cows grazing in a snow covered field. [sent-697, score-0.652]

77 ” sent 4: “Three black cows and one brown cow stand in a snowy field. [sent-698, score-0.585]

78 ” sent 1: “A yellow and white sail boat glides between the shore and a yellow buoy. [sent-699, score-0.51]

79 ” sent 2: “Sail boat on water with two people riding inside. [sent-700, score-0.51]

80 ” sent 3: “Small sailboat with spinnaker passing a buoy. [sent-701, score-0.476]

81 ” sent 1: “A table is set with wine and dishes for two people. [sent-702, score-0.476]

82 ” sent 3: “A wooden table is set with candles, wine, and a purple plastic bowl. [sent-704, score-0.476]

83 ” sent 1: “An old fashioned passenger bus with open windows. [sent-705, score-0.509]

84 ” sent 2: “Bus with yellow flag sticking out window. [sent-706, score-0.476]

85 ” sent 3: “The front of a red, blue, and yellow bus. [sent-707, score-0.513]

86 ” sent 4: “The idle tourist bus awaits its passengers. [sent-708, score-0.509]

87 A few example images, sentences per image and the final segmentation. [sent-710, score-0.335]

88 image three sofap e rps eo en rs o n p e rps e o srn o n rtanip e psr eo srn o n rtanip e psr eo srn o n cowcow Yao [29] one sent two sent sent sent 1: “Passengers at a station waiting to board a train pulled by a green locomotive engine. [sent-711, score-2.558]

89 ” sent 2: “Passengers loading onto a train with a green and black steam engine. [sent-712, score-0.58]

90 ” sent 3: “Several people waiting to board the train. [sent-713, score-0.618]

91 ” sent 1: “Black and white cows grazing in a pen. [sent-714, score-0.652]

92 ’ sent 2: “The black and white cows pause in front of the gate. [sent-715, score-0.622]

93 ” sent 3: “Two cows in a field grazing near a gate. [sent-716, score-0.684]

94 ” image Yao [29] one sent two sent three sent pesronpersonaeroppalensreonaeorppalenreson sent 1: “Two men on a plane, the closer one with a suspicious look on his face. [sent-717, score-1.948]

95 ” sent 2: “A wide-eyed blonde man sits in an airplane next to an Asian “Up close photo of man with short blonde hair on airplane. [sent-719, score-0.672]

96 ” sent 3: pesrbiondc h a irc hca hri a r id rni gatblepd siron gatbelc h atrimvocnh c ahticoari ari pesrdnio gtabelc h vta rimonc tihc ah rai r ipedsrnio gatbelc h avt rim ocn hcti ha o ari ri sent 1: “Man using computer on a table. [sent-721, score-1.034]

97 ” sent 2: “The man sitting at a messy table and using a laptop. [sent-722, score-0.569]

98 ” sent 3: “Young man sitting at messy table staring at laptop. [sent-723, score-0.569]

99 Results as a function of the number of sentences employed. [sent-725, score-0.335]

100 Connecting modalities: Semisupervised segmentation and annotation of images using unaligned text corpora. [sent-825, score-0.276]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sent', 0.476), ('sentences', 0.335), ('cardinality', 0.235), ('text', 0.229), ('preposition', 0.226), ('prepositions', 0.2), ('potentials', 0.183), ('sentence', 0.182), ('textual', 0.163), ('holistic', 0.143), ('cows', 0.109), ('sentential', 0.103), ('descriptions', 0.097), ('parse', 0.091), ('nouns', 0.084), ('parsing', 0.083), ('passengers', 0.082), ('prep', 0.082), ('textgt', 0.082), ('class', 0.075), ('potential', 0.068), ('grazing', 0.067), ('unary', 0.067), ('gt', 0.065), ('pos', 0.064), ('waiting', 0.064), ('srn', 0.062), ('mentioned', 0.06), ('compatibility', 0.058), ('man', 0.057), ('uiuc', 0.056), ('boxes', 0.055), ('synonyms', 0.055), ('ype', 0.055), ('fidler', 0.054), ('scene', 0.051), ('sheep', 0.048), ('segmentation', 0.047), ('relations', 0.046), ('semantic', 0.046), ('tags', 0.044), ('textonboost', 0.044), ('men', 0.044), ('board', 0.044), ('relation', 0.042), ('box', 0.041), ('employs', 0.041), ('blonde', 0.041), ('gatbelc', 0.041), ('herding', 0.041), ('rtanip', 0.041), ('utilize', 0.041), ('voc', 0.041), ('bl', 0.04), ('car', 0.039), ('zk', 0.039), ('crf', 0.038), ('employed', 0.038), ('front', 0.037), ('child', 0.037), ('extract', 0.037), ('employ', 0.037), ('duygulu', 0.036), ('cls', 0.036), ('locomotive', 0.036), ('messy', 0.036), ('nps', 0.036), ('prepositional', 0.036), ('pulled', 0.036), ('rps', 0.036), ('senses', 0.036), ('station', 0.036), ('stats', 0.036), ('steam', 0.036), ('webpages', 0.036), ('ap', 0.036), ('language', 0.036), ('detections', 0.035), ('classes', 0.035), ('dpm', 0.035), ('wi', 0.034), ('train', 0.034), ('sail', 0.034), ('loading', 0.034), ('classifier', 0.034), ('people', 0.034), ('understanding', 0.033), ('bus', 0.033), ('dog', 0.032), ('boosts', 0.032), ('near', 0.032), ('candidate', 0.032), ('extracting', 0.032), ('adjectives', 0.032), ('ard', 0.032), ('dictated', 0.032), ('psr', 0.032), ('type', 0.031), ('ioft', 0.031), ('rescoring', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun

Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].

2 0.23076862 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

3 0.21735467 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Author: Pradipto Das, Chenliang Xu, Richard F. Doell, Jason J. Corso

Abstract: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.

4 0.14929472 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction

Author: C. Lawrence Zitnick, Devi Parikh

Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

5 0.14121701 5 cvpr-2013-A Bayesian Approach to Multimodal Visual Dictionary Learning

Author: Go Irie, Dong Liu, Zhenguo Li, Shih-Fu Chang

Abstract: Despite significant progress, most existing visual dictionary learning methods rely on image descriptors alone or together with class labels. However, Web images are often associated with text data which may carry substantial information regarding image semantics, and may be exploited for visual dictionary learning. This paper explores this idea by leveraging relational information between image descriptors and textual words via co-clustering, in addition to information of image descriptors. Existing co-clustering methods are not optimal for this problem because they ignore the structure of image descriptors in the continuous space, which is crucial for capturing visual characteristics of images. We propose a novel Bayesian co-clustering model to jointly estimate the underlying distributions of the continuous image descriptors as well as the relationship between such distributions and the textual words through a unified Bayesian inference. Extensive experiments on image categorization and retrieval have validated the substantial value of the proposed joint modeling in improving visual dictionary learning, where our model shows superior performance over several recent methods.

6 0.13601924 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

7 0.13125247 156 cvpr-2013-Exploring Compositional High Order Pattern Potentials for Structured Output Learning

8 0.11583998 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

9 0.10194194 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings

10 0.099484771 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors

11 0.09925469 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes

12 0.091367051 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential

13 0.090130247 24 cvpr-2013-A Principled Deep Random Field Model for Image Segmentation

14 0.088573813 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters

15 0.088046275 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation

16 0.087978445 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation

17 0.085361496 364 cvpr-2013-Robust Object Co-detection

18 0.080523774 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels

19 0.080014758 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

20 0.079481617 187 cvpr-2013-Geometric Context from Videos


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, -0.061), (2, 0.016), (3, -0.045), (4, 0.102), (5, 0.046), (6, 0.026), (7, 0.127), (8, -0.04), (9, 0.009), (10, 0.041), (11, 0.004), (12, -0.023), (13, 0.015), (14, -0.046), (15, 0.06), (16, 0.069), (17, 0.087), (18, -0.014), (19, -0.072), (20, -0.011), (21, -0.036), (22, 0.086), (23, 0.033), (24, 0.013), (25, -0.059), (26, 0.019), (27, 0.086), (28, -0.007), (29, -0.138), (30, -0.095), (31, -0.042), (32, -0.037), (33, 0.069), (34, -0.029), (35, -0.004), (36, -0.034), (37, 0.099), (38, -0.117), (39, 0.025), (40, -0.008), (41, 0.051), (42, -0.099), (43, -0.001), (44, -0.06), (45, 0.036), (46, -0.086), (47, 0.014), (48, 0.129), (49, -0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91610008 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun

Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].

2 0.76790887 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

3 0.68282151 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Author: Pradipto Das, Chenliang Xu, Richard F. Doell, Jason J. Corso

Abstract: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.

4 0.67351979 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction

Author: C. Lawrence Zitnick, Devi Parikh

Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

5 0.66722989 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection

Author: Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song Gao, Zhong Zhang

Abstract: Scene text recognition has inspired great interests from the computer vision community in recent years. In this paper, we propose a novel scene text recognition method using part-based tree-structured character detection. Different from conventional multi-scale sliding window character detection strategy, which does not make use of the character-specific structure information, we use part-based tree-structure to model each type of character so as to detect and recognize the characters at the same time. While for word recognition, we build a Conditional Random Field model on the potential character locations to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework. The final word recognition result is obtained by minimizing the cost function defined on the random field. Experimental results on a range of challenging public datasets (ICDAR 2003, ICDAR 2011, SVT) demonstrate that the proposed method outperforms stateof-the-art methods significantly bothfor character detection and word recognition.

6 0.61732155 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

7 0.60349482 132 cvpr-2013-Discriminative Re-ranking of Diverse Segmentations

8 0.58403367 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation

9 0.56778163 24 cvpr-2013-A Principled Deep Random Field Model for Image Segmentation

10 0.56533992 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

11 0.56170154 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings

12 0.55494004 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation

13 0.54022515 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

14 0.53829825 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior

15 0.53506392 156 cvpr-2013-Exploring Compositional High Order Pattern Potentials for Structured Output Learning

16 0.52672642 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters

17 0.51417458 406 cvpr-2013-Spatial Inference Machines

18 0.51260114 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

19 0.49608529 78 cvpr-2013-Capturing Layers in Image Collections with Componential Models: From the Layered Epitome to the Componential Counting Grid

20 0.4926976 278 cvpr-2013-Manhattan Junction Catalogue for Spatial Reasoning of Indoor Scenes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.089), (16, 0.018), (26, 0.053), (28, 0.013), (33, 0.236), (39, 0.013), (62, 0.252), (67, 0.078), (69, 0.052), (80, 0.013), (87, 0.061), (99, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81445789 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun

Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].

2 0.80689949 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data

Author: Martin Bäuml, Makarand Tapaswi, Rainer Stiefelhagen

Abstract: We address the problem of person identification in TV series. We propose a unified learning framework for multiclass classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance.

3 0.80512732 286 cvpr-2013-Mirror Surface Reconstruction from a Single Image

Author: Miaomiao Liu, Richard Hartley, Mathieu Salzmann

Abstract: This paper tackles the problem of reconstructing the shape of a smooth mirror surface from a single image. In particular, we consider the case where the camera is observing the reflection of a static reference target in the unknown mirror. We first study the reconstruction problem given dense correspondences between 3D points on the reference target and image locations. In such conditions, our differential geometry analysis provides a theoretical proof that the shape of the mirror surface can be uniquely recovered if the pose of the reference target is known. We then relax our assumptions by considering the case where only sparse correspondences are available. In this scenario, we formulate reconstruction as an optimization problem, which can be solved using a nonlinear least-squares method. We demonstrate the effectiveness of our method on both synthetic and real images.

4 0.732391 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

5 0.73213428 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

Author: Xiaohui Shen, Zhe Lin, Jonathan Brandt, Ying Wu

Abstract: Detecting faces in uncontrolled environments continues to be a challenge to traditional face detection methods[24] due to the large variation in facial appearances, as well as occlusion and clutter. In order to overcome these challenges, we present a novel and robust exemplarbased face detector that integrates image retrieval and discriminative learning. A large database of faces with bounding rectangles and facial landmark locations is collected, and simple discriminative classifiers are learned from each of them. A voting-based method is then proposed to let these classifiers cast votes on the test image through an efficient image retrieval technique. As a result, faces can be very efficiently detected by selecting the modes from the voting maps, without resorting to exhaustive sliding window-style scanning. Moreover, due to the exemplar-based framework, our approach can detect faces under challenging conditions without explicitly modeling their variations. Evaluation on two public benchmark datasets shows that our new face detection approach is accurate and efficient, and achieves the state-of-the-art performance. We further propose to use image retrieval for face validation (in order to remove false positives) and for face alignment/landmark localization. The same methodology can also be easily generalized to other facerelated tasks, such as attribute recognition, as well as general object detection.

6 0.73078239 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

7 0.72963065 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

8 0.72905731 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

9 0.72902972 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

10 0.72876841 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

11 0.72874171 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

12 0.7281028 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

13 0.72799289 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

14 0.72788411 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors

15 0.72686547 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

16 0.72672814 325 cvpr-2013-Part Discovery from Partial Correspondence

17 0.72620869 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

18 0.72602296 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

19 0.72585714 311 cvpr-2013-Occlusion Patterns for Object Class Detection

20 0.72553635 414 cvpr-2013-Structure Preserving Object Tracking