nips nips2011 nips2011-126 knowledge-graph by maker-knowledge-mining

126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs


Source: pdf

Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We develop and demonstrate automatic image description methods using a large captioned photo collection. [sent-3, score-0.636]

2 One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. [sent-4, score-0.389]

3 We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. [sent-6, score-0.54]

4 1 Introduction Producing a relevant and accurate caption for an arbitrary image is an extremely challenging problem, perhaps nearly as difficult as the underlying general image understanding task. [sent-8, score-0.861]

5 In this paper, we present a method to effectively skim the top of the image understanding problem to caption photographs by collecting and utilizing the large body of images on the internet with associated visually descriptive text. [sent-11, score-1.032]

6 image localization [13], retrieving photos with specific content [27], or image parsing [26] – much more bite size and amenable to very simple nonparametric matching methods. [sent-14, score-0.89]

7 In our case, with a large captioned photo collection we can create an image description surprisingly well even with basic global image representations for retrieval and caption transfer. [sent-15, score-1.327]

8 In addition, we show that it is possible to make use of large numbers of state of the art, but fairly noisy estimates of image content to produce more pleasing and relevant results. [sent-16, score-0.591]

9 Studying collections of existing natural image descriptions and how to compose descriptions for novel queries will help advance progress toward more complex human recognition goals, such as how to tell the story behind an image. [sent-19, score-0.69]

10 These goals include determining what content people judge to be most important in images and what factors they use to construct natural language to describe imagery. [sent-20, score-0.538]

11 Emma in her hat looking super cute Figure 1: SBU Captioned Photo Dataset: Photographs with user-associated captions from our web-scale captioned photo collection. [sent-27, score-0.657]

12 We collect a large number of photos from Flickr and filter them to produce a data collection containing over 1 million well captioned pictures. [sent-28, score-0.452]

13 Often a variety of features related to document content [23], surface [25], events [19] or feature combinations [28] are used in the selection process to produce sentences that reflect the most significant concepts in the document. [sent-30, score-0.383]

14 In our photo captioning problem, we would like to generate a caption for a query picture that summarizes the salient image content. [sent-31, score-1.054]

15 We do this by considering a large relevant document set constructed from related image captions and then use extractive methods to select the best caption(s) for the image. [sent-32, score-0.776]

16 In this way we implicitly make use of human judgments of content importance during description generation, by directly transferring human made annotations from one image to another. [sent-33, score-0.614]

17 This paper presents two extractive approaches for image description generation. [sent-34, score-0.415]

18 The first uses global image representations to select relevant captions (Sec 3). [sent-35, score-0.656]

19 The second incorporates features derived from noisy estimates of image content (Sec 5). [sent-36, score-0.509]

20 Therefore, to enable our approach we build a webscale collection of images with associated descriptions (ie captions) to serve as our document for relevant caption extraction. [sent-38, score-0.915]

21 Some small collections of captioned images have been created by hand in the past. [sent-40, score-0.476]

22 The ImageClef2 image retrieval challenge contains 10k images with associated human descriptions. [sent-42, score-0.449]

23 However neither of these collections is large enough to facilitate reasonable image based matching necessary for our goals, as demonstrated by our experiments on captioning with varying collection size (Sec 3). [sent-43, score-0.497]

24 In addition this is the first – to our knowledge – attempt to mine the internet for general captioned images on a web scale! [sent-44, score-0.536]

25 In summary, our contributions are: • A large novel data set containing images from the web with associated captions written by people, filtered so that the descriptions are likely to refer to visual content. [sent-45, score-0.771]

26 • A description generation method that utilizes global image representations to retrieve and transfer captions from our data set to a query image. [sent-46, score-0.961]

27 • A description generation method that utilizes both global representations and direct estimates of image content (objects, actions, stuff, attributes, and scenes) to produce relevant image descriptions. [sent-47, score-1.026]

28 This results in descriptions for images that are usually closely related to image content, but that are also often quite verbose and non-humanlike. [sent-52, score-0.616]

29 Feng and Lapata [11] generate captions for images using extractive and abstractive generation methods, but assume relevant documents are provided as input, whereas our generation method requires only an image as input. [sent-66, score-1.088]

30 In this work the authors produce image descriptions via a retrieval method, by translating both images and text descriptions to a shared meaning space represented by a single < object, action, scene > tuple. [sent-68, score-0.977]

31 A description for a query image is produced by retrieving whole image descriptions via this meaning space from a set of image descriptions (the UIUC Pascal Sentence data set). [sent-69, score-1.373]

32 This results in descriptions that are very human – since they were written by humans – but which may not be relevant to the specific image content. [sent-70, score-0.513]

33 This limited relevancy often occurs because of problems of sparsity, both in the data collection – 1000 images is too few to guarantee similar image matches – and in the representation – only a few categories for 3 types of image content are considered. [sent-71, score-1.076]

34 In contrast, we attack the caption generation problem for much more general images (images found via thousands of Flickr queries compared to 1000 images from Pascal) and a larger set of object categories (89 vs 20). [sent-72, score-0.935]

35 In addition to extending the object category list considered, we also include a wider variety of image content aspects, including: non-part based stuff categories, attributes of objects, person specific action models, and a larger number of common scene classes. [sent-73, score-1.042]

36 We also generate our descriptions via an extractive method with access to much larger and more general set of captioned photographs from the web (1 million vs 1 thousand). [sent-74, score-0.776]

37 Captions can also be generated after step 2 from descriptions associated with top globally matched images. [sent-78, score-0.489]

38 In the rest of the paper, we describe collecting a web-scale data set of captioned images from the internet (Sec 2. [sent-79, score-0.49]

39 1), caption generation using a global representation (Sec 3), content estimation for various content types (Sec 4), and finally present an extension to our generation method that incorporates content estimates (Sec 5). [sent-80, score-1.379]

40 To achieve the first requirement we query Flickr using a huge number of pairs of query terms (objects, attributes, actions, stuff, and scenes). [sent-84, score-0.434]

41 To achieve our second requirement 3 Query  Image   1k  matches   10k  matches   100k  matches   1million  matches   Figure 3: Size Matters: Example matches to a query image for varying data set sizes. [sent-86, score-0.49]

42 we filter this set of photos so that the descriptions attached to a picture are relevant and visually descriptive. [sent-87, score-0.432]

43 To encourage visual descriptiveness in our collection, we select only those images with descriptions of satisfactory length based on observed lengths in visual descriptions. [sent-88, score-0.455]

44 This results in a final collection of over 1 million images with associated text descriptions – the SBU Captioned Photo Dataset. [sent-92, score-0.517]

45 These text descriptions generally function in a similar manner to image captions, and usually directly refer to some aspects of the visual image content (see fig 1 for examples). [sent-93, score-1.024]

46 Hereafter, we will refer to this web based collection of captioned images as C. [sent-94, score-0.552]

47 As is usually the case with web photos, the photos in this set display a wide range of difficulty for visual recognition algorithms and captioning, from images that depict scenes (e. [sent-96, score-0.372]

48 We achieve this by computing the global similarity of a query image to our large web-collection of captioned images, C. [sent-106, score-0.817]

49 We find the closest matching image (or images) and simply transfer over the description from the matching image to the query image. [sent-107, score-0.842]

50 We also collect the 100 most similar images to a query – our matched set of images Im ∈ M – for use in our our content based description generation method (Sec 5). [sent-108, score-1.254]

51 The first descriptor is the well known gist feature, a global image descriptor related to perceptual dimensions – naturalness, roughness, ruggedness etc – of scenes. [sent-110, score-0.428]

52 The second descriptor is also a global image descriptor, computed by resizing the image into a “tiny image”, essentially a thumbnail of size 32x32. [sent-111, score-0.601]

53 To find visually relevant images we compute the similarity of the query image to images in C using a sum of gist similarity and tiny image color similarity (equally weighted). [sent-113, score-1.33]

54 Our global caption generation method is illustrated in the first 2 panes and the first 2 resulting captions of Fig 2. [sent-115, score-0.788]

55 4 Image Content Estimation Given an initial matched set of images Im ∈ M based on global descriptor similarity, we would like to re-rank the selected captions by incorporating estimates of image content. [sent-120, score-1.117]

56 For a query image, Iq and images in its matched set we extract and compare 5 kinds of image content: • Objects (e. [sent-121, score-0.902]

57 5 Each type of content is used to compute the similarity between matched images (and captions) and the query image. [sent-134, score-0.976]

58 We then rank the matched images (and captions) according to each content measure and combine their results into an overall relevancy ranking (Sec 5). [sent-135, score-0.716]

59 As the number of object detectors increases this becomes even more of an obstacle to content prediction. [sent-139, score-0.437]

60 In our web collection, C, there are strong indicators of content in the form of caption words – if an object is described in the text associated with an image then it is likely to be depicted. [sent-141, score-1.029]

61 Therefore, for the images, Im ∈ M , in our matched set we run only those detectors for objects (or stuff) that are mentioned in the associated caption. [sent-142, score-0.384]

62 For a query image we can essentially perform detection verification against the relatively clean matched image detections. [sent-147, score-1.014]

63 This also aids similarity computation between a query and a matched image because objects can be matched at an action level. [sent-153, score-1.128]

64 Training images for the attribute classifiers come from Flickr, Google, the attribute dataset provided by Farhadi et al [8], and ImageNet [5]. [sent-170, score-0.376]

65 While the low level features are useful for discriminating stuff by their appearance, the scene layout maps introduce a soft preference for certain spatial locations dependent on stuff type. [sent-180, score-0.479]

66 my chair by the river Figure 4: Results: Some good captions selected by our system for query images. [sent-187, score-0.526]

67 4 Scenes The last commonly described kind of image content relates to the general scene where an image was captured. [sent-199, score-0.841]

68 This often occurs when examining captioned photographs of vacation snapshots or general outdoor settings, e. [sent-200, score-0.373]

69 water under the bridge the water the boat was in girl in a box that is a train walking the dog in the primeval forest small dog in the grass shadows in the blue sky I tried to cross the street to get in my car but you can see that I failed LOL. [sent-214, score-0.509]

70 5 TFIDF Measures For a query image, Iq , we wish to select the best caption from the matched set, Im ∈ M . [sent-217, score-0.813]

71 For all of the content measures described so far, we have computed the similarity of the query image content to the content of each matched image independently. [sent-218, score-1.812]

72 We would also like to use information from the entire matched set of images and associated captions to predict importance. [sent-219, score-0.755]

73 We calculate this weighting both in the standard sense for matched caption document words and for detection category frequencies (to compensate for more prolific object detectors). [sent-222, score-0.801]

74 ni,j |D| tf idf = ∗ log |j : ti ∈ dj | k nk,j We define our matched set of captions (images for detector based tfidf) to be our document, j and compute the tfidf score where ni,j represents the frequency of term i in the matched set of captions (number of detections for detector based tfidf). [sent-223, score-1.299]

75 5 Content Based Description Generation For a query image, Iq , with global descriptor based matched images, Im ∈ M , we want to rerank the matched images according to the similarity of their content to the query. [sent-225, score-1.363]

76 We perform this re-ranking individually for each of our content measures: object shape, object attributes, people actions, stuff classification, and scene type (Sec 4). [sent-226, score-0.842]

77 The second method divides our training set into two classes, positive images consisting of the top 50% of the training set by BLEU score, and negative images from the bottom 50%. [sent-229, score-0.394]

78 For a novel query image, we return the captions from the top ranked image(s) as our result. [sent-232, score-0.556]

79 For an example matched caption like “The little boy sat in the grass with a ball”, several types of content will be used to score the goodness of the caption. [sent-233, score-0.982]

80 This will be computed based on words in the caption for which we have trained content models. [sent-234, score-0.602]

81 For example, for the word “ball” both the object shape and attributes will be used to compute the best similarity between a ball detection in the query image and a ball detection in the matched image. [sent-235, score-1.055]

82 For the word “boy” an action descriptor will be used to compare the activity in which the boy is occupied between the query and the matched image. [sent-236, score-0.689]

83 For the word “grass” stuff classifications will be used to compare detections between the query and the matched image. [sent-237, score-0.751]

84 For each word in the caption tfidf overlap (sum of tfidf scores for the caption) is also used as well as detector based tfidf for those words referring to objects. [sent-238, score-0.404]

85 In the event that multiple objects (or stuff, people or scenes) are mentioned in a matched image caption the 7 object (or stuff, people, or scene) based similarity measures will be a sum over the set of described terms. [sent-239, score-1.117]

86 For the case where a matched image caption contains a word, but there is no corresponding detection in the query image, the similarity is not incorporated. [sent-240, score-1.15]

87 Results & Evaluation: Our content based captioning method often produces reasonable results (exs are shown in Fig 4). [sent-241, score-0.383]

88 Other captions can be quite poetic (Fig 5) – a picture of a derelict boat captioned “The water the boat was in”, a picture of monstrous tree roots captioned “Walking the dog in the primeval forest”. [sent-254, score-1.199]

89 0060 Table 1: Automatic Evaluation: BLEU score measured at 1 As can be seen in Table 1 data set size has a significant effect on BLEU score; more data provides more similar and relevant matched images (and captions). [sent-280, score-0.497]

90 Local content matching also improves BLEU score somewhat over purely global matching. [sent-281, score-0.376]

91 The user must assign the caption to the most relevant image (care is taken to remove biases due to placement). [sent-283, score-0.653]

92 For evaluation we use a query image and caption generated by our method. [sent-284, score-0.818]

93 As a sanity check of our evaluation measure we also evaluate how well a user can discriminate between the original ground truth image that a caption was written about and a random image. [sent-287, score-0.632]

94 We perform this evaluation on 100 images from our web-collection using Amazon’s mechanical turk service, and find that users are able to select the ground truth image 96% of the time. [sent-288, score-0.451]

95 Considering the top retrieved caption produced by our final method – global plus local content matching with a linear SVM classifier – we find that users are able to select the correct image 66. [sent-290, score-1.011]

96 Because the top caption is not always visually relevant to the query image even when the method is capturing some information, we also perform an evaluation considering the top 4 captions produced by our method. [sent-292, score-1.296]

97 This demonstrates the strength of our content based method to produce relevant captions for images. [sent-295, score-0.661]

98 6 Conclusion We have described an effective caption generation method for general web images. [sent-296, score-0.468]

99 This method relies on collecting and filtering a large data set of images from the internet to produce a novel webscale captioned photo collection. [sent-297, score-0.65]

100 We present two variations on our approach, one that uses only global image descriptors to compose captions, and one that incorporates estimates of image content for caption generation. [sent-298, score-1.137]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('caption', 0.332), ('captions', 0.309), ('content', 0.27), ('matched', 0.264), ('captioned', 0.261), ('image', 0.239), ('query', 0.217), ('descriptions', 0.195), ('stuff', 0.193), ('images', 0.182), ('bleu', 0.137), ('extractive', 0.127), ('captioning', 0.113), ('photographs', 0.112), ('sec', 0.109), ('object', 0.1), ('dog', 0.099), ('scene', 0.093), ('generation', 0.09), ('photo', 0.087), ('people', 0.086), ('attribute', 0.082), ('df', 0.072), ('detectors', 0.067), ('descriptor', 0.066), ('picture', 0.066), ('sky', 0.063), ('collection', 0.063), ('photos', 0.062), ('boy', 0.062), ('flickr', 0.062), ('summarization', 0.061), ('visually', 0.058), ('global', 0.057), ('oq', 0.057), ('detection', 0.055), ('grass', 0.054), ('objects', 0.053), ('relevant', 0.051), ('document', 0.05), ('tower', 0.05), ('iq', 0.05), ('attributes', 0.05), ('description', 0.049), ('categories', 0.049), ('matching', 0.049), ('person', 0.049), ('action', 0.048), ('internet', 0.047), ('water', 0.047), ('web', 0.046), ('detections', 0.045), ('actions', 0.044), ('scenes', 0.043), ('farhadi', 0.043), ('similarity', 0.043), ('tfidf', 0.042), ('webscale', 0.042), ('im', 0.042), ('text', 0.042), ('cvpr', 0.041), ('pascal', 0.04), ('detector', 0.04), ('visual', 0.039), ('berg', 0.039), ('beach', 0.037), ('girl', 0.037), ('pq', 0.036), ('million', 0.035), ('lq', 0.035), ('bourdev', 0.034), ('maji', 0.034), ('matches', 0.034), ('retrieved', 0.034), ('sq', 0.034), ('collections', 0.033), ('tiny', 0.033), ('descriptive', 0.032), ('sentences', 0.032), ('word', 0.032), ('street', 0.032), ('fig', 0.032), ('user', 0.031), ('parsing', 0.031), ('produce', 0.031), ('sentence', 0.031), ('boat', 0.031), ('evaluation', 0.03), ('al', 0.03), ('top', 0.03), ('cloud', 0.029), ('human', 0.028), ('depictions', 0.028), ('idf', 0.028), ('lijiang', 0.028), ('morocco', 0.028), ('ourika', 0.028), ('pasture', 0.028), ('poetic', 0.028), ('prepositional', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999881 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1

2 0.18068655 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1

3 0.16138963 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

4 0.14881973 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.

5 0.14258587 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

Author: Bin Zhao, Fei Li, Eric P. Xing

Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1

6 0.13987486 293 nips-2011-Understanding the Intrinsic Memorability of Images

7 0.12512954 231 nips-2011-Randomized Algorithms for Comparison-based Search

8 0.11456724 229 nips-2011-Query-Aware MCMC

9 0.11194799 168 nips-2011-Maximum Margin Multi-Instance Learning

10 0.1082734 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

11 0.10442711 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

12 0.10268914 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

13 0.098508142 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

14 0.0984843 35 nips-2011-An ideal observer model for identifying the reference frame of objects

15 0.097298689 22 nips-2011-Active Ranking using Pairwise Comparisons

16 0.09205737 193 nips-2011-Object Detection with Grammar Models

17 0.089317426 165 nips-2011-Matrix Completion for Multi-label Image Classification

18 0.087298289 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

19 0.08712431 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

20 0.087111302 127 nips-2011-Image Parsing with Stochastic Scene Grammar


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, 0.15), (2, -0.137), (3, 0.235), (4, 0.109), (5, 0.023), (6, 0.03), (7, -0.012), (8, 0.049), (9, 0.092), (10, 0.034), (11, 0.026), (12, -0.037), (13, 0.072), (14, 0.091), (15, -0.047), (16, 0.025), (17, 0.062), (18, -0.048), (19, 0.043), (20, 0.02), (21, -0.004), (22, 0.05), (23, 0.0), (24, 0.006), (25, 0.045), (26, 0.103), (27, 0.023), (28, 0.061), (29, 0.02), (30, -0.003), (31, 0.059), (32, -0.051), (33, 0.009), (34, -0.035), (35, -0.022), (36, 0.058), (37, -0.03), (38, -0.066), (39, -0.005), (40, 0.003), (41, 0.098), (42, 0.012), (43, -0.016), (44, -0.069), (45, 0.005), (46, 0.025), (47, -0.007), (48, 0.049), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97519976 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1

2 0.77221513 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1

3 0.76322138 293 nips-2011-Understanding the Intrinsic Memorability of Images

Author: Phillip Isola, Devi Parikh, Antonio Torralba, Aude Oliva

Abstract: Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects’ contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al. [13], and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision. 1

4 0.74617624 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

5 0.72573555 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

Author: Joel Z. Leibo, Jim Mutch, Tomaso Poggio

Abstract: Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpointspecific cells projecting to downstream viewpoint-invariant identity-specific cells [1]. A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint. We show, through computational simulations, that an architecture which applies this method of gaining invariance to class-specific transformations is effective when restricted to faces and fails spectacularly when applied to other object classes. We argue here that in order to accomplish viewpoint-invariant face identification from a single example view, visual cortex must separate the circuitry involved in discounting 3D rotations of faces from the generic circuitry involved in processing other objects. The resulting model of the ventral stream of visual cortex is consistent with the recent physiology results showing the hierarchical organization of the face processing network. 1

6 0.71786737 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

7 0.70011848 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

8 0.68989569 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

9 0.66850245 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

10 0.65917408 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

11 0.65278381 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

12 0.63311726 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

13 0.62125725 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

14 0.61705267 35 nips-2011-An ideal observer model for identifying the reference frame of objects

15 0.59873688 231 nips-2011-Randomized Algorithms for Comparison-based Search

16 0.57724458 168 nips-2011-Maximum Margin Multi-Instance Learning

17 0.57270688 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors

18 0.57180077 193 nips-2011-Object Detection with Grammar Models

19 0.56262529 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows

20 0.54636651 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (4, 0.056), (20, 0.058), (26, 0.03), (31, 0.058), (33, 0.078), (43, 0.036), (45, 0.137), (57, 0.037), (59, 0.011), (60, 0.016), (73, 0.259), (74, 0.035), (83, 0.018), (84, 0.021), (99, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82328063 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1

2 0.6379838 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

Author: Ioannis A. Gkioulekas, Todd Zickler

Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1

3 0.59356523 168 nips-2011-Maximum Margin Multi-Instance Learning

Author: Hua Wang, Heng Huang, Farhad Kamangar, Feiping Nie, Chris H. Ding

Abstract: Multi-instance learning (MIL) considers input as bags of instances, in which labels are assigned to the bags. MIL is useful in many real-world applications. For example, in image categorization semantic meanings (labels) of an image mostly arise from its regions (instances) instead of the entire image (bag). Existing MIL methods typically build their models using the Bag-to-Bag (B2B) distance, which are often computationally expensive and may not truly reflect the semantic similarities. To tackle this, in this paper we approach MIL problems from a new perspective using the Class-to-Bag (C2B) distance, which directly assesses the relationships between the classes and the bags. Taking into account the two major challenges in MIL, high heterogeneity on data and weak label association, we propose a novel Maximum Margin Multi-Instance Learning (M3 I) approach to parameterize the C2B distance by introducing the class specific distance metrics and the locally adaptive significance coefficients. We apply our new approach to the automatic image categorization tasks on three (one single-label and two multilabel) benchmark data sets. Extensive experiments have demonstrated promising results that validate the proposed method.

4 0.592444 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

5 0.58848381 165 nips-2011-Matrix Completion for Multi-label Image Classification

Author: Ricardo S. Cabral, Fernando Torre, Joao P. Costeira, Alexandre Bernardino

Abstract: Recently, image categorization has been an active research topic due to the urgent need to retrieve and browse digital images via semantic keywords. This paper formulates image categorization as a multi-label classification problem using recent advances in matrix completion. Under this setting, classification of testing data is posed as a problem of completing unknown label entries on a data matrix that concatenates training and testing features with training labels. We propose two convex algorithms for matrix completion based on a Rank Minimization criterion specifically tailored to visual data, and prove its convergence properties. A major advantage of our approach w.r.t. standard discriminative classification methods for image categorization is its robustness to outliers, background noise and partial occlusions both in the feature and label space. Experimental validation on several datasets shows how our method outperforms state-of-the-art algorithms, while effectively capturing semantic concepts of classes. 1

6 0.58646947 154 nips-2011-Learning person-object interactions for action recognition in still images

7 0.58539706 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

8 0.57733077 227 nips-2011-Pylon Model for Semantic Segmentation

9 0.57648373 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

10 0.57510644 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

11 0.57353991 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

12 0.57020736 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

13 0.56998616 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

14 0.56896728 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

15 0.56892395 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

16 0.56789881 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

17 0.56671345 127 nips-2011-Image Parsing with Stochastic Scene Grammar

18 0.56628317 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

19 0.56593359 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

20 0.56527048 231 nips-2011-Randomized Algorithms for Comparison-based Search