Author: Jie Luo, Barbara Caputo, Vittorio Ferrari

Abstract: Given a corpus of news items consisting of images accompanied by text captions, we want to find out “who’s doing what”, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint ‘face and pose’ model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images. 1

1 associate names and action verbs in the captions to the face and body pose of the persons in the images. [sent-8, score-1.485]

2 We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. [sent-9, score-0.638]

3 1 Introduction A huge amount of images with accompanying text captions are available on the Internet. [sent-12, score-0.371]

4 The learned models could then be used in a variety of Computer Vision applications, including face recognition, image search engines, and to annotate new images for which no caption is available. [sent-17, score-0.563]

5 Previous works on news items has focused on associating names in the captions to faces in the images [5, 6, 16, 21]. [sent-21, score-0.696]

6 This is difficult due to the correspondence ambiguity problem: multiple persons appear in the image and the caption. [sent-22, score-0.357]

7 Moreover, persons in the image are not always mentioned in the caption, and not all names in the caption appear in the image. [sent-23, score-0.69]

8 As a result, these methods work well for frequently occurring persons (typical for famous people) appearing in dataset with thousands of news items. [sent-25, score-0.274]

9 In this paper we propose to go beyond the above works, by modeling both names and action verbs jointly. [sent-26, score-0.399]

10 These correspond to faces and body poses in the images (figure 3). [sent-27, score-0.281]

11 The connections between the subject (name) and verb in a caption can be found by well established language analysis techniques [1, 8]. [sent-28, score-0.685]

12 We present a new generative model where the observed variables are names and verbs in the caption as well as detected persons in the image. [sent-30, score-0.829]

13 The image-caption correspondences are carried by latent variables, while the visual appearance of face and pose classes corresponding to different names and verbs are model parameters. [sent-31, score-1.035]

14 The face and upper body of the persons in the image are marked by bounding-boxes. [sent-45, score-0.569]

15 We stress a caption might contain names and/or verbs not visible in the image, and vice versa. [sent-46, score-0.562]

16 In our joint model, the correspondence ambiguity is reduced because the face and pose information help each other. [sent-47, score-0.604]

17 This paper is most closely related to works on associating names and faces, which we discussed above. [sent-52, score-0.248]

18 There exist also works on associating nouns to image regions [2, 3, 10], starting from images annotated with a list of nouns indicating the objects it contains (typical datasets contain natural scenes and objects such as ‘water’ and ‘tiger’). [sent-53, score-0.255]

19 2 Generative model for faces and body poses The news item corpus used to train our face and pose model consists of still images of person(s) performing some action(s). [sent-58, score-0.867]

20 Each image is annotated with a caption describing “who’s doing what” in the image (figure 1). [sent-59, score-0.344]

21 Some names from the caption might not appear in the image, and viceversa some imaged persons might not be mentioned in the caption. [sent-60, score-0.625]

22 The basic units in our model are persons in the image, consisting of their face and upper body. [sent-61, score-0.45]

23 Our system automatically detects them by bounding-boxes in the image using a face detector [23] and an upper body detector [14]. [sent-62, score-0.34]

24 In the rest of the paper, we say “person” to indicate a detected face and the upper body associated with it (including false positive detections). [sent-63, score-0.336]

25 A face and an upper-body are considered to belong to the same person if the face lies near the center of the upper body bounding-box. [sent-64, score-0.601]

26 For each person, we obtain a pose estimate using [11] (figure 3(right)). [sent-65, score-0.32]

27 Our goals are to: (i) associate the persons in the images to the name-verb pairs in the captions, and (ii) learn visual appearance models corresponding to names and verbs. [sent-67, score-0.639]

28 , DM } with each document Di consisting of an image I i and its caption C i . [sent-76, score-0.283]

29 These captions implicitly provide the labels of the person(s)’ name(s) and pose(s) in the corresponding images. [sent-77, score-0.262]

30 For each caption C i , we consider only the name-verb pairs ni returned by a language parser [1, 8] and ignore other words. [sent-78, score-0.343]

31 We make the same assumptions as for the name-face problem [5, 6, 16, 21] that the labels can only come from the name-verb pairs in the captions or null (for persons not mentioned in the caption). [sent-79, score-0.766]

32 , y i,P ), y i,p is the assignment of the pth person in ith image Ai : Set of possible assignments for document i Ai = {ai , . [sent-83, score-0.371]

33 , ai i } 1 L Li : Number of possible assignments for document D i ai : l ai l i {ai,1 , . [sent-86, score-0.447]

34 , ai,P }, l l where lth assignment = Θ: Appearance models for face and pose classes V : Number of different verbs U : Number of different names θ k : Sets of class representative vectors for class k v v θverb = {µv,1 , . [sent-89, score-0.964]

35 , µv,R } pose pose ai,p l is the label for the pth person Θ = (θname , θverb ) 1 V θverb = (θverb , . [sent-92, score-0.796]

36 , µu,R } face face Table I: The mathematical notation used in the paper I C W Y P A L M Figure 2: Graphical plate representation of the generative model. [sent-101, score-0.442]

37 Hence, we replace the captions by the sets of possible assignments A = {A1 , . [sent-104, score-0.358]

38 , y i,P ) be the assignment for the P i i,p i,p persons in the ith image. [sent-116, score-0.256]

39 Each y i,p = (yface , ypose ) is a pair of indices defining the assignment of a person’s face to a name and pose to a verb. [sent-117, score-0.836]

40 N/V is the number of different names/verbs over all the captions and null represents unknown names/verbs and false positive person detections. [sent-125, score-0.623]

41 Assuming independence between multiple persons in an image, the likelihood of an image can be expressed as the product over the likelihood of each person: P (I i,p |y i,p , Θ) P (I i |Y i , Θ) = (2) I i,p ∈I i i,p i,p where y i,p define the name-verb indices of the pth person in the image. [sent-134, score-0.45]

42 A person I i,p = (Iface , Ipose ) i,p i,p is represented by the appearance of her face Iface and pose Ipose . [sent-135, score-0.743]

43 , θverb , βverb ) is a set of representative vectors modeling the variability within the pose class corresponding to a verb v. [sent-140, score-0.801]

44 For example, the verb “serve” in tennis could correspond to different poses such as holding the ball on the racket, tossing the ball and hitting it. [sent-141, score-0.542]

45 u Analogously, θname models the variability within the face class corresponding to a name u. [sent-142, score-0.396]

46 2 Face and pose descriptors and similarity measures After detecting faces from the images with the multi-view algorithm [23], we use [12] to detect nine distinctive feature points within the face bounding box (figure 3(left)). [sent-144, score-0.711]

47 A pose E consists of a distribution over the position (x, y and orientation) for each of 6 body parts (head, torso, upper/lower 3 Figure 3: Example images with facial features and pose estimates superimposed. [sent-149, score-0.76]

48 Left Facial features (left and right corners of each eye, two nostrils, tip of the nose, and the left and right corners of the mouth) located using [12] in the detected face bounding-box. [sent-150, score-0.259]

49 The pose estimator factors out variations due to clothing and background, so E conveys purely spatial arrangements of body parts. [sent-156, score-0.374]

50 We derive three relatively low-dimensional pose descriptors from E, as proposed in [13]. [sent-157, score-0.349]

51 These descriptors represent pose in different ways, such as the relative position between pairs of body parts, and part-specific soft-segmentations of the image (i. [sent-158, score-0.51]

52 3 Appearance model The appearance model for a pose class (corresponding to a verb) is defined as: i,p i,p P (Ipose |ypose , θverb ) = i,p i,p k δ(ypose , k) · P (Ipose |θverb ) (4) k∈{1,. [sent-165, score-0.417]

53 ,V,null} k where θverb are the parameters of the kth pose class (or βverb if k = null). [sent-168, score-0.32]

54 We only explain here the model for a pose class, as the face model is derived analogously. [sent-170, score-0.541]

55 Some previous works on names-faces used a Gaussian mixture model [6, 21]: each name is associated with a Gaussian density, plus an additional Gaussian to model the null class. [sent-172, score-0.429]

56 Problems such as face and pose recognition are particularly challenging because they involve complex non-Gaussian multimodal distributions. [sent-175, score-0.566]

57 Figure 3(right) shows a few examples of the variance within the pose class for a verb. [sent-176, score-0.32]

58 Moreover, we cannot easily employ existing pose similarity measures [13]. [sent-177, score-0.32]

59 , µpose }, where Rk is the number of pose r k representative poses for verb k. [sent-181, score-0.887]

60 The scalar βverb represents the null model, thus poses assigned to null have likelihood Zθ1 e−βverb . [sent-183, score-0.552]

61 It is important to have this null model, as some detected persons might not verb correspond to any verb in the caption or they might be false detections. [sent-184, score-1.626]

62 4 Name-verb assignments The name-verb pairs ni for a document are observed in its caption C i . [sent-190, score-0.356]

63 , ai i } of name-verb pairs to persons in the image. [sent-194, score-0.379]

64 The 1 L number of possible assignments Li depends both on the number of persons and of name-verb pairs. [sent-195, score-0.325]

65 Therefore, given a document with P i persons and min(P i ,W i ) P i Wi W i name-verb pairs, the number of possible assignments is Li = j=0 j , where j · j is the number of persons assigned to a name-verb pair instead of null. [sent-198, score-0.581]

66 The reported results are based on automatically parsed captions for learning. [sent-274, score-0.282]

67 For each different name/verb, we select all captions containing only this name/verb. [sent-280, score-0.262]

68 If a name/verb only appears in captions with multiple names/verbs or if the corresponding images always contain multiple persons (e. [sent-283, score-0.557]

69 Each point I i,p in a cluster is given a weight i,p wY i = P (Y i |I i,p , Ai , Θ) j i,p , Ai , Θ) Y j ∈Ai P (Y |I (11) i,p i,p which represents the likelihood that Iface and Ipose belong to the name and verb defined by Y i . [sent-298, score-0.631]

70 Therefore, faces and poses from images with many detections have a lower weights and contribute less to the cluster centers, reflecting the larger uncertainty in their assignments. [sent-299, score-0.249]

71 Faces often occupy most of the image so the body pose is not visible. [sent-302, score-0.439]

72 Second, the captions frequently describe the event at an abstract level, rather than using a verb to describe the actions of the persons in the image (compare figure 1 to the figures in [6, 16]). [sent-303, score-1.012]

73 Therefore, we collected a new dataset 2 by querying Google-images using a combination of names and verbs (from sports and social interactions), corresponding to distinct upper body poses. [sent-304, score-0.425]

74 Our dataset contains 1610 images, each with at least one person whose face occupies less than 5% of the image, and with the accompanying snippet of text returned by Google-images. [sent-306, score-0.369]

75 Sarkozy - embrace Brian Cowen - null Hu Jintao - Wave R. [sent-335, score-0.233]

76 Sarkozy null Hu Jintao Hu Jintao - shake hands J. [sent-341, score-0.536]

77 Bakjyev - shake hands Kyrgyzstan - null F:: FP: null;null;null null;null;Hu Jintao null;Hu Jintao N. [sent-347, score-0.536]

78 Bakjyev Figure 5: Examples of when modeling pose improves the results at learning time. [sent-353, score-0.32]

79 Below the images we report the name-verb pairs (C) from the caption as returned by the automatic parser and compare the association recovered by a model using only faces (F) and using both faces and poses (FP). [sent-354, score-0.607]

80 The assigned names (left to right) correspond to the detected face bounding-boxes (left to right). [sent-355, score-0.464]

81 Sharapova Hold Hu Jintao Shakehands Figure 6: Recognition results on images without text captions (using models learned from automatically parsed captions). [sent-362, score-0.369]

82 Left compares face annotation using different models and scenarios (see main text); Right shows a few examples of the labels predicted by the joint face and pose model (without using captions). [sent-363, score-0.806]

83 extend these snippets into realistic captions when necessary, with varied long sentences, mentioning the action of the persons in the image as well as names/verbs not appearing in the image (as ‘noise’, figure 1). [sent-364, score-0.649]

84 Moreover, they also annotated the ground-truth name-verb pairs mentioned in the captions as well as the location of the target persons in the images, enabling to evaluate results quantitatively. [sent-365, score-0.556]

85 In our experiments we only consider 7 names and verbs occurring in at least 3 captions for a name, and 20 captions for a verb. [sent-367, score-0.895]

86 This leaves 69 names corresponding to 69 face classes and 20 verbs corresponding to 20 pose classes. [sent-368, score-0.912]

87 We used an open source Named Entity recognizer [1] to detect names in the captions and a language parser [8] to find name-verbs pairs (or name-null if the language parser could not find a verb associated with a name). [sent-369, score-1.185]

88 By using simple stemming rules, the same verb under different tenses and possessive adjectives was merged together. [sent-370, score-0.479]

89 For instance “shake their hands”, “is shaking hands” and “shakes hands” all correspond to the action verb “shake hands”. [sent-371, score-0.484]

90 By discarding infrequent names and verbs as explained above, we retain 85 names and 20 verbs to be learned by our model (recall that some of these are false positives rather than actual person names and verbs). [sent-375, score-1.075]

91 We compare experimentally our face and pose model to stripped-down versions using only face or pose information. [sent-377, score-1.082]

92 The accuracy is defined as the percentage of correct assignments over all detected persons, including assignments to null, as in [5, 16]. [sent-384, score-0.23]

93 As the figure shows, our joint ‘face and pose’ model outperforms both models using face or pose alone in all setups. [sent-385, score-0.541]

94 As a second point, our model with face alone also outperforms the baseline approach using Gaussian mixture appearance models (e. [sent-389, score-0.318]

95 Figure 5 shows a few examples of how including pose improves the learning results and solve some of the correspondence ambiguities. [sent-392, score-0.362]

96 We collected a new set of 100 images and captions from Google-images using five keywords based on names and verbs from the training dataset. [sent-399, score-0.72]

97 Here we run inference on the model, recovering the best assignment Y from the set of possible assignments generated from the captions; (b) the same test images are used but the captions are not given, so the problem degenerates to a standard face and pose recognition task. [sent-401, score-1.017]

98 Figure 6(left) reports face annotation accuracy for three methods using captions (scenario (a)): (⋄) a baseline which randomly assigns a name (or null) from the caption to each face in the image; (x) our face and pose model; ( ) our model using only faces. [sent-402, score-1.655]

99 On scenario (a) all models outperform the baseline, and our joint face and pose model improves significantly on the face-only model for all keywords, especially when there are multiple persons in the image. [sent-404, score-0.77]

100 We present an approach for the joint modeling of faces and poses in images and their association to names and action verbs in accompanying text captions. [sent-406, score-0.669]

