cvpr cvpr2013 cvpr2013-389 knowledge-graph by maker-knowledge-mining

389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data


Source: pdf

Author: Martin Bäuml, Makarand Tapaswi, Rainer Stiefelhagen

Abstract: We address the problem of person identification in TV series. We propose a unified learning framework for multiclass classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We address the problem of person identification in TV series. [sent-3, score-0.24]

2 We propose a unified learning framework for multiclass classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. [sent-4, score-0.289]

3 We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. [sent-5, score-0.368]

4 The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. [sent-6, score-0.547]

5 We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance. [sent-7, score-0.375]

6 Introduction Automatic identification of characters in TV series and movies is both an important and challenging problem. [sent-9, score-0.486]

7 Recently, multimedia content providers have started to offer information on cast and characters for TV series and movies during playback1 ,2,3, presumably via a combination of face tracking, automatic identification and crowd sourcing. [sent-11, score-0.83]

8 In this paper, we approach the problem of naming characters in TV series as a transductive learning problem with constraints. [sent-12, score-0.463]

9 Our goal is to automatically identify all characters by training discriminative multi-class classifiers from (i) weakly-supervised track labels, (ii) additional unlabeled data and (iii) automatically generated constraints between tracks. [sent-13, score-0.676]

10 For example, the identification can be performed offline beforehand if the goal is to display additional information on characters during the playback of a TV episode. [sent-26, score-0.34]

11 We apply the proposed learning framework to the task of character naming in TV series (Sec. [sent-31, score-0.34]

12 Related work Automatic naming of characters in TV series has received increasing attention in the last years. [sent-38, score-0.463]

13 While most work is focused on naming face tracks [5, 10, 13, 14], the problem has recently been extended to person tracks both to increase coverage and performance [15]. [sent-39, score-1.112]

14 [5] proposed an automatic method to weakly label some track identities by detecting speakers, and aligning subtitles and transcripts to obtain identities. [sent-41, score-0.543]

15 We use a similar method in this work to automatically obtain labels for those tracks which can be detected as speaking. [sent-43, score-0.381]

16 , usually only about 2030% of the tracks can be assigned a name). [sent-46, score-0.308]

17 In order to increase the coverage of the weak labeling, one can treat the names from transcripts as ambiguous labels, i. [sent-47, score-0.244]

18 , assign multiple possible names to a face track when the speaking face cannot be reliably detected (e. [sent-49, score-0.66]

19 [10] further take into account unlabeled data with a cross entropy loss between the expected prior distribution of identities and the model. [sent-54, score-0.388]

20 [1] make use of must-link and cannot-link constraints in order to learn a face- and cast-specific metric in order to improve face clustering and identification. [sent-56, score-0.27]

21 [16] identify persons in a camera network and integrate must-link and cannot-link constraints in an empirical loss in their learning framework. [sent-60, score-0.21]

22 In this work, we bring together learning from weakly labeled data, unlabeled data and constraints in a common framework. [sent-69, score-0.317]

23 Semi-supervised learning with constraints Let Xl = {(xi , yi)}iN=1 denote training data xi with assocLiaetted X la=bel {s( yi ∈ }Y. [sent-71, score-0.196]

24 The problem of character naming is inherently a m∈ul Yti-. [sent-72, score-0.249]

25 Iyn tnhi ws paper, we propose a combined loss function that takes into account (i) labeled data Xl, (ii) unlabeled data Xu and (iii) constraints C: L(X; θ) = L(yl, yc; Xl, Xu, C, θ) (3) = Ll(yl; Xl, θ) + Lu(Xu, θ) + Lc(yc; C, θ) . [sent-85, score-0.384]

26 0 % supervised semi−supervised (a) semi−supervised + constraints (b) unsupervised + constraints Figure 2: Visualization of the effect of the different parts of the loss function on a toy example. [sent-109, score-0.347]

27 The denoted error is the joint error on labeled and unlabeled data. [sent-110, score-0.208]

28 (b) Additionally taking unlabeled data (black l×a )b einletdo aancdco uunnlta bfietlse tdh dea tdae. [sent-114, score-0.194]

29 Entropy loss for unlabeled data While the unlabeled data Xu does not carry information aboWuth i tlse c thlaess u membership, i Xt can be informative about the distribution of data points in regions without labels. [sent-142, score-0.421]

30 Instead of placing decision boundaries as far as possible between labeled samples, we desire that the decision boundaries also respect the distribution of unlabeled data. [sent-143, score-0.338]

31 A common way to achieve this is to include an entropy term into the loss function in order to encourage uniformly distributed class membership across the unlabeled data [10, 17]. [sent-146, score-0.356]

32 Instead, we use the entropy function as a penalty on having the decision boundaries close to unlabeled data points (see Fig. [sent-147, score-0.293]

33 , two tracks which temporally overlap cannot belong to the same person, and can be automatically generated without manual effort. [sent-171, score-0.369]

34 Automatic character naming We apply the proposed learning framework to the task of character naming in videos. [sent-201, score-0.498]

35 We consider only face tracks for identification similar to [5, 10, 14], in contrast to our previous work [15] which builds on person tracks. [sent-202, score-0.737]

36 However, since [15] relies on identities from face recognition as input, we can directly improve those results by providing improved facial identities. [sent-203, score-0.254]

37 Pre-processing Face Tracking For tracking faces, we employ a detectorbased face tracker based on the Modified Census Transform [6]. [sent-209, score-0.189]

38 Our tracker is able to track faces over a wide range of pose angles (including profile faces and in-plane rotations of up to 45 degrees), which results in a large number of tracks in non-frontal poses. [sent-210, score-0.514]

39 Following [5, 10, 14], we align subtitles with transcripts from the web in order to combine the timing component of subtitles with the identities from the transcripts. [sent-212, score-0.504]

40 Using the 9-point facial feature model from [5], we estimate the locations of eyes, nose and mouth in each face track. [sent-213, score-0.229]

41 Based on the estimated mouth position, we determine for each face track whether the person is speaking or not: we follow [5, 14] and compute for each frame the minimum nearest neighbor distance of the (gray scale, histogram equalized) mouth region to the previous frame. [sent-214, score-0.616]

42 By thresholding the distances, we determine whether a person is speaking or not. [sent-215, score-0.245]

43 First, the face is aligned (warped and cropped) to a size of 48 64 pixeallisg. [sent-217, score-0.189]

44 Training Given the face tracks, speaking information and subtitles associated with names, we obtain three different types of data from the given videos. [sent-223, score-0.466]

45 “#speaking tracks” denotes the number of tracks which were determined as speaking, which is usually less than 30% of the tracks (not all characters speak at the same time). [sent-228, score-0.86]

46 On average, we associate a name to about 22% of the tracks with a precision of 87%, which is similar to the reported performances of [5, 10, 14]. [sent-229, score-0.37]

47 Unlabeled data With only 22% of the face tracks labeled by the previous method, we are left with around 78% of the data that has no labels associated with it. [sent-232, score-0.583]

48 We take all features of the unlabeled tracks as Xu. [sent-233, score-0.471]

49 Constraints We can automatically deduce constraints between data points from face tracks. [sent-234, score-0.302]

50 Negative constraints are formed when two tracks overlap temporally, based on the assumption that the same person cannot appear twice at the same time. [sent-235, score-0.49]

51 This poses a problem if there actually are two tracks of the same (or very similar looking) person at 333666000533 BBT-1 BBT-2 BBT-3 BBT-4 BBT-5 BBT-6 BF-1 BF-2 BF-3 BF-4 BF-5 BF-6 # uspneks#pnaokfcisawpnchanregk tc ra icaescti eoka rls nl286378026. [sent-237, score-0.409]

52 83 6845 Table 1: Statistics across all videos in the data set showing the number of characters, face tracks and speaker assignment performance. [sent-249, score-0.672]

53 Training We first collect training data from all available episodes, and train one joint multi-class classifier from supervised data, unsupervised data and constraints by minimization of the joint loss function (Eq. [sent-254, score-0.26]

54 Taking into account all available training data from mul- tiple episodes at the same time is unfortunately computationally infeasible, especially for the kernelized version of the multinomial logistic regression. [sent-256, score-0.461]

55 Identification For determining the identity yt of a face track t with features we apply the learned classifier framewise according t}o Eq. [sent-263, score-0.343]

56 6 and compute a class score for the track having identity k as {x(it) }i|=t|1 pt(k) =|1t|i? [sent-264, score-0.187]

57 (16) The track is assigned the identity of the most likely class over all frames yt = argmkaxpt(k) . [sent-268, score-0.187]

58 Assignment to “unknown” Usually some unknowns have small speaking roles, and therefore we can automatically collect some training samples for them. [sent-270, score-0.293]

59 We model unknown characters as one joint class in the model, i. [sent-271, score-0.276]

60 Thus, no special handling for the unknown class is required: a new track is assigned the “unknown” identity, when it is the most likely class according to Eq. [sent-274, score-0.21]

61 Data set and experimental setup Our data set4 consists of 12 full episodes from two TV series. [sent-279, score-0.284]

62 We select episodes 1–6 from season 1 of The Big Bang Theory (BBT-1 to BBT-6) (as used in [15]), and episodes 1–6 from season 5 of Buffy the Vampire Slayer (BF-1 to BF-6) (as used in [5, 10, 14]). [sent-280, score-0.666]

63 It includes many full-view shots which contain multiple people at a time, however the faces are rather small (the average face size is around 75px). [sent-283, score-0.345]

64 On the other hand, Buffy has an average length of ∼40 minutes per episode, dw,i tBhu a ym haiasn caans at seizraeg earlo eunngdt h1 o2,f ∼w4h0ile m miinn specific episodes there are up to 18 important characters. [sent-284, score-0.284]

65 However, it also contains a sizable number of face close-up shots (the average face size is around 116px). [sent-286, score-0.482]

66 Buffy episodes contain on average less than double the amount of face tracks compared to BBT due to the above mentioned higher number of close-up shots in Buffy. [sent-288, score-0.885]

67 Speaking-face detection and naming performs equally well on both series, with on average around 22% recall (of all face tracks) and around 87% precision. [sent-289, score-0.394]

68 Table 2 shows the number of face tracks for each identity accumulated over the six episodes of BBT. [sent-290, score-0.833]

69 The precision and recall of the speaking-face naming from subtitles and transcripts reveal that there is a large variation in available training data across the main cast of Leonard, Sheldon, Penny, Howard and Raj. [sent-291, score-0.578]

70 75 – – Table 2: The cast list of BBT, the face tracks across all episodes, and the performance oftagging speaking face tracks automatically. [sent-313, score-1.174]

71 For example, in BBT there are four minor named characters with less than 35 tracks. [sent-316, score-0.23]

72 We require all characters to be identified correctly, even when the automatic speaker assignment does not provide any train- ing data for them. [sent-322, score-0.446]

73 Baseline results In order to establish a baseline, and also compare with previous approaches, we use the automatically generated weak face labels data to train different supervised-only classifiers. [sent-329, score-0.262]

74 Figure 3: Confusion matrix over all 6 episodes of BBT for MLR + Lu + Lc. [sent-335, score-0.284]

75 For Doug and Summer, the automatic labeling MdidL nRot + f iLnd any tracks for training (c. [sent-336, score-0.381]

76 Note that in [15], where also SVMs are used, face labels were manually supplied, whereas we obtain them automatically from the transcripts. [sent-342, score-0.262]

77 When using our SVM results as input to [15] (“SVM+MRF” in Table 3), we obtain a significant improvement to about 82% accuracy in face recognition. [sent-343, score-0.189]

78 Since we do not have person tracks for Buffy, we perform this evaluation only for BBT. [sent-344, score-0.409]

79 SS+Constraints MLR We evaluate our method starting with the supervised loss only and then add the other loss terms for incremental improvement. [sent-345, score-0.243]

80 The big drop in accuracy in BBT-6 can be explained by the large number of unknowns present in that episode (195 tracks, see Table 1), which are harder to identify because there is usually no training data for them. [sent-355, score-0.338]

81 Figure 3 shows the confusion matrix over all 6 episodes of BBT, which confirms the difficulty in identifying unknowns. [sent-357, score-0.284]

82 MLR + L+ L The first line shows the accuracy that could be achieved by assigning each track the most often appearing person in the series (Leonard for BBT, and Buffy for Buffy). [sent-403, score-0.294]

83 MLR denotes the basic supervised multinomial logistic regression classifier, and Lu and Lc denote the additionally incorporated loss terms. [sent-406, score-0.327]

84 The importance of constraints in BBT can be explained from the fact that BBT contains many shots with multiple faces, thus allowing constraints such as uniqueness to be useful. [sent-408, score-0.334]

85 On the other hand, Buffy favors close-up face shots, which also results in much fewer and less diverse constraints. [sent-409, score-0.189]

86 The lack of influence of unlabeled data in BBT can be explained by the relatively small cast compared to Buffy, while at the same time having many training samples for each of the main characters. [sent-410, score-0.261]

87 Finally, if we use the face identification results from our best-performing method as input to the clothing-based MRF model of [15], we can further increase the performance to 83. [sent-411, score-0.328]

88 Failure analysis We already identified the naming of unknowns as one of the error sources (see also Fig. [sent-413, score-0.285]

89 We analyze the identification accuracy de– – pending on the mean pan-angle of the face tracks (see Fig. [sent-417, score-0.636]

90 Pose independent face recognition has been an active research area for many years, and a more robust feature should directly have an impact on our recognition performance. [sent-421, score-0.189]

91 This can be explained by a similar drop in speaker assignment recall for those pose angles (see Fig. [sent-423, score-0.24]

92 Similar to the dependency on the average pan angle, there is a dependency on track length and average face size. [sent-425, score-0.383]

93 We observe that the identification accuracy decreases for shorter tracks or tracks with a small average face size. [sent-426, score-0.944]

94 Finally, for some minor characters we are unable to find any speaking tracks, e. [sent-427, score-0.345]

95 Therefore, we are unable to correctly identify any face track that belongs to these characters (see Fig. [sent-431, score-0.526]

96 Conclusion In this paper, we address the problem of person identification in multimedia data. [sent-436, score-0.286]

97 We propose to use a unified learning framework combining both labeled and unlabeled data, along with their constraints in a principled manner, and apply it to train multinomial logistic regression classifiers. [sent-437, score-0.468]

98 The methods are tested on six episodes each of two TV series – The Big Bang Theory and Buffy the Vampire Slayer and we obtain state-of-the-art results for person identification. [sent-439, score-0.476]

99 Unsupervised metric learning for face identification in TV video. [sent-447, score-0.328]

100 One can see, that our system not only correctly identifies unlabeled characters, but also is able to correct wrong speaking labels. [sent-468, score-0.307]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bbt', 0.367), ('tracks', 0.308), ('episodes', 0.284), ('buffy', 0.21), ('characters', 0.201), ('mlr', 0.195), ('face', 0.189), ('transcripts', 0.173), ('naming', 0.171), ('unlabeled', 0.163), ('speaking', 0.144), ('identification', 0.139), ('speaker', 0.134), ('subtitles', 0.133), ('tv', 0.124), ('episode', 0.12), ('shots', 0.104), ('track', 0.102), ('person', 0.101), ('loss', 0.095), ('pan', 0.092), ('series', 0.091), ('unknowns', 0.086), ('xl', 0.085), ('constraints', 0.081), ('multinomial', 0.079), ('character', 0.078), ('bang', 0.073), ('doug', 0.073), ('logistic', 0.067), ('entropy', 0.065), ('identities', 0.065), ('name', 0.062), ('summer', 0.06), ('leonard', 0.06), ('howard', 0.06), ('movies', 0.055), ('supervised', 0.053), ('xi', 0.053), ('faces', 0.052), ('identity', 0.052), ('yl', 0.05), ('ekt', 0.049), ('rainer', 0.049), ('season', 0.049), ('sheldon', 0.049), ('lr', 0.047), ('multimedia', 0.046), ('labeled', 0.045), ('ez', 0.043), ('speak', 0.043), ('knock', 0.043), ('tapaswi', 0.043), ('guest', 0.043), ('karlsruhe', 0.043), ('curiously', 0.043), ('ostinger', 0.043), ('penny', 0.043), ('automatic', 0.042), ('unknown', 0.042), ('labels', 0.041), ('assignment', 0.041), ('mouth', 0.04), ('slayer', 0.04), ('vampire', 0.04), ('cinbis', 0.04), ('lu', 0.039), ('ci', 0.038), ('oseo', 0.038), ('toy', 0.037), ('uniqueness', 0.037), ('ln', 0.037), ('names', 0.036), ('census', 0.036), ('big', 0.036), ('cast', 0.036), ('xu', 0.035), ('coverage', 0.035), ('semi', 0.035), ('identify', 0.034), ('lc', 0.034), ('recall', 0.034), ('boundaries', 0.033), ('regression', 0.033), ('yc', 0.033), ('class', 0.033), ('frontal', 0.032), ('decision', 0.032), ('automatically', 0.032), ('training', 0.031), ('explained', 0.031), ('presumably', 0.031), ('story', 0.031), ('dea', 0.031), ('yi', 0.031), ('named', 0.029), ('temporally', 0.029), ('identified', 0.028), ('played', 0.028), ('weakly', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data

Author: Martin Bäuml, Makarand Tapaswi, Rainer Stiefelhagen

Abstract: We address the problem of person identification in TV series. We propose a unified learning framework for multiclass classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance.

2 0.24260911 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification

Author: Enrique G. Ortiz, Alan Wright, Mubarak Shah

Abstract: This paper presents an end-to-end video face recognition system, addressing the difficult problem of identifying a video face track using a large dictionary of still face images of a few hundred people, while rejecting unknown individuals. A straightforward application of the popular ?1minimization for face recognition on a frame-by-frame basis is prohibitively expensive, so we propose a novel algorithm Mean Sequence SRC (MSSRC) that performs video face recognition using a joint optimization leveraging all of the available video data and the knowledge that the face track frames belong to the same individual. By adding a strict temporal constraint to the ?1-minimization that forces individual frames in a face track to all reconstruct a single identity, we show the optimization reduces to a single minimization over the mean of the face track. We also introduce a new Movie Trailer Face Dataset collected from 101 movie trailers on YouTube. Finally, we show that our methodmatches or outperforms the state-of-the-art on three existing datasets (YouTube Celebrities, YouTube Faces, and Buffy) and our unconstrained Movie Trailer Face Dataset. More importantly, our method excels at rejecting unknown identities by at least 8% in average precision.

3 0.17531095 92 cvpr-2013-Constrained Clustering and Its Application to Face Clustering in Videos

Author: Baoyuan Wu, Yifan Zhang, Bao-Gang Hu, Qiang Ji

Abstract: In this paper, we focus on face clustering in videos. Given the detected faces from real-world videos, we partition all faces into K disjoint clusters. Different from clustering on a collection of facial images, the faces from videos are organized as face tracks and the frame index of each face is also provided. As a result, many pairwise constraints between faces can be easily obtained from the temporal and spatial knowledge of the face tracks. These constraints can be effectively incorporated into a generative clustering model based on the Hidden Markov Random Fields (HMRFs). Within the HMRF model, the pairwise constraints are augmented by label-level and constraint-level local smoothness to guide the clustering process. The parameters for both the unary and the pairwise potential functions are learned by the simulated field algorithm, and the weights of constraints can be easily adjusted. We further introduce an efficient clustering framework specially for face clustering in videos, considering that faces in adjacent frames of the same face track are very similar. The framework is applicable to other clustering algorithms to significantly reduce the computational cost. Experiments on two face data sets from real-world videos demonstrate the significantly improved performance of our algorithm over state-of-theart algorithms.

4 0.15189129 338 cvpr-2013-Probabilistic Elastic Matching for Pose Variant Face Verification

Author: Haoxiang Li, Gang Hua, Zhe Lin, Jonathan Brandt, Jianchao Yang

Abstract: Pose variation remains to be a major challenge for realworld face recognition. We approach this problem through a probabilistic elastic matching method. We take a part based representation by extracting local features (e.g., LBP or SIFT) from densely sampled multi-scale image patches. By augmenting each feature with its location, a Gaussian mixture model (GMM) is trained to capture the spatialappearance distribution of all face images in the training corpus. Each mixture component of the GMM is confined to be a spherical Gaussian to balance the influence of the appearance and the location terms. Each Gaussian component builds correspondence of a pair of features to be matched between two faces/face tracks. For face verification, we train an SVM on the vector concatenating the difference vectors of all the feature pairs to decide if a pair of faces/face tracks is matched or not. We further propose a joint Bayesian adaptation algorithm to adapt the universally trained GMM to better model the pose variations between the target pair of faces/face tracks, which consistently improves face verification accuracy. Our experiments show that our method outperforms the state-ofthe-art in the most restricted protocol on Labeled Face in the Wild (LFW) and the YouTube video face database by a significant margin.

5 0.13592219 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

Author: Xiaohui Shen, Zhe Lin, Jonathan Brandt, Ying Wu

Abstract: Detecting faces in uncontrolled environments continues to be a challenge to traditional face detection methods[24] due to the large variation in facial appearances, as well as occlusion and clutter. In order to overcome these challenges, we present a novel and robust exemplarbased face detector that integrates image retrieval and discriminative learning. A large database of faces with bounding rectangles and facial landmark locations is collected, and simple discriminative classifiers are learned from each of them. A voting-based method is then proposed to let these classifiers cast votes on the test image through an efficient image retrieval technique. As a result, faces can be very efficiently detected by selecting the modes from the voting maps, without resorting to exhaustive sliding window-style scanning. Moreover, due to the exemplar-based framework, our approach can detect faces under challenging conditions without explicitly modeling their variations. Evaluation on two public benchmark datasets shows that our new face detection approach is accurate and efficient, and achieves the state-of-the-art performance. We further propose to use image retrieval for face validation (in order to remove false positives) and for face alignment/landmark localization. The same methodology can also be easily generalized to other facerelated tasks, such as attribute recognition, as well as general object detection.

6 0.12640902 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories

7 0.12540483 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection

8 0.12432677 89 cvpr-2013-Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation

9 0.11752222 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes

10 0.1144708 438 cvpr-2013-Towards Pose Robust Face Recognition

11 0.1143818 34 cvpr-2013-Adaptive Active Learning for Image Classification

12 0.11189335 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation

13 0.10771754 387 cvpr-2013-Semi-supervised Domain Adaptation with Instance Constraints

14 0.10414089 430 cvpr-2013-The SVM-Minus Similarity Score for Video Face Recognition

15 0.10361493 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking

16 0.10333371 182 cvpr-2013-Fusing Robust Face Region Descriptors via Multiple Metric Learning for Face Recognition in the Wild

17 0.10331795 276 cvpr-2013-MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification

18 0.097229451 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

19 0.096895874 463 cvpr-2013-What's in a Name? First Names as Facial Attributes

20 0.095985904 152 cvpr-2013-Exemplar-Based Face Parsing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.189), (1, -0.075), (2, -0.066), (3, -0.022), (4, 0.058), (5, 0.004), (6, -0.006), (7, -0.094), (8, 0.182), (9, -0.078), (10, 0.066), (11, -0.021), (12, 0.014), (13, 0.051), (14, -0.082), (15, 0.003), (16, 0.02), (17, -0.056), (18, -0.037), (19, -0.073), (20, -0.076), (21, 0.034), (22, -0.075), (23, 0.036), (24, 0.051), (25, 0.007), (26, -0.024), (27, 0.06), (28, 0.002), (29, -0.069), (30, -0.031), (31, 0.088), (32, 0.035), (33, 0.07), (34, 0.061), (35, 0.046), (36, -0.047), (37, -0.004), (38, 0.034), (39, -0.132), (40, -0.06), (41, 0.011), (42, 0.027), (43, 0.001), (44, 0.014), (45, 0.024), (46, -0.055), (47, -0.029), (48, 0.073), (49, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96090597 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data

Author: Martin Bäuml, Makarand Tapaswi, Rainer Stiefelhagen

Abstract: We address the problem of person identification in TV series. We propose a unified learning framework for multiclass classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance.

2 0.81051278 92 cvpr-2013-Constrained Clustering and Its Application to Face Clustering in Videos

Author: Baoyuan Wu, Yifan Zhang, Bao-Gang Hu, Qiang Ji

Abstract: In this paper, we focus on face clustering in videos. Given the detected faces from real-world videos, we partition all faces into K disjoint clusters. Different from clustering on a collection of facial images, the faces from videos are organized as face tracks and the frame index of each face is also provided. As a result, many pairwise constraints between faces can be easily obtained from the temporal and spatial knowledge of the face tracks. These constraints can be effectively incorporated into a generative clustering model based on the Hidden Markov Random Fields (HMRFs). Within the HMRF model, the pairwise constraints are augmented by label-level and constraint-level local smoothness to guide the clustering process. The parameters for both the unary and the pairwise potential functions are learned by the simulated field algorithm, and the weights of constraints can be easily adjusted. We further introduce an efficient clustering framework specially for face clustering in videos, considering that faces in adjacent frames of the same face track are very similar. The framework is applicable to other clustering algorithms to significantly reduce the computational cost. Experiments on two face data sets from real-world videos demonstrate the significantly improved performance of our algorithm over state-of-theart algorithms.

3 0.7821691 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification

Author: Enrique G. Ortiz, Alan Wright, Mubarak Shah

Abstract: This paper presents an end-to-end video face recognition system, addressing the difficult problem of identifying a video face track using a large dictionary of still face images of a few hundred people, while rejecting unknown individuals. A straightforward application of the popular ?1minimization for face recognition on a frame-by-frame basis is prohibitively expensive, so we propose a novel algorithm Mean Sequence SRC (MSSRC) that performs video face recognition using a joint optimization leveraging all of the available video data and the knowledge that the face track frames belong to the same individual. By adding a strict temporal constraint to the ?1-minimization that forces individual frames in a face track to all reconstruct a single identity, we show the optimization reduces to a single minimization over the mean of the face track. We also introduce a new Movie Trailer Face Dataset collected from 101 movie trailers on YouTube. Finally, we show that our methodmatches or outperforms the state-of-the-art on three existing datasets (YouTube Celebrities, YouTube Faces, and Buffy) and our unconstrained Movie Trailer Face Dataset. More importantly, our method excels at rejecting unknown identities by at least 8% in average precision.

4 0.75368142 430 cvpr-2013-The SVM-Minus Similarity Score for Video Face Recognition

Author: Lior Wolf, Noga Levy

Abstract: Face recognition in unconstrained videos requires specialized tools beyond those developed for still images: the fact that the confounding factors change state during the video sequence presents a unique challenge, but also an opportunity to eliminate spurious similarities. Luckily, a major source of confusion in visual similarity of faces is the 3D head orientation, for which image analysis tools provide an accurate estimation. The method we propose belongs to a family of classifierbased similarity scores. We present an effective way to discount pose induced similarities within such a framework, which is based on a newly introduced classifier called SVMminus. The presented method is shown to outperform existing techniques on the most challenging and realistic publicly available video face recognition benchmark, both by itself, and in concert with other methods.

5 0.74159318 338 cvpr-2013-Probabilistic Elastic Matching for Pose Variant Face Verification

Author: Haoxiang Li, Gang Hua, Zhe Lin, Jonathan Brandt, Jianchao Yang

Abstract: Pose variation remains to be a major challenge for realworld face recognition. We approach this problem through a probabilistic elastic matching method. We take a part based representation by extracting local features (e.g., LBP or SIFT) from densely sampled multi-scale image patches. By augmenting each feature with its location, a Gaussian mixture model (GMM) is trained to capture the spatialappearance distribution of all face images in the training corpus. Each mixture component of the GMM is confined to be a spherical Gaussian to balance the influence of the appearance and the location terms. Each Gaussian component builds correspondence of a pair of features to be matched between two faces/face tracks. For face verification, we train an SVM on the vector concatenating the difference vectors of all the feature pairs to decide if a pair of faces/face tracks is matched or not. We further propose a joint Bayesian adaptation algorithm to adapt the universally trained GMM to better model the pose variations between the target pair of faces/face tracks, which consistently improves face verification accuracy. Our experiments show that our method outperforms the state-ofthe-art in the most restricted protocol on Labeled Face in the Wild (LFW) and the YouTube video face database by a significant margin.

6 0.72549105 261 cvpr-2013-Learning by Associating Ambiguously Labeled Images

7 0.69764292 182 cvpr-2013-Fusing Robust Face Region Descriptors via Multiple Metric Learning for Face Recognition in the Wild

8 0.66074383 252 cvpr-2013-Learning Locally-Adaptive Decision Functions for Person Verification

9 0.65604675 438 cvpr-2013-Towards Pose Robust Face Recognition

10 0.64456874 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

11 0.63119656 463 cvpr-2013-What's in a Name? First Names as Facial Attributes

12 0.59151953 271 cvpr-2013-Locally Aligned Feature Transforms across Views

13 0.58343601 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

14 0.57977307 420 cvpr-2013-Supervised Descent Method and Its Applications to Face Alignment

15 0.573681 168 cvpr-2013-Fast Object Detection with Entropy-Driven Evaluation

16 0.57183212 34 cvpr-2013-Adaptive Active Learning for Image Classification

17 0.56818002 399 cvpr-2013-Single-Sample Face Recognition with Image Corruption and Misalignment via Sparse Illumination Transfer

18 0.56332511 390 cvpr-2013-Semi-supervised Node Splitting for Random Forest Construction

19 0.55777055 152 cvpr-2013-Exemplar-Based Face Parsing

20 0.55775529 220 cvpr-2013-In Defense of Sparsity Based Face Recognition


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.103), (16, 0.023), (26, 0.05), (28, 0.015), (33, 0.226), (59, 0.017), (62, 0.23), (67, 0.124), (69, 0.051), (87, 0.087)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82707858 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data

Author: Martin Bäuml, Makarand Tapaswi, Rainer Stiefelhagen

Abstract: We address the problem of person identification in TV series. We propose a unified learning framework for multiclass classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance.

2 0.81747389 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun

Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].

3 0.81256407 286 cvpr-2013-Mirror Surface Reconstruction from a Single Image

Author: Miaomiao Liu, Richard Hartley, Mathieu Salzmann

Abstract: This paper tackles the problem of reconstructing the shape of a smooth mirror surface from a single image. In particular, we consider the case where the camera is observing the reflection of a static reference target in the unknown mirror. We first study the reconstruction problem given dense correspondences between 3D points on the reference target and image locations. In such conditions, our differential geometry analysis provides a theoretical proof that the shape of the mirror surface can be uniquely recovered if the pose of the reference target is known. We then relax our assumptions by considering the case where only sparse correspondences are available. In this scenario, we formulate reconstruction as an optimization problem, which can be solved using a nonlinear least-squares method. We demonstrate the effectiveness of our method on both synthetic and real images.

4 0.77133441 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation

Author: Magnus Burenius, Josephine Sullivan, Stefan Carlsson

Abstract: We consider the problem of automatically estimating the 3D pose of humans from images, taken from multiple calibrated views. We show that it is possible and tractable to extend the pictorial structures framework, popular for 2D pose estimation, to 3D. We discuss how to use this framework to impose view, skeleton, joint angle and intersection constraints in 3D. The 3D pictorial structures are evaluated on multiple view data from a professional football game. The evaluation is focused on computational tractability, but we also demonstrate how a simple 2D part detector can be plugged into the framework.

5 0.76860112 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

Author: Luming Zhang, Mingli Song, Zicheng Liu, Xiao Liu, Jiajun Bu, Chun Chen

Abstract: Weakly supervised image segmentation is a challenging problem in computer vision field. In this paper, we present a new weakly supervised image segmentation algorithm by learning the distribution of spatially structured superpixel sets from image-level labels. Specifically, we first extract graphlets from each image where a graphlet is a smallsized graph consisting of superpixels as its nodes and it encapsulates the spatial structure of those superpixels. Then, a manifold embedding algorithm is proposed to transform graphlets of different sizes into equal-length feature vectors. Thereafter, we use GMM to learn the distribution of the post-embedding graphlets. Finally, we propose a novel image segmentation algorithm, called graphlet cut, that leverages the learned graphlet distribution in measuring the homogeneity of a set of spatially structured superpixels. Experimental results show that the proposed approach outperforms state-of-the-art weakly supervised image segmentation methods, and its performance is comparable to those of the fully supervised segmentation models.

6 0.76765847 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

7 0.76731795 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

8 0.76655221 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

9 0.76182246 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers

10 0.76083678 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues

11 0.76078635 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification

12 0.75958294 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence

13 0.75814915 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection

14 0.75725222 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors

15 0.75641924 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search

16 0.75456852 414 cvpr-2013-Structure Preserving Object Tracking

17 0.75442266 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

18 0.75405753 375 cvpr-2013-Saliency Detection via Graph-Based Manifold Ranking

19 0.75393343 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

20 0.75384206 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation