cvpr cvpr2013 cvpr2013-120 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vineet Gandhi, Remi Ronfard
Abstract: We introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and naming actors in long video sequences. More specifically, the actor’s head and shoulders are each represented as a constellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% precision in actor identification (naming).
Reference: text
sentIndex sentText sentNum sentScore
1 We demonstrate the model on the task of localizing and naming actors in long video sequences. [sent-4, score-0.467]
2 We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. [sent-7, score-0.508]
3 We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% precision in actor identification (naming). [sent-8, score-1.586]
4 Introduction Detecting and naming actors in movies is important for content-based indexing and retrieval of movie scenes and can also be used to support statistical analysis of film style. [sent-10, score-0.571]
5 Additionally, detecting and naming actors in unedited footage can be useful for post-production. [sent-11, score-0.527]
6 Methods for learning such features are desired to improve the recall and precision of actor detection in long movie scenes where the appearance of actors is consistent over time. [sent-13, score-1.289]
7 Contributions We propose a complete framework to learn view-independent actor models using maximally stable color regions (MSCR) [10] with a novel clustering algorithm. [sent-14, score-0.853]
8 The actor’s head and shoulders are represented as constellations of color blobs where the appearance of each blob is represented in a 9 dimensional space combining color, size, shape and position relative to the actor’s coordinate system, together with a frequency term. [sent-15, score-0.714]
9 fr k-nearest neighbours corresponding to the actor by just using the appearance of the blobs in the model. [sent-19, score-1.072]
10 The second stage is a sliding window search for the best localization of the actor in position and scale. [sent-20, score-0.893]
11 By repeating those two steps for all actors at all sizes, we obtain detection windows and actor names that maximize the posterior likelihood of each video frame. [sent-21, score-1.222]
12 First, we briefly review related work in generic and specific actor detection in movies. [sent-23, score-0.756]
13 Related work Much previous work on actor detection and recognition has been based on face detection. [sent-29, score-0.781]
14 In one of their examples, they report that 42% of actor appearances are frontal, 21% profile and 37% are actors facing away from the camera. [sent-34, score-1.176]
15 Indeed, generic meth- ods for detecting upper-body or full-body actors have been proposed by Dalal et al. [sent-36, score-0.412]
16 While such methods can potentially increase the coverage of actor detectors by detecting actors in profile and back views, they also suffer from the higher variability of actor appearances in such views. [sent-39, score-2.002]
17 (c) The blobs chosen for training shown with the head and the torso partition. [sent-46, score-0.44]
18 (d) The color blobs scaled and shifted to the normalized actor coordinate system which is represented by the red axis. [sent-47, score-1.051]
19 While the work in [18] and [15] focuses on first obtaining tracks and later classifying or hand labeling them, our method can directly perform actor specific detections on individual frames. [sent-49, score-0.771]
20 Our actor model is simpler since we only model two body parts (head and shoulders) but each part is an arbitrarily complex constellation model of color blobs. [sent-54, score-0.824]
21 It is difficult to apply such models to actor appearances because interest points and their image features are typically Figure 2. [sent-58, score-0.772]
22 Training images from different viewpoints are merged to obtain an actor model. [sent-59, score-0.75]
23 We resolve this issue by representing parts of actors with color regions rather than local features. [sent-61, score-0.449]
24 We extend such previous work to the more difficult problem of ”re-detecting” actors where the generic detectors fail due to variations in pose and viewpoints, partial occlusions, etc. [sent-63, score-0.4]
25 Generative model In this section, we introduce our generative model for the appearance of actors and describe a method for learning the model from a small number of individual keyframes or short video tracks. [sent-67, score-0.517]
26 Our model is designed to incorporate the costume of the actor and to be robust to changes in viewpoint and pose. [sent-68, score-0.76]
27 We make one important assumption that the actor is in an upright position and that both the head and the shoulders are visible. [sent-69, score-0.87]
28 As a result, we model the actor with two image windows for the head and shoulders, in a normalized coordinate system with the origin at the actor’s neck, and with unit size set to twice the height of the actor’s eyes relative to the origin. [sent-70, score-0.834]
29 Appearance models for all 8 actors in the movie ”Rope” [11] extends from (−1, −1) to (1, 0) and the shoulder region exteexntdesn dfrso fmro (m− (1−, 10), −to1 ()1 t,o o3 (). [sent-75, score-0.523]
30 More specifically, we associate with each actor a visual vocabulary of color blobs Ci described in terms of their normalized coordinates xi, yi, sizes si, colors ci and shapes mi, and their frequencies Hi. [sent-77, score-1.072]
31 Color blobs above the actor’s origin are labeled as ”head” features and color blobs under the origin are labeled as ”shoulder” features. [sent-78, score-0.68]
32 Formally, our generative model for each actor consists in the following three steps: 1. [sent-81, score-0.749]
33 Choose screen location and window size for the actor on the screen, using the detections in the previous frame as a prior. [sent-82, score-0.873]
34 Maximally stable color regions The maximally stable color regions (MSCR) feature is a color extension of the maximally stable extremal region (MSER) feature [10]. [sent-91, score-0.355]
35 These approximated ellipses are termed as color blobs in the later part ofthe text. [sent-98, score-0.34]
36 The actual choice of samples is not very important, as long as we cover the entire range of appearances ofthe actor (we try to keep the training set equally sampled across different views i. [sent-104, score-0.811]
37 Ideally a sequence of each actor performing a 360 degree turn would be sufficient to build such models. [sent-107, score-0.711]
38 We then collect the color blobs in all training windows, center and resize them, and assign them to ac333777000866 (a)(b) Figure 4. [sent-110, score-0.34]
39 Example of two independent randomly generated blob images given the actor and background models. [sent-111, score-0.894]
40 We cluster the blobs for all actors using a constrained agglomerative clustering. [sent-114, score-0.753]
41 For every actor in n training images we get n set of blobs (f1, f2 , . [sent-115, score-1.011]
42 , fn) with varying number of blobs in each set, where each blob is represented as a 9 dimensional vector in normalized actor coordinates. [sent-119, score-1.194]
43 We then compute pairwise matching between those clusters and the blobs in the next set f2. [sent-122, score-0.334]
44 Note that the number of clusters per actor is variable. [sent-132, score-0.745]
45 As a result, actors with more complex appearances can be represented with a larger number of clusters. [sent-133, score-0.443]
46 The appearance models for eight different actors are shown in Fig 3. [sent-134, score-0.414]
47 Our framework searches for actors over a variety of scales, from foreground (larger scales) to background (smaller scales). [sent-137, score-0.382]
48 For each actor we first perform a search space reduction using kNN-search. [sent-138, score-0.731]
49 2: for each actor a do 3: for each scale s do 4: Normalize image features w. [sent-146, score-0.731]
50 7: for each position (x, y) do 8: Find blob indices Jhead and Jshoulders 9: Compute mij using blob indices and inverted indices score(x, y, s, a) = ? [sent-154, score-0.644]
51 This gives us a initial set of blobs B over which we perform a refinement step using kNN search given the actor model and the particular scale. [sent-159, score-1.056]
52 Firstly for each blob within the sliding window we only require to compare it with its corresponding entries in the inverted index table instead of doing an exhaustive search. [sent-169, score-0.373]
53 (d) Refined set of blobs for the given actor at given scale, based on appearance. [sent-178, score-1.011]
54 (g) The corresponding matched cluster centers in the given actor model. [sent-181, score-0.781]
55 Sliding window search We now proceed to define the detection scores for all actors at all positions and scales using a sliding window approach. [sent-189, score-0.635]
56 Each actor detection score is based on the likelihood that the image in the sliding window was generated by the actor model using the previous frame detections as prior information. [sent-190, score-1.699]
57 In practice we compute MSCR features in the best available scale and then shift and scale the blobs respectively while searching at different scales. [sent-191, score-0.34]
58 During recognition, we similarly normalize the size, shape and position of blobs relative to the sliding window. [sent-193, score-0.405]
59 This ensures that all computations are performed in reduced actor coordinates. [sent-194, score-0.711]
60 We represent B as the set of all blobs detected in the image and Ca as the set of cluster centers in the model for a given actor a. [sent-196, score-1.095]
61 Given a sliding window at position (x, y) and scale s, we find all blobs centered within the sliding window and assign the blobs indices Jhead and Jshoulders. [sent-197, score-0.953]
62 The term P(Bj , mij , a) is the similarity function between the model cluster Cia and the corresponding matched blob Bj in nine dimensional space (position, size, color and shape), which is defined as follows: P(Bj,mij,a) =? [sent-204, score-0.385]
63 where Cia is the center for cluster iin the actor model and Σia is its covariance matrix. [sent-207, score-0.783]
64 A distinctive feature of our detection framework is that it requires us to find a partial assignment mij between blobs in the the sliding window and clusters in the model. [sent-208, score-0.602]
65 More precisely, we compute such that each blob in the sliding window is assigned to at most one cluster, each cluster is assigned to at most one b? [sent-209, score-0.366]
66 this method produces significantly better results than computing the average score over all possible blobto-cluster assignments, where the same blob may be assigned to multiple clusters, and the same cluster to multiple blobs, which is prone to detection errors. [sent-219, score-0.312]
67 where, l1,1 and l1,0 measures the probability that the same actor is observed in consecutive frames. [sent-223, score-0.711]
68 When the actor is not present in the previous frame, all positions in next frame are equally probable. [sent-224, score-0.743]
69 When the actor is present in both frames, we assume the new position to be close to the previous position, within some covariance term Σpos. [sent-225, score-0.762]
70 Comparison of results on recall and precision for actor detection using Upper body detector(UBD), Color Blob detector(CBD) and Combined method (UBD-CBD) are in fact independent on the choice of actor or movie. [sent-227, score-1.539]
71 Benefiting from the fact that an actor can only appear once on a frame, we then search for the best possible positions [x∗ (a) , y∗ (a) , s∗ (a)] which maximizes the total score over all actors. [sent-234, score-0.763]
72 Dataset As mentioned earlier the previous work in actor specific models have focused on obtaining tracks and then performing classification on the obtained tracks. [sent-241, score-0.734]
73 Our method on the other hand can perform direct detection using actor specific models, even on keyframes which is a much harder task than classifying tracks and we target the scenario where it is difficult to obtain face or upper body tracks. [sent-243, score-0.916]
74 Comparison of recall with increasing number of actors in UBD, CBD and combined cases keyframes at equal intervals, a frame every 10 seconds from the movie ”Rope” [11] by Alfred Hitchcock. [sent-247, score-0.579]
75 This dataset presents significant scale and viewpoint variations for each actor with presence of motion and focus blur. [sent-248, score-0.731]
76 The lighting changes considerably during the movie and the clothing appearance of all the actors remains consistent which makes it suitable for our experiments. [sent-250, score-0.526]
77 There are 8 different actors in the entire movie (except the initial victim and Alfred Hitchcock himself). [sent-252, score-0.469]
78 The number of appearances per actor vary between 38 and 275. [sent-254, score-0.772]
79 All 443 frames were hand labeled with the names and screen locations for all actors to serve as ground truth. [sent-255, score-0.458]
80 Actor identification results for all 8 actors in Rope dataset (percentages). [sent-259, score-0.382]
81 We ran our detection and recognition algorithm using the built actor models, on all 443 frames and compared the results with the ground truth. [sent-261, score-0.756]
82 Fig 6 shows the results on recall and precision of actor detection using the proposed method (CBD) and it is compared with the state of the art Upper Body Detector (UBD)2 and the combined case where we merged the detections obtained from both the methods individually. [sent-262, score-0.844]
83 Results demonstrate an increase in recall from about 57 percent in UBD to 70 percent in proposed Color blob detector (CBD) to about 81 percent in combined approach for a similar precision. [sent-263, score-0.323]
84 Some detection results from Rope dataset using proposed method CBD (in red) with recognized actor names on top left, UBD (in yellow) and not detected by both (in green). [sent-270, score-0.828]
85 Notice how our method is able to detect and identify the actors in the presence of multiple actors with partial occlusions [e. [sent-271, score-0.782]
86 (e)], merging of foreground blobs with the background due to low illumination [e. [sent-283, score-0.3]
87 Fig 7 we plot recall rates for both UBD and CBD and combined case, with different number of actors present in the frame and it shows that the proposed method gives consistent results with varying number of actors. [sent-293, score-0.446]
88 Recognition results on the detected actors are presented in Table 5. [sent-294, score-0.414]
89 As can be seen, our method not only increases the average recall rate for all actors, but also correctly names all actors with an average precision of 89 percent despite the large number of back views and partial occlusions. [sent-296, score-0.549]
90 Some of the example detections results are shown in Fig 8 and they demonstrate how our method performs well even with severe cases of occlusions, viewpoint scale and pose variations in a multi actor scenario. [sent-297, score-0.768]
91 In 9(a) the blobs in the torso region of the undetected actor gets merged with the background, heavy blur causes mis-detection in 9(b), torso is largely occluded in 9(c). [sent-300, score-1.173]
92 In fourth instance 9(d), the hand gets detected as the head and blobs below as torso, leading to a false detection. [sent-301, score-0.41]
93 Conclusions We have presented a generative appearance model for detecting and naming actors in movies that can be learned from a small number of training examples. [sent-307, score-0.584]
94 We have shown that low dimensional features like MSCR that were previously used for actor re-identification can also support actor detection, even in difficult multiple actors scenarios. [sent-308, score-1.804]
95 Results show significant increase in coverage (recall) for actor detection maintaining high precision. [sent-309, score-0.803]
96 To our knowledge, this is the first time that a generative appearance model is demonstrated on the task of detecting and recognizing actors from arbitrary viewpoints. [sent-310, score-0.482]
97 Our method also appears to be a good candidate for tracking multiple actors constantly changing viewpoints and occluding each other in long video sequences such as ”Rope”, which include important application scenarios, such as unedited, raw video footage and recordings of live performances. [sent-311, score-0.46]
98 We also plan to investigate weakly supervised methods by extracting actor labels from temporally aligned movie scripts [17, 2]. [sent-312, score-0.825]
99 One obvious limitation of our method is that it only handles cases where the appearance of actors does not change much over time. [sent-313, score-0.414]
100 In future work, we are planning to investigate extensions with mutually-exclusive appearances per actor, so that actors can change their appearances and costumes over time. [sent-314, score-0.504]
wordName wordTfidf (topN-words)
[('actor', 0.711), ('actors', 0.382), ('blobs', 0.3), ('blob', 0.183), ('mscr', 0.12), ('rope', 0.114), ('fig', 0.098), ('ubd', 0.098), ('mij', 0.092), ('movie', 0.087), ('cbd', 0.087), ('head', 0.078), ('sliding', 0.074), ('naming', 0.066), ('torso', 0.062), ('appearances', 0.061), ('idx', 0.058), ('window', 0.057), ('shoulder', 0.054), ('ronfard', 0.053), ('cluster', 0.052), ('shoulders', 0.05), ('costume', 0.049), ('bj', 0.049), ('cia', 0.048), ('coverage', 0.047), ('keyframes', 0.046), ('detection', 0.045), ('maximally', 0.044), ('names', 0.04), ('color', 0.04), ('brandon', 0.04), ('singleton', 0.04), ('knn', 0.04), ('indices', 0.04), ('body', 0.04), ('views', 0.039), ('generative', 0.038), ('detections', 0.037), ('screen', 0.036), ('percent', 0.036), ('movies', 0.036), ('inverted', 0.035), ('optional', 0.035), ('clusters', 0.034), ('constellation', 0.033), ('alfred', 0.033), ('janet', 0.033), ('jhead', 0.033), ('ljk', 0.033), ('remi', 0.033), ('appearance', 0.032), ('detected', 0.032), ('recall', 0.032), ('score', 0.032), ('frame', 0.032), ('stable', 0.031), ('position', 0.031), ('detecting', 0.03), ('neighbours', 0.029), ('unedited', 0.029), ('sivic', 0.028), ('regions', 0.027), ('scripts', 0.027), ('vineet', 0.027), ('pt', 0.026), ('upper', 0.026), ('clothing', 0.025), ('benefiting', 0.025), ('grenoble', 0.025), ('windows', 0.025), ('face', 0.025), ('refinement', 0.025), ('exhaustive', 0.024), ('tracks', 0.023), ('person', 0.022), ('ca', 0.022), ('pictorial', 0.022), ('profile', 0.022), ('ci', 0.021), ('coherence', 0.021), ('back', 0.02), ('covariance', 0.02), ('shots', 0.02), ('eichner', 0.02), ('footage', 0.02), ('inria', 0.02), ('scale', 0.02), ('search', 0.02), ('failure', 0.02), ('origin', 0.02), ('viewpoints', 0.02), ('agglomerative', 0.019), ('merged', 0.019), ('video', 0.019), ('leonardis', 0.019), ('blur', 0.019), ('matched', 0.018), ('occlusions', 0.018), ('detectors', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
Author: Vineet Gandhi, Remi Ronfard
Abstract: We introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and naming actors in long video sequences. More specifically, the actor’s head and shoulders are each represented as a constellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% precision in actor identification (naming).
2 0.17448376 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
Author: Nikolaos Kyriazis, Antonis Argyros
Abstract: In several hand-object(s) interaction scenarios, the change in the objects ’ state is a direct consequence of the hand’s motion. This has a straightforward representation in Newtonian dynamics. We present the first approach that exploits this observation to perform model-based 3D tracking of a table-top scene comprising passive objects and an active hand. Our forward modelling of 3D hand-object(s) interaction regards both the appearance and the physical state of the scene and is parameterized over the hand motion (26 DoFs) between two successive instants in time. We demonstrate that our approach manages to track the 3D pose of all objects and the 3D pose and articulation of the hand by only searching for the parameters of the hand motion. In the proposed framework, covert scene state is inferred by connecting it to the overt state, through the incorporation of physics. Thus, our tracking approach treats a variety of challenging observability issues in a principled manner, without the need to resort to heuristics.
3 0.14251798 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
4 0.091508612 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification
Author: Enrique G. Ortiz, Alan Wright, Mubarak Shah
Abstract: This paper presents an end-to-end video face recognition system, addressing the difficult problem of identifying a video face track using a large dictionary of still face images of a few hundred people, while rejecting unknown individuals. A straightforward application of the popular ?1minimization for face recognition on a frame-by-frame basis is prohibitively expensive, so we propose a novel algorithm Mean Sequence SRC (MSSRC) that performs video face recognition using a joint optimization leveraging all of the available video data and the knowledge that the face track frames belong to the same individual. By adding a strict temporal constraint to the ?1-minimization that forces individual frames in a face track to all reconstruct a single identity, we show the optimization reduces to a single minimization over the mean of the face track. We also introduce a new Movie Trailer Face Dataset collected from 101 movie trailers on YouTube. Finally, we show that our methodmatches or outperforms the state-of-the-art on three existing datasets (YouTube Celebrities, YouTube Faces, and Buffy) and our unconstrained Movie Trailer Face Dataset. More importantly, our method excels at rejecting unknown identities by at least 8% in average precision.
5 0.07982672 272 cvpr-2013-Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery
Author: Rikke Gade, Anders Jørgensen, Thomas B. Moeslund
Abstract: This paper presents a robust occupancy analysis system for thermal imaging. Reliable detection of people is very hard in crowded scenes, due to occlusions and segmentation problems. We therefore propose a framework that optimises the occupancy analysis over long periods by including information on the transition in occupancy, whenpeople enter or leave the monitored area. In stable periods, with no activity close to the borders, people are detected and counted which contributes to a weighted histogram. When activity close to the border is detected, local tracking is applied in order to identify a crossing. After a full sequence, the number of people during all periods are estimated using a probabilistic graph search optimisation. The system is tested on a total of 51,000 frames, captured in sports arenas. The mean error for a 30-minute period containing 3-13 people is 4.44 %, which is a half of the error percentage optained by detection only, and better than the results of comparable work. The framework is also tested on a public available dataset from an outdoor scene, which proves the generality of the method.
6 0.074610248 440 cvpr-2013-Tracking People and Their Objects
7 0.071149871 100 cvpr-2013-Crossing the Line: Crowd Counting by Integer Programming with Local Features
8 0.068064362 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
9 0.06716378 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
10 0.063474417 334 cvpr-2013-Pose from Flow and Flow from Pose
11 0.062759846 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data
12 0.062185071 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
13 0.060961932 335 cvpr-2013-Poselet Conditioned Pictorial Structures
14 0.05876305 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
15 0.057608809 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
16 0.055364255 134 cvpr-2013-Discriminative Sub-categorization
17 0.055300426 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
18 0.054928415 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
19 0.054369621 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
20 0.054338988 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
topicId topicWeight
[(0, 0.128), (1, -0.021), (2, 0.008), (3, -0.056), (4, -0.018), (5, 0.009), (6, 0.031), (7, -0.01), (8, 0.042), (9, -0.025), (10, -0.019), (11, 0.005), (12, 0.006), (13, -0.003), (14, 0.029), (15, -0.0), (16, 0.025), (17, 0.006), (18, -0.009), (19, -0.034), (20, 0.023), (21, 0.042), (22, -0.037), (23, 0.054), (24, -0.003), (25, -0.014), (26, 0.012), (27, -0.021), (28, -0.02), (29, -0.049), (30, 0.027), (31, 0.043), (32, 0.011), (33, 0.029), (34, 0.026), (35, -0.005), (36, 0.001), (37, 0.043), (38, 0.018), (39, -0.054), (40, -0.027), (41, 0.02), (42, -0.03), (43, -0.002), (44, -0.004), (45, -0.033), (46, -0.002), (47, -0.019), (48, 0.038), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.87441713 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
Author: Vineet Gandhi, Remi Ronfard
Abstract: We introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and naming actors in long video sequences. More specifically, the actor’s head and shoulders are each represented as a constellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% precision in actor identification (naming).
2 0.67593974 272 cvpr-2013-Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery
Author: Rikke Gade, Anders Jørgensen, Thomas B. Moeslund
Abstract: This paper presents a robust occupancy analysis system for thermal imaging. Reliable detection of people is very hard in crowded scenes, due to occlusions and segmentation problems. We therefore propose a framework that optimises the occupancy analysis over long periods by including information on the transition in occupancy, whenpeople enter or leave the monitored area. In stable periods, with no activity close to the borders, people are detected and counted which contributes to a weighted histogram. When activity close to the border is detected, local tracking is applied in order to identify a crossing. After a full sequence, the number of people during all periods are estimated using a probabilistic graph search optimisation. The system is tested on a total of 51,000 frames, captured in sports arenas. The mean error for a 30-minute period containing 3-13 people is 4.44 %, which is a half of the error percentage optained by detection only, and better than the results of comparable work. The framework is also tested on a public available dataset from an outdoor scene, which proves the generality of the method.
3 0.6564936 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
Author: Ishani Chakraborty, Hui Cheng, Omar Javed
Abstract: We present a unified framework for detecting and classifying people interactions in unconstrained user generated images. 1 Unlike previous approaches that directly map people/face locations in 2D image space into features for classification, we first estimate camera viewpoint and people positions in 3D space and then extract spatial configuration features from explicit 3D people positions. This approach has several advantages. First, it can accurately estimate relative distances and orientations between people in 3D. Second, it encodes spatial arrangements of people into a richer set of shape descriptors than afforded in 2D. Our 3D shape descriptors are invariant to camera pose variations often seen in web images and videos. The proposed approach also estimates camera pose and uses it to capture the intent of the photo. To achieve accurate 3D people layout estimation, we develop an algorithm that robustly fuses semantic constraints about human interpositions into a linear camera model. This enables our model to handle large variations in people size, heights (e.g. age) and poses. An accurate 3D layout also allows us to construct features informed by Proxemics that improves our semantic classification. To characterize the human interaction space, we introduce visual proxemes; a set of prototypical patterns that represent commonly occurring social interactions in events. We train a discriminative classifier that classifies 3D arrangements of people into visual proxemes and quantitatively evaluate the performance on a large, challenging dataset.
4 0.64926028 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
Author: Varun Ramakrishna, Takeo Kanade, Yaser Sheikh
Abstract: The human body is structurally symmetric. Tracking by detection approaches for human pose suffer from double counting, where the same image evidence is used to explain two separate but symmetric parts, such as the left and right feet. Double counting, if left unaddressed can critically affect subsequent processes, such as action recognition, affordance estimation, and pose reconstruction. In this work, we present an occlusion aware algorithm for tracking human pose in an image sequence, that addresses the problem of double counting. Our key insight is that tracking human pose can be cast as a multi-target tracking problem where the ”targets ” are related by an underlying articulated structure. The human body is modeled as a combination of singleton parts (such as the head and neck) and symmetric pairs of parts (such as the shoulders, knees, and feet). Symmetric body parts are jointly tracked with mutual exclusion constraints to prevent double counting by reasoning about occlusion. We evaluate our algorithm on an outdoor dataset with natural background clutter, a standard indoor dataset (HumanEva-I), and compare against a state of the art pose estimation algorithm.
Author: Alessandro Perina, Nebojsa Jojic
Abstract: Recently, the Counting Grid (CG) model [5] was developed to represent each input image as a point in a large grid of feature counts. This latent point is a corner of a window of grid points which are all uniformly combined to match the (normalized) feature counts in the image. Being a bag of word model with spatial layout in the latent space, the CG model has superior handling of field of view changes in comparison to other bag of word models, but with the price of being essentially a mixture, mapping each scene to a single window in the grid. In this paper we introduce a family of componential models, dubbed the Componential Counting Grid, whose members represent each input image by multiple latent locations, rather than just one. In this way, we make a substantially more flexible admixture model which captures layers or parts of images and maps them to separate windows in a Counting Grid. We tested the models on scene and place classification where their com- ponential nature helped to extract objects, to capture parallax effects, thus better fitting the data and outperforming Counting Grids and Latent Dirichlet Allocation, especially on sequences taken with wearable cameras.
6 0.62154055 264 cvpr-2013-Learning to Detect Partially Overlapping Instances
7 0.60216963 440 cvpr-2013-Tracking People and Their Objects
9 0.58838302 252 cvpr-2013-Learning Locally-Adaptive Decision Functions for Person Verification
10 0.58173651 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data
11 0.57875913 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
12 0.57861292 282 cvpr-2013-Measuring Crowd Collectiveness
13 0.57479393 100 cvpr-2013-Crossing the Line: Crowd Counting by Integer Programming with Local Features
14 0.55486721 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
15 0.55479664 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
16 0.55385584 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
17 0.55246395 172 cvpr-2013-Finding Group Interactions in Social Clutter
18 0.54351389 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
19 0.53342301 299 cvpr-2013-Multi-source Multi-scale Counting in Extremely Dense Crowd Images
20 0.53114218 118 cvpr-2013-Detecting Pulse from Head Motions in Video
topicId topicWeight
[(10, 0.076), (16, 0.034), (26, 0.053), (28, 0.018), (33, 0.235), (62, 0.011), (67, 0.105), (69, 0.046), (80, 0.023), (87, 0.059), (93, 0.251)]
simIndex simValue paperId paperTitle
1 0.84054101 208 cvpr-2013-Hyperbolic Harmonic Mapping for Constrained Brain Surface Registration
Author: Rui Shi, Wei Zeng, Zhengyu Su, Hanna Damasio, Zhonglin Lu, Yalin Wang, Shing-Tung Yau, Xianfeng Gu
Abstract: Automatic computation of surface correspondence via harmonic map is an active research field in computer vision, computer graphics and computational geometry. It may help document and understand physical and biological phenomena and also has broad applications in biometrics, medical imaging and motion capture. Although numerous studies have been devoted to harmonic map research, limited progress has been made to compute a diffeomorphic harmonic map on general topology surfaces with landmark constraints. This work conquer this problem by changing the Riemannian metric on the target surface to a hyperbolic metric, so that the harmonic mapping is guaranteed to be a diffeomorphism under landmark constraints. The computational algorithms are based on the Ricci flow method and the method is general and robust. We apply our algorithm to study constrained human brain surface registration problem. Experimental results demonstrate that, by changing the Riemannian metric, the registrations are always diffeomorphic, and achieve relative high performance when evaluated with some popular cortical surface registration evaluation standards.
same-paper 2 0.80914605 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
Author: Vineet Gandhi, Remi Ronfard
Abstract: We introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and naming actors in long video sequences. More specifically, the actor’s head and shoulders are each represented as a constellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% precision in actor identification (naming).
3 0.7761566 89 cvpr-2013-Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation
Author: Kota Hara, Rama Chellappa
Abstract: We present a hierarchical method for human pose estimation from a single still image. In our approach, a dependency graph representing relationships between reference points such as bodyjoints is constructed and thepositions of these reference points are sequentially estimated by a successive application of multidimensional output regressions along the dependency paths, starting from the root node. Each regressor takes image features computed from an image patch centered on the current node ’s position estimated by the previous regressor and is specialized for estimating its child nodes ’ positions. The use of the dependency graph allows us to decompose a complex pose estimation problem into a set of local pose estimation problems that are less complex. We design a dependency graph for two commonly used human pose estimation datasets, the Buffy Stickmen dataset and the ETHZ PASCAL Stickmen dataset, and demonstrate that our method achieves comparable accuracy to state-of-the-art results on both datasets with significantly lower computation time than existing methods. Furthermore, we propose an importance weighted boosted re- gression trees method for transductive learning settings and demonstrate the resulting improved performance for pose estimation tasks.
4 0.76310921 24 cvpr-2013-A Principled Deep Random Field Model for Image Segmentation
Author: Pushmeet Kohli, Anton Osokin, Stefanie Jegelka
Abstract: We discuss a model for image segmentation that is able to overcome the short-boundary bias observed in standard pairwise random field based approaches. To wit, we show that a random field with multi-layered hidden units can encode boundary preserving higher order potentials such as the ones used in the cooperative cuts model of [11] while still allowing for fast and exact MAP inference. Exact inference allows our model to outperform previous image segmentation methods, and to see the true effect of coupling graph edges. Finally, our model can be easily extended to handle segmentation instances with multiple labels, for which it yields promising results.
5 0.76178831 376 cvpr-2013-Salient Object Detection: A Discriminative Regional Feature Integration Approach
Author: Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu, Nanning Zheng, Shipeng Li
Abstract: Salient object detection has been attracting a lot of interest, and recently various heuristic computational models have been designed. In this paper, we regard saliency map computation as a regression problem. Our method, which is based on multi-level image segmentation, uses the supervised learning approach to map the regional feature vector to a saliency score, and finally fuses the saliency scores across multiple levels, yielding the saliency map. The contributions lie in two-fold. One is that we show our approach, which integrates the regional contrast, regional property and regional backgroundness descriptors together to form the master saliency map, is able to produce superior saliency maps to existing algorithms most of which combine saliency maps heuristically computed from different types of features. The other is that we introduce a new regional feature vector, backgroundness, to characterize the background, which can be regarded as a counterpart of the objectness descriptor [2]. The performance evaluation on several popular benchmark data sets validates that our approach outperforms existing state-of-the-arts.
6 0.74999571 449 cvpr-2013-Unnatural L0 Sparse Representation for Natural Image Deblurring
7 0.73748809 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
8 0.73701173 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
10 0.73284382 44 cvpr-2013-Area Preserving Brain Mapping
11 0.73200005 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
12 0.73176235 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
13 0.73043585 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
14 0.7292735 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
15 0.72865772 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
16 0.72837991 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
17 0.72772241 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
18 0.72698855 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
19 0.72673488 438 cvpr-2013-Towards Pose Robust Face Recognition
20 0.72636771 389 cvpr-2013-Semi-supervised Learning with Constraints for Person Identification in Multimedia Data