cvpr cvpr2013 cvpr2013-172 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ruonan Li, Parker Porfilio, Todd Zickler
Abstract: We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pairwise interactions; and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. [sent-6, score-0.955]
2 Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. [sent-7, score-1.505]
3 The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. [sent-8, score-1.329]
4 Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. [sent-10, score-0.333]
5 Introduction Social interactions are common, but they rarely take place in isolation. [sent-13, score-0.265]
6 Conversations and other group interactions occur on busy streets, in crowded cafes, in conference halls, and in other types of social gatherings. [sent-14, score-0.613]
7 In these situations, before a computer vision system can recognize distinctive group interactions, it must first detect them by distinguishing between participants and by-standers and by localizing them in time. [sent-15, score-0.506]
8 This paper addresses this spatiotemporal detection problem for cases in which the agents in a large gathering can be reasonably detected and tracked. [sent-16, score-0.38]
9 We consider group interactions broadly as distinctive space-time structural co-occurrence of individual actions. [sent-17, score-0.457]
10 Given an exemplar video of an N-person social interaction, we seek to find similar interactions in a long input video with M > N approximately-tracked people. [sent-21, score-0.9]
11 For each temporal frame in the exemplar, the N best-matching participants are identified separately in each temporal unit of the input, and the matches are assigned scores. [sent-22, score-0.823]
12 Their interaction is then localized in time using an efficient branch-and-bound search. [sent-24, score-0.263]
13 In a collection of hockey games, we might want all instances of a “three-on-one”, and in nature we might be interested in localizing instances of distinctive group interactions among populations of animals, insects, or bacteria. [sent-27, score-0.52]
14 Given an exemplar video of a distinctive group interaction involving a small handful of N agents, we detect and localize instances of similar interactions within a long video of a larger gathering of M ≥ N agents. [sent-32, score-1.119]
15 Matching an exemplar interaction amounts to searching through space and time for ensembles that are similar in some sense. [sent-34, score-0.661]
16 To use our matching approach for recognition, we simply match an input video against a labeled gallery of exemplars and then extract a class label or ranked list of labels from the resulting scored matches. [sent-36, score-0.334]
17 Second, we expect that the same type of interaction can occur over different temporal extents and at variable rates within its temporal extent, so we want an approach insensitive to these “within-class” variations. [sent-40, score-0.721]
18 First, the social descriptor-ensemble at each exemplar time unit is compared separately to each time unit of the input video, and the best-matching N participants in each unit are identified along with their matching score (yellow and gray lines in Fig. [sent-43, score-1.059]
19 Third and finally, the temporal extent of the interaction is determined through an efficient branch-and-bound search. [sent-46, score-0.46]
20 For analyzing interacting groups, previous approaches have considered cases in which: 1) there are no bystanders [11, 10, 3, 19, 21]; the interaction of interest is a priori localized in time [17, 4]; or both of these simultaneously [12, 20, 15]. [sent-50, score-0.366]
21 A notable exception is [1], which like us, addresses the problem of localizing interactions in long videos that contain bystanders, albeit with a less flexible representation (more on this in Sec. [sent-51, score-0.391]
22 Matching and Localizing Interactions We consider a video as a sequence of T temporal units that occur at a frequency equal to or less than the frame-rate of the raw video data. [sent-57, score-0.445]
23 The duration of these T units is typically between one and a few raw video frames, and it is determined by the application-appropriate choice for temporal resolution of atomic action descriptors (e. [sent-58, score-0.484]
24 Due to agent entry and exit, occlusions, and other tracking errors, not all M tracks will persist over all T frames, and some of the M tracks may correspond to short-lived false detections. [sent-62, score-0.511]
25 There are TM per-time-unit dI-dimensional descriptors {fm,t} where fm,t encodes the mth agent’s activity aptto trism {ef uni}t tw h∈e [e1, f T] ; and TM(M − 1) pairwise tdivPi-tydim aten timsioen auln descriptors {gm,m? [sent-65, score-0.355]
26 ,t encod-edsi mate ntismioen atl th dees cmrioptitoonrs a {ngd/or appearance gof agent m relative to agent m? [sent-67, score-0.384]
27 Each exemplar video is processed in the very same way as the input video, so that an exemplar of N ≤ M participants over Sp tti vmide euon,it sos tish represented art o efa Nch ≤tim Me s ∈ [1, S] by the ensemble Ds ? [sent-83, score-1.006]
28 Given a collection of exemplars and an input video, our matching strategy is as follows. [sent-88, score-0.26]
29 For each exemplar D, we smeaatrcchhi through gthye input Qoll fowors t. [sent-89, score-0.348]
30 he F optimal emxeamtchp,l identifying cthhe sherto uofg Nh t participants oanr dth localizing mtheatirc hin, tidereancttiifoynin time. [sent-90, score-0.371]
31 Matching between Temporal Units The first step in our framework is to separately compute the correspondence between the N exemplar agents at each time s ∈ [1, S] and the optimal subset of N ≤ M of input agents a∈t [e1a,cSh ]t aimnde t h ∈e [p1t,i mT]a . [sent-98, score-0.908]
32 binary pmreatsreixnt tW th,i sw Nhe-troe -tMhe cnomrr-ethsp entry wnm yi st one only w bhienna rtyhe m nattrhi exemplar agent is matched to the mth input agent. [sent-100, score-0.694]
33 t The quality of a correspondence is measured by the similarity between the individual and pairwise descriptors of the N selected input agents and those of the N exemplar agents. [sent-107, score-0.895]
34 (1) to be the dissimilarity between two instantaneous ensembles under a particular matching matrix W. [sent-118, score-0.341]
35 wnm ∈ {0, 1}, W1 = 1, WT1 w ≤ 1, (2) where c is a MN 1 vector of distances between inwdihveidrueal c descriptors, ×d I1 (fm,t, fnD,s), ainstda cHe si sb a MeenN n×MN matrix of distances betwee)n, pairwise descriptors dP (gm,m? [sent-134, score-0.413]
36 We achieve this through voting, with the intuition being that the optimal matching W∗ will occur relatively frequently among the instantaneous matches {Wt,s}. [sent-159, score-0.272]
37 The first is the dissimilarity between the descriptorensemble of the exemplar and that of the matched input agents D(Qt , Ds). [sent-165, score-0.713]
38 The second is a measure of temporal consistency, w,iDth the intuition being that if the N-subset of agents is matched at temporal pair (t, s) is correct, the same N-subset of agents should be matched for other pairs (t? [sent-166, score-1.04]
39 ) in small temporal neighborhoods of the exemplar and input video. [sent-168, score-0.546]
40 ∈N(t,s) where N(t,s) is a temporal neighborhood of (t,s) in whi(c4h) we erenf Norc(te, sth)e is consistency aenigdh bito rish depicted sin) Fig. [sent-184, score-0.233]
41 As a result, the voting procedure is shown in Algorithm 1, where in the last two steps we find among those matching matrices which receive a substantial number of supports from instantaneous matchings the best matching W∗ with the lowest average dissimilarity to the exemplar. [sent-189, score-0.369]
42 1, where a thick matching line indicates a strong similarity (low weight v), and the agents receiving the lowest average weight are selected as participants. [sent-191, score-0.385]
43 For this purpose, after the participants are dteertaecr-- mined through the best matching W∗, we recompute for all (t, s) pairs the dissimilarities under this best matching Dˆ(Qt, Ds , W∗), between the interaction of the individuDals( Qsel,eDcted by W∗ at time t and the exemplar at time s. [sent-195, score-0.975]
44 We then compute D∗ (t) = mins Dˆ(Qt, Ds , W∗), the minimal dissimilarity of the input inDter(aQcti,oDn by the selected participants at time t to the entire exemplar, and s∗ (t) = arg mins Dˆ(Qt, Ds , W∗), the time in the exemplar at which the input Dat( tQim,eD Dt exhibits this maximum similarity. [sent-196, score-0.781]
45 As interactions occur at variable rates within their temporal extent, we use a temporal pyramid to efficiently measure alignment in a way that also respects these variations. [sent-199, score-0.795]
46 1 Let (ts , te) be the true, unknown starting and ending times of the detected interaction in the input video, and suppose that the input descriptor-ensemble over this interval exactly matches that of the exemplar. [sent-209, score-0.49]
47 To determine good estimates for the interval (ts , te) we define a cost that is a product of the temporal alignment and visual similarity summed over the candidate interval: ? [sent-210, score-0.305]
48 This means that the summand in (6) considered as a function of t assumes a negative value in the desired interval ts ≤ t ≤ te and a positive value otherwise, as denoted as q(t)≤ an td ≤ depicted in the bottom of Fig. [sent-216, score-0.414]
49 4, the process can also handle moderately broken tracks by setting the descriptor values of missing temporal units to be sufficiently large (or small) so as not to be matched with any exemplar agents. [sent-223, score-0.769]
50 Each row is an annotated two-cell exemplar with markers representing instantaneous descriptor-ensembles at each time unit. [sent-233, score-0.459]
51 For discrimination between interaction categories, distances between ensembles of the same class (red circles and red squares) should be small whenever they occur in the same cell number; and distances for different classes (red vs. [sent-234, score-0.625]
52 For effective and efficient temporal localization, distances between ensembles at labeled times and unlabeled “background” times (black circles) should be large, and all distances should be offset by −1. [sent-236, score-0.451]
53 localization by ensuring that distances between labeled ensembles and unlabeled “background” ensembles are large. [sent-237, score-0.362]
54 The combination of 1) and 2) leads to more accurate spatial localizations of participants (i. [sent-238, score-0.323]
55 2), and induces the “quality function” conditions required for efficient temporal localization by branch-andbound (Sec. [sent-242, score-0.246]
56 For each application scenario, we use a training set of exemplar videos—possibly having varying numbers of agents N—that are annotated with start/end times, category labels, N-agent correspondences between exemplars of the same category. [sent-246, score-0.789]
57 This figure depicts three different exemplar videos in which a subset of time units have been labeled as being distinctive interactions of two different classes. [sent-250, score-0.726]
58 In this example, each labeled exemplar is shown as being divided into two cells; these correspond to the lowest level of the temporal pyramid described in Sec. [sent-251, score-0.587]
59 The first three constraints in the list enhance discrimination between categories, while the last three enhance the accuracy of temporal localization. [sent-254, score-0.303]
60 3) and occur roughly in the same temporal location within the interaction instances (i. [sent-259, score-0.475]
61 , in the same cell of the lowest level of the temporal pyramids), together with their “ground-truth” matchings. [sent-261, score-0.277]
62 For the classroom dataset, pairwise descriptors for groups comprised of (a) three or more participants, and (b) two participants. [sent-263, score-0.409]
63 The datasets are very different from one another, with distinct types of individual and pairwise descriptors that are appropriate for that environment. [sent-276, score-0.267]
64 In all experiments we use four-level temporal pyramids for the interactions and we set the time unit to be half the duration of the cells in the lowest level. [sent-277, score-0.603]
65 The classroom is “interactive” because at various times throughout the lecture students are invited to engage in ad-hoc group discussions about problems provided by the instructor (see, e. [sent-282, score-0.297]
66 (a)(b)(c) ROC curves for identifying the participants of an two-person, three-person, and four-person interactions using the proposed approach and baselines. [sent-290, score-0.549]
67 (d) Temporal localization accuracies using the proposed approach with and without metric learning, using individual and/or pairwise descriptors. [sent-291, score-0.255]
68 seating rows, and detecting them is a challenge because the number of by-standers is much larger than the number of participants (M is between 10 and 20 while N is between 2 and 4), video quality is limited (low light, 15fps), and the visual cues for interaction are quite subtle. [sent-292, score-0.601]
69 The ability to automatically detect such interactions is important for education researchers, however, since it can help in understanding how students self-organize into groups, and which geometric configurations of groups lead to improved educational outcomes [5]. [sent-293, score-0.4]
70 In consultation with education experts, we manually identified the participants and start/end times of all two-person, three-person, and four-person interactions, obtaining 254 two-person, 112 three-person, and 16 four-person interactions in total. [sent-295, score-0.593]
71 We defined interaction categories based on the geometric configurations of the participants: three categories for 2-person interactions (same row; different rows with left agent in front; different rows with right agent in front) and four categories for 3-person interactions. [sent-296, score-0.993]
72 The annotated interactions range from a few seconds to tens-ofseconds in length. [sent-299, score-0.3]
73 Also, for each split of the data we manually eliminate the false detections and tracks two-person and three-person interactions (Individual and/or pairwise descriptors, with or without metric learning (ML)). [sent-301, score-0.566]
74 We begin by looking at accuracy of detection, where we ignore the inferred interaction categories and simply measure the systems ability to detect when an interaction has occurred. [sent-320, score-0.488]
75 Using all parts of the system yields the best results, and we note that performance improves as the number of participants N increases. [sent-324, score-0.284]
76 The latter is due to the fact that interaction patterns are more salient when more pairwise information is available. [sent-325, score-0.331]
77 Examples of social interaction detection and matching on the classroom interaction database. [sent-327, score-0.783]
78 Each row is an example of detecting a salient interaction from an input. [sent-328, score-0.255]
79 (a) the input; (b) detected social interaction; (c-1) to (c-3) top three associated database exemplars that support the detection. [sent-329, score-0.334]
80 (Due to the small number of 4-person interactions in our dataset, we did not de- fine categories for them. [sent-332, score-0.305]
81 6 shows the average true positive rates versus false positives when further classifying detected interactions into the three or four categories. [sent-334, score-0.406]
82 Finally, we investigate the temporal localization performance, for which we compute the ratio of the intersection to the union of the estimated interval and the annotated interval, and we show the averages in Fig. [sent-341, score-0.388]
83 In the fourth row, a three-person interaction is correctly identified even though the third associated exemplar is from a different category (two looking right). [sent-346, score-0.612]
84 In the other rows, twoperson and three-person interactions are correctly detected and matched with exemplars. [sent-347, score-0.348]
85 We follow the protocol defined in previous work [21, 1]: 20% of available interaction annotations are used as exemplars for training, and the remaining (non-annotated) sequences are used for testing. [sent-351, score-0.354]
86 For our system, we consider one database exemplar at a time, compute its maximal response over the input video, and claim a true positive only when both the class-label and the identified participants are simultaneously correct. [sent-365, score-0.676]
87 Otherwise a false positive is indicated for that exemplar class. [sent-366, score-0.364]
88 Next we study detection in terms ofboth temporal localization and participant identification. [sent-368, score-0.323]
89 For temporal localization, we follow the protocol of [1] by indicating a true-positive when there is correct classification and more than a 50% ratio between the intersection and union of the estimated temporal interval and the ground-truth. [sent-369, score-0.503]
90 We achieve a slightly smaller area under ROC curve than the two baselines, as shown in Table 2, but point out that differences are hard to interpret because the temporal boundaries are somewhat ambiguous for the consecutively-executed interactions in the dataset. [sent-370, score-0.463]
91 We attribute this to the fact that we explicitly discriminate 222777222866 interactions and participants in the form of tracks of bounding boxes, while [1] does not do so but simply explains an input using a non-discriminative generative model. [sent-372, score-0.714]
92 It is interesting to see the pairwise descriptor plays a more crucial role for this dataset: A significant performance drop arises when we only consider individual action descriptors. [sent-380, score-0.263]
93 Classification accuracies and false positive (FP) rates comparison on UT-Interaction dataset for evaluating the effectiveness of different components of the proposed approach: Individual and/or pairwise descriptors, with or without metric learning (ML). [sent-382, score-0.25]
94 As an application to agents other than humans, we also evaluate our approach in the mouse dataset of [2]. [sent-390, score-0.338]
95 We introduced a voting-based approach for detecting and localizing small-group interactions within larger social gatherings. [sent-393, score-0.546]
96 Since it operates on agent tracks, it is also quite flexible and can be applied in many different multi-agent scenarios, provided that the environment-specific individual descriptor and the environment-specific pairwise descriptor are properly defined. [sent-395, score-0.452]
97 We represent group interactions as collections of individual and pairwise descriptors (1st and 2nd order), and our results suggest that this is effective for groups of up to four agents. [sent-397, score-0.73]
98 Higher-order interaction descriptors may play a more important role for larger interacting groups, and this may be a useful future research direction as new datasets become available. [sent-398, score-0.38]
99 We use a simple combination of descriptor collection and temporal pyramid, but one could imagine using a (learned) tree of space-time parts, analogous to how spatial parts-based models are used for object detection. [sent-400, score-0.311]
100 A chains model for localizing participants of group activities in videos. [sent-406, score-0.466]
wordName wordTfidf (topN-words)
[('exemplar', 0.312), ('participants', 0.284), ('agents', 0.28), ('interactions', 0.265), ('interaction', 0.224), ('temporal', 0.198), ('agent', 0.192), ('qt', 0.171), ('social', 0.163), ('ts', 0.161), ('exemplars', 0.13), ('ensembles', 0.125), ('instantaneous', 0.112), ('te', 0.111), ('classroom', 0.111), ('interval', 0.107), ('pairwise', 0.107), ('descriptors', 0.103), ('tracks', 0.099), ('group', 0.095), ('students', 0.091), ('localizing', 0.087), ('ds', 0.085), ('participant', 0.077), ('wnm', 0.075), ('harvard', 0.074), ('units', 0.07), ('cvx', 0.066), ('distances', 0.064), ('lmnn', 0.064), ('video', 0.062), ('matching', 0.061), ('gathering', 0.059), ('collections', 0.059), ('mouse', 0.058), ('individual', 0.057), ('occur', 0.053), ('interacting', 0.053), ('unit', 0.053), ('false', 0.052), ('action', 0.051), ('bystanders', 0.05), ('porfilio', 0.05), ('localization', 0.048), ('rates', 0.048), ('voting', 0.048), ('descriptor', 0.048), ('dp', 0.046), ('matches', 0.046), ('roc', 0.045), ('gallery', 0.045), ('conversations', 0.044), ('lowest', 0.044), ('groups', 0.044), ('identified', 0.044), ('comprised', 0.044), ('mn', 0.043), ('cells', 0.043), ('opencv', 0.043), ('metric', 0.043), ('dissimilarity', 0.043), ('matched', 0.042), ('tw', 0.042), ('minw', 0.041), ('detected', 0.041), ('distinctive', 0.04), ('categories', 0.04), ('videos', 0.039), ('localized', 0.039), ('permuted', 0.039), ('localizations', 0.039), ('triples', 0.039), ('extent', 0.038), ('turned', 0.038), ('entry', 0.037), ('busy', 0.037), ('enhance', 0.037), ('individuals', 0.036), ('input', 0.036), ('mins', 0.035), ('cell', 0.035), ('annotated', 0.035), ('depicted', 0.035), ('head', 0.034), ('vote', 0.033), ('boxes', 0.033), ('pyramid', 0.033), ('dissimilarities', 0.033), ('collection', 0.033), ('array', 0.033), ('tracking', 0.032), ('category', 0.032), ('analogous', 0.032), ('detecting', 0.031), ('discrimination', 0.031), ('wt', 0.03), ('bounding', 0.03), ('dh', 0.03), ('fragmented', 0.03), ('circles', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 172 cvpr-2013-Finding Group Interactions in Social Clutter
Author: Ruonan Li, Parker Porfilio, Todd Zickler
Abstract: We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pairwise interactions; and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.
2 0.218779 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment
Author: Suha Kwak, Bohyung Han, Joon Hee Han
Abstract: We present a joint estimation technique of event localization and role assignment when the target video event is described by a scenario. Specifically, to detect multi-agent events from video, our algorithm identifies agents involved in an event and assigns roles to the participating agents. Instead of iterating through all possible agent-role combinations, we formulate the joint optimization problem as two efficient subproblems—quadratic programming for role assignment followed by linear programming for event localization. Additionally, we reduce the computational complexity significantly by applying role-specific event detectors to each agent independently. We test the performance of our algorithm in natural videos, which contain multiple target events and nonparticipating agents.
3 0.20338495 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
Author: Xiaohui Shen, Zhe Lin, Jonathan Brandt, Ying Wu
Abstract: Detecting faces in uncontrolled environments continues to be a challenge to traditional face detection methods[24] due to the large variation in facial appearances, as well as occlusion and clutter. In order to overcome these challenges, we present a novel and robust exemplarbased face detector that integrates image retrieval and discriminative learning. A large database of faces with bounding rectangles and facial landmark locations is collected, and simple discriminative classifiers are learned from each of them. A voting-based method is then proposed to let these classifiers cast votes on the test image through an efficient image retrieval technique. As a result, faces can be very efficiently detected by selecting the modes from the voting maps, without resorting to exhaustive sliding window-style scanning. Moreover, due to the exemplar-based framework, our approach can detect faces under challenging conditions without explicitly modeling their variations. Evaluation on two public benchmark datasets shows that our new face detection approach is accurate and efficient, and achieves the state-of-the-art performance. We further propose to use image retrieval for face validation (in order to remove false positives) and for face alignment/landmark localization. The same methodology can also be easily generalized to other facerelated tasks, such as attribute recognition, as well as general object detection.
4 0.19812562 402 cvpr-2013-Social Role Discovery in Human Events
Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei
Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.
5 0.18120486 152 cvpr-2013-Exemplar-Based Face Parsing
Author: Brandon M. Smith, Li Zhang, Jonathan Brandt, Zhe Lin, Jianchao Yang
Abstract: In this work, we propose an exemplar-based face image segmentation algorithm. We take inspiration from previous works on image parsing for general scenes. Our approach assumes a database of exemplar face images, each of which is associated with a hand-labeled segmentation map. Given a test image, our algorithm first selects a subset of exemplar images from the database, Our algorithm then computes a nonrigid warp for each exemplar image to align it with the test image. Finally, we propagate labels from the exemplar images to the test image in a pixel-wise manner, using trained weights to modulate and combine label maps from different exemplars. We evaluate our method on two challenging datasets and compare with two face parsing algorithms and a general scene parsing algorithm. We also compare our segmentation results with contour-based face alignment results; that is, we first run the alignment algorithms to extract contour points and then derive segments from the contours. Our algorithm compares favorably with all previous works on all datasets evaluated.
6 0.15836537 440 cvpr-2013-Tracking People and Their Objects
7 0.13936734 107 cvpr-2013-Deformable Spatial Pyramid Matching for Fast Dense Correspondences
8 0.13860925 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
9 0.13460454 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
10 0.13063289 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
11 0.12723722 103 cvpr-2013-Decoding Children's Social Behavior
12 0.12582998 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
13 0.1243679 189 cvpr-2013-Graph-Based Discriminative Learning for Location Recognition
14 0.12057802 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
15 0.11838978 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
16 0.11545405 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
17 0.11483506 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
18 0.11320576 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
20 0.10964483 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
topicId topicWeight
[(0, 0.25), (1, -0.077), (2, -0.013), (3, -0.123), (4, -0.035), (5, 0.001), (6, -0.048), (7, -0.041), (8, 0.036), (9, -0.012), (10, 0.077), (11, -0.041), (12, 0.095), (13, 0.021), (14, 0.017), (15, -0.043), (16, 0.064), (17, 0.059), (18, 0.048), (19, -0.107), (20, -0.045), (21, 0.023), (22, 0.034), (23, 0.036), (24, 0.022), (25, -0.018), (26, 0.051), (27, -0.079), (28, -0.011), (29, -0.01), (30, 0.013), (31, -0.034), (32, 0.056), (33, -0.015), (34, 0.023), (35, -0.044), (36, 0.13), (37, 0.002), (38, 0.071), (39, 0.034), (40, 0.052), (41, -0.023), (42, -0.071), (43, 0.036), (44, 0.233), (45, -0.088), (46, -0.1), (47, -0.107), (48, -0.061), (49, 0.101)]
simIndex simValue paperId paperTitle
same-paper 1 0.94685119 172 cvpr-2013-Finding Group Interactions in Social Clutter
Author: Ruonan Li, Parker Porfilio, Todd Zickler
Abstract: We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pairwise interactions; and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.
2 0.79557699 402 cvpr-2013-Social Role Discovery in Human Events
Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei
Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.
3 0.71187377 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment
Author: Suha Kwak, Bohyung Han, Joon Hee Han
Abstract: We present a joint estimation technique of event localization and role assignment when the target video event is described by a scenario. Specifically, to detect multi-agent events from video, our algorithm identifies agents involved in an event and assigns roles to the participating agents. Instead of iterating through all possible agent-role combinations, we formulate the joint optimization problem as two efficient subproblems—quadratic programming for role assignment followed by linear programming for event localization. Additionally, we reduce the computational complexity significantly by applying role-specific event detectors to each agent independently. We test the performance of our algorithm in natural videos, which contain multiple target events and nonparticipating agents.
4 0.65695405 103 cvpr-2013-Decoding Children's Social Behavior
Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye
Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.
Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa
Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.
6 0.54095036 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
7 0.53776699 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
8 0.49396059 107 cvpr-2013-Deformable Spatial Pyramid Matching for Fast Dense Correspondences
9 0.49021858 440 cvpr-2013-Tracking People and Their Objects
10 0.4824751 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
11 0.47686014 152 cvpr-2013-Exemplar-Based Face Parsing
12 0.47422737 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
13 0.45815977 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
14 0.45232981 189 cvpr-2013-Graph-Based Discriminative Learning for Location Recognition
15 0.44983876 463 cvpr-2013-What's in a Name? First Names as Facial Attributes
16 0.44404826 272 cvpr-2013-Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery
17 0.44258177 234 cvpr-2013-Joint Spectral Correspondence for Disparate Image Matching
18 0.43832082 184 cvpr-2013-Gauging Association Patterns of Chromosome Territories via Chromatic Median
19 0.43611073 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles
20 0.43063968 280 cvpr-2013-Maximum Cohesive Grid of Superpixels for Fast Object Localization
topicId topicWeight
[(10, 0.098), (16, 0.02), (26, 0.058), (33, 0.215), (67, 0.076), (69, 0.396), (87, 0.071)]
simIndex simValue paperId paperTitle
1 0.88594776 1 cvpr-2013-3D-Based Reasoning with Blocks, Support, and Stability
Author: Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, Tsuhan Chen
Abstract: 3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other objects to fall. We propose a new approach for parsing RGB-D images using 3D block units for volumetric reasoning. The algorithm fits image segments with 3D blocks, and iteratively evaluates the scene based on block interaction properties. We produce a 3D representation of the scene based on jointly optimizing over segmentations, block fitting, supporting relations, and object stability. Our algorithm incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting (i.e., one that does not topple) arrangement of objects. We experiment on several datasets including controlled and real indoor scenarios. Results show that our stability-reasoning framework improves RGB-D segmentation and scene volumetric representation.
same-paper 2 0.86947912 172 cvpr-2013-Finding Group Interactions in Social Clutter
Author: Ruonan Li, Parker Porfilio, Todd Zickler
Abstract: We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pairwise interactions; and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.
3 0.84924531 114 cvpr-2013-Depth Acquisition from Density Modulated Binary Patterns
Author: Zhe Yang, Zhiwei Xiong, Yueyi Zhang, Jiao Wang, Feng Wu
Abstract: This paper proposes novel density modulated binary patterns for depth acquisition. Similar to Kinect, the illumination patterns do not need a projector for generation and can be emitted by infrared lasers and diffraction gratings. Our key idea is to use the density of light spots in the patterns to carry phase information. Two technical problems are addressed here. First, we propose an algorithm to design the patterns to carry more phase information without compromising the depth reconstruction from a single captured image as with Kinect. Second, since the carried phase is not strictly sinusoidal, the depth reconstructed from the phase contains a systematic error. We further propose a pixelbased phase matching algorithm to reduce the error. Experimental results show that the depth quality can be greatly improved using the phase carried by the density of light spots. Furthermore, our scheme can achieve 20 fps depth reconstruction with GPU assistance.
4 0.83880818 135 cvpr-2013-Discriminative Subspace Clustering
Author: Vasileios Zografos, Liam Ellis, Rudolf Mester
Abstract: We present a novel method for clustering data drawn from a union of arbitrary dimensional subspaces, called Discriminative Subspace Clustering (DiSC). DiSC solves the subspace clustering problem by using a quadratic classifier trained from unlabeled data (clustering by classification). We generate labels by exploiting the locality of points from the same subspace and a basic affinity criterion. A number of classifiers are then diversely trained from different partitions of the data, and their results are combined together in an ensemble, in order to obtain the final clustering result. We have tested our method with 4 challenging datasets and compared against 8 state-of-the-art methods from literature. Our results show that DiSC is a very strong performer in both accuracy and robustness, and also of low computational complexity.
5 0.8340891 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
Author: Fuxin Li, Joao Carreira, Guy Lebanon, Cristian Sminchisescu
Abstract: In this paper we present an inference procedure for the semantic segmentation of images. Differentfrom many CRF approaches that rely on dependencies modeled with unary and pairwise pixel or superpixel potentials, our method is entirely based on estimates of the overlap between each of a set of mid-level object segmentation proposals and the objects present in the image. We define continuous latent variables on superpixels obtained by multiple intersections of segments, then output the optimal segments from the inferred superpixel statistics. The algorithm is capable of recombine and refine initial mid-level proposals, as well as handle multiple interacting objects, even from the same class, all in a consistent joint inference framework by maximizing the composite likelihood of the underlying statistical model using an EM algorithm. In the PASCAL VOC segmentation challenge, the proposed approach obtains high accuracy and successfully handles images of complex object interactions.
6 0.82320732 231 cvpr-2013-Joint Detection, Tracking and Mapping by Semantic Bundle Adjustment
7 0.80893368 392 cvpr-2013-Separable Dictionary Learning
9 0.7420457 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment
10 0.70418775 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics
11 0.68950874 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection
12 0.68233079 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
13 0.6755023 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
14 0.66903156 282 cvpr-2013-Measuring Crowd Collectiveness
15 0.66750413 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects
16 0.66633123 402 cvpr-2013-Social Role Discovery in Human Events
17 0.6656009 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models
18 0.66136974 132 cvpr-2013-Discriminative Re-ranking of Diverse Segmentations
19 0.65838796 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
20 0.6540519 364 cvpr-2013-Robust Object Co-detection