cvpr cvpr2013 cvpr2013-103 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye
Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.
Reference: text
sentIndex sentText sentNum sentScore
1 We specifically target interactions between children aged 1–2 years and an adult. [sent-7, score-0.206]
2 Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. [sent-8, score-0.233]
3 In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. [sent-10, score-1.078]
4 Introduction There has been a long history of work in activity recognition, but for the most part it has focused on single individuals engaged in task-oriented activities or short interactions between multiple actors. [sent-14, score-0.264]
5 The goal of this paper is to introduce a novel problem domain for activity recognition, which consists of the decoding of dyadic social interactions between young children and adults. [sent-15, score-0.709]
6 Nonetheless, these interactions have a detailed structure defined by the patterning of behavior of both participants. [sent-20, score-0.191]
7 Our goal is to go beyond the simple classification of actions and activities, and address the challenges of parsing an extended interaction into its constituent elements, and producing ratings of the level of engagement that characterizes the quality of the interaction. [sent-21, score-0.383]
8 We refer to these problems collectively as decoding the dyadic social interaction. [sent-22, score-0.404]
9 The problem of decoding dyadic social interactions arises naturally in the diagnosis and treatment of developmental and behavioral disorders. [sent-24, score-0.627]
10 Research utilizing video-based micro-coding of the behavior of young children engaged in social interactions has revealed a number of clear behavioral “red flags” for autism in the first two years of life [20], specifically in the areas of social, communication, and play skills. [sent-26, score-0.79]
11 There is much potential for activity recognition to scale early screening and treatment efforts by bringing reliable, rich measurement of child behavior to real-word settings. [sent-28, score-0.535]
12 We present an approach to decoding social interactions in the context of a semi-structured play interaction between a child and an adult, called the Rapid-ABC [14]. [sent-29, score-0.878]
13 This protocol is a brief (3 to 5 minute) interactive assessment designed to elicit social attention, back-and-forth interaction, 333444111422 and nonverbal communication. [sent-30, score-0.321]
14 The contribution of this paper is the introduction of the Multimodal Dyadic Behavior (MMDB) dataset which contains this interaction data, along with an initial series of single mode and multimodal analyses to segment, classify and measure relevant behaviors across numerous play interactions. [sent-32, score-0.429]
15 However, most of these works are focused either on the actions of a single adult subject, or on relatively brief interactions between a pair of subjects, such as the “hug” action in [10] or the fighting activities in [16]. [sent-36, score-0.264]
16 In the case of single person activities such as meal preparation [3], or structured group activities [13], activities can be complex and can take place over a significant temporal duration. [sent-37, score-0.191]
17 In contrast to these prior works, the domain of social interactions between adults and children poses significant new challenges, since they are inherently dyadic, loosely structured, and multi-modal. [sent-39, score-0.392]
18 Recently, several authors have addressed the problem of recognizing social interactions between groups of people [15, 4, 1, 9]. [sent-40, score-0.275]
19 In particular, our earlier work on categorizing social games in YouTube videos [15] includes many examples of adult-child dyadic interactions. [sent-41, score-0.343]
20 In contrast, our goal is to produce fine-grained descriptions of social interactions, including the assessment of gaze and facial affect and the strength of engagement. [sent-43, score-0.373]
21 Our approach to analyzing dyadic social interactions is based on the explicit identification of “mid-level” behavioral cues. [sent-44, score-0.481]
22 We extract these cues by employing a variety of video and audio analysis modules, such as the tracking of head pose and arm positions in RGBD video and the detection of keywords in adult speech. [sent-45, score-0.346]
23 In this context our contribution is twofold: We show how existing analysis methods can be combined to construct a layered description of an extended, structured social interaction, and we assess the effectiveness of these standard methods in analyzing children’s behavior. [sent-47, score-0.215]
24 Challenges From an activity recognition perspective, the analysis of social interactions introduces a number of challenges which do not commonly arise in existing datasets. [sent-52, score-0.342]
25 First, the dyadic nature of the interaction makes it necessary to explicitly model the interplay between agents. [sent-53, score-0.25]
26 Second, social behavior is inherently multimodal, and requires the integration of video, audio, and other modalities in order to achieve a complete portrait of behavior. [sent-55, score-0.326]
27 Third, social interactions are often defined by the strength of the engagement and the reciprocity between the participants, not by the performance of a particular task. [sent-56, score-0.496]
28 For example, detecting whether a child’s gestures, affective expressions, and vocalizations are coordinated with gaze to the adult’s face is critical in identifying whether the child’s behaviors are socially directed and intentional. [sent-59, score-0.462]
29 When a child is using vocalizations or gestures, is their intention (a) to request that their partner give them an object or perform an action; (b) to direct the partner’s attention to an interesting object; or simply (c) to maintain an ongoing social interaction. [sent-61, score-0.684]
30 Finally, advances in wearable technology have made it possible to go beyond visible behaviors and measure the activity of the autonomic nervous system, for example via respiration or heart-rate. [sent-63, score-0.276]
31 These physiological signals can be combined with audio and video streams in order to interpret the meaning and function of expressed behaviors [8]. [sent-66, score-0.337]
32 The interaction follows the Rapid-ABC play protocol, which was developed in collaboration with clinical psychologists who specialize in the diagnosis of developmental delay [14]. [sent-72, score-0.261]
33 This play protocol is a brief (3–5 minute) interactive assessment, in which a trained examiner elicits social attention, back-and-forth interaction, and nonverbal communication from the child. [sent-73, score-0.828]
34 These behaviors reflect key socio-communicative milestones in the first two years of life, and their diminished occurrence and qualitative difference in expression have been found to represent early markers of autism spectrum disorders. [sent-74, score-0.223]
35 During the play interaction, the child sits in a parent’s lap across a small table from an adult examiner. [sent-75, score-0.57]
36 The behavior of the examiner is structured both in terms of specific gestures (i. [sent-78, score-0.714]
37 , how the materials are presented to the child) and the language the examiner uses to introduce the various activities (e. [sent-80, score-0.569]
38 Additional presses to elicit specific behaviors are built into the assessment. [sent-84, score-0.21]
39 For example, the examiner silently holds up the ball and the book when they are first presented to see whether the child will shift attention from the objects to her face (exhibiting joint attention). [sent-85, score-1.384]
40 She also introduces deliberate pauses into the interaction to gauge whether and how the child re-establishes the interaction. [sent-86, score-0.484]
41 For example, the ball stage consists of the substages “Ball Present,” “Ball Play,” and “Ball Pause. [sent-88, score-0.387]
42 The examiner scores seventeen such behaviors as present or absent at the substage level, immediately following the completion of the assessment. [sent-90, score-0.763]
43 In addition, for each stage of the protocol, she rates the effort required to engage the child using a 3-point Likert scale, with a score of 0 indicating that the child was easily engaged and a score of 2 indicating that significant effort was required. [sent-91, score-0.891]
44 The ratings attempt to capture an overall measure of the child’s social engagement, which relates to a core aspect of the behavior of children who may be at risk for an Autism Spectrum Disorder (ASD). [sent-92, score-0.436]
45 In addition to the scoring sheet, the MMDB dataset also includes frame-level, continuous annotation of relevant child behaviors that occur during the assessment. [sent-93, score-0.511]
46 , gaze to the examiner’s or parent’s face, ball, book), vocalizations and verbalizations (words and phrases), vocal affect (laughing and crying), and communicative gestures (e. [sent-97, score-0.311]
47 In the following sections, we describe our analysis methods in more detail and present our experimental findings from an initial set of child recordings. [sent-105, score-0.366]
48 The ability to parse video and audio records into these major stages and their substages makes it possible to focus subsequent analysis on the appropriate intervals of time. [sent-109, score-0.289]
49 The examiner follows a pre-defined language protocol in which key phrases are used to guide the child into and through each stage. [sent-111, score-0.986]
50 This is an example of a more general property of many standard protocols for assessment and therapy: By leveraging the statistical regularities of adult speech (and other modalities), we can obtain valuable information about the state of the dyad. [sent-113, score-0.241]
51 The Nexidia tool takes as input an audio clip and a phrase of interest. [sent-115, score-0.22]
52 It detects instances of the phrase in the audio stream and outputs the time-stamp locations of the detected phrases and their confidence scores. [sent-116, score-0.246]
53 In a second experiment, we predicted the start times for the substages ball present and book present, which occur within the Ball and Book stages, respectively. [sent-133, score-0.495]
54 Detecting Discrete Behaviors We have described a procedure for parsing a continuous interaction into its constituent stages and substages. [sent-140, score-0.19]
55 Within each substage, the examiner assesses whether or not the child produced a set of key behaviors (see Section 4), including smiling and making eye contact. [sent-141, score-1.14]
56 The primary challenge stems from the fact that the examiner produces a rating for an entire substage, based on whether or not the behavior occurred at least once. [sent-143, score-0.642]
57 Smile Detection Given a segmented video clip corresponding to a substage in the interaction, our goal is to predict a binary smile label. [sent-148, score-0.295]
58 We employed commercial software from Omron, the OKAO Vision Library, to detect and track the child’s face, and obtain measures of face detection confidence, smile degree (the strength of the smile), and smile confidence for each detected face. [sent-149, score-0.308]
59 We used the joint time series of smile degree and smile confidence in each frame as the feature data for smile detection. [sent-150, score-0.405]
60 We used a training set of 39 child participants and a testing set of 17 additional participants. [sent-161, score-0.366]
61 Next, we present the results of combining our smile detection system with the parsing result from Section 5, which yields a fully automated smile detection system. [sent-164, score-0.288]
62 We note that children are difficult subjects for automated face analysis, as they are more likely to move rapidly and turn their heads away from the examiner (and therefore the camera). [sent-168, score-0.686]
63 Gaze Detection Gaze is a fundamentally important element in understanding social interactions, and the automatic non-intrusive measurement of children’s gaze remains a challenging unsolved problem. [sent-172, score-0.318]
64 In previous work [21], we used a wearable camera on the examiner to obtain a more consistent viewing angle, but this adds additional complexity to the recording process. [sent-175, score-0.553]
65 Our multimodal approach exploits the structured nature of the Rapid-ABC interaction and does not require an active camera system or the need to wear additional hardware. [sent-179, score-0.19]
66 Given a particular substage within the interaction, our goal is to predict whether the child made eye contact with the examiner at least once. [sent-180, score-1.078]
67 Our goal was to differentiate gaze directed up at the examiner’s face from gaze directed down at hands or objects on the table. [sent-187, score-0.292]
68 Experimental Results: We performed our analysis on 20 hand-picked sessions in which the child remained at the table throughout and the tracker worked successfully. [sent-191, score-0.532]
69 It would be interesting to extend this approach to detect moments when the child is looking at targets other than faces, such as the objects used in the interaction. [sent-198, score-0.406]
70 Predicting Child Engagement For each of the five stages of the play protocol, the examiner assessed the difficulty of engaging the child on a scale from 0 (easily engaged) to 2 (very difficult to engage). [sent-202, score-1.023]
71 In this section, we describe methods for predicting the engagement score based on video and audio features. [sent-203, score-0.377]
72 Engagement Prediction in Ball and Book Play In this vision-based approach to predicting engagement, we designed engagement features and trained a binary clas- sifier to estimate if the child was easy to engage or not. [sent-207, score-0.652]
73 Object Detection and Tracking: We track the objects (ball and book) and the heads of the child and of the examiner using the overhead Kinect camera view. [sent-209, score-0.907]
74 To detect the ball and the book, we use region covariance templates [19] over the RGB-D channels. [sent-212, score-0.273]
75 The examiner shows the ball to the child by holding it high and near her head. [sent-217, score-1.154]
76 We detect this event by measuring the relative position of the ball with respect to the head. [sent-218, score-0.273]
77 In order to detect moments during the ball game when the partners touch the ball, we collected a training set of example templates in which the ball is touched and partially occluded. [sent-219, score-0.677]
78 During tracking, we detect the ball region, extract its descriptor, and compare it to the top two template descriptors using the Affine Invariance Riemann Metric (AIRM) distance [19]. [sent-221, score-0.273]
79 ” We can further classify into “touched by child” and “touched by examiner” by examining the ball location. [sent-223, score-0.273]
80 Feature Extraction and Engagement Prediction: To estimate the engagement level, we designed and extracted features that intuitively reflect the effort of the examiner to get the attention of the child, and the degree to which the child is participating in the interaction. [sent-224, score-1.131]
81 If the child is easy to en- gage, we can expect that the examiner will spend less time in prompting the child, and the child will quickly respond to the examiner’s initiating behaviors while interacting with the objects. [sent-225, score-1.421]
82 For the Ball stage, based on the detected ball shown and ball touched events, we extracted the raw features in Table 4, and split them in groups of no more than three. [sent-226, score-0.637]
83 Finally, the margin from each SVM was treated as a mid-level feature and used to train a decision tree to predict whether the child was easy to engage or not. [sent-228, score-0.456]
84 333444111977 During testing, the ball tracker failed in one sequence, and we omitted it from the ball results. [sent-237, score-0.595]
85 The overall accuracy in predicting engagement for the Ball and Book stages was 92. [sent-238, score-0.28]
86 The book interaction involved a deformable object and more complex patterns of occlusion, and was therefore more difficult to analyze. [sent-245, score-0.241]
87 Audio-Visual Prediction of Engagement We have demonstrated that visual features can be used to accurately predict engagement in the Ball and Book stages. [sent-248, score-0.221]
88 The first step in our approach is to automatically segment the speech portions ofthe audio input. [sent-250, score-0.192]
89 The VAD was applied to both the child and the examiner’s lapelmounted wireless microphones, thereby identifying the start and end of speech segments and extracting them. [sent-252, score-0.431]
90 Table 5 gives the statistics for the duration and number of extracted speech segments for the examiner (E) and child (C). [sent-253, score-0.979]
91 In addition to acoustic features, we added event-based features such as the duration of cross-talk between child and examiner, the number of turns taken (C-to-E and E-to-C), and the number of speech segments. [sent-258, score-0.51]
92 Using the 14 overlapping play interaction sessions in the test set, the joint classifier resulted in 10 true positives, 3 true negatives and 1false positive. [sent-280, score-0.293]
93 Conclusion We introduced a new and challenging domain for activity recognition—the analysis of dyadic social interactions between children and adults. [sent-284, score-0.616]
94 We created a new Multimodal Dyadic Behavior (MMDB) dataset containing more than 160 examples of structured adult-child social interactions, which were captured using multiple sensor modalities and contain rich annotation. [sent-285, score-0.253]
95 We presented baseline analyses which are a first attempt to decode children’s social behavior by determining whether they produce key behaviors, such as looks to their partner, smiles, and gestures, during specific moments of an interaction, and by assessing the degree of engagement. [sent-286, score-0.393]
96 Our long-term goal is to develop a rich, fine-grained computational understanding of child behavior in these settings. [sent-287, score-0.468]
97 , is the child combining affect, vocalizations, and gestures with looks to the examiner’s face), timing (e. [sent-291, score-0.434]
98 , how does the child time their response to the examiner’s social bids), and function (e. [sent-293, score-0.552]
99 , is the child directing the examiner’s attention to an object to share their interest in the object, or only to request it). [sent-295, score-0.395]
100 Skin conductance responses to another person’s gaze in children with autism. [sent-360, score-0.249]
wordName wordTfidf (topN-words)
[('examiner', 0.515), ('child', 0.366), ('ball', 0.273), ('engagement', 0.221), ('social', 0.186), ('dyadic', 0.157), ('book', 0.148), ('behaviors', 0.145), ('gaze', 0.132), ('audio', 0.127), ('smile', 0.125), ('adult', 0.121), ('children', 0.117), ('sessions', 0.117), ('mmdb', 0.103), ('substage', 0.103), ('behavior', 0.102), ('interaction', 0.093), ('touched', 0.091), ('interactions', 0.089), ('play', 0.083), ('autism', 0.078), ('substages', 0.074), ('gestures', 0.068), ('multimodal', 0.068), ('activity', 0.067), ('engage', 0.065), ('speech', 0.065), ('decoding', 0.061), ('phrases', 0.061), ('stages', 0.059), ('disorders', 0.059), ('nexidia', 0.059), ('vocalizations', 0.059), ('developmental', 0.057), ('assessment', 0.055), ('activities', 0.054), ('engaged', 0.054), ('communicative', 0.052), ('georgia', 0.049), ('behavioral', 0.049), ('tracker', 0.049), ('affective', 0.048), ('acoustic', 0.046), ('smiling', 0.046), ('basler', 0.044), ('formant', 0.044), ('partner', 0.044), ('protocol', 0.044), ('eye', 0.043), ('moments', 0.04), ('head', 0.04), ('analyses', 0.04), ('stage', 0.04), ('cbi', 0.039), ('greeting', 0.039), ('modalities', 0.038), ('clip', 0.038), ('wearable', 0.038), ('parsing', 0.038), ('elicit', 0.036), ('physiological', 0.036), ('duration', 0.033), ('young', 0.032), ('kinect', 0.032), ('sheet', 0.031), ('ratings', 0.031), ('confidence', 0.03), ('video', 0.029), ('abowd', 0.029), ('electrodermal', 0.029), ('eyben', 0.029), ('greetballbookhattickle', 0.029), ('initiating', 0.029), ('lapel', 0.029), ('microphones', 0.029), ('okao', 0.029), ('omron', 0.029), ('ousley', 0.029), ('presses', 0.029), ('smiles', 0.029), ('tickle', 0.029), ('vad', 0.029), ('yaw', 0.029), ('attention', 0.029), ('structured', 0.029), ('fathi', 0.028), ('face', 0.028), ('phrase', 0.028), ('diagnosis', 0.028), ('session', 0.028), ('tool', 0.027), ('prosodic', 0.026), ('autonomic', 0.026), ('engages', 0.026), ('infants', 0.026), ('contact', 0.026), ('heads', 0.026), ('hat', 0.026), ('whether', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 103 cvpr-2013-Decoding Children's Social Behavior
Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye
Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.
2 0.15174016 402 cvpr-2013-Social Role Discovery in Human Events
Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei
Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.
3 0.12972802 340 cvpr-2013-Probabilistic Label Trees for Efficient Large Scale Image Classification
Author: Baoyuan Liu, Fereshteh Sadeghi, Marshall Tappen, Ohad Shamir, Ce Liu
Abstract: Large-scale recognition problems with thousands of classes pose a particular challenge because applying the classifier requires more computation as the number of classes grows. The label tree model integrates classification with the traversal of the tree so that complexity grows logarithmically. In this paper, we show how the parameters of the label tree can be found using maximum likelihood estimation. This new probabilistic learning technique produces a label tree with significantly improved recognition accuracy.
4 0.12723722 172 cvpr-2013-Finding Group Interactions in Social Clutter
Author: Ruonan Li, Parker Porfilio, Todd Zickler
Abstract: We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pairwise interactions; and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.
5 0.1097272 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg
Abstract: Weposit that user behavior during natural viewing ofimages contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. In this paper, we conduct experiments to better understand the relationship between images, the eye movements people make while viewing images, and how people construct natural language to describe images. We explore these relationships in the context of two commonly used computer vision datasets. We then further relate human cues with outputs of current visual recognition systems and demonstrate prototype applications for gaze-enabled detection and annotation.
6 0.10081573 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
7 0.099792317 440 cvpr-2013-Tracking People and Their Objects
8 0.093560733 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
9 0.08831168 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
10 0.085816294 258 cvpr-2013-Learning Video Saliency from Human Gaze Using Candidate Selection
11 0.082980089 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
12 0.076991163 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
13 0.07409858 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
14 0.072980247 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
15 0.071177594 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
16 0.061623704 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
17 0.060736585 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
18 0.059266482 296 cvpr-2013-Multi-level Discriminative Dictionary Learning towards Hierarchical Visual Categorization
19 0.057232328 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
20 0.056297723 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
topicId topicWeight
[(0, 0.132), (1, -0.04), (2, -0.005), (3, -0.065), (4, -0.035), (5, -0.005), (6, -0.009), (7, -0.012), (8, 0.039), (9, 0.024), (10, 0.024), (11, -0.037), (12, 0.027), (13, 0.006), (14, -0.018), (15, 0.022), (16, 0.03), (17, 0.089), (18, 0.009), (19, -0.098), (20, -0.075), (21, 0.068), (22, 0.037), (23, -0.008), (24, -0.027), (25, 0.011), (26, 0.041), (27, -0.008), (28, 0.028), (29, 0.034), (30, -0.054), (31, 0.048), (32, -0.048), (33, -0.036), (34, 0.053), (35, -0.003), (36, -0.01), (37, 0.063), (38, 0.003), (39, 0.101), (40, -0.043), (41, 0.029), (42, -0.04), (43, 0.086), (44, 0.11), (45, 0.006), (46, 0.04), (47, -0.088), (48, 0.037), (49, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.91577971 103 cvpr-2013-Decoding Children's Social Behavior
Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye
Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.
2 0.67193544 402 cvpr-2013-Social Role Discovery in Human Events
Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei
Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.
3 0.60361344 172 cvpr-2013-Finding Group Interactions in Social Clutter
Author: Ruonan Li, Parker Porfilio, Todd Zickler
Abstract: We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pairwise interactions; and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.
Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa
Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.
5 0.55263543 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
Author: Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, Tamara L. Berg
Abstract: Weposit that user behavior during natural viewing ofimages contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. In this paper, we conduct experiments to better understand the relationship between images, the eye movements people make while viewing images, and how people construct natural language to describe images. We explore these relationships in the context of two commonly used computer vision datasets. We then further relate human cues with outputs of current visual recognition systems and demonstrate prototype applications for gaze-enabled detection and annotation.
6 0.52970934 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
7 0.52617514 214 cvpr-2013-Image Understanding from Experts' Eyes by Modeling Perceptual Skill of Diagnostic Reasoning Processes
8 0.51709044 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
9 0.50895452 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
10 0.47905067 72 cvpr-2013-Boundary Detection Benchmarking: Beyond F-Measures
11 0.47172773 440 cvpr-2013-Tracking People and Their Objects
12 0.47092459 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
13 0.46902084 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
14 0.4660764 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes
15 0.46086618 118 cvpr-2013-Detecting Pulse from Head Motions in Video
16 0.4540697 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
17 0.4536114 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
18 0.45301971 11 cvpr-2013-A Genetic Algorithm-Based Solver for Very Large Jigsaw Puzzles
19 0.44007242 272 cvpr-2013-Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery
20 0.43525821 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment
topicId topicWeight
[(10, 0.063), (16, 0.027), (26, 0.026), (33, 0.156), (67, 0.511), (69, 0.052), (72, 0.013), (76, 0.013), (87, 0.038)]
simIndex simValue paperId paperTitle
1 0.86229539 142 cvpr-2013-Efficient Detector Adaptation for Object Detection in a Video
Author: Pramod Sharma, Ram Nevatia
Abstract: In this work, we present a novel and efficient detector adaptation method which improves the performance of an offline trained classifier (baseline classifier) by adapting it to new test datasets. We address two critical aspects of adaptation methods: generalizability and computational efficiency. We propose an adaptation method, which can be applied to various baseline classifiers and is computationally efficient also. For a given test video, we collect online samples in an unsupervised manner and train a randomfern adaptive classifier . The adaptive classifier improves precision of the baseline classifier by validating the obtained detection responses from baseline classifier as correct detections or false alarms. Experiments demonstrate generalizability, computational efficiency and effectiveness of our method, as we compare our method with state of the art approaches for the problem of human detection and show good performance with high computational efficiency on two different baseline classifiers.
same-paper 2 0.85138315 103 cvpr-2013-Decoding Children's Social Behavior
Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye
Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.
3 0.81020832 398 cvpr-2013-Single-Pedestrian Detection Aided by Multi-pedestrian Detection
Author: Wanli Ouyang, Xiaogang Wang
Abstract: In this paper, we address the challenging problem of detecting pedestrians who appear in groups and have interaction. A new approach is proposed for single-pedestrian detection aided by multi-pedestrian detection. A mixture model of multi-pedestrian detectors is designed to capture the unique visual cues which are formed by nearby multiple pedestrians but cannot be captured by single-pedestrian detectors. A probabilistic framework is proposed to model the relationship between the configurations estimated by single- and multi-pedestrian detectors, and to refine the single-pedestrian detection result with multi-pedestrian detection. It can integrate with any single-pedestrian detector without significantly increasing the computation load. 15 state-of-the-art single-pedestrian detection approaches are investigated on three widely used public datasets: Caltech, TUD-Brussels andETH. Experimental results show that our framework significantly improves all these approaches. The average improvement is 9% on the Caltech-Test dataset, 11% on the TUD-Brussels dataset and 17% on the ETH dataset in terms of average miss rate. The lowest average miss rate is reduced from 48% to 43% on the Caltech-Test dataset, from 55% to 50% on the TUD-Brussels dataset and from 51% to 41% on the ETH dataset.
4 0.74162883 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification
Author: Enrique G. Ortiz, Alan Wright, Mubarak Shah
Abstract: This paper presents an end-to-end video face recognition system, addressing the difficult problem of identifying a video face track using a large dictionary of still face images of a few hundred people, while rejecting unknown individuals. A straightforward application of the popular ?1minimization for face recognition on a frame-by-frame basis is prohibitively expensive, so we propose a novel algorithm Mean Sequence SRC (MSSRC) that performs video face recognition using a joint optimization leveraging all of the available video data and the knowledge that the face track frames belong to the same individual. By adding a strict temporal constraint to the ?1-minimization that forces individual frames in a face track to all reconstruct a single identity, we show the optimization reduces to a single minimization over the mean of the face track. We also introduce a new Movie Trailer Face Dataset collected from 101 movie trailers on YouTube. Finally, we show that our methodmatches or outperforms the state-of-the-art on three existing datasets (YouTube Celebrities, YouTube Faces, and Buffy) and our unconstrained Movie Trailer Face Dataset. More importantly, our method excels at rejecting unknown identities by at least 8% in average precision.
5 0.73549426 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
Author: Georgia Gkioxari, Pablo Arbeláez, Lubomir Bourdev, Jitendra Malik
Abstract: We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets [4] and train highly discriminative classifiers to differentiate among arm configurations, which we call armlets. We propose a rich representation which, in addition to standardHOGfeatures, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan [26], the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.
6 0.72589016 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
7 0.7129209 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
8 0.71035129 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections
9 0.69266307 375 cvpr-2013-Saliency Detection via Graph-Based Manifold Ranking
10 0.67125976 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
11 0.64199758 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection
12 0.63536531 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
13 0.62262213 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
14 0.61741138 363 cvpr-2013-Robust Multi-resolution Pedestrian Detection in Traffic Scenes
15 0.60266513 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
16 0.59025371 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
17 0.56267035 438 cvpr-2013-Towards Pose Robust Face Recognition
18 0.55985147 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
19 0.55703425 338 cvpr-2013-Probabilistic Elastic Matching for Pose Variant Face Verification
20 0.55605781 63 cvpr-2013-Binary Code Ranking with Weighted Hamming Distance