jmlr jmlr2013 jmlr2013-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
Reference: text
sentIndex sentText sentNum sentScore
1 We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. [sent-11, score-0.337]
2 The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. [sent-12, score-0.367]
3 We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. [sent-13, score-0.646]
4 The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. [sent-14, score-0.571]
5 Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction 1. [sent-17, score-0.953]
6 Introduction Action recognition as a general problem is a very fertile research theme due to its strong applicability in several real world domains, ranging from video-surveillance to content-based video retrieval and video classification. [sent-18, score-0.499]
7 This paper refers specifically to action recognition in the context of HumanMachine Interaction (HMI), and therefore it focuses on whole-body actions performed by a human who is standing at a short distance from the sensor. [sent-19, score-0.805]
8 , 2011) by Microsoft); these depth-based sensors are drastically changing the field of action recognition, enabling the achievement of high performance using fast algorithms. [sent-31, score-0.321]
9 Following this recent trend we propose a complete system based on RGBD video sequences, which models actions from one example only. [sent-32, score-0.414]
10 Subsequently, we summarize the action within adjacent frames by building feature vectors that describe the feature evolution over time. [sent-36, score-0.371]
11 Finally, we train a Support Vector Machine (SVM) for each action class. [sent-37, score-0.321]
12 Furthermore, thanks to the simultaneous appearance and motion description complemented by the sparse coding stage, the method provides a one-shot learning procedure. [sent-39, score-0.497]
13 Our objective in designing this interaction game is to stress the effectiveness of our gesture recognition system in complex and uncontrolled settings. [sent-43, score-0.615]
14 In particular, some approaches are based on machine learning techniques, where each action is described as a complex structure; in this class we find methods based on Hidden Markov Models (Malgireddy et al. [sent-61, score-0.321]
15 , 2012), Coupled Hidden Semi-Markov models (Natarajan and Nevatia, 2007), action graphs (Li et al. [sent-62, score-0.321]
16 Other methods are based on matching: the recognition of actions is carried out through a similarity match with all the available data, and the most similar datum dictates the estimated class (Seo and Milanfar, 2012; Mahbub et al. [sent-65, score-0.4]
17 Each action is then modeled as a multi channel Hidden Markov Model (mcHMM). [sent-72, score-0.321]
18 Another recent method following the trend of matching-based action recognition algorithms is Mahbub et al. [sent-75, score-0.51]
19 2619 FANELLO , G ORI , M ETTA AND O DONE An alternative to classifying gesture recognition algorithms is based on the data representation of gesture models. [sent-81, score-0.699]
20 Differently from these works, our approach aims specifically to obtain an accurate real-time recognition from one video example only. [sent-103, score-0.344]
21 We conclude the section with a reference to some works focusing on continuous action or activity recognition (Ali and Aggarwal, 2001; Green and Guan, 2004; Liao et al. [sent-104, score-0.541]
22 Our work deals with continuous action recognition as well, indeed the proposed framework comprehends a novel and robust temporal segmentation algorithm. [sent-108, score-0.63]
23 In this work we rely on sparse coding to obtain a compact descriptor with a good discriminative power even if it is derived from very small data sets. [sent-112, score-0.393]
24 We refer to adaptive sparse coding when the coding is driven by data. [sent-114, score-0.351]
25 The motivations behind the use of image coding arise from biology: there is evidence that similar signal coding happens in the neurons of the primary visual cortex (V1), which produces sparse and overcomplete activations (Olshausen and Fieldt, 1997). [sent-118, score-0.433]
26 The action of coding is followed by a pooling stage, whose purpose is to aggregate multiple local descriptors in a single and global one. [sent-145, score-0.717]
27 The final descriptor of the image is the concatenation of the descriptors ps among all the regions. [sent-157, score-0.361]
28 Action Recognition System In this section we describe the versatile real-time action recognition system we propose. [sent-161, score-0.558]
29 The resultant 3DHOF+GHOG descriptor is processed via a sparse coding step to compute a compact and meaningful representation of the performed action. [sent-165, score-0.384]
30 A novel on-line video segmentation algorithm is proposed which allows isolating different actions while recognizing the action sequence. [sent-167, score-0.778]
31 1 Region Of Interest Segmentation The first step of each action recognition system is to identify correctly where in the image the action is occurring. [sent-169, score-0.931]
32 Many suggests that motion alone can be used to recognize actions (Bisio et al. [sent-187, score-0.414]
33 In artificial systems this developmental-scale experience is typically not available, although actions can still be represented from two main cues: motion and appearance (Giese and Poggio, 2003). [sent-189, score-0.506]
34 Although many variants of complex features describing human actions have been proposed, many of them imply computationally expensive routines. [sent-190, score-0.337]
35 2 G LOBAL H ISTOGRAM OF O RIENTED G RADIENT In specific contexts, motion information is not sufficient to discriminate actions, and information on the pose or appearance becomes crucial. [sent-226, score-0.334]
36 Thus we extend the motion descriptor with a shape feature computed on the depth map. [sent-229, score-0.434]
37 This appearance descriptor produces an overall description of the appearance of the ROI without splitting the image in cells. [sent-235, score-0.438]
38 3 S PARSE C ODING At this stage, each frame Ft is represented by two global descriptors: z(t) ∈ Rn1 for the motion component and h(t) ∈ Rn2 for the appearance component. [sent-241, score-0.408]
39 , z(K)], where K is the number of all the frames in the training data, our goal is to learn one motion dictionary DM (a n1 × d1 matrix, with d1 the dictionary size and n1 the motion vector size) and the codes UM (a d1 × K matrix) that minimize the Equation 1, so that z(t) ∼ DM uM (t). [sent-246, score-0.5]
40 The local minima of the standard deviation function are break points that define the end of an action and the beginning of another one. [sent-249, score-0.321]
41 Therefore, after the Sparse Coding stage, we can describe a frame as a code u(i), which is the concatenation of the motion and appearance codes: u(i) = [uM (i), uG (i)]. [sent-257, score-0.408]
42 3 Learning and Recognition The goal of this phase is to learn a model of a given action from data. [sent-260, score-0.321]
43 Since we are implementing a one-shot action recognition system, the available training data amounts to one training sequence for each action of interest. [sent-261, score-0.831]
44 In order to model the temporal extent of an action we extract sets of sub-sequences from a sequence, each one containing T adjacent frames. [sent-262, score-0.35]
45 The remainder of the section describes in details the two phases of action learning and action recognition. [sent-277, score-0.642]
46 Blue dots are the break points computed by the video segmentation algorithm that indicate the end of an action and the beginning of a new one. [sent-281, score-0.567]
47 1 ACTION L EARNING Given a video Vs of ts frames, containing only one action As , we compute a set of descriptors [u(1), . [sent-284, score-0.641]
48 Then, action learning is carried out on a set of data that are descriptions of a frame buffer BT (t), where T is its length: BT (t) = (u(t − T ), . [sent-289, score-0.516]
49 , BT (ts )] computed from the single video Vs of the class As are used as positive examples for the action As . [sent-300, score-0.476]
50 Although we use only one example for each class, we benefit from the chosen representation: indeed, descriptors are computed per frame, therefore one single video of length ts provides a number of examples equal to ts − T where T is the buffer size. [sent-302, score-0.402]
51 As long as the scores evolve we need to predict (on-line) when an action ends and another one begins; this is achieved computing the standard deviation σ(H) for a fixed t over all the scores Hit (Figure 4, right chart). [sent-317, score-0.385]
52 When an action ends we can expect all the SVM output scores to be similar, because no model should be predominant with respect to idle states; this brings to a local minimum in the function σ(H). [sent-318, score-0.353]
53 Therefore, each local minimum corresponds to the end of an action and the beginning of a new one. [sent-319, score-0.321]
54 When the standard deviation trend is below the mean, all the SVMs scores predict similar values, hence it is likely that an action has just ended. [sent-322, score-0.353]
55 On the right the overall Levenshtein Distance computed in 20 batches with respect to the buffer size parameter is depicted for both 3DHOF+GHOG features and descriptors processed with sparse coding. [sent-340, score-0.408]
56 We empirically choose a quantization parameter for the 3DHOF, n1 equal to 5, n2 = 64 bins for the GHOG descriptor, and dictionary sizes d1 and d2 equal to 256 for both motion and appearance components. [sent-341, score-0.346]
57 This led to a frame descriptor of size 189 for simple descriptors, which increases to 512 after the sparse coding processing. [sent-342, score-0.459]
58 The data set is organized in batches, where each batch includes 100 recorded gestures grouped in sequences of 1 to 5 gestures arbitrarily performed at different speeds. [sent-348, score-0.426]
59 The gestures are drawn from a small vocabulary of 8 to 15 unique gestures called lexicon, which is defined within a batch. [sent-349, score-0.426]
60 11% for features processed with sparse coding, whereas simple 3DHOF+GHOG descriptors without sparse coding lead to a performance of 43. [sent-356, score-0.462]
61 In the recognition phase we classify each slice of the video comparing it with all the templates. [sent-385, score-0.344]
62 (2012) has a training computational complexity of O (n × k2 ) for each action class, where k is the number of HMM states and n the number of examples, while the testing computational complexity for a video frame is O (k2 ). [sent-398, score-0.589]
63 Furthermore our on-line video segmentation algorithm shows excellent results with respect to the temporal segmentation used in the compared frameworks; in fact it is worth noting that the proposed algorithm leads to an action detection error rate TeLen = FP+FN equal to 5. [sent-404, score-0.687]
64 In general we notice that the combination of both motion and appearance descriptors leads to the best results when the lexicon is composed of actions where both motion and appearance are equally important. [sent-410, score-1.048]
65 The error obtained using only the 3DHOF descriptors was expected, due to the nature of the lexicons chosen: indeed in most gestures the motion component has little significance. [sent-416, score-0.552]
66 Considering instead batch devel 01, where motion is an important component in the gesture vocabulary, we have that 3DHOF descriptors lead to a Levenshtein Distance equal to 29. [sent-417, score-0.618]
67 2 L INEAR VS N ON -L INEAR C LASSIFIERS In this section we compare the performances of linear and non linear SVM for the action recognition task. [sent-424, score-0.51]
68 For this experiment we used coded features where both motion and appearance are employed. [sent-429, score-0.337]
69 1, we noted that the resolution of the proposed appearance descriptor is quite low and may not be ideal when actions differ by small details, especially on the hands, therefore a localization of the interesting parts to model would be effective. [sent-436, score-0.476]
70 The simplest way to build in this specific information is to resort to a body part tracker; indeed, if a body tracker were available it would have been easy to extract descriptors from different limbs and then concatenate all the features to obtain the final frame representation. [sent-437, score-0.519]
71 2; these gestures are difficult to model without a proper body tracker, indeed the most contribution for the GHOG comes from the body shape rather than the hand. [sent-443, score-0.387]
72 Then, we slightly modify the approach, computing 3DHOF and GHOG descriptors on three different body parts (left/right hand and whole body shape); the final frame representation becomes the concatenation of all the part descriptors. [sent-449, score-0.46]
73 In Figure 7 on the left the overall accuracy is shown; using sparse coding descriptors computed only on the body shape we obtain a Levenshtein Distance around 30%. [sent-451, score-0.469]
74 By concatenating descriptors extracted from the hands the system achieves 10% for features enhanced with sparse coding and 20% for normal descriptors. [sent-452, score-0.457]
75 3 Human-Robot Interaction The action recognition system has been implemented and tested on the iCub, a 53 degrees of freedom humanoid robot developed by the RobotCub Consortium (Metta et al. [sent-458, score-0.788]
76 In this setting the action recognition system can be used for more general purposes such as Human-Robot-Interaction (HRI) or learning by imitation tasks. [sent-462, score-0.558]
77 Each action is modeled using only the motion component (3DHOF), since we want the descriptor to be independent on the particular object shape used. [sent-468, score-0.669]
78 Our game takes inspiration from the classic “Simon” game; nevertheless, since the original version has been often defined as “visually boring”, we developed a revisited version, based on gesture recognition, which involves a “less boring” opponent: the iCub (Metta et al. [sent-475, score-0.338]
79 Both the human and the robot have to take turns and perform the longest possible sequence of gestures by adding one gesture at each turn: one player starts performing a gesture, the opponent has to recognize the gesture, imitate it and add another gesture to the sequence. [sent-477, score-0.932]
80 The game is carried on until one of the two players loses: the human player can lose because of limited memory skills, whereas the robot can lose because the gesture recognition system fails. [sent-478, score-0.871]
81 The typical game setting is shown in Figure 10: the player stays in front of the robot while performing gestures that are recognized with Kinect. [sent-481, score-0.449]
82 Importantly, hand gestures cannot be learned exploiting the Skeleton Data of Kinect: the body tracker detects the position of the hand and it is not enough to discriminate more complicate actions,—for example, see gesture classes 1 and 5 or 2 and 6 in Figure 9. [sent-482, score-0.576]
83 The vision system has been trained using 8 different actors performing each gesture class for 3 times. [sent-484, score-0.392]
84 There are three main modules that take care of recognizing the action sequence, defining the game rules and making the robot gestures. [sent-486, score-0.557]
85 This result indicates that the recognition system is robust also to different players performing variable gestures at various speeds. [sent-495, score-0.488]
86 Discussion This paper presented the design and implementation of a complete action recognition system to be used in real world applications such as HMI. [sent-500, score-0.558]
87 Left: the human player performs the first gesture of the sequence. [sent-513, score-0.32]
88 One-Shot Learning: one example is sufficient to teach an new action to the system; this is mainly due to the effective per-frame representation. [sent-521, score-0.353]
89 Sparse Frame Representation: starting from a simple and computationally inexpensive description that combines global motion (3DHOF) and appearance (GHOG) information over a ROI, subsequently filtered through sparse coding, we obtained a sparse representation at each frame. [sent-523, score-0.439]
90 We showed that these global descriptors are appropriate to model actions of the upper body of a person. [sent-524, score-0.448]
91 On-line Video Segmentation: we propose a new, effective, reliable and on-line video segmentation algorithm that achieved a 5% error rate on action detection on a set of 2000 actions grouped in sequences of 1 to 5 gestures. [sent-526, score-0.778]
92 This segmentation procedure works concurrently with the recognition process, thus a sequence of actions is simultaneously segmented and recognized. [sent-527, score-0.491]
93 For testing purposes, we proposed a memory game, called “All Gestures You Can”, where a person can challenge the iCub robot on action recognition and sequencing. [sent-533, score-0.684]
94 The approach is competitive against many of the state-of-the-art methods for action recognition. [sent-541, score-0.321]
95 We are currently working on a more precise appearance description at frame level still under the severe constraint of real-time performance; this would enable the use of more complex actions even when the body tracker is not available. [sent-542, score-0.572]
96 A unified framework for gesture recognition and spatiotemporal gesture segmentation. [sent-564, score-0.661]
97 All gestures you can: a memory game against a humanoid robot. [sent-693, score-0.451]
98 Continuous human action segmentation and recognition using a spatio-temporal probabilistic framework. [sent-774, score-0.685]
99 Single view human action recognition using key pose matching and viterbi path searching. [sent-784, score-0.633]
100 Real-time human pose recognition in parts from a single depth image. [sent-895, score-0.398]
wordName wordTfidf (topN-words)
[('action', 0.321), ('gesture', 0.236), ('gestures', 0.213), ('actions', 0.211), ('fanello', 0.203), ('recognition', 0.189), ('motion', 0.174), ('descriptors', 0.165), ('levenshtein', 0.156), ('video', 0.155), ('coding', 0.149), ('descriptor', 0.144), ('ghog', 0.139), ('robot', 0.134), ('etta', 0.128), ('icub', 0.128), ('chalearn', 0.124), ('appearance', 0.121), ('eep', 0.118), ('imple', 0.118), ('frame', 0.113), ('roi', 0.11), ('ori', 0.11), ('vision', 0.108), ('kinect', 0.107), ('game', 0.102), ('eal', 0.101), ('ecognition', 0.099), ('gori', 0.096), ('humanoid', 0.096), ('metta', 0.096), ('segmentation', 0.091), ('depth', 0.086), ('human', 0.084), ('ime', 0.084), ('lexicon', 0.082), ('buffer', 0.082), ('pooling', 0.082), ('histograms', 0.079), ('malgireddy', 0.074), ('body', 0.072), ('parse', 0.07), ('batches', 0.066), ('scene', 0.064), ('tracker', 0.055), ('mahbub', 0.053), ('sparse', 0.053), ('image', 0.052), ('dictionary', 0.051), ('frames', 0.05), ('system', 0.048), ('histogram', 0.048), ('flow', 0.047), ('discriminative', 0.047), ('oriented', 0.046), ('svm', 0.044), ('svms', 0.044), ('devel', 0.043), ('ghogs', 0.043), ('mhi', 0.043), ('odone', 0.043), ('rgbd', 0.043), ('features', 0.042), ('stage', 0.042), ('invariance', 0.041), ('laptev', 0.041), ('workshops', 0.041), ('ow', 0.041), ('memory', 0.04), ('interaction', 0.04), ('pose', 0.039), ('representation', 0.038), ('players', 0.038), ('pattern', 0.038), ('wu', 0.037), ('ngers', 0.037), ('pipeline', 0.036), ('videos', 0.036), ('bt', 0.034), ('ft', 0.033), ('scores', 0.032), ('adhd', 0.032), ('bobick', 0.032), ('cit', 0.032), ('comoldi', 0.032), ('disorder', 0.032), ('francesca', 0.032), ('giorgio', 0.032), ('hmi', 0.032), ('hri', 0.032), ('ilaria', 0.032), ('shotton', 0.032), ('teach', 0.032), ('telen', 0.032), ('telev', 0.032), ('activity', 0.031), ('shape', 0.03), ('visual', 0.03), ('recognize', 0.029), ('temporal', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000014 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
2 0.35482365 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju
Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models
3 0.32999665 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner
Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition
5 0.10592437 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
Author: Anastasios Roussos, Stavros Theodorakis, Vassilis Pitsikalis, Petros Maragos
Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification
6 0.079764113 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
7 0.069386013 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
8 0.067847893 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
9 0.063755132 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
10 0.060289029 82 jmlr-2013-Optimally Fuzzy Temporal Memory
11 0.057279252 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics
12 0.045578156 21 jmlr-2013-Classifier Selection using the Predicate Depth
13 0.045539644 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
14 0.042903107 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
15 0.038689554 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
16 0.035995461 22 jmlr-2013-Classifying With Confidence From Incomplete Information
18 0.032642119 8 jmlr-2013-A Theory of Multiclass Boosting
19 0.030742113 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
20 0.030032564 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
topicId topicWeight
[(0, -0.209), (1, -0.035), (2, -0.677), (3, -0.039), (4, 0.01), (5, -0.178), (6, 0.033), (7, -0.027), (8, 0.015), (9, 0.048), (10, -0.069), (11, 0.084), (12, 0.056), (13, 0.017), (14, 0.031), (15, 0.018), (16, 0.019), (17, 0.039), (18, -0.006), (19, -0.032), (20, 0.018), (21, -0.02), (22, 0.04), (23, -0.021), (24, -0.018), (25, 0.003), (26, 0.007), (27, 0.019), (28, -0.008), (29, 0.005), (30, -0.012), (31, 0.014), (32, 0.042), (33, -0.007), (34, -0.004), (35, 0.02), (36, -0.027), (37, -0.007), (38, -0.002), (39, 0.056), (40, 0.007), (41, -0.007), (42, -0.0), (43, 0.004), (44, -0.02), (45, -0.021), (46, -0.006), (47, 0.01), (48, 0.03), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.96900177 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
2 0.94886106 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
3 0.90048963 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju
Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models
Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner
Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition
5 0.49332511 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
Author: Anastasios Roussos, Stavros Theodorakis, Vassilis Pitsikalis, Petros Maragos
Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification
6 0.26283893 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
7 0.2382426 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
8 0.21585499 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics
9 0.20414743 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
10 0.19198489 82 jmlr-2013-Optimally Fuzzy Temporal Memory
12 0.18591908 22 jmlr-2013-Classifying With Confidence From Incomplete Information
13 0.18442437 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
14 0.18396834 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
15 0.16548935 21 jmlr-2013-Classifier Selection using the Predicate Depth
16 0.1647277 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
17 0.15418932 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
18 0.141139 106 jmlr-2013-Stationary-Sparse Causality Network Learning
19 0.13573952 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
20 0.13512503 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
topicId topicWeight
[(0, 0.018), (5, 0.082), (6, 0.026), (10, 0.049), (20, 0.045), (23, 0.583), (68, 0.02), (70, 0.013), (75, 0.033), (87, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.90437704 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
2 0.83687598 104 jmlr-2013-Sparse Single-Index Model
Author: Pierre Alquier, Gérard Biau
Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method
3 0.59242469 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
4 0.43767667 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju
Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models
Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner
Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition
6 0.39340988 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
7 0.38562778 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
8 0.37780204 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
9 0.3596822 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
10 0.3554292 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
11 0.34137979 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
12 0.33839869 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
13 0.33829942 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
14 0.33661416 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
15 0.3295399 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
16 0.3289862 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
18 0.32701388 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
19 0.32603842 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
20 0.32414481 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning