jmlr jmlr2013 jmlr2013-58 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju
Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Computer Science and Engineering University at Buffalo, SUNY Buffalo, NY 14260, USA Editors: Isabelle Guyon and Vassilis Athitsos Abstract We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. [sent-5, score-0.406]
2 In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. [sent-6, score-0.817]
3 We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. [sent-9, score-0.731]
4 We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. [sent-10, score-1.211]
5 Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models 1. [sent-11, score-0.702]
6 Introduction Vision-based activity recognition is currently a very active area of computer vision research, where the goal is to automatically recognize different activities from a video. [sent-12, score-0.918]
7 In a simple case where a video contains only one activity, the goal is to classify that activity, whereas, in a more general case, the objective is to detect the start and end locations of different specific activities occurring in a video. [sent-13, score-0.611]
8 The former, simpler case is known as activity classification and latter as activity spotting. [sent-14, score-0.704]
9 The ability to recognize activities in videos, can be helpful in several applications, such as monitoring elderly persons; surveillance systems in airports and other important public areas to detect abnormal and suspicious activities; and content based video retrieval, amongst other uses. [sent-15, score-0.68]
10 Motivated by the successes of this modeling technique in solving general high-level problems, we define an activity as a sequence of contiguous sub-actions, where the sub-action is a discrete unit that can be identified in a action stream. [sent-24, score-0.469]
11 Extracting the complete vocabulary of sub-actions in activities is a challenging problem since the exhaustive list of sub-actions involved in a set of given activities is not necessarily known beforehand. [sent-27, score-0.656]
12 We therefore hypothesize that the use of sub-actions in combination with the use of a generative model for representing activities will improve recognition accuracy and can also aid in activity spotting. [sent-29, score-0.86]
13 Background and Related Work Although extensive research has gone into the study of the classification of human activities in video, fewer attempts have been made to spot actions from an activity stream. [sent-32, score-0.784]
14 A recent, more complete survey on activity recognition research is presented by Aggarwal and Ryoo (2011). [sent-33, score-0.447]
15 We divide the related work in activity recognition into two main categories: activity classification and activity spotting. [sent-34, score-1.151]
16 When referring to activity spotting purposes, we use the term gestures instead of activities, only to be consistent with the terminology of the ChaLearn Gesture Challenge. [sent-37, score-0.663]
17 2190 A L ANGUAGE -M OTIVATED A PPROACH TO ACTION R ECOGNITION model (HMM) is learned over the features; (iii) hierarchical approaches: an activity is modeled hierarchically, as combination of simpler low level activities. [sent-38, score-0.506]
18 A typical space-time approach for activity recognition involves the detection of interest points and the computation of various descriptors for each interest point. [sent-43, score-0.617]
19 Hence, when an unlabeled, unseen video is presented, similar descriptors are extracted as mentioned above and presented to a classifier for labeling. [sent-46, score-0.413]
20 2 S EQUENTIAL A PPROACHES Sequential approaches represent an activity as an ordered sequence of features, here the goal is to learn the order of specific activity using state-space models. [sent-66, score-0.704]
21 HMMs and other dynamic Bayesian networks (DBNs) are popular state-space models used in activity recognition. [sent-67, score-0.403]
22 If an activity is represented as a set of hidden states, each hidden state can produce a feature at each time frame, known as the observation. [sent-68, score-0.446]
23 HMMs were first applied to activity recognition in 1992 by Yamato et al. [sent-69, score-0.447]
24 They extracted features at each frame of a video by first binarizing the frame and dividing it into (M × N) meshes. [sent-71, score-0.443]
25 Their experiments showed that CHSMM modeled an activity better than the CHMM. [sent-81, score-0.413]
26 3 H IERARCHICAL A PPROACHES The main idea of hierarchical approaches is to perform recognition of higher-level activities by modeling them as a combination of other simpler activities. [sent-84, score-0.488]
27 The major advantage of these approaches over sequential approaches is their ability to recognize activities with complex structures. [sent-85, score-0.397]
28 The results from this layer are fed into the second layer and used for the actual activity recognition. [sent-89, score-0.454]
29 The lower layer HMMs classified the video and audio data with a time granularity of less than 1 second while the higher layer learned typical office activities such as phone conversation, face-to-face conversation, presentation, etc. [sent-92, score-0.758]
30 , 2005) were used to recognize human activities such as person having “short-meal”, “snacks” and “normal meal”. [sent-95, score-0.455]
31 The main difference between the above mentioned methods and our proposed method, is that these approaches assume that the higher-level activities and atomic activities (sub-actions) are known a priori, hence, the parameters of the model can be learned directly based on this notion. [sent-99, score-0.71]
32 The task of activity spotting was therefore reduced to one of performing an optimal search for activities in the video. [sent-107, score-0.913]
33 (2010) introduced a local descriptor of video dynamics based on visual spacetime oriented energy measures. [sent-109, score-0.47]
34 Rather than 2192 A L ANGUAGE -M OTIVATED A PPROACH TO ACTION R ECOGNITION Figure 1: Our general framework abstracts low-level visual features from videos and connects them to poses, motion patterns and classes of activity. [sent-112, score-0.629]
35 (c) Each video segment is modeled as a distribution over motion patterns. [sent-119, score-0.62]
36 The time component is incorporated by modeling the transitions between the video segments, so that a complete video is modeled as a dynamic network of motion patterns. [sent-120, score-0.923]
37 The distributions and transitions of underlying motion patterns in a video determine the final activity label assigned to that video. [sent-121, score-0.949]
38 A Language-Motivated Hierarchical Model for Classification Our proposed language-motivated hierarchical approach aims to perform recognition of higher-level activities by modeling them as a combination of other simpler activities. [sent-125, score-0.488]
39 The major advantage of this 2193 M ALGIREDDY, N WOGU AND G OVINDARAJU approach over the typical sequential approaches and other hierarchical approaches is its ability to recognize activities with complex structures. [sent-126, score-0.462]
40 An important aspect of this model is that motion patterns are shared across activities. [sent-133, score-0.374]
41 So although the model is generative in structure, it can act discriminatively as it specifically learns which motion patterns are present in each activity. [sent-134, score-0.399]
42 The fact that motion patterns are shared across activities was validated empirically (Messing et al. [sent-135, score-0.674]
43 A given video is broken into motion segments comprising of either a combination of a fixed number of frames, or at the finest level, a single frame. [sent-140, score-0.528]
44 Each motion segment can be represented as bag of vectorized descriptors (visual words) so that the input to the model (at time t) is the bag of visual words for motion segment t. [sent-141, score-0.847]
45 This can be interpreted as follows: For each video m in the corpus, a motion pattern indicator zt is drawn from z p(zt |zt−1 , ψcm ), denoted by Mult(ψct−1 ), where cm is the class label for the video m. [sent-150, score-1.123]
46 That is, for each visual word, a pose indicator yt,i is sampled according to pattern specific pose distribution θzt , and then the corresponding pose-specific word distribution φyt,i is used to draw a visual word. [sent-152, score-0.456]
47 The poses φy , motion patterns θz and transition matrices ψ j are sampled once for the entire corpus. [sent-153, score-0.377]
48 The joint distribution of all known and hidden variables given the hyperparameters for a video is: p({xt , yt , zt }T , φ, ψ j , θ|α, β, γ j ) = p(φ|β)p(θ|α)p(ψ|γ j ) ∏ ∏ p(xt,i |yt,i )p(yt,i |zt )p(zt |zt−1 ). [sent-154, score-0.496]
49 nz,zm ,¬t denotes the count from all the t+1 videos with label cm where motion-pattern zt+1 is followed by motion-pattern zt excluding the token at t. [sent-168, score-0.483]
50 The middle and bottom rows show the fourteen motion patterns discovered and modeled from the poses. [sent-188, score-0.375]
51 The ten simulated dynamic activity classes were the writing of the ten digital digits, 0-9 as shown in Figure 3. [sent-193, score-0.403]
52 An activity class therefore consisted of the steps needed to simulate the writing of each digit and the purpose of the simulation was to visually observe the clusters of motion patterns involved in the activities. [sent-195, score-0.666]
53 An activity or digit written was therefore classified based on the sequences of distributions of these motion patterns over time. [sent-202, score-0.666]
54 The confusion matrix computed from this experiment is given in Figure 5 and a comparison with other activity recognition methods on the Daily Activities data set is given in Table 1. [sent-235, score-0.447]
55 Qualitatively, Figure 7 pictorially illustrates some examples of different activities having the same underlying shared motion patterns. [sent-237, score-0.605]
56 The results show that the approach based on computing a distribution mixture over motion orientations at each spatial location of the video sequence (Benabbas et al. [sent-241, score-0.528]
57 , 2011) Video temporal cropping technique Our supervised dynamic hierarchical model Direction of motion features (Benabbas et al. [sent-262, score-0.459]
58 For example, the activity of answering the phone shares a common motion pattern (#85) with the activities of dialing the phone and drinking water. [sent-264, score-1.051]
59 A Language-Motivated Model for Gesture Recognition and Spotting Few methods have been proposed for gesture spotting and among them include the work of Yuan et al. [sent-269, score-0.41]
60 The task of gesture spotting was therefore reduced to performing an optimal search for gestures in the video. [sent-271, score-0.488]
61 (2010) who introduced a local descriptor of video dynamics based on visual space-time oriented energy measures. [sent-273, score-0.47]
62 Similar to the previous work, their input was also a video in which a specific action was searched for. [sent-274, score-0.4]
63 We therefore propose to develop a probabilistic framework for gesture spotting that can be learned with very little training data and can readily generalize to different environments. [sent-277, score-0.41]
64 Justification: Although the proposed framework is a generative probabilistic model, it performs comparably to the state-of-the-art activity techniques which are typically discriminative in nature, as demonstrated in Tables 2 and 3. [sent-278, score-0.437]
65 An additional benefit of the framework is its usefulness for gesture spotting based on learning from only one, or few training examples. [sent-279, score-0.41]
66 Background: In speech recognition, unconstrained keyword spotting refers to the identification of specific words uttered, when those words are not clearly separated from other words, and no grammar is enforced on the sentence containing them. [sent-280, score-0.415]
67 Our proposed spotting framework uses the Viterbi decoding algorithm and is motivated by the keyword-filler HMM for spotting keywords in continuous speech. [sent-281, score-0.466]
68 These two models are then 2201 M ALGIREDDY, N WOGU AND G OVINDARAJU Figure 7: Different activities showing shared underlying motion patterns. [sent-284, score-0.605]
69 The shared motion patterns are 85 and 90, amidst other underlying motion patterns shown combined to form a composite filler HMM that is used to annotate speech parts using the Viterbi decoding scheme. [sent-285, score-0.69]
70 In a similar manner, we compute the probabilistic signature for a gesture class, and using the filler model structure, we test for the presence of that gesture within a given video. [sent-288, score-0.382]
71 2202 A L ANGUAGE -M OTIVATED A PPROACH TO ACTION R ECOGNITION Figure 8: Plates model for mcHMM showing the relationship between activities or gestures, states and the two channels of observed visual words (VW). [sent-290, score-0.519]
72 Figure 9 shows an example of the stacked mcHMMs involved the gesture spotting task. [sent-316, score-0.41]
73 This toy example shown in the figure can spot gestures in a test video comprised of at most two gestures. [sent-317, score-0.389]
74 Similarly we can enter the second gesture from S′′ and end at e′ or directly go from S′′ to e′ which handles the case for a video having only one gesture. [sent-322, score-0.46]
75 The ratio between the likelihood of the Viterbi path that passed through the keyword model and the likelihood of an alternate path that passes through the non-keyword portion was then used to score the occurrence of a keyword, where a keyword here referred to a gesture class. [sent-325, score-0.385]
76 The toy example shown assumes there are at most two activities in any test video, where the first activity is from the set of activities that start from s′ and end at s′′ , followed by one from the set that start from s′′ and end at e′ . [sent-328, score-1.008]
77 Experiments and Results using mcHMM In this section, we present our approach on generating visual words and our observations as well as the results of applying proposed mcHMM model to activity classification and gesture spotting, using publicly available benchmark data sets. [sent-330, score-0.72]
78 1 Generating Visual Words An important step in generating visual words is the the need to extract interest points from frames sampled from the videos at 30 fps. [sent-332, score-0.39]
79 For each of these tracks, motion boundary histogram descriptors based on HoG and HoF descriptors were extracted. [sent-335, score-0.453]
80 Because the HMDB data set is comprised of real-life scenes which contain people and activities occurring at multiple scales, the frame-size in the video was reduced by a factor of two repeatedly, and motion boundary descriptors were extracted at multiple scales. [sent-339, score-1.014]
81 We therefore divided the construction of visual words for HMDB data set into a two step process where visual words were first constructed for each activity class separately, and then the visual words obtained for each class were used as the input samples to cluster the final visual words. [sent-354, score-0.973]
82 2 Study Performed on the HMDB and KTH Data Sets In order to compare our framework to the other current state-of-the-art methods, we performed activity classification on video sequences created from the KTH database (Sch¨ ldt et al. [sent-357, score-0.635]
83 67 Table 2: Comparison of our proposed model and features for KTH data set Method Best results on 51 activities (original) (Kuehne et al. [sent-370, score-0.396]
84 , 2011) Proposed mcHMM on 51 activities (original) Best results on 10 activities (original) (Kuehne et al. [sent-371, score-0.656]
85 , 2011) Proposed mcHMM on 10 activities (original) Proposed mcHMM on 10 activities (stabilized) Accuracy 23. [sent-372, score-0.656]
86 HMDB is currently the most realistic database for human activity recognition comprising of 6766 video clips and 51 activities extracted from a wide range of sources like YouTube, Google videos, digitized movies and other videos available on the Internet. [sent-382, score-1.285]
87 Each split has 70 video for training and 30 videos for testing for each class. [sent-385, score-0.426]
88 All the videos in the data set are stabilized to remove the camera motion and the authors of the initial paper (Kuehne et al. [sent-386, score-0.416]
89 Table 3 summarizes the performance of proposed mcHMM method on 51 activities as well as 10 activities for both original and stabilized videos. [sent-389, score-0.684]
90 1 A NALYSIS OF R ESULTS For both the case of simple actions as found in the KTH data set and the case of significantly more complex actions as found in the HMDB data set, the mcHMM model performs comparably with other methods, outperforming them in the activity recognition task. [sent-392, score-0.595]
91 This suggests that the overall framework (combination of dense descriptors and a state-based probabilistic model) is fairly robust with respect to these low-level video degradations. [sent-394, score-0.387]
92 3 Study Performed on the ChaLearn Gesture Data Set Lastly, we present our results of gesture spotting from the ChaLearn gesture data set (Cha, 2011). [sent-397, score-0.587]
93 Since the taskat-hand was gesture spotting via one-shot learning, only one video per class was provided to train an activity (or gesture). [sent-399, score-1.045]
94 Gesture spotting was then performed by creating a spotting network made up of connected mcHMM models, one for each gesture learned, as explained in Section 5. [sent-415, score-0.643]
95 Being a generative model, we can detect abnormal activities based on low likelihood measures. [sent-444, score-0.385]
96 This framework was validated by its comparable performance on tests performed on the daily activities data set, a naturalistic data set involving everyday activities in the home. [sent-445, score-0.702]
97 An additional benefit of this framework was its usefulness for gesture spotting based on learning from only one, or few training examples. [sent-447, score-0.41]
98 The use of auto-detected video segments could prove useful both in activity classification and gesture spotting. [sent-451, score-0.812]
99 HMDB: a large video database for human motion recognition. [sent-561, score-0.586]
100 Learning and detecting activities from movement trajectories using the hierarchical hidden markov models. [sent-608, score-0.44]
wordName wordTfidf (topN-words)
[('activity', 0.352), ('activities', 0.328), ('video', 0.283), ('motion', 0.245), ('mchmm', 0.242), ('spotting', 0.233), ('gesture', 0.177), ('zt', 0.166), ('videos', 0.143), ('visual', 0.132), ('hmdb', 0.127), ('action', 0.117), ('algireddy', 0.117), ('ovindaraju', 0.117), ('wogu', 0.117), ('chalearn', 0.113), ('hmm', 0.111), ('cm', 0.11), ('anguage', 0.107), ('otivated', 0.107), ('descriptors', 0.104), ('mcmclda', 0.098), ('recognition', 0.095), ('pproach', 0.092), ('laptev', 0.09), ('keyword', 0.09), ('ecognition', 0.083), ('viterbi', 0.083), ('lda', 0.079), ('gestures', 0.078), ('nz', 0.075), ('vision', 0.074), ('patterns', 0.069), ('recognize', 0.069), ('nzt', 0.068), ('hierarchical', 0.065), ('poses', 0.063), ('modeled', 0.061), ('pose', 0.061), ('kuehne', 0.059), ('human', 0.058), ('generative', 0.057), ('descriptor', 0.055), ('layer', 0.051), ('frames', 0.051), ('dynamic', 0.051), ('hof', 0.05), ('mult', 0.05), ('benabbas', 0.049), ('buffalo', 0.049), ('messing', 0.049), ('hidden', 0.047), ('frame', 0.047), ('daily', 0.046), ('actions', 0.046), ('phone', 0.045), ('hmms', 0.042), ('features', 0.04), ('ller', 0.039), ('edit', 0.039), ('matikainen', 0.039), ('token', 0.039), ('hog', 0.039), ('kth', 0.038), ('enhancement', 0.038), ('pattern', 0.036), ('sampler', 0.035), ('word', 0.034), ('banana', 0.033), ('interest', 0.033), ('shared', 0.032), ('words', 0.031), ('gibbs', 0.031), ('segment', 0.031), ('competition', 0.031), ('layers', 0.031), ('speech', 0.03), ('temporal', 0.03), ('ser', 0.03), ('bqt', 0.029), ('derpanis', 0.029), ('hospedales', 0.029), ('mesh', 0.029), ('plates', 0.029), ('pproaches', 0.029), ('putative', 0.029), ('rochester', 0.029), ('bayesian', 0.029), ('model', 0.028), ('comparably', 0.028), ('oliver', 0.028), ('stabilized', 0.028), ('comprised', 0.028), ('recognizing', 0.028), ('atomic', 0.026), ('extracted', 0.026), ('motions', 0.026), ('gong', 0.026), ('augmented', 0.026), ('excluding', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju
Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models
2 0.35482365 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
3 0.26250738 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner
Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition
5 0.094452985 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
Author: Anastasios Roussos, Stavros Theodorakis, Vassilis Pitsikalis, Petros Maragos
Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification
6 0.085897371 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
7 0.083517499 82 jmlr-2013-Optimally Fuzzy Temporal Memory
8 0.076537259 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
9 0.06093419 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
10 0.053398523 108 jmlr-2013-Stochastic Variational Inference
11 0.047983356 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
12 0.046005454 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
13 0.038666923 15 jmlr-2013-Bayesian Canonical Correlation Analysis
15 0.035104673 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
16 0.03371039 121 jmlr-2013-Variational Inference in Nonconjugate Models
17 0.032108683 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
18 0.029869296 22 jmlr-2013-Classifying With Confidence From Incomplete Information
19 0.027631862 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
20 0.026234828 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
topicId topicWeight
[(0, -0.182), (1, -0.086), (2, -0.612), (3, -0.042), (4, -0.018), (5, -0.156), (6, 0.055), (7, -0.019), (8, 0.032), (9, 0.067), (10, -0.053), (11, 0.082), (12, -0.003), (13, -0.018), (14, 0.023), (15, 0.047), (16, -0.008), (17, 0.006), (18, -0.002), (19, 0.013), (20, 0.014), (21, 0.059), (22, -0.037), (23, -0.009), (24, 0.02), (25, -0.032), (26, -0.029), (27, -0.008), (28, 0.021), (29, 0.003), (30, -0.017), (31, -0.041), (32, 0.037), (33, 0.031), (34, 0.031), (35, -0.015), (36, 0.022), (37, 0.032), (38, -0.021), (39, -0.043), (40, -0.087), (41, 0.022), (42, 0.068), (43, -0.044), (44, 0.001), (45, -0.006), (46, -0.002), (47, -0.032), (48, 0.017), (49, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.96999019 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju
Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models
2 0.905038 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
3 0.87822926 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner
Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition
Author: Anastasios Roussos, Stavros Theodorakis, Vassilis Pitsikalis, Petros Maragos
Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification
6 0.36709231 82 jmlr-2013-Optimally Fuzzy Temporal Memory
7 0.30901077 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
8 0.30560511 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
9 0.26258856 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
10 0.24963954 15 jmlr-2013-Bayesian Canonical Correlation Analysis
11 0.21135046 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
12 0.19316573 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
13 0.18035901 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
15 0.15918866 22 jmlr-2013-Classifying With Confidence From Incomplete Information
16 0.15417282 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
17 0.14955202 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
18 0.14562085 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
19 0.13838083 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
20 0.13615739 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
topicId topicWeight
[(0, 0.015), (5, 0.073), (6, 0.026), (10, 0.063), (20, 0.506), (23, 0.13), (44, 0.012), (53, 0.022), (68, 0.015), (70, 0.014), (75, 0.03), (85, 0.012)]
simIndex simValue paperId paperTitle
1 0.95911032 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules
Author: Ansaf Salleb-Aouissi, Christel Vrain, Cyril Nortet, Xiangrong Kong, Vivek Rathod, Daniel Cassard
Abstract: In this paper, we propose Q UANT M INER, a mining quantitative association rules system. This system is based on a genetic algorithm that dynamically discovers “good” intervals in association rules by optimizing both the support and the confidence. The experiments on real and artificial databases have shown the usefulness of Q UANT M INER as an interactive, exploratory data mining tool. Keywords: association rules, numerical and categorical attributes, unsupervised discretization, genetic algorithm, simulated annealing
same-paper 2 0.85570115 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju
Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models
3 0.62890744 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos
Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm
4 0.41638136 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
5 0.3958306 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
7 0.34685013 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
8 0.3403345 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
9 0.31862429 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
10 0.31277296 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
11 0.30876556 104 jmlr-2013-Sparse Single-Index Model
12 0.3028352 82 jmlr-2013-Optimally Fuzzy Temporal Memory
13 0.30281767 95 jmlr-2013-Ranking Forests
14 0.2989881 33 jmlr-2013-Dimension Independent Similarity Computation
15 0.29394037 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
16 0.29224366 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
17 0.29211041 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
18 0.28635231 22 jmlr-2013-Classifying With Confidence From Incomplete Information
19 0.28565452 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
20 0.28446224 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut