jmlr jmlr2013 jmlr2013-80 knowledge-graph by maker-knowledge-mining

80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features


Source: pdf

Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng

Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , Itd Beijing, 100083, China Editors: Isabelle Guyon and Vassilis Athitsos Abstract For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. [sent-10, score-0.781]

2 For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. [sent-11, score-0.313]

3 The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). [sent-16, score-0.654]

4 Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP) 1. [sent-17, score-0.71]

5 Introduction Human gestures frequently provide a natural and intuitive communication modality in daily life, and the techniques of gesture recognition can be widely applied in many areas, such as human computer interaction (HCI) (Pavlovic et al. [sent-18, score-0.502]

6 To model gesture signals and achieve acceptable recognition performance, the most common approaches are to use Hidden Markov Models (HMMs) or its variants (Kim et al. [sent-26, score-0.441]

7 Vogler (2003) presented a parallel HMM algorithm to model gesture components and can recognize continuous gestures in sentences. [sent-33, score-0.388]

8 They used HCRF to recognize gesture recognition and proved that HCRF can get better performance. [sent-49, score-0.441]

9 Another important approach is dynamic time warping (DTW) widely used in gesture recognition. [sent-54, score-0.327]

10 Early DTW-based methods were applied to isolated gesture recognition (Corradini, 2001; Lichtenauer et al. [sent-55, score-0.441]

11 Besides these methods, other approaches are also widely used for gesture recognition, such as linguistic sub-units (Cooper et al. [sent-60, score-0.327]

12 , 2009) have become an important branch for gesture recognition. [sent-72, score-0.327]

13 Dardas and Georganas (2011) proposed a method for real-time hand gesture recognition based on standard BoF model, but they first needed to detect and track hands and that would be difficult in a clutter background. [sent-73, score-0.486]

14 were calculated by optical flow (Lowe, 2004), and learned a codebook using hierarchical k-means algorithm, then matched the test gesture sequence with the database using a term frequency-inverse document frequency (tf-idf) weighting scheme. [sent-87, score-0.646]

15 However, in this paper, we explore one-shot learning gesture recognition (Malgireddy et al. [sent-89, score-0.441]

16 Some important challenging issues for one-shot learning gesture recognition are the following: 1. [sent-91, score-0.441]

17 Second, BoF is a modular system with three parts, namely, i) spatio-temporal feature extraction, ii) codebook learning and descriptor coding, iii) classifier, each of which can be easily replaced with different methods. [sent-106, score-0.417]

18 In this paper, we focus on solving these two challenging issues and propose a new approach to achieve good performance for one-shot learning gesture recognition. [sent-109, score-0.327]

19 2551 WAN , RUAN , D ENG AND L I • Obtained high ranking results on ChaLearn gesture challenge. [sent-116, score-0.327]

20 1 Traditional Bag of Feature (BoF) Model Figure 1(a) illustrates the traditional BoF approach for gesture (or action) recognition. [sent-125, score-0.355]

21 In the training part, after extracting local features from training videos, the visual codebook is learned with the kmeans algorithm. [sent-126, score-0.363]

22 In the testing stage, the features are extracted from a new input video, and then those features are mapped into a histogram vector by the descriptor coding method (e. [sent-129, score-0.482]

23 First, there is only one training sample per gesture class, while dozens or hundreds of training samples per class are provided in the traditional BoF model. [sent-134, score-0.411]

24 Next, a local extreme detected from difference of Gaussian pyramids (DoG) can only become an interest point if it has sufficient motion in the optical flow pyramid. [sent-158, score-0.48]

25 Finally, as the process of the SIFT descriptor calculation, the MoSIFT descriptors are respectively computed from Gaussian pyramid and optical flow pyramid so that each MoSIFT descriptor now has 256 dimensions. [sent-159, score-0.838]

26 Then we use kmeans algorithm to learn codebook and apply SOMP algorithm to achieve descriptor coding. [sent-180, score-0.371]

27 Given a gesture sample including two videos (one for RGB video and the other for depth video),1 a Gaussian pyramid for every grayscale frame (converted from RGB frame) and a depth Gaussian pyramid for every depth frame can be built via Equation (1). [sent-202, score-1.114]

28 Figure 3 shows two Gaussian pyramids (LIt , LIt+1 ) built from two consecutive grayscale frames and two depth Gaussian pyramids (LDt , LDt+1 ) built from the corresponding depth frames. [sent-209, score-0.501]

29 2555 WAN , RUAN , D ENG AND L I Figure 3: Building Gaussian pyramids and depth Gaussian pyramids for two consecutive frames. [sent-220, score-0.343]

30 (a) the gaussian pyramid LIt at time t; (b) the gaussian pyramid LIt+1 at time t + 1; (c) the depth gaussian pyramid LDt at time t; (d) the depth gaussian pyramid LDt+1 at time t + 1. [sent-221, score-0.758]

31  ζ T [Vx Vy ] = [vρi vρi ]T , x y (3) i=1 ρ ρ where ζ is the number of points in the image F1, vx i (vy i ) denotes the horizontal (vertical) velocity of the point ρi , and Vx (Vy ) denotes the horizontal (vertical) component of the estimated optical flow for all the points in an image. [sent-242, score-0.317]

32 For example, we use the Gaussian pyramids in Figure 3(a) and (b) to compute the optical flow pyramid via Equation (4). [sent-247, score-0.363]

33 Then those local extrema can only become interest points when those points have sufficient motion in the optical flow pyramid. [sent-259, score-0.351]

34 Other extrema will be eliminated, because they have no sufficient motion in the optical flow pyramids. [sent-263, score-0.321]

35 2558 O NE - SHOT L EARNING G ESTURE R ECOGNITION FROM RGB-D DATA U SING BAG OF F EATURES Figure 5: The horizontal and vertical optical flow pyramids are calculated from Figure 3(a) and (b). [sent-272, score-0.325]

36 (a) The horizontal component of the estimated optical flow pyramid VxIt at time t; (b) The vertical component of the estimated optical flow pyramid VyIt at time t; (c) The depth changing component VzDt at time t. [sent-273, score-0.624]

37 For a given point p1 from an image in different scale spaces at time t, we can easily know the horizontal and vertical velocities vx , vy by the corresponding image of the pyramids VxIt ,VyIt . [sent-283, score-0.441]

38 We can see that the highlighted parts accurately occur in the gesture motion region. [sent-287, score-0.502]

39 That is say, the interest point detection must simultaneously satisfy the condition in Equation (5) and a new condition defined as: vz ≥ β2 × w2 + h2 , (7) where vz is the depth changing value of a point from the depth changing pyramid Vz ; β2 is a predefined threshold. [sent-289, score-0.498]

40 To calculate the feature descriptors, we first extract the local patches (Γ1 ∼ Γ5 ) around the detected point in five pyramids (LIt , LDt ,VxIt ,VyIt and It It Dt It VzIt ), where Γ1 is extracted from L0,1 , Γ2 from L0,1 , Γ3 from Vx,(0,1) , Γ4 from Vy,(0,1) and Γ5 from Dt Vz,(0,1) . [sent-300, score-0.341]

41 2561 WAN , RUAN , D ENG AND L I Figure 8: Computing the feature descriptor in two parts: (a) 3D Gradient Space, (b) 3D Motion Space, (c) Feature descriptor calculation 16 (4 × 4) grids. [sent-318, score-0.41]

42 Similar to the descriptor calculation in 3D gradient space, we can compute the magnitude and orientation (using vx , vy , vz ) for the local patch around the detected points in three planes. [sent-328, score-0.471]

43 Finally, we integrate these two descriptor vectors into a long descriptor vector with 768 dimensions. [sent-331, score-0.364]

44 Besides, compared to other similar features (SIFT, MoSIFT, 3D MoSIFT), the new features can capture more compact motion patterns and are 2562 O NE - SHOT L EARNING G ESTURE R ECOGNITION FROM RGB-D DATA U SING BAG OF F EATURES not sensitive to the slight motion (see the Figure 7). [sent-337, score-0.519]

45 For a given sample including an RGB video and a depth video, we can calculate feature descriptors between two consecutive frames. [sent-338, score-0.358]

46 To do that, we will create histograms counting how many times a descriptor vector (representing a feature) appears at interest points anywhere in the video clip representing the gesture. [sent-342, score-0.363]

47 The coding methods map each descriptor into a M-dimensional code to generate the video representation. [sent-364, score-0.361]

48 1 C ODEBOOK L EARNING Let η denote the number of gesture classes (that means there are η training samples for one-shot learning), Ω = [X 1 , X 2 , . [sent-368, score-0.355]

49 , X η ], Ω ∈ ℜd×Ltr is the set of all the descriptor vectors extracted from all the training samples, X i ∈ ℜd×Ni with Ni descriptor vectors is the set extracted from the ith class, and Ltr = ∑η Ni is the number of features extracted from all the training samples. [sent-371, score-0.63]

50 2 C ODING D ESCRIPTORS BY VQ In the traditional VQ method, we can calculate the Euclidean distance between a given descriptor x ∈ ℜd and every codeword bi ∈ ℜd of the codebook B and find the closest codeword. [sent-380, score-0.428]

51 To the best of our knowledge, we are the first to use SOMP in BoF model for gesture recognition, especially for one-shot learning gesture recognition. [sent-408, score-0.654]

52 h= 1 N ∑ ci , N i=1 (11) where ci ∈ ℜM is the ith descriptor of C ∈ ℜM×N , and N is the total number of descriptors extracted from a sample and h ∈ ℜM . [sent-452, score-0.382]

53 So we select the NN classification for gesture recognition. [sent-454, score-0.327]

54 In the above discussion, we assume that every video has one gesture but this assumption is not suitable for continuous gesture recognition system. [sent-455, score-0.871]

55 Therefore, we first apply DTW to achieve temporal gesture segmentation, which splits the multiple gestures to be recognized. [sent-456, score-0.421]

56 We use the sample code about DTW provided in ChaLearn gesture challenge website (http://gesture. [sent-457, score-0.327]

57 We briefly introduce the process for temporal gesture segmentation by DTW so as to make this paper more self-contained. [sent-462, score-0.386]

58 A video is represented by a set of motion features obtained from difference images as follows. [sent-468, score-0.347]

59 Therefore, a video V with N frames is represented by a matrix (the set of motion features) fV ∈ ℜ9×(N−1) . [sent-478, score-0.319]

60 We calculate the negative Euclidean distance between each entry (a motion feature) from Ftr and each entry (a motion feature) from Fte . [sent-485, score-0.379]

61 In Figure 11, the left gray image shows the set of motion features (Ftr ) as the reference sequence calculated from training videos. [sent-487, score-0.364]

62 5 Overview of the Proposed Approach In this section, we describe the proposed approach based on bag of 3D EMoSIFT features for oneshot learning gesture recognition in detail. [sent-492, score-0.603]

63 In the recognition stage, it has five steps: temporal gesture segmentation by DTW, feature descriptor extraction using 3D EMoSIFT, coding descriptor via SOMP, coefficient histogram calculation and the recognition results via NN classifier. [sent-493, score-1.165]

64 2567 WAN , RUAN , D ENG AND L I Figure 11: Temporal gesture segmentation by DTW. [sent-495, score-0.353]

65 Experimental Results This section summarizes our results and demonstrates the proposed method is well suitable for oneshot learning gesture recognition. [sent-497, score-0.327]

66 Each batch is made of 47 gesture videos and split into a training set and a test set. [sent-504, score-0.388]

67 Detailed descriptions of the gesture data can be found in Guyon et al. [sent-507, score-0.327]

68 , hrK ] via Equation (11) (computed from training stage) • A test sample (RGB-D data): te Output: • The recognition results: class 1: Initialization: class = [ ] 2: Temporal gesture segmentation: [te1 ,te2 , . [sent-522, score-0.469]

69 c j 0 ≤ k, ∀ j minC Xte − BC 2 F 6: Calculate the coefficient histogram hte via Equation (11) 7: Recognition: tmp calss = nn classi f y(Hr , hte ) 8: class = [class tmp calss] 9: end for 10: return class Figure 12: Some samples from ChaLearn gesture database. [sent-527, score-0.366]

70 In our case, the strings contain the gesture labels detected in each sample. [sent-529, score-0.385]

71 β1 and β2 determine the detection of interest points based on motion and depth change. [sent-546, score-0.326]

72 If a given codebook size M is too large, it may cause over-clustering on some batches where the number of features is relatively fewer (e. [sent-572, score-0.318]

73 1, the 2570 O NE - SHOT L EARNING G ESTURE R ECOGNITION FROM RGB-D DATA U SING BAG OF F EATURES corresponding mean codebook size 1440 is much smaller than the given codebook size 3500 which is from the best result in Table 3. [sent-587, score-0.378]

74 It shows MLD scores by different spatio-temporal features with different values of γ, where (R) means the features are extracted from RGB video, (R+D) means the features are extracted from the RGB and depth videos. [sent-756, score-0.422]

75 However, those features a may be not sufficient to capture the distinctive motion pattern only from RGB data because there is only one training sample per class. [sent-761, score-0.327]

76 That is because the descriptors captured by MoSIFT are simply calculated from RGB data while 3D MoSIFT and 3D EMoSIFT construct 3D gradient and 3D motion space from the local patch around each interest point by fusing RGB-D data. [sent-763, score-0.339]

77 To show the distinctive views for both 3D MoSIFT and 3D EMoSIFT features, we record three gesture classes: clapping, pointing and waving. [sent-764, score-0.357]

78 Then we use 3D MoSIFT and 3D EMoSIFT features extracted from the three training samples to generate a codebook which has 20 visual words, respectively. [sent-767, score-0.382]

79 From the above discussions, we see that 3D EMoSIFT is suitable for one-shot learning gesture recognition. [sent-789, score-0.327]

80 In this section, we separately evaluate these two components and determinate which component is more essential to gesture recognition. [sent-850, score-0.327]

81 We randomly select a sample from Chalearn gesture database and test the average time with c++ programs and OpenCV library (Bradski, 2000) on a standard personal computer (CPU: 3. [sent-865, score-0.327]

82 The results are reported in Table 8 where the principal motion method (Escalante and Guyon, 2012) is the baseline method and DTW is an optional method on Chalearn gesture challenge (round 2). [sent-873, score-0.502]

83 method motion signature analysis HMM+HOGHOF BoF+3D MoSIFT principle motion DTW CRF HCRF LDCRF our method validation set(01 ∼ 20) 0. [sent-881, score-0.35]

84 1259 team name Alfnie Turtle Tamers Joewan – – – – – – Table 8: Results of different methods on Chalearn gesture data set. [sent-899, score-0.327]

85 Those motion features extracted from training videos are used to train CRF-based 2577 WAN , RUAN , D ENG AND L I models. [sent-906, score-0.352]

86 That is because the simple motion features may be indistinguishable to represent the gesture pattern. [sent-914, score-0.571]

87 Conclusion In this paper, we propose a unified framework based on bag of features for one-shot learning gesture recognition. [sent-916, score-0.489]

88 Additionally, 3D EMoSIFT features are scale and rotation invariant and can capture more compact and richer video representations even though there is only one training sample for each gesture class. [sent-922, score-0.527]

89 In our feature research, we will focus on extending 3D EMoSIFT to extract features from complex background, especially for one-shot learning gesture recognition. [sent-928, score-0.442]

90 Acknowledgments We appreciate ChaLearn providing the gesture database (http://chalearn. [sent-932, score-0.327]

91 Hand gesture recognition using a real-time tracking method and hidden markov models. [sent-962, score-0.441]

92 Dynamic time warping for off-line recognition of a small gesture vocabulary. [sent-979, score-0.441]

93 Real-time hand gesture detection and recognition using bagof-features and support vector machine techniques. [sent-986, score-0.475]

94 Hand gesture recognition following the dyo ı ı a namics of a topology-preserving network. [sent-1012, score-0.441]

95 Results and analysis of the chalearn gesture challenge 2012. [sent-1062, score-0.398]

96 Simultaneous gesture segmentation and recognition based on forward spotting accumulative hmms. [sent-1085, score-0.467]

97 A fast algorithm for vision-based hand gesture recognition ¸ for robot control. [sent-1130, score-0.441]

98 Dynamic hand gesture recognition: An exemplar-based approach from motion divergence fields. [sent-1190, score-0.502]

99 Hand gesture recognition based on dynamic bayesian network framework. [sent-1205, score-0.441]

100 Extraction of 2d motion trajectories and its application to hand gesture recognition. [sent-1301, score-0.502]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mosift', 0.37), ('emosift', 0.335), ('gesture', 0.327), ('somp', 0.278), ('codebook', 0.189), ('descriptor', 0.182), ('motion', 0.175), ('mld', 0.171), ('vq', 0.17), ('wan', 0.149), ('pyramid', 0.146), ('bof', 0.14), ('pyramids', 0.128), ('ruan', 0.128), ('recognition', 0.114), ('video', 0.103), ('esture', 0.097), ('eatures', 0.096), ('sift', 0.094), ('eng', 0.093), ('descriptors', 0.093), ('bag', 0.093), ('optical', 0.089), ('ecognition', 0.088), ('shot', 0.088), ('depth', 0.087), ('coding', 0.076), ('dtw', 0.076), ('vy', 0.076), ('chalearn', 0.071), ('features', 0.069), ('vx', 0.068), ('sing', 0.068), ('rgb', 0.066), ('dog', 0.064), ('gestures', 0.061), ('batches', 0.06), ('detected', 0.058), ('vz', 0.057), ('extrema', 0.057), ('octave', 0.055), ('image', 0.051), ('hcrf', 0.05), ('lit', 0.05), ('morency', 0.05), ('visual', 0.049), ('cuboid', 0.049), ('histograms', 0.048), ('extracted', 0.047), ('feature', 0.046), ('ow', 0.046), ('vision', 0.046), ('detect', 0.045), ('crf', 0.043), ('iy', 0.043), ('octaves', 0.043), ('ldt', 0.043), ('earning', 0.042), ('calculated', 0.041), ('frames', 0.041), ('histogram', 0.039), ('lowe', 0.038), ('doll', 0.038), ('horizontal', 0.037), ('ldcrf', 0.036), ('mres', 0.036), ('velocity', 0.035), ('frame', 0.034), ('detection', 0.034), ('scores', 0.034), ('videos', 0.033), ('patches', 0.033), ('ming', 0.033), ('temporal', 0.033), ('laptev', 0.033), ('slight', 0.031), ('yamato', 0.03), ('grayscale', 0.03), ('ci', 0.03), ('distinctive', 0.03), ('interest', 0.03), ('guyon', 0.03), ('vertical', 0.03), ('orientation', 0.03), ('calculate', 0.029), ('blurred', 0.028), ('ltr', 0.028), ('lucas', 0.028), ('suk', 0.028), ('xy', 0.028), ('training', 0.028), ('traditional', 0.028), ('hmm', 0.027), ('besides', 0.027), ('extraction', 0.026), ('spatial', 0.026), ('segmentation', 0.026), ('pattern', 0.025), ('reconstruction', 0.025), ('dbn', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng

Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)

2 0.32999665 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone

Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction

3 0.26250738 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju

Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models

4 0.17006518 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner

Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition

5 0.090429246 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

Author: Anastasios Roussos, Stavros Theodorakis, Vassilis Pitsikalis, Petros Maragos

Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification

6 0.064301342 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

7 0.059088316 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes

8 0.049476959 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

9 0.049448848 21 jmlr-2013-Classifier Selection using the Predicate Depth

10 0.048605785 76 jmlr-2013-Nonparametric Sparsity and Regularization

11 0.043237671 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

12 0.037767574 22 jmlr-2013-Classifying With Confidence From Incomplete Information

13 0.033418376 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

14 0.03178075 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

15 0.030852772 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

16 0.028146364 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints

17 0.027617572 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

18 0.026752766 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

19 0.026406385 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

20 0.023482112 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.171), (1, -0.036), (2, -0.599), (3, -0.048), (4, 0.035), (5, -0.17), (6, 0.049), (7, -0.026), (8, 0.006), (9, 0.06), (10, -0.086), (11, 0.09), (12, 0.067), (13, 0.023), (14, 0.017), (15, 0.037), (16, 0.007), (17, 0.023), (18, -0.023), (19, -0.02), (20, -0.031), (21, -0.014), (22, 0.067), (23, -0.001), (24, 0.002), (25, 0.075), (26, 0.005), (27, -0.009), (28, -0.018), (29, -0.017), (30, -0.027), (31, 0.038), (32, 0.027), (33, -0.012), (34, -0.063), (35, 0.032), (36, -0.037), (37, 0.005), (38, -0.013), (39, 0.011), (40, 0.036), (41, -0.059), (42, -0.009), (43, 0.059), (44, 0.003), (45, -0.009), (46, -0.006), (47, 0.031), (48, 0.013), (49, -0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95566708 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng

Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)

2 0.92599255 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone

Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction

3 0.86809307 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

Author: Daniel Kyu Hwa Kohlsdorf, Thad E. Starner

Abstract: Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users’ normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an “Everyday Gesture Library” or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC’s effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers. Keywords: gesture recognition, gesture spotting, false positives, continuous recognition

4 0.84446508 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju

Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models

5 0.440979 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

Author: Anastasios Roussos, Stavros Theodorakis, Vassilis Pitsikalis, Petros Maragos

Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification

6 0.22339204 21 jmlr-2013-Classifier Selection using the Predicate Depth

7 0.1965373 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

8 0.18718024 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

9 0.17297399 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

10 0.16355649 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes

11 0.15861358 22 jmlr-2013-Classifying With Confidence From Incomplete Information

12 0.14976396 76 jmlr-2013-Nonparametric Sparsity and Regularization

13 0.14922905 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

14 0.14526147 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

15 0.14472398 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

16 0.12878653 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

17 0.12575942 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

18 0.12360755 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

19 0.12131367 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

20 0.12113808 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (2, 0.018), (5, 0.071), (6, 0.027), (10, 0.052), (20, 0.059), (23, 0.171), (62, 0.02), (68, 0.026), (70, 0.016), (75, 0.043), (85, 0.02), (87, 0.015), (89, 0.011), (92, 0.335)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7349354 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng

Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)

2 0.51595193 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone

Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction

3 0.496254 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method

4 0.47776395 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

5 0.39555681 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju

Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models

6 0.36629164 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

7 0.36303246 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

8 0.35659295 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

9 0.34617749 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

10 0.33315042 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

11 0.33210027 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

12 0.33167446 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

13 0.32962397 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

14 0.32951587 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

15 0.32764205 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

16 0.32685927 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

17 0.32520056 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

18 0.32477438 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

19 0.32267779 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

20 0.32250166 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems