Abstract: We propose the novel approach of dynamic affine-invariant shape-appearance model (Aff-SAM) and employ it for handshape classification and sign recognition in sign language (SL) videos. AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. We construct SA images representing the hand’s shape and appearance without landmark points. We model the variation of the images by linear combinations of eigenimages followed by affine transformations, accounting for 3D hand pose changes and improving model’s compactness. We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. We rather employ a short development data set to adapt the models for a new signer. Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification

1 AffSAM offers a compact and descriptive representation of hand configurations as well as regularized model-fitting, assisting hand tracking and extracting handshape features. [sent-11, score-0.768]

2 We also incorporate static and dynamic handshape priors, offering robustness in occlusions, which occur often in signing. [sent-14, score-0.718]

3 The approach includes an affine signer adaptation component at the visual level, without requiring training from scratch a new singer-specific model. [sent-15, score-0.382]

4 Experiments on the Boston-University-400 continuous SL corpus demonstrate improvements on handshape classification when compared to other feature extraction approaches. [sent-17, score-0.68]

5 Supplementary evaluations of sign recognition experiments, are conducted on a multi-signer, 100-sign data set, from the Greek sign language lemmas corpus. [sent-18, score-0.336]

6 These explore the fusion with movement cues as well as signer adaptation of Aff-SAM to multiple signers providing promising results. [sent-19, score-0.484]

7 Keywords: affine-invariant shape-appearance model, landmarks-free shape representation, static and dynamic priors, feature extraction, handshape classification 1. [sent-20, score-0.759]

8 The hand localization and tracking in a sign video as well as the derivation of features that reliably describe the configuration of the signer’s hand are crucial for successful handshape classification. [sent-23, score-0.917]

9 In this article, we propose a novel modeling of the shape and dynamics of the hands during signing that leads to efficient handshape features, employed to train statistical handshape models and finally for handshape classification and sign recognition. [sent-27, score-1.923]

10 After developing a procedure for the training of the Aff-SAM, we design a robust hand tracking system by adopting regularized model fitting that exploits prior information about the handshape and its dynamics. [sent-33, score-0.722]

11 Furthermore, we propose to use as handshape features the Aff-SAM’s eigenimage weights estimated by the fitting process. [sent-34, score-0.659]

12 The overall framework is evaluated and compared to other methods in extensive handshape classification experiments. [sent-36, score-0.579]

13 The experiments are based on manual annotation of handshapes that contain 3D pose parameters and the American Sign Language (ASL) handshape configuration. [sent-38, score-0.837]

14 Many methods, 1628 DYNAMIC A FFINE -I NVARIANT S HAPE -A PPEARANCE H ANDSHAPE F EATURES AND C LASSIFICATION including the one presented here, use skin color segmentation for hand detection (Argyros and Lourakis, 2004; Yang et al. [sent-60, score-0.347]

15 Cui and Weng (2000) and Huang and Jeng (2001) employ motion cues assuming the hand is the only moving object on a stationary background, and that the signer is relatively still. [sent-66, score-0.422]

16 Segmented hand images are usually normalized for size, in-plane orientation, and/or illumination and afterwards principal component analysis (PCA) is often applied for dimensionality reduction and descriptive representation of handshape (Sweeney and Downton, 1996; Birk et al. [sent-91, score-0.724]

17 Closely related to PCA approaches, active shape and active appearance models (Cootes and Taylor, 2004; Matthews and Baker, 2004) are employed for handshape feature extraction and recognition (Ahmad et al. [sent-100, score-0.795]

18 Example frames with extracted skin region masks and assigned bodypart labels H (head), L (left hand), R (right hand). [sent-105, score-0.355]

19 A method earlier employed for action-type features is the histogram of oriented gradients (HOG): these descriptors are used for the handshapes of a signer (Buehler et al. [sent-108, score-0.363]

20 (2011) take advantage of linguistic constraints and exploit them via a Bayesian network to improve handshape recognition accuracy. [sent-114, score-0.655]

21 In the sign recognition experiments of Section 8, we employ the handshape subunits construction presented by Roussos et al. [sent-125, score-0.835]

22 The output of this subsystem at every frame is a set of skin region masks together with one or multiple labels assigned to every region, Figure 1. [sent-136, score-0.334]

23 As presented in Section 4, the framework of SA refines this tracking while extracting handshape features. [sent-140, score-0.648]

24 We consider a Gaussian model of the signer’s skin color in the perceptually uniform color space CIE-Lab, after keeping the two chromaticity components a∗ , b∗ , to obtain robustness to illumination (Cai and Goshtasby, 1999). [sent-143, score-0.42]

25 We assume that the (a∗ ,b∗ ) values of skin pixels follow a bivariate Gaussian distribution ps (a∗ , b∗ ), which is fitted using a training set of color samples (Figure 2). [sent-144, score-0.356]

26 2 Morphological Processing of Skin Masks In each frame, a first estimation of the skin mask S0 is derived by thresholding at every pixel x the value ps (a∗ (x), b∗ (x)) of the learned skin color distribution, see Figures 2, 3(b). [sent-147, score-0.59]

27 The corresponding threshold is determined so that a percentage of the training skin color samples are classified to skin. [sent-148, score-0.329]

28 The skin mask S0 may contain spurious regions or holes inside the head area due to parts with different color, as for instance eyes, mouth. [sent-150, score-0.324]

29 Affine Shape-Appearance Modeling In this section, we describe the proposed framework of dynamic affine-invariant shape-appearance model which offers a descriptive representation of the hand configurations as well as a simultaneous hand tracking and feature extraction process. [sent-188, score-0.324]

30 Therefore, it is more effective to represent the 2D handshape without using any landmarks. [sent-193, score-0.579]

31 We thus represent the handshape by implicitly using its binary mask M, while incorporating also the appearance of the hand, that is, the color values inside this mask. [sent-194, score-0.788]

32 3 Training of the SAM Linear Combination In order to train the hand SA images model, we employ a representative set of handshape images from frames where the modeled hand is fully visible and non-occluded. [sent-234, score-0.961]

33 4 Regularized SAM Fitting with Static and Dynamic Priors After having built the shape-appearance model, we fit it in the frames of an input sign language video, in order to track the hand and extract handshape features. [sent-274, score-0.87]

34 In parallel, to achieve robustness against 1636 DYNAMIC A FFINE -I NVARIANT S HAPE -A PPEARANCE H ANDSHAPE F EATURES AND C LASSIFICATION occlusions, we exploit prior information about the handshape and its dynamics. [sent-276, score-0.579]

35 For each non-occluded segment, we start from its middle frame and we get 1) a segment with forward direction by ending to the middle frame of the next occluded segment and 2) a segment with backward direction by ending after the middle frame of the previous occluded segment. [sent-308, score-0.485]

36 Otherwise, if K(n) = 0, we test as initializations the two similarity transforms that, when applied to the SAM mean image A0 , make its mask have the same centroid, area and orientation as the mask of the current frame’s SA image. [sent-379, score-0.354]

37 In addition, extensive handshape classification experiments were performed in order to evaluate the extracted handshape features employing the proposed Aff-SAM method (see Section 7). [sent-392, score-1.226]

38 1 Skin Color and Normalization The employed skin color modeling adapts on the characteristics of the skin color of a new signer. [sent-400, score-0.602]

39 Figure 8 illustrates the skin color modeling for the two signers of the GSL lemmas corpus, where we test the adaptation. [sent-401, score-0.414]

40 In addition, the mapping g(I) of skin color values, used to create the SA images, is normalized according to the skin color distribution of each signer. [sent-406, score-0.602]

41 This skin color adaptation makes the body-parts label extraction of the visual front-end preprocessing to behave robustly over different signers. [sent-408, score-0.432]

42 They thus automatically compensate for the fact that the second signer has thinner hands and longer fingers. [sent-420, score-0.322]

43 3 New Signer Fitting To process a new signer the visual front-end is applied as in Section 3. [sent-423, score-0.323]

44 We observe that, despite the anatomical differences of the two signers, 1642 DYNAMIC A FFINE -I NVARIANT S HAPE -A PPEARANCE H ANDSHAPE F EATURES AND C LASSIFICATION Source signer (A) New signer (B) Figure 10: Regularized Shape-Appearance Model fitting on 2 signers. [sent-447, score-0.554]

45 These concern the pose and handshape configurations and are essential for the supervised classification experiments. [sent-457, score-0.667]

46 1 Handshape Parameters and Annotation The parameters that need to be specified for the annotation of the data are the (pose-independent) handshape configuration and the 3D hand pose, that is the orientation of the hand in the 3D space. [sent-459, score-0.836]

47 For the annotation of the handshape configurations we followed the SignStream annotation conventions (Neidle, 2007). [sent-460, score-0.753]

48 The adopted annotation parameters are as follows: 1) Handshape identity (HSId) which defines the handshape configuration, that is, (‘A’, ‘B’, ‘1’, ‘C’ etc. [sent-464, score-0.666]

49 2 Data Selection and Classes We select and annotate a set of occluded and non-occluded handshapes so that 1) they cover substantial handshape and pose variation as they are observed in the data and 2) they are quite frequent. [sent-474, score-0.811]

50 More specifically we have employed three different data sets (DS): 1) DS-1: 1430 non-occluded handshape instances with 18 different HSIds. [sent-475, score-0.579]

51 2) DS-1-extend: 3000 non-occluded handshape instances with 24 different HSIds. [sent-476, score-0.579]

52 3) DS-2: 4962 occluded and non-occluded handshape instances with 42 different HSIds. [sent-477, score-0.673]

53 Table 1 presents an indicative list of annotated handshape configurations and 3D hand orientation parameters. [sent-478, score-0.703]

54 Handshape Classification Experiments In this section we present the experimental framework consisting of the statistical system for handshape classification. [sent-480, score-0.579]

55 This is based 1) on the handshape features extracted as described in Section 4; 2) on the annotations as described in Section 6. [sent-481, score-0.615]

56 Table 1: Samples of annotated handshape identities (HSId) and corresponding 3D hand orientation (pose) parameters for the D-HFSBP class dependency and the corresponding experiment; in this case each model is fully dependent on all of the orientation parameters. [sent-489, score-0.843]

57 In each case, we show an example handshape image that is randomly selected among the corresponding handshape instances of the same class. [sent-492, score-1.197]

58 This partitioning samples data, among all realizations per handshape class in order to equalize class occurrence. [sent-496, score-0.579]

59 The number of realizations per handshape class are on average 50, with a minimum and maximum number of realizations in the range of 10 to 300 depending on the experiment and the handshape class definition. [sent-497, score-1.158]

60 We assign to each experiment’s training set one GMM per handshape class; each has one mixture and diagonal covariance matrix. [sent-498, score-0.607]

61 Note that we are not employing other classifiers since we are interested in the evaluation of the handshape features and not the classifier. [sent-501, score-0.647]

62 The dependency or non-dependency state to a particular parameter for the handshape trained models is noted as ‘D’ or ‘*’ respectively. [sent-510, score-0.675]

63 There are two choices, either 1) construct handshape models independent to this parameter or 2) construct different handshape models for each value of the parameter. [sent-516, score-1.158]

64 In other words, at one extent CD restricts the models generalization by making each handshape model specific to the annotation parameters, thus highly discriminable, see for instance in Table 2 the experiment corresponding to D-HFSBP. [sent-517, score-0.666]

65 At the other extent CD extends the handshape models generalization w. [sent-518, score-0.579]

66 to the annotation parameters, by letting the handshape models account for pose variability (that is depend only on the HSId; same HSId’s with different pose parameters are tied), see for instance experiment corresponding to the case D-H (Table 2). [sent-521, score-0.842]

67 Note that in the occlusion cases, this simplified fitting is done directly on the SA image of the region that contains the modeled hand as well as the other occluded bodypart(s) (that is the other hand and/or the head), without using any static or dynamic priors as those of Section 4. [sent-526, score-0.551]

68 Cropped handshape images are placed at the models’ centroids. [sent-532, score-0.65]

69 In this simplified version too, the hand occlusion cases are treated by simply fitting the model to the Shape-Appearance image that contains the occlusion, without static or dynamic priors. [sent-534, score-0.35]

70 It presents a single indicative cropped handshape image per class to add intuition on the presentation: these images correspond to the points in the feature space that are closest to the specific classes’ centroids. [sent-558, score-0.721]

71 We observe that similar handshape models share close positions in the space. [sent-559, score-0.579]

72 indices are for varying CD field, that is the orientation parameters on which the handshape models are dependent or not (as discussed in Section 7. [sent-596, score-0.657]

73 At the one extent (that is ‘D-HFBSP’) we trained one GMM model for each different combination of the handshape configuration parameters (H,F,B,S,P). [sent-634, score-0.613]

74 Thus, the trained models were dependent on the 3D handshape pose and so are the classes for the classification (34 different classes). [sent-635, score-0.701]

75 In the other extent (‘D-H’) we trained one GMM model for each HSId thus the trained models were independent to the 3D handshape pose and so are the classes for the classification (18 different classes). [sent-636, score-0.735]

76 3 DATA S ET DS-1- EXTEND This is an extension of DS-1 and consists of 24 different HSIds with much more 3D handshape pose variability. [sent-652, score-0.667]

77 We trained models independent to the 3D handshape pose. [sent-653, score-0.613]

78 However, DS-2 data set consists of 42 handshape HSIds for both occlusion and nonocclusion cases. [sent-673, score-0.705]

79 1651 ROUSSOS , T HEODORAKIS , P ITSIKALIS AND M ARAGOS This indicates that Aff-SAM handles handshape classification obtaining decent results even during occlusions. [sent-676, score-0.579]

80 Given the Aff-SAM based models from signer A these are then adapted and fitted to another signer (B) as in Section 5 for which no Aff-SAM models have been trained. [sent-691, score-0.554]

81 2) Second we employ the handshape features and the sub-unit construction via clustering of the handshape features (Roussos et al. [sent-702, score-1.266]

82 For the movementposition lexicon we recompose the constructed dynamic/static SUs, whereas for the Handshape lexicon we recompose the handshape subunits (HSU) to form each sign realization. [sent-705, score-0.787]

83 4) Next, for the training of the SUs we employ a GMM for the static and handshape subunits and an 5-state HMM for the dynamic subunits. [sent-706, score-0.826]

84 5) Finally, we fuse the movement-position and handshape cues via one possible late integration scheme, that is Parallel HMMs (PaHMMs) (Vogler and Metaxas, 1999). [sent-708, score-0.642]

85 2 Sign Recognition Results In Figure 16 we present the sign recognition performance on the GSL lemmas corpus employing 100 signs from two signers, A and B, while varying the cues employed: movement-position (MP), handshape (HS) recognition performance and the fusion of both MP+HS cues via PaHMMs. [sent-711, score-1.036]

86 This is expected, and indicates that handshape cue is crucial for sign recognition. [sent-713, score-0.679]

87 Thus by applying the affine adaptation procedure and employing only a small development set, as presented in Section 5 we can extract reliable handshape features for multiple signers. [sent-715, score-0.678]

88 Conclusions In this paper, we propose a new framework that incorporates dynamic affine-invariant Shape - Appearance modeling and feature extraction for handshape classification. [sent-722, score-0.714]

89 occlusions, we employ a regularized fitting of the SAM that exploits prior information on the handshape and its dynamics. [sent-729, score-0.615]

90 This process outputs an accurate tracking of the hand as well as descriptive handshape features. [sent-730, score-0.722]

91 3) We introduce an affine-adaptation for different signers than the signer that was used to train the model. [sent-731, score-0.39]

92 4) All the above features are integrated in a statistical handshape classification GMM and a sign recognition HMM-based system. [sent-732, score-0.791]

93 On the task of sign recognition for a 100-sign lexicon of GSL lemmas, the approach is evaluated via handshape subunits and also fused with movement-position cues, leading to promising results. [sent-742, score-0.831]

94 To conclude with, given that handshape is among the main sign language phonetic parameters, we address issues that are indispensable for automatic sign language recognition. [sent-745, score-0.899]

95 Extraction of 3D hand shape and posture from images sequences from sign language recognition. [sent-984, score-0.318]

96 Affine-invariant modeling of shapeappearance images applied on sign language handshape classification. [sent-1080, score-0.81]

97 Hand tracking and affine shapeappearance handshape sub-units in continuous sign language recognition. [sent-1088, score-0.808]

98 Exploiting phonological constraints for handshape inference in asl video. [sent-1141, score-0.579]

99 Advances in dynamic-static integration of movement and handshape cues for sign language recognition. [sent-1148, score-0.802]

100 Recognition with raw canonical phonetic movement and handshape subunits on videos of continuous sign language. [sent-1155, score-0.761]

