In this paper, a novel gesture spotting and recognition technique is proposed to handle hand gesture from continuous hand motion based on Conditional Random Fields in conjunction with Support Vector Machine. Firstly, YCbCr color space and 3D depth map are used to detect and segment the hand. The depth map is to neutralize complex background sense. Secondly, 3D spatio-temporal features for hand volume of dynamic affine-invariants like elliptic Fourier and Zernike moments are extracted, in addition to three orientations motion features. Finally, the hand gesture is spotted and recognized by using the discriminative Conditional Random Fields Model. Accordingly, a Support Vector Machine verifies the hand shape at the start and the end point of meaningful gesture, which enforces vigorous view invariant task. Experiments demonstrate that the proposed method can successfully spot and recognize hand gesture from continuous hand motion data with 92.50% recognition rate.
The task of locating the start and the end points that correspond to a gesture of interest is a challenging task in Human Computer Interaction. We define a gesture as the motion of the hand to communicate with a computer. The task of locating meaningful patterns from input signals is called pattern spotting [
Boundary-based invariants such as Fourier descriptors explore only the contour information; they cannot capture the interior content of the shape. On the other hand, these methods cannot deal with disjoint shapes where single closed boundary may not be available; therefore, they have limited applications. For region-based invariants, all of the pixels of the image are taken into account to represent the shape. Because region-based invariants combine information of an entire image region rather than exploiting information just along the boundary pixels, they can capture more information from the image. The region-based invariants can also be employed to describe disjoint shapes.
Lee, H.-K. and Kim, J.H. [
To face the mentioned challenges, CRF forward gesture spotting by using Circular Buffer method is proposed, which simultaneously handles the hand gesture spotting and recognition in stereo color image sequences without time delay. A hand appearance-based sign verification method using SVM is considered to further improve sign language spotting accuracy. Additionally, a depth image sequence is exploited to identify the Region of Interest (ROI) without processing the whole image, which consequently reduces the cost of ROI searching and increases the processing speed. Our experiments on own dataset, showed that the proposed approach is more robust and yields promising results when comparing favorably with those previously reported throughout the literature.
An application of gesture-based interaction with Arabic numbers (0 - 9) is implemented to demonstrate the co- action of suggested components and the effectiveness of gesture spotting & recognition approach (
Automatic segmentation and preprocessing is an important stage in our approach. The segmentation of the hand takes place using color information and 3D depth map. Firstly, the hand is segmented (i.e. Area of interest (AOI)) using Gaussian Mixture Model (GMM) over YCbCr color space, where Y channel represents brightness and (Cb, Cr) channels refer to chrominance [
depth map is to neutralize complex senses background to increase the robustness of skin segmentation for AOI completely (
To retrieve the extracted features during occlusion, a robust method for hand tracking is considered using Mean- shift analysis in conjunction with depth map. The motivation behind mean shift analysis is to achieve accurate and robust hand tracking. Mean-shift analysis uses the gradient of Bhattacharyya coefficient [
The orientation gives the direction of the hand when traverses in space during the gesture making process. A gesture path is spatio-temporal pattern that consists of centroid points (xhand, yhand). Therefore, orientation feature is based on; the calculation of the hand displacement vector at every point which is represented by the orientation according to the centroid of gesture path (θ1t), the orientation between two consecutive points (θ2t) and the orientation between start and current gesture point ( θ3t).
where (cx, cy) refers to the centroid of gravity at n points, and T represents the length of hand gesture path such that
In this manner, gesture is represented as an ordered sequence of feature vectors, which are projected and clustered in space dimension to obtain discrete code words. This is done using k-means clustering algorithm [
The shape flow as the global flow of hand is characterized and stated by the elliptic Fourier descriptors and Zernike moments
The elliptic Fourier descriptors for action silhouettes are obtained using a trigonometric form of the shape curve Ck.
where
Invariance of hand image can be achieved by using Zernike moments, which give an orthogonal set of rotation-invariant moments. Additionally, scale and translation invariance can be implemented using moment normalization [
Here,
using the center of silhouette image
Thus, the Zernike moment features invariant along with geometric features to shape translation, rotation and scaling with remarkable similarity to the Hu invariant moments is assigned by Gz = [z00, z11, z22].
To spot meaningful gestures of numbers (0 - 9), which are embedded in the input video stream accurately, a two-layer CRF architecture is applied (
CRFs are a framework based on conditional probability approaches for segmenting and labeling sequential data. CRFs use a single exponential distribution to model all labels of given observations. Therefore, there is a trade- off in the weights of each feature function, for each state. In our application, each state corresponds to segments of the number. In addition, each label in CRFs is employed as exponential model to conditional probabilities of
the next label for a given current label.
(1) Spotting with CRFs
Conditional Random Fields were developed for labeling sequential data (i.e. determining the probability of a given label sequence for a given input sequence) and are undirected graphical models (i.e. discriminative models). The structure of current label is design to form the chain with an edge between itself and previous label. Moreover, each label corresponds to a gesture number. The probability of label sequence y for a given observation sequence O is calculated as:
where parameter
where a transition feature function at position i and i − 1 is
The CRFs are initially built without label for non-gesture pattern; Because CRFs use a single model for the joint probability of the sequences p(y|O, θ).
Using gradient ascent with the BFGS optimization technique with 300 iterations is used for trained CRFs to achieve optimal convergence. Therefore, the labels of CRFs are
gesture patterns are modeled by adding a label (N) for non-gesture patterns. Moreover,
are the labels of N-CRFs. The proposed N-CRFs model does not need non-gesture patterns for training and also can better spot gestures and non-gesture patterns.
(2) N-CRFs Model Parameters
The label of non-gesture pattern is created, by using the weight of state and transition feature function of the initialized CRFs model. There are two main parameters of CRFs named state feature function and transition feature function as in Equation (2.7). From the idea of Dugad et al. [
where
variance of the gth state feature functions TN reflects the width of state features function in some way. The optimal value of TN is 0.7 and is determined by multiple experiments which have been conducted with a range of values on a training data set.
It is difficult to spot and recognize short gestures because short gestures have fewer samples than long gestures. A challenging problem is caused by the fact that there is a quite bit of variability in the same gesture even for the same person. A short gesture detector is added to avoid this problem, where the weights of self-transition feature functions are increased as follows:
Such that:
and
where Nframe(Yl) is the average frame number of a gesture Yl,
The weight of the self-transition feature function of the label of non-gesture patterns is approximately assigned with the maximum weight of transition feature functions to initialize CRFs as follows:
where
As described above about the transition parameters of non-gesture model, a method is employed to compute the weights of transition feature functions between the labels of gesture models and the label of non-gesture patterns. Therefore, the weights of transition feature functions from the non-gesture label to other labels are computed by the following equation:
Additionally, the weights of transition feature functions from the gesture labels to non-gesture label occurs by the given equation below:
As a result, the N-CRFs model can better spot gestures and non-gesture patterns.
Several samples of each gesture in the gesture vocabulary are stored in the database to be used in training the classifier. A gesture is stored as a sequence of (x, y, t) joint coordinates. In recognition mode, a circular buffer is used to temporarily store the real-time information. The circular buffer contains a number of sequential observations instead of a single observation. It is used to reduce the impact of observation changes for a short interval which are caused by incomplete feature extraction. The circular buffer is set to store 13 frames. The size of the circular buffer is chosen to be equal to the shortest gesture (i.e. gesture “1” represents a shortest gesture in our system). This ensures that at some stage during a continuous motion, the buffer will be completely filled with gesture data, and will not contain any transitional motion data. In addition, a maximum of one complete gesture can be stored in the buffer at one time.
Assume that, the size of circular buffer is initialized with the input observation sequence with length T = 13 as in our system,
When the value of DP(t) at time t is negative, the start point in this case is not detected and therefore the circular buffer is shifted on unit (i.e.
observed key gesture segmented is represented by union of all possible partial gesture segments
At each step, the gesture type of A is determined. When the value of DP becomes negative again or there is no gesture image, the final gesture label type of observed gesture segment A is determined. When there is more gesture images, the previous steps are repeated with re-initializing the circular buffer at the next time t. Therefore, a forward scheme has the ability to resolve the issues of time delay between gesture spotting and recognition.
The main motivation behind using SVMs-based gesture verification to decide whether or not to accept a gesture spotted. This helps to discriminate gestures which may have similar hand motions but different hand shapes. SVMs are trained with a set of images of a gesture hand shape according to the extracted features via elliptic Fourier and Zernike moments. Therefore, the histograms of gradient features are extracted for training SVM from training samples [
In our work, the hand shape at the start of the gesture is verified. Then the hand appearance is verified over a period of several frames. As a result, voting value over frames is used to consider whether to accept or reject the gesture. If voting value is greater than a specific threshold, which experimentally determined, then the candidate gesture is chosen be a meaningful gesture.
The input images were captured by Bumblebee stereo camera system which has 6 mm focal length at 15FPS with 240 × 320 pixels image resolution, Matlab implementation. Classification results are based on our database that contains 600 video samples for isolated gestures (i.e. 60 video samples for each gesture from 0 to 9) which are captured from three persons. Each isolated number from 0 to 9 was based on 42 videos for training CRF and SVMs. Additionally, the database contains 280 video samples of continuous hand motion for testing. Each video sample either contains one or more meaningful gestures.
On a standard desktop PC, training process is more expensive for CRFs since the time which the model needs ranges from 20 minutes to several hours based on observation window. On the contrary, the inference (i.e. recognition) process is less costly and very fast for all models with sequences of several frames.
In automatic gesture spotting task, there are three types of errors called Insertion (I), Substitution (S) and Deletion (D). The insertion error is occurred when the spotter detects a nonexistent gesture. It is because the emission probability of the current state for a given observation sequence is equal to zero. A substitution error occurs when the key gesture is classified falsely (i.e. classifies the gesture as another gesture). This error is usually happened when the extracted features are falsely quantized to other code words. The deletion error happens when the spotter fails to detect a key gesture. In order to calculate the recognition ratio (Rec.) (Equation (3.2)), insertion errors are totally not considered. However, insertion errors are probably caused due to substitution and deletion errors because they are often considered as strong decision in determining the end point of gestures to eliminate all or part of the meaningful gestures from observation. Deletion errors directly affect the recognition ratio whereas insertion errors do not. However, the insertion errors affect the gesture spotting ratio directly. To take into consideration the effect of insertion errors, another performance measure called reliability (Rel.) is proposed by the following equation:
The recognition ratio and the reliability are computed based on the number of spotting errors (
Experimental results of CRF show that the proposed method automatically recognizes meaningful gestures with 92.50% recognition (
Lee, H.-K. and Kim, J.H. [
Gesture path | Train data | Meaningful gestures spotting results | ||||||
---|---|---|---|---|---|---|---|---|
Test | I | D | S | Correct | Rec. (%) | Rel. (%) | ||
“0” | 42 | 28 | 1 | 0 | 1 | 27 | 96.43 | 93.10 |
“1” | 42 | 28 | 1 | 0 | 0 | 27 | 96.43 | 93.10 |
“2” | 42 | 28 | 1 | 1 | 1 | 26 | 92.86 | 89.65 |
“3” | 42 | 28 | 1 | 1 | 1 | 26 | 92.86 | 89.65 |
“4” | 42 | 28 | 1 | 2 | 1 | 25 | 89.93 | 86.21 |
“5” | 42 | 28 | 1 | 1 | 1 | 26 | 92.86 | 89.65 |
“6” | 42 | 28 | 1 | 1 | 1 | 26 | 92.86 | 89.65 |
“7” | 42 | 28 | 1 | 2 | 1 | 25 | 89.93 | 86.21 |
“8” | 42 | 28 | 1 | 2 | 1 | 25 | 89.93 | 86.21 |
“9” | 42 | 28 | 1 | 1 | 1 | 26 | 92.86 | 89.65 |
Total | 420 | 280 | 10 | 11 | 9 | 259 | 92.50 | 89.31 |
rate 93.14%. However, the proposed method has a problem in that the system cannot report the detection of a gesture immediately after the system reaches its end point. It is because the endpoint detection process postpones the decision until the detection of the next gesture in order to avoid premature decision. The delayed response may cause one to wonder whether one’s gesture has been recognized correctly or not. For that reason, their system is not capable for real-time applications.
Yang, H.-D., Sclaroff, S. and Lee, S.-W. [
Elmezain, M., Al-Hamadi, A. and Michaelis, B. [
In the light of this comparison (
The image sequences depicted in
Method | Recognition rate |
---|---|
Our method Lee, H.-K. and Kim, J. H. [ | 92.50% 93.14% 87.00% 90.49% |
This paper proposed a novel gesture spotting and recognition technique, which handled hand gesture from continuous hand motion based on Conditional Random Field in conjunction with Support Vector Machine. 3D depth map was captured by bumblebee stereo camera to neutralize complex background sense. Additionally, dynamic affine-invariants features like elliptic Fourier and Zernike moments, in addition to three orientations motion features were employed to CRF and SVMs. Finally, the discriminative model of CRF performed the spotting and recognition processes by using the combined of orientation features. Accordingly, Support Vector Machine verified the hand shape at the start and the end point of meaningful gesture by using elliptic Fourier and Zernike moments features. Experiments showed that our proposed method could successfully spot hand gesture from continuous hand motion data with 92.50% recognition rate.
Fayed F. M.Ghaleb,Ebrahim A.Youness,MahmoudElmezain,Fatma Sh.Dewdar, (2015) Vision-Based Hand Gesture Spotting and Recognition Using CRF and SVM. Journal of Software Engineering and Applications,08,313-323. doi: 10.4236/jsea.2015.87032