The recent boom of mass media communication (such as social media and mobiles) has boosted more applications of automatic facial expression recognition (FER). Thus, human facial expressions have to be encoded and recognized through digital devices. However, this process has to be done under recurrent problems of image illumination changes and partial occlusions. Therefore, in this paper, we propose a fully automated FER system based on Local Fourier Coefficients and Facial Fourier Descriptors. The combined power of appearance and geometric features is used for describing the specific facial regions of eyes-eyebrows, nose and mouth. All based on the attributes of the Fourier Transform and Support Vector Machines. Hence, our proposal overcomes FER problems such as illumination changes, partial occlusion, image rotation, redundancy and dimensionality reduction. Several tests were performed in order to demonstrate the efficiency of our proposal, which were evaluated using three standard databases: CK+, MUG and TFEID. In addition, evaluation results showed that the average recognition rate of each database reaches higher performance than most of the state-of-the-art techniques surveyed in this paper.
Facial expressions of emotions are defined by facial muscle movements which represent specific human emotions. Psychologists have established the facial expressions of emotions as six basic and universally recognized expressions: anger, disgust, fear, happiness, sadness and surprise [
On the other hand, ongoing researches of computer vision and machine learning try to find a suitable way to encode the facial representations which define human emotions. Thus, a complex Human-Computer Interaction (HCI) could be attained. Formally, automatic facial expression recognition (FER) is the field in charge of analyzing and recognizing facial feature changes from visual information (i.e. spatial or spatio-temporal representations). Some of the applications of automated and real-time FER systems include health-care, customer satisfaction analysis, virtual reality, smart environments, video-conferencing, human emotion analysis, cognitive science, and more [
FER systems can be categorized as spatial or spatio-temporal [
This paper proposes a fully automated FER system based on the combination of local Fourier coefficients (appearance features) and facial Fourier descriptors (geometric features) of independent-specific facial regions (eyes-eyebrows, nose and mouth). By performing independent subspaces on frequency domain for each facial region, we can approach common FER problems such as illumination changes, partial occlusion, image rotation, redundancy and dimensionality reduction. Finally, facial expressions are recognized using Support Vector Machines (SVMs) and evaluated with three widely used data sets: the Extended Cohn-Kanade (CK+) database [
This paper is strongly related to the work presented in [
A deeper literature review of similar works, which serves as comparison results.
A fully automated fiducial point detection and region segmentation (instead of manually annotated).
A more detailed description of the method for making easier its reproduction.
A complete evaluation using full-size data sets (not only culture-specific frames).
A higher recognition rate performance obtained by getting the ideal number of fiducial points and size of sub-blocks.
A study of the combinations of facial regions for facing the problem of partial occlusion.
In summary, the main contributions of this paper are:
A fully automated FER system based on appearance and geometric features using local Fourier coefficients and SVMs.
A study of local Fourier coefficients with different sizes of sub-blocks.
A study of facial Fourier descriptors with a different number of fiducial points.
A comparison results with the state-of-the-art methods facing the problem of partial occlusion.
Extensive FER experiments on three different data sets demonstrating the efficiency of the proposed system above some previous works.
The rest of the paper is organized as follows: a review of related works is presented in Section 2. The general framework of the proposed FER system is explained in Section 3 followed by the description of data sets and the evaluation protocol in Section 4. Section 5 shows the experimental results and finally, the conclusion and future works are drawn in Section 6.
Several studies have been proposed for combining the benefits of appearance and geometric features for FER. For instance, Li et al. [
The proposed FER system consists of four steps: face detection, facial region segmentation, feature extraction and classification. As shown in
As mentioned before, face detection is carried out by Viola-Jones algorithm. Thus, we obtain a detected face region (defined as DFR) of size
where
being
In order to obtain the fiducial points of each facial image we applied the work proposed in [
The basis of our proposal is the Fourier transform which has been applied a few times for facial recognition (FR) and FER. For instance, the method proposed in [
Appearance feature extraction is carried out by using LFC which builds on the 2-D DFT. This process consists of dividing the input image into several sub- blocks to locally extract Fourier coefficients. For instance, the 2-D DFT is defined as:
where
Consider
where
where
Considering the Equation (3), the local Fourier coefficient matrix is given by:
where lfc has the same dimensions as FR. In summary, lfc matrix represents the real components of frequency features obtained locally by each sub-block of size
Subsequently, a variation of PCA is applied in order to reduce the dimensionality and for correlating the local information with the set of training images. To this end, the lfc matrix is converted into a column vector, so that
where
where P is the total number of images used for training and
Subsequently, the covariance matrix
Those eigenvectors are then stored in a descendent order according to the corresponding eigenvalues. The sorted eigenvectors of the covariance matrix determine the subspace
where V0 is the eigenvector associated with the largest eigenvalue, V1 is the eigenvector associated with the second largest eigenvalue, and H is the number of eigenvectors used for further projections. It is worth noting that this process is applied so that 90% of the variance of training vectors is retained. Finally, the LFC feature vector
where
On the other hand, geometric feature extraction process is based on FFD which uses Fourier Descriptors (FD). FFD represents a digital boundary of 1D Fourier coefficients estimated by a sequence of coordinate pairs transformed by applying the DFT. To this end, each facial region shape is considered as K-point coordinate pairs, K being the number of facial feature points of the shape. An analysis of the effect of different number of K is presented in Section 5.2.
For applying FFD, suppose that a specific shape of the FR-th facial region is represented as a sequence of coordinates, so that
where
where
Subsequently, the FFD of
for
The combination of both kinds of features comes at this point, where feature vectors of appearance and geometric features were individually calculated by Equations (10) and (15). The fusion begins with the concatenation of both feature vectors, so that
Subsequently, the process of Equations (5)-(9) has to be applied once more. Thus, the final feature vector of one facial region is defined by:
where
where Y represents the concatenation of C individual facial regions. It is worth noting that C can be equal to 2 or 3 depending on how many facial regions are involved in the feature extraction process.
Finally, in order to overcome the identity bias, we follow the assumption that facial expressions can be represented as a linear combination of expressive and neutral face images of the same subject [
for
Support Vector Machines (SVM) is an efficient classifier known for its generalization capability. Therefore, in this paper, a multi-class SVMs employing radial basis function (RBF) kernels were used in order to classify the six basic facial expressions. The library LIBSVM [
A subset of the Extended Cohn-Kanade (CK+) database [
The fully automated system was evaluated using the complete version of the CK+ database, the Multimedia Understanding Group (MUG) database [
Data Set | Ang. | Dis. | Fea. | Hap. | Sad. | Sur. | Subjects |
---|---|---|---|---|---|---|---|
CK+ | 45 | 59 | 50 | 69 | 56 | 83 | 116 |
MUG | 52 | 51 | 48 | 52 | 49 | 52 | 52 |
TFEID | 34 | 40 | 40 | 40 | 39 | 36 | 40 |
from each sequence (not only peak frames). Thus, the original number of se- quences of these expressions is 25 and 28 respectively.
The system was evaluated following a widely used protocol in FER, this is leave-one-subject-out (LOSO) cross-validation. This method consists of dividing the database according to the number of subjects, such as each sub-group consists of only images from the same subject. Then, one of these sub-groups has to be picked out for testing and the remaining are used for training. This procedure has to be repeated the same number of times as the number of subjects in the database. Finally, the recognition accuracy is averaged over all trials. In addition to the average recognition rate of LOSO, confusion matrices are also presented for evaluation results. The diagonal entries of the confusion matrices represent the accuracy of the facial expressions correctly classified, whereas the off-di- agonal rates the misclassification problems.
The experimental results are divided into four main tests: analysis of sub-block sizes for LFC, where different sizes of sub-blocks are tested using the subset of CK+; analysis of the number of landmarks for FFD, where the geometric features are defined with different number of landmarks using the same subset as the previous test; results of LFC + FFD with all data sets, this test presents the results of the main proposal of this paper using CK+, MUG and TFEID; and the comparison with previous methods presents the performance of different approaches which used the same data sets, a comparison with approaches that overcome partial occlusions are also presented in this test.
In this section, several variations of sub-block sizes are proposed in order to find the ideal sub-block size for LFC. Based on the analysis presented in [
mary, the following analysis presents the performance of LFC when eight different sizes of L in Equation (2) are used for feature vector calculation.
The results of the eight different sub-block sizes are shown in
Choosing the number of landmarks that defines the facial shape is an important issue for every FER system based on geometric features. Therefore, a test for FFD using eight different number of fiducial points (K = 31, 41, 51, 64, 81, 93, 115 and 123) is presented in this section. The test consists of analyzing FER performance based on different shape representations by changing the number of landmarks used in Equation (11). It is important to mention that for this particular test, the landmark estimation was manually annotated for all images of the CK+ subset. The main differences between the eight shape representations reside on the location and the number of facial landmarks of each facial region. For example, for K = 31 the number of landmarks representing the nose region is 7 whereas for K = 123 the same region is represented by 29 landmarks.
Results of the eight K values for FFD are shown in
Finally, the last test performed with the subset of CK+ is a comparison of LFC, FFD and LFC + FFD methods using the ideal size of sub-block and the chosen number of facial landmarks (L = 2 and K = 51 respectively). It is important to mention that final feature vectors of LFC, FFD and LFC + FFD were obtained using the combination of all facial regions, as defined in Equations (10), (15) and (17) respectively.
This section presents the results of our main proposal, the fully automated FER system based on LFC + FFD. Feature vectors were obtained with Equation (19) and classified by SVMs as described in Section 3.3. Results obtained with full data sets of CK+, MUG, and TFEID are presented in
Method: | LFC | FFD | LFC + FFD |
---|---|---|---|
Eyes-Eyebrows | 67.1 | 65.8 | 71.7 |
Nose | 67.1 | 69.6 | 75.8 |
Mouth | 85.4 | 90.0 | 90.8 |
Eyes-Eyebrows-Nose | 76.7 | 66.7 | 78.8 |
Eyes-Eyebrows-Mouth | 90.0 | 92.1 | 94.2 |
Nose-Mouth | 89.2 | 85.0 | 93.3 |
All Regions | 92.1 | 92.5 | 95.8 |
Data Set: | CK+ | MUG | TFEID |
---|---|---|---|
Eyes-Eyebrows | 78.7 | 81.1 | 77.8 |
Nose | 86.2 | 80.2 | 74.7 |
Mouth | 87.7 | 85.7 | 80.4 |
Eyes-Eyebrows-Nose | 89.8 | 88.5 | 86.7 |
Eyes-Eyebrows-Mouth | 96.4 | 94.0 | 93.0 |
Nose-Mouth | 94.0 | 89.9 | 88.0 |
All Regions | 97.9 | 95.9 | 94.9 |
mance when less than two regions are used for feature extraction. In other words, it is more difficult to recognize the six basic expressions using only one facial region with TFEID data set. In summary, the best performance reached by our proposal is based on all regions for feature extraction and the mouth seems to be the facial region which can better represent the six basic expressions.
Tables 4-6 present the confusion matrices of the proposed system evaluated with CK+, MUG and TFEID respectively. From these tables, we can see that among all data sets, sadness is the expression with higher recurrent misclassification problems, and in most of the cases this expression is misrecognized with anger (3%, 10% and 14% for CK+, MUG and TFEID respectively). On the other hand, surprise is the only expressions which obtain a perfect recognition accuracy among all data sets. Some specific situations are presented for each data set, e.g. for CK+ anger is misrecognized with disgust; for MUG disgust is misrecognized with happiness; and for TFEID fear is misrecognized with disgust. In general, it can be summarized that the expressions of surprise and happiness are the easiest to recognize among all basic expressions, whereas sadness and anger are the most difficult.
Ang. | Dis. | Fea. | Hap. | Sad. | Sur. | |
---|---|---|---|---|---|---|
Anger | 95.7 | 2.2 | 0 | 0 | 2.2 | 0 |
Disgust | 1.7 | 98.3 | 0 | 0 | 0 | 0 |
Fear | 0 | 0 | 95.7 | 4.3 | 0 | 0 |
Happiness | 0 | 0 | 1.5 | 98.5 | 0 | 0 |
Sadness | 3.0 | 0 | 0 | 0 | 97.0 | 0 |
Surprise | 0 | 0 | 0 | 0 | 0 | 100 |
Ang. | Dis. | Fea. | Hap. | Sad. | Sur. | |
---|---|---|---|---|---|---|
Anger | 94.6 | 0 | 2.7 | 0 | 2.7 | 0 |
Disgust | 0 | 97.0 | 0 | 3.0 | 0 | 0 |
Fear | 0 | 0 | 100 | 0 | 0 | 0 |
Happiness | 0 | 0 | 0 | 100 | 0 | 0 |
Sadness | 10.3 | 3.4 | 6.9 | 0 | 79.3 | 0 |
Surprise | 0 | 0 | 0 | 0 | 0 | 100 |
Ang. | Dis. | Fea. | Hap. | Sad. | Sur. | |
---|---|---|---|---|---|---|
Anger | 91.7 | 0 | 4.2 | 0 | 4.2 | 0 |
Disgust | 0 | 100 | 0 | 0 | 0 | 0 |
Fear | 0 | 4.2 | 95.8 | 0 | 0 | 0 |
Happiness | 0 | 0 | 2.7 | 97.3 | 0 | 0 |
Sadness | 14.3 | 4.8 | 0 | 0 | 81.0 | 0 |
Surprise | 0 | 0 | 0 | 0 | 0 | 100 |
A comparison with other approaches evaluated with same data sets is shown in this section. CK+ is one of the most used data sets for FER, therefore
Ref. & Year | Method | Classifier | Data | Features | Protocol | Accuracy (%) |
---|---|---|---|---|---|---|
[ | FPDRC + CARC + SDEP | NN | Image | Both | - | 88.70 |
[ | Weighted Feats. | SVM | Image | Geo. | 2-fold | 93.00 |
[ | Boosted LBP | SVM | Image | App. | 10-fold | 95.10 |
[ | PCA | LDCRF | Sequence | Geo. | 4-fold | 95.79 |
[ | DVNP | RF | Sequence | Geo. | 10-fold | 96.38 |
[ | CNN | LR | Image | App. | 8-fold | 96.76 |
[ | PCA Dictionary | SRC | Image | App. | LOSO | 97.19 |
[ | LBP + NCM | SVM | Image | Both | 5-fold | 97.25 |
[ | CNN + DNN | Joint F-N | Sequence | Both | 10-fold | 97.25 |
Proposed | LFC + FFD | SVM | Image | Both | LOSO | 97.90 |
a. “App.” and “Geo.” refer to appearance and geometric features respectively; b. “Both” refers to a combination of appearance and geometric features.
Ref. & Year | Method | Classifier | Data | Features | Protocol | Accuracy (%) |
---|---|---|---|---|---|---|
[ | Gabor + PCA | NN | Image | App. | 2-fold | 89.29 |
[ | Landmark Dist. | SVM | Image | Geo. | 2-fold | 90.50 |
[ | LFDA | kNN | Image | App. | LOSO | 95.24 |
[ | Triangle Land. | SVM | Sequence | Geo. | 10-fold | 95.50 |
Proposed | LFC + FFD | SVM | Image | Both | LOSO | 95.85 |
Ref. & Year | Method | Classifier | Data | Features | Protocol | Accuracy (%) |
---|---|---|---|---|---|---|
[ | Haar Wavelet | LR | Image | App. | 10-fold | 89.58 |
[ | LBP + MPC | SVM | Image | App. | 10-fold | 92.54 |
[ | Pyramid Feat. | SVM | Image | App. | LOSO | 93.38 |
[ | DSNGE | kNN | Image | App. | LOSO | 93.89 |
Proposed | LFC + FFD | SVM | Image | Both | LOSO | 94.94 |
the highest recognition accuracy. This occurs, even when some approaches don’t use the complete data set of MUG, like [
The last comparison with different approaches is focused on the capability to handle the partial occlusion problem. Methods [
Occluded Part (%) | ||||||
---|---|---|---|---|---|---|
Ref. & Year | Data Set | Method | Classifier | Eyes | Mouth | NO |
[ | CK | Eigenphases | SVM | 87.7 | 75.3 | 92.0 |
[ | CK | Random Gabor Filters | SVM | 90.5 | 82.9 | 91.5 |
[ | CK | Radial Gabor Filters | LDA + kNN | 95.1 | 90.8 | 95.3 |
Proposed | CK+ | LFC + FFD | SVM | 94.0 | 89.8 | 97.9 |
a. “NO” refers to No Occlusion.
One Region Test (%) | ||||||
---|---|---|---|---|---|---|
Ref. & Year | Data Set | Method | Features | Eyes | Nose | Mouth |
[ | CK+ | Weighted Feats. | Geo. | 41.9 | 25.5 | 60.4 |
[ | CK | Eigenphases | App. | 53.3 | 61.0 | 79.3 |
Proposed | CK+ | LFC + FFD | Both | 78.7 | 86.2 | 87.7 |
In this paper, we proposed a fully automated FER system based on the combination of two novel feature extraction methods: LFC and FFD, which are focused on appearance and geometric features obtained from individual facial regions of eyes-eyebrows, nose and mouth. Therefore, our proposal is robust to common FER problems such as illumination changes, image rotation and dimensionality reduction. In addition, more different than the reviewed state-of-the-art approaches, our proposal could work well even when fiducial points are not accurately detected. This is possible because the appearance feature extraction does not depend on the extraction of geometric features. Thus, this proposal just depends on face and eyes detection, carried out by the robust algorithm of Viola-Jones, which achieved 100% of recognition with all data sets tested. Evaluation results also show that the proposed system can handle problems of partial occlusion without heavily decreasing its accuracy performance.
In general, results obtained with the proposed algorithm overcome most of the previous works. In addition, compared with recently famous methods such as CNN and DNN, our system shows better performance with CK+, MUG and TFEID data sets, reaching 98%, 96% and 95% respectively. On the other hand, we admit that the present work could present some limitations based on possible problems of head pose variations and non-frontal images. Therefore, in order to efficiently recognize spontaneous facial expressions, we will focus on solving these problems as a future work.
Finally, the proposed method should be also valid in other applications such as face recognition, facial action unit recognition and facial image understanding. Therefore, in the future, we would like to apply our method in some of these possible applications.
We would like to thank the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan for the Japanese government (Monbukagakusho) scholarship which supports the Ph.D. studies of the first author.
Benitez-Garcia, G., Nakamura, T. and Kaneko, M. (2017) Facial Expression Recognition Based on Local Fourier Coefficients and Facial Fourier Descriptors. Journal of Signal and Information Processing, 8, 132-151. https://doi.org/10.4236/jsip.2017.83009