Open Journal of Social Sciences, 2014, 2, 12-18
Published Online July 2014 in SciRes. http://www.scirp.org/journal/jss
http://dx.doi.org/10.4236/jss.2014.27003
How to cite this paper: Zhang, Y.Y. (2014) Robust Factor Analysis and Its Applications in the CSI 100 Index. Open Journal of
Social Sciences, 2, 12-18. http://dx.doi.org/10.4236/jss.2014.27003
Robust Factor Analysis and Its Applications
in the CSI 100 Index
Yingying Zhang
Department of Statistics and Actuarial Science, Chongqing University, Chongqing, China
Email: robertzhangyying@qq.com
Received May 2014
Abstract
We apply the object-oriented robust factor analysis R package robustfa to the 28 financial indica-
tors of the 100 listed companies in Chinas Chinese Securities Index (CSI) 100 index in the first
quarter of 2013. First of all, according to the size of the data, we automatically choose a robust es-
timator, the robust Ogk estimator. By the Mahalanobis distances which are computed by the ro-
bust Ogk estimator, greater than the critical value, we find a total of 47 abnormal points. This pa-
per discovers that the results of the sample correlation matrix, the rotated factor loading matrix,
the contribution of the factors to the original variables, the contribution rate, the cumulative con-
tribution rate, the screeplot of the eigenvalues of the sample correlation matrix, the scatter plot of
the first two factor scores, factor scores, and the sorted scores according to factor scores etc.
computed by the classical estimator and the robust Ogk estimator are quite different. Finally, we
condense the 28 financial indicators to 5 factors by combining the principal component analysis
method and the robust Ogk estimator: Provident fund market value factor, profit factor, market
value profit rate factor, value per share factor, and asset liability factor. Finally, we sort the 5 fac-
tor scores from high to low of each factor, and also get some special stocks according to the factor
scores. The robust factor analysis results provide a good basis for investors to choose the stocks.
Keywords
Robustness , Factor Analysis, R Software, CSI 100 Index, Financial Indicators
1. Introduction
The stock market is an important part of the rapid development of Chinas socialist market economic system. In
order to achieve long-term stability and healthy development of stock market, we must strengthen the domestic
stock market research and regulation. Listed companies are the foundation and objects of research and supervi-
sion of stock markets. Make an objective evaluation for operating performance of listed companies is one of the
main bases for operation of securities companies and investors in investment decisions. So making a scientific
and fair evaluation for listed companies is one of the main contents of the research on stock market.
The 100 stocks of the CSI 100 Index are the largest constituent stocks that are selected from the CSI 300 In-
dex. It is a comprehensive reflection of overall state of the Shanghai and Shenzhen stock marketsinfluential
capitalization companies. Therefore, it is of great practical significance to effectively evaluate the 100 listed
Y. Y. Zhang
13
companies.
Factor analysis [1]-[4] is a dimension reduction method of multivariate statistical analysis. It is a model used
to analyze the hidden factors in the phenomena. Factor analysis studies internal dependent relationships of the
correlation matrix or the covariance matrix. It condenses the original variables into a few factors and displays
the relationship between the original variables and factors.
Outliers virtually exist in any datasets of any application field. To avoid the impact of outliers, we need to use
robust estimators. Classical estimators of multivariate mean and covariance matrix are the sample mean and the
sample covariance matrix. Outliers will affect the sample mean and the sample covariance matrix, and thus they
will affect the classical factor analysis which depends on the classical estimators [5]. So it is necessary to use the
robust estimators of the sample mean and the sample covariance matrix. There are several robust estimators in
the literature: MCD, OGK, MVE, M, S, and Stahel-Donoho. The most direct way to robustify multivariate anal-
ysis is to replace the sample mean and the sample covariance matrix of the classical estimators to robust estima-
tors [6] [7], which is our choice of robust factor analysis. A base package for robust multivariate analysis is ro-
bustbase [8]. The package robust [9] has a large overlap with robustbase. The package rrcov [10] builts upon
robustbase. The package rrcov provides many methods based on the S4 method, contains almost a complete set
of estimators to compute the robust sample mean and robust sample covariance matrix, and also contains robust
principal component analysis, robust linear discriminant analysis, and robust quadratic discriminant analysis.
The package robustfa [11] used in this paper follows rrcov, and it is a package created to solve robust factor
anal ysi s.
2. Empirical Analysis
2.1. Sample Selection and Data Source
This paper selects 28 financial indicators of the 100 listed companies in Chinas Chinese Securities Index (CSI)
100 index in the first quarter of 2013 to do the empirical research. The data is downloaded from the great wis-
dom free software.
2.2. Roubst Factor Analysis
Here we use FaCov() function to calculate robust factor analysis. Its usage is:
FaCov(x, factors = 2, cor = FALSE, cov.control = CovControlMcd(), method = c(mle, pca”, pfa),
scores Method = c (“none, r egre ssio n, “Bar tle tt), ...)
Where x is a numeric matrix or an object that can be coerced to a numeric matrix (such as a data frame), fa c-
tors is the number of factors, the default value is 2. cor is a logical value indicating whether the calculation
should use the covariance matrix (cor = FALSE, the default) or the correlation matrix (cor = TRUE). cov.control
specifies which covariance estimator to use by providing a CovControl-class object. The default is Co vCo n-
trolMcd-cla ss which will indirectly call CovMcd(). If cov.control = NULL is specified, the classical estimates
will be used by calling CovCl assic (). method is the method of factor analysis, one of mle(the default), pca,
and pfa. scor esMethod specifies which type of scores to produce. The default is none, regre ssion gives
Thompsons scores, and Bartlett” gives Bartletts weighted least-squares scores.
First we choose the value of cov.control, the function getMeth() in rrcov can be used to choose one of the ro-
bust estimators according to the size of the dataset.
R> getMeth(CovRobust(scale (ZhongZheng100_28))).
[1] Orthogonalized Gnanadesikan-Kettenring Estimator”.
Therefore, we take cov.control = CovControlOgk(). Then we choose the method value, the criterion is to
choose the smallest sum of squares
( )
QE
of the residual matrix. Where
( )
( )
,
Tij pp
E RAADe
×
=− +=
( )
2
11
.
pp
ij
ii
QE e
= =
=
∑∑
The results are given in Table 1. At the same time, we also calculated
( )
QE
of three classical factor analy-
sis methods. In Table 1, for
( )
QE
, the smallest combination is (pfa, Classic), but the sample mean and the
Y. Y. Zhang
14
Table 1. The Q(E) values of classical and robust factor analysis.
Classic Ogk
pca
pfa
mle
2.131621
1.412886
NA
2.568465
1.946182
NA
sample covariance matrix of classical estimators have been severely affected by outliers, thus we select the
minimum
( )
QE
from the remaining robust estimators. It is the combination (pfa, Ogk), and
( )
1.946182QE=
.
We can also calculate the difference of the sample correlation matrices of the robust Ogk estimator and the
classical estimator. Due to limitted space, we do not show it here. Readers can run the program to see the result
in the R software. Their difference is big, which means that the outliers seriously influence the classical sample
correlation matrix.
Now we use the myplotDD() function in the package robustfa to plot a Cov -c lass object, and the figure is
omitted due to limitted space. Here the robust estimators is the Ogk estimator. We find that the robust mahala-
nobis distance is much larger than the classical mahalanobis distance. Outliers have large robust mahalanobis
distance. In myplotDD(), id.n and ind are shown. Here id.n is the number of observations to identify by a label.
By default, the number of observations with robust distances larger than cutoff is used. By default cutoff =
sqrt(qchisq(0.975, p)). ind is the index of robust distances whose values are larger than cu toff.
cutoff = 6.667893
id.n = length(which(rd > cutoff))
id.n = 47
Here y is the robust distance (rd).
sort.y = (To save space, only the smallest five and largest five
elements of sort.y$x and sort.y$ix are shown.)
$x
Omitted to save space
$ix
[1] 20 19 72 4 60 … 74 98 71 96 77
ind =
[1] 32 49 41 40 2 45 43 59 34 23 44 78 94 42 29 25
[17] 3 90 76 70 84 22 80 28 75 52 63 69 30 39 33 1
[33] 82 92 93 73 68 79 100 26 31 37 74 98 71 96 77
From the above results we see that cutoff = 6.667893. There are id.n = 47 observations with robust distance
larger than c utoff. so rt.y is a list containing the sorted values of y (the robust distance). so rt. y$ x is arranged in
increasing order. sort.y$ ix contains the indices. ind shows the indices of the outliers.
Next, we try to draw the scatter plot of the first two factor score and 97.5% confidence ellipses of the two es-
timators. Get an error message:
Error in solve.default(cov, ...) :
System is computationally singular: reciprocal condition number = 8.61033e09.
The reasons may be the determinant (det) of the sample correlation matrices of the two estimators are close to
0, namely the sample correlation matrices are singular; The condition numbers (kappa) of the sample correlation
matrices of the two estimators are very large, namely the sample correlation matrices are ill-conditioned. The
determinants and condition numbers of the sample correlation matrices of the two estimators are shown in Table
2. In this case, the principal factor analysis method still iterates about the sample correlation matrix and the re-
sulting factor loading matrix is very bad, and thus the factor scores are not good. Therefore, in this dataset, we
do not pursue the smallest Q(E), instead see results of method = pca”.
Table 3 shows the rotated factor loading matrix, the communality, the contribution of the factors to the origi-
nal variables, the contribution rate, and the cumulative contribution rate that are calculated by the classical esti-
mator and the robust Ogk estimator by the method = pca”. In Table 3, in each row of the factor loading matrix,
the element with the largest absolute value and its value is greater than or equal to 0.4 is displayed in red font,
thus there is at most one element which is displayed in red font in each row, these elements have the effect of
Y. Y. Zhang
15
Table 2. Determinants (det) and condition numbers (kappa) of the sample correlation matrices of two estimators.
det kappa
R_classical 7.153293e38 2.920069e+15
R_robust 1.766092e32 2.502002e+13
Table 3. The rotated factor loading matrices and other results of classical estimator and robust Ogk estimator.
Variables
Classical estimator Robust Ogk estimator
Factor loading
Factor loading
F1 F2 F3 F4 F5
2
i
h
F1 F2 F3 F4 F5
2
i
h
x1 0.053 0.901 0.155 0.155 0.161 0.889 0.272 0.617 0.149
0.632 0.108 0.888
x2 0.124 0.897 0.102 0.196 0.149 0.891 0.252 0.024 0.087
0.909 0.125 0.914
x3 0.263 0.586 0.168 0.507 0.161 0.723 0.444 0.737 0.171
0.205 0.032 0.813
x4 0.035 0.619 0.459 0.121 0.129 0.627 0.110 0.531
0.161
0.467 0.082 0.545
x5 0.238 0.397 0.562 0.374 0.145 0.691 0.566 0.321 0.082
0.487 0.025 0.667
x6 0.052 0.882 0.229 0.016 0.063 0.837 0.073 0.203 0.051
0.842 0.129 0.775
x7 0.242 0.095 0.858 0.095 0.157 0.837 0.048 0.044 0.162
0.017
0.927 0.889
x8 0.046 0.117 0.006 0.003 0.313 0.113 0.110 0.345 0.650
0.003 0.249 0.615
x9 0.091 0.319 0.111 0.145 0.386 0.292 0.066 0.327 0.518
0.191
0.014 0.416
x10 0.898 0.054 0.286 0.063 0.146 0.916 0.682 0.214
0.080
0.131
0.616 0.913
x11 0.433 0.047 0.138 0.638 0.297 0.703 0.364 0.018
0.489
0.167
0.166 0.428
x12 0.696 0.021 0.031 0.022 0.029 0.487 0.089 0.048
0.540
0.240
0.374 0.500
x13 0.884 0.056 0.296 0.097 0.156 0.907 0.511 0.197
0.056
0.169
0.728 0.862
x14 0.920 0.033 0.107 0.319 0.014 0.960 0.944 0.155
0.103
0.001
0.146 0.948
x15 0.763 0.021 0.383 0.376 0.088 0.878 0.937 0.162
0.067
0.020
0.077 0.914
x16 0.684 0.109 0.094 0.313 0.347 0.707 0.358 0.486
0.507
0.220 0.155 0.694
x17 0.943 0.047 0.138 0.127 0.061 0.930 0.217 0.917 0.037
0.062 0.135 0.910
x18 0.387 0.063 0.004 0.115 0.415 0.339 0.029 0.256
0.226
0.391 0.159 0.295
x19 0.944 0.046 0.138 0.128 0.056 0.931 0.212 0.919 0.024
0.083 0.142 0.917
x20 0.942 0.039 0.135 0.137 0.062 0.929 0.161 0.926 0.022
0.133 0.116 0.914
x21 0.170 0.198 0.099 0.104 0.098 0.098 0.063 0.212 0.145
0.323
0.348 0.295
x22 0.051 0.100 0.317 0.027 0.785 0.730 0.101 0.258 0.822
0.101
0.223 0.812
x23 0.116 0.157 0.569 0.249 0.351 0.547 0.444 0.140 0.429
0.062 0.344 0.522
x24 0.911 0.044 0.047 0.341 0.024 0.952 0.739 0.440 0.320
0.143 0.032 0.863
x25 0.693 0.092 0.063 0.531 0.008 0.774 0.719 0.394 0.339
0.118 0.082 0.807
x26 0.374 0.113 0.814 0.086 0.222 0.871 0.055 0.064
0.085
0.133
0.943 0.921
x27 0.250 0.093 0.879 0.108 0.101 0.865 0.074 0.007
0.129
0.022 0.936 0.899
x28 0.327 0.164 0.091 0.170 0.764 0.755 0.082 0.050 0.783
0.013 0.186 0.658
Contributio n
8.853 3.568 3.687 1.850 2.226 4.862 5.046 3.413
3.002 4.274
Contribution
rate
0.316 0.127 0.132 0.066 0.079 0.174 0.180 0.122
0.107 0.153
Cumulative
contribution
rate
0.316 0.444 0.575 0.641 0.721 0.174 0.354 0.476
0.583 0.736
Y. Y. Zhang
16
main explanations of the corresponding variables. Aside from the red font elements in the factor loading matrix,
the element with absolute value greater than or equal to 0.3 is displayed in green font. These green elements can
partly explain the corresponding variables. From Table 3, we find that the cumulative contribution rate of the
first five factors calculated by the classical estimator reaches 72.1%, and that by the robust Ogk estimator
reaches 73.6%.
Since the results of classical factor analysis are greatly affected by the outliers, we only analyze the results by
the robust Ogk estimator. From Table 3, we see that factor 1 mainly explains x5 (provident fund per share), x10
(total assets), x14 (shareholdersequity), x15 (capital accumulation fund), x23 (PB, i.e., price/book value ratio),
x24 (total market value), x25 (circulation market value), partially explains x3 (return on equity), x11 (fixed as-
sets), x13 (total liabilities), x16 (main business income), and thus it is called the provident fund market value
factor; factor 2 mainly explains x3 (return on equity), x4 (operating cash per share), x17 (operating profit), x19
(total profit), x20 (net profit), partially explains x1 (earnings per share), x5 (provident fund per share), x8 (net
profit on year-on-year basis), x9 (main business revenue on year-on-year basis), x16 (main business income),
x24 (total market value), x25 (circulation market value), and thus it is called the profit factor; factor 3 mainly
explains x8 (net profit on year-on-year basis), x9 (main business revenue on year-on-year basis), x11 (fixed as-
sets), x12 (intangible assets), x16 (main business income), x22 (price-to-sales ratio), x28 (net profit margin of
main business), partially explains x23 (PB, i.e., price/book value ratio), x24 (total market value), x25 (circula-
tion market value), and thus it is called the market value profit rate factor; factor 4 mainly explains x1 (earn-
ings per share), x2 (net assets per share), x6 (unallocated per share), partially explains x4 (operating cash per
share), x5 (provident fund per share), x18 (non-operating income and expenses), x21 (p/e ratio), and thus it is
called the value per share factor; factor 5 mainly explains x7 (shareholdersequity ratio), x13 (total liabilities),
x26 (equity ratio), x27 (asset-liability ratio), partially explains x10 (total assets), x12 (intangible assets), x21 (p/e
ratio), x23 (PB, i.e., price/book value ratio), and thus it is called the asset liability factor.
We can sort each of the 5 factor scores computed by the robust factor analysis from high to low in tables. Be-
cause of the limited space, the tables are omitted. By calculation, in the top 10 and the last 10 stocks (a total of
100 stocks), there are 81 stocks of the robust Ogk estimator that are outliers. This is a normal phenomenon, be-
cause the factor scores of maximum/minimum values of the sample points are likely to be outliers.
The ranges of 5 factors scores computed by the robust Ogk estimator are:
Factor11.5 64.6,≤≤
Factor26.1 66.0,≤≤
Factor325.8 10.9,≤≤
Factor451.6 10.5,≤≤
Factor511.4 54.5.≤≤
Although the
maximum and minimum values of each factor score is asymmetric about zero, even very asymmetric, but the
average factor scores of the regular points are zero. If the score is close to zero, then the stock corresponding to
the factor score is close to the average level.
The larger the value of Factor 1 is, the higher the provident fund market value is; conversely, the lower. In-
dustrial and commercial bank of China, Construction bank, The bank of China, Agricultural bank of China,
China s oils values of Factor1 are larger, their provident fund market values are higher ; Guodian south, Yanghe
sharess values of Factor1 are smaller, their provident fund market values are lower.
The larger the value of Factor 2 is, the more strong the profitability is; conversely, the less. Industrial and
commercial bank of China, Construction bank, Agricultural bank of Chinas value of Factor 2 are larger, their
profitability are strong; Aluminum corporation of China, *ST Anshan Iron & Steels values of Factor 2 are small,
their profitability are weak.
The larger the value of Factor 3 is, the higher the market value profit rate value is; conversely, the lower. The
bank of China, Construction banks values of Factor 3 are larger, their market value profit rate values are higher;
Aluminum corporation of China, Industrial and commercial bank of China, China petroleum & chemical corpo-
rations values of Factor3 are smaller, their market value profit rate values are lower.
The larger the value of Factor 4 is, the higher the value per share is; conversely, the lower. Guizhou maotai,
Ping an bank, Chinas Ping Ans values of Factor 4 are larger, their values per share are higher; Industrial and
commercial bank of China, Construction bank, Agricultural bank of China, The bank of Chinas value of Factor
4 are smaller, their values per share are lower.
The larger the value of Factor 5 is, the higher the asset liability value is; conversely, the lower. Industrial and
commercial bank of China, The bank of China, Agricultural bank of China, Construction bank, Bank of Co m-
munications’s values of Factor 5 are larger, their asset liability values are higher; Aluminum corporation of
China, Chinas oil, China Shenhua ener gys values of Factor 4 are smaller, their asset liability values are lower.
Y. Y. Zhang
17
There are some special stocks with extreme factors values. For the factor score matrix, there are a total of 5
factors, the top 10 stocks whos e fac to r val ues arebig, the last 10 stocks whose factor values aresmal l. The
number of stock categories of the special factor scores from 2 factors is
2
5210 440.
2

×= ×=


Of course,
not all of the 40 categories have stocks. The number of stock categories of the special factor scores from 2, 3, 4,
5 factors is
40 80 80 32232.+++=
After calculation, there are 67 categories that have special factor value
stocks in 232 categories. To save space, we only report the special stocks with 5 special factor scores: Construc-
tion bank, The bank of China, Agricultural bank of China, and China merchants bank have 5 factor scores (big,
big, big, small, big); Industrial and commercial bank of China has 5 factor scores (big, big, small, small, big);
China s oil and China petroleum & chemical corporation have 5 factor scores (big, big, small, small, small).
3. Summary
We apply the object-oriented robust factor analysis R package robustfa to the 28 financial indicators of the 100
listed companies in Chinas Chinese Securities Index (CSI) 100 index in the first quarter of 2013. First of all,
according to the size of the data, we automatically choose a robust estimator, the robust Ogk estimator. By the
Mahalanobis distances which are computed by the robust Ogk estimator, greater than the critical value, we find
a total of 47 abnormal points. This paper discovers that the results of the sample correlation matrix, the rotated
factor loading matrix, the contribution of the factors to the original variables, the contribution rate, the cumula-
tive contribution rate, the screeplot of the eigenvalues of the sample correlation matrix, the scatter plot of the
first two factor scores, factor scores, and the sorted scores according to factor scores etc. computed by the clas-
sical estimator and the robust Ogk estimator are quite different. Finally, we condense the 28 financial indicators
to 5 factors by combining the principal component analysis method and the robust Ogk estimator: Provident
fund market value factor, profit factor, market value profit rate factor, value per share factor, and asset liability
factor. Finally, we sort the 5 factor scores from high to low of each factor, and also get some special stocks ac-
cording to the factor scores, most of them are outliers. The number of stock categories of the special factor
scores from 2, 3, 4, 5 factors is 232. There are 67 categories that have special factor value stocks in 232 catego-
ries. The robust factor analysis results provide a good basis for investors to choose the stocks.
Acknowledgements
Yingying Zhang thanks Natural Science Foundation Project of CQ CSTC CSTC2011BB0058.
References
[1] Yang, H. (2013) Multivariate Statistical Analysis. Chongqing University Press, Chongqing.
[2] Xue, Y. and C hen, L.P. (2009) Statistical Modeling and R Software. Tsinghua University Press, Beijing.
[3] Wan g, X.M. (2009 ) Applied Multivariate Analysis. 3rd Edition, Shanghai University of Finance and Economics Press,
Shanghai.
[4] Zhang, T.J., Yang, A.M. and Zhang, C.H. (2008 ) An Empirical Study of Operational Risk Control Model of State-
Owned Commercial Banks—Based on Exploratory Factor Analysis and Confirmatory Factor Analysis Point Inspection.
Journal of Chongqing University (Social Science Edition), 14, 36-43.
[5] Pi son , G., Rousseeuw, P.J., Filzmoser, P. and Croux, C. (2003 ) Robust Factor Analysis. Journal of Multivariate Analy-
sis, 84, 145-172. http://dx.doi.org/10.1016/S0047-259X(02)00007-6
[6] Maronna, R.A., Martin, D. and Yohai, V. (2006) Robust Statistics: Theory and Methods. John Wiley & Son, New York.
http://dx.doi.org/10.1002/0470010940
[7] Todorov, V. and Filzmoser, P. (2009) An Object-Oriented Framework for Robust Multivariate An alysis. Journal of
Statistical Software, 32, 1-47. http://www.jstatsoft.org/v32/i03/
[8] Rousseeuw, P.J., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T. and Maechler , M. (2013)
Robustbase: Basic Robust Statistics. R Package Version 0 .9 -10 . http://CRAN.R-project.org/p ackage=ro bu stbase
[9] Wan g, J., Zamar, R., Marazzi, A., Yohai, V., Salibian-Barrera, M., Maronna, R., Zivot, E., Rocke, D., Martin, D. and
Konis, K. (2013) Robus t: Insightful Robust Library. R Package Version 0.4-15.
http://CRAN.R-proj ect.o rg/package=ro bus t
Y. Y. Zhang
18
[10] Todorov, V. (2013) Rrcov: Scalable Robust Estimators with High Breakdown Point. R Package Version 1.3-4.
http://CRAN.R-proj ect.o rg/package=rrco v
[11] Zhang, Y.Y. (2013) Robustfa: An Object Oriented Solution for Robust Factor Analysis. R Package Version 1. 0 -5.
http://CRAN.R-proj ect.o rg/package=ro bus tfa