Journal of Software Engineering and Applications, 2013, 6, 37-41
doi:10.4236/jsea.2013.63b009 Published Online March 2013 (http://www.scirp.org/journal/jsea)
Copyright © 2013 SciRes. JSEA
37
Stock Market Prediction Using Heat of Related Keywords
on Micro Blog
Shengchen Zhou, Xunzhi Shi, Yunchen Sun, Wenting Qu, Yingzi Shi
Sydney Institute of Language & Commerce, Shanghai University, Shanghai, China.
Email: 8860578@163.com
Received 2013
ABSTRACT
Whether the stock market investors’ emotion can influence the stock market itself is one of the hot topic in financial
research. In this paper, a method based on the heat of related keywords on Micro Blog is proposed, as Micro Blog is an
ideal source for capturing public opinions towards certain topic. We choose Shanghai Composite index as the research
object, through correlation analysis, Granger causality analysis, and support vector machine classification, the results
have shown that the keywords heat on micro blog can make a short-time prediction of stock market, and the keyword
which expresses negative emotion have more powerful prediction ability.
Keywords: Micro Blog; Stock Market Prediction; Emotion; SVM
1. Introduction
Prediction of stock market has achieved widespread
concern from academic and business communities. The
prediction possibility of stock market is closely related to
the Efficient Market Hypothesis (EMH). If the theory of
efficient market hypothesis is established, stock price
reflect all relevant information, than the change of stock
price will be subject to the random walk theory, which
means that the price of stock cannot be predicted [1].
On the other hand, according to some experts’ study,
stock market is not fully comply with the random walk
hypothesis, there are still some predictable components
[2-4], and the emotion of investors is one of the vital
factors that can influence the stock market volatility [5-7].
In China, investor emotion data has four main sources: 1)
CCTV Index; 2) Haodan Index; 3) Huading Pupil Opin-
ion Suvey; 4) Real-time Survey of stock software [8].
Since these sources are mostly collected by conventional
methods, there exist certain limitations to the mass in-
vestor emotion detection.
In recent years, with the popularity of Internet social
applications, researchers begin to detect the public emo-
tions through popular social networking websites. For
instance, Asur et al. [9] use Twitter.com (a famous micro
blog websites) to predict movie box performance based
on the public emotions extracted from twitter; Johan
Bollen et al. [10] found that micro blogs which are la-
beled as “calm” have a powerful prediction ability to the
Dow Jones industries average index, with highest accu-
racy over 80%.
The above studies focus primarily on the sentimental
analysis of micro blog text. However, data expresses
public emotions not only comes from the text itself. We
come up with an idea to detect emotion by observing the
daily numbers of micro blogs related to certain keywords
(heat of keywords), and we use this method to predict the
stock market. We have collected the heat of six stock
market terms from July 1st, 2011 to December 30th,
2011 (16:00 the day before to 10:00 the right day). The
resulting time series are compared with Shanghai Com-
posite Index through correlation analysis. After that, tra-
ditional statistics and artificial intelligence approaches
are used to measure the predictive ability of key-words
heat. The results have shown that the keywords heat on
micro blog can make a short-time prediction of stock
market, and the heat of keyword which reflects negative
emotion has more powerful prediction ability.
2. Data Collection & Processing
2.1. Data of Keywords Heat
We use Sina Weibo search platform and GooSeeker web
information extraction tool to collect the daily number of
micro blogs related to six certain keywords during the
period from July 1st, 2011 to December 30th, 2011. The
six keywords are “牛市”, “熊市”, “利好”, “利空”, “股市
~a ~沪深~股指~大盘 上涨”, “股市~a ~沪深~股指
~大盘 下跌” in Chinese , which are explained as “Bull
Market”, “Bear Market”, “Positive news”, “Negative
news”, “Stock Index rise” and “Stock Index drop”. After
Stock Market Prediction Using Heat of Related Keywords on Micro Blog
Copyright © 2013 SciRes. JSEA
38
a sampling analysis of the related micro blogs, we found
that these Chinese keywords can express the investors’
positive or negative emotions to some extends. It is
worth to mention, the keywords heat was collected from
16:00 on the previous day (an hour after the stock mar-
kets closed) to 10:00 on the right day (half an hour after
the stock markets opened). During this period, the stock
market is closed, and micro blogs related to these key-
words are more likely to be the investors’ emotion ex-
pression, rather than the objective comments. Weighted
average method was used to process the heat of key
words during the holidays.
Since stock market was closed on July 2nd and July 3rd,
we calculate the key words heat on July 4th according to
the Eq.1:
H7/4=0.2*H7/2+0.2*H7/3+0.6*H 7/4 (1)
H7/4 reflects the heat of key words on July 4th.
2.2. Data of Stock Market
We collected the closing prices of Shanghai Composite
Index from Google Finance during the period from July
1st, 2011 to December 30th, 2011. We used +1 or -1 to
label the up or down of Shanghai Composite Index com-
pared with the previous day. Dealt with statistics, 52 days
of closing price were up and 73 days of closing price
were down. The overall trend of Shanghai Composite
Index was declining.
3. Methodology and Experiment
3.1. Overview
Figure 1 illustrates the idea of our methods. Specifically,
there are three main phases: 1) correlation analysis; 2)
Granger causality analysis to find the keywords which
have predictive ability; 3) up or down judgments based
on support vector machine (SVM) classification model.
3.2. Correlation Analysis
In order to find the correlation between the six keywords
heat and the closing price of Shanghai Composite Index,
we used Eviews to do the correlation analysis. The result
of the correlation analysis is showed in Table 1. Some of
the six key words have a relatively significant correlation
with the closing price of Shanghai Composite Index re-
spectively.
The public express their feelings and views toward
events through micro-blogging. As a result, when some
significant event occurs, the amount of the micro-blog-
ging containing related topics or keywords will change.
The Figure 2 shows the heat of the six keywords during
July 1st, 2011 to December 30th, 2011. In order to analyze
the six hot words in a same level, we standardized them
by using Z-score, as (2):
Figure 1. Phases of methodology.
Table 1. Results of Correlation analysis.
Figure 2. July 1st – Dece mber 30th Z-score lines of six key-
words heat.
()/ZX

(2)
In this Equation, μ represents mean and
stands for
standard deviation.
From Figure 2, it can be found: in July 7th, the Peo-
ple’s Bank of China declared to raise the bank reserve
ratio by 0.25 percentage point. The six keywords heat
tended to obviously turning that day; In August 5th,
Standard & Poor’s reduced the credit ratings of Ameri-
can government. Some keywords heat had negative reac-
tion; In November 5th, the People’s Bank of China de-
clared to reduce the bank reserve ratio by 0.5 percentage
point. According to this message, the heat of these words
shot up in the next three days especially the heat of
Stock Market Prediction Using Heat of Related Keywords on Micro Blog
Copyright © 2013 SciRes. JSEA
39
“positive news”. As a consequence, the heat of keywords
can reflect some event indirectly and contains the attitude
of the public. It could be a tool to help judge the emotion
of investors. In addition, the heat of keywords is suitable
to analyze and discuss short-time stock market.
3.3. Granger Causality Analysis
To further examine the link between the keywords heat
and Shanghai Composite Index, this section introduces a
Granger causality analysis method to verify the hypothe-
sis: The changes of keywords heat occur before the
changes of Shanghai Composite Index.
Granger causality is one of the econometric research
focus, which is defined as: "To determine whether X
caused Y, first examine to what extent the Y's the value
can be explained by past values of Y, then inspected
whether adding the X's lagging value can improve the
explanation of the degree. If the lagging values of X help
to explain the degree of improvement on Y, then X is Y's
Granger cause "[11].
We use Eviews as a research tool, first determine those
series have a good stability by using the ADF test on the
six keywords heat time series.
On that basis, the Granger causality test is applied be-
tween Shanghai Composite Index and 6 keywords heat
time series. The Regression model used in testing is
showed as Eq. (3).
11
nn
titiitit
ii
YYX



 
 (3)
According to the test results shown in Table 2, we can
reject the null hypothesis: keyword heat is not the
Granger reason of Shanghai Composite Index closing
price, that is, keyword heat cannot predict the closing
prices. It can be found through observation that there is
significant Granger causality relationship between the
"rise" or "drop" and the Shanghai index closing prices,
for lags ranging in the 2nd day for “rise” (p-value <0.05 )
and from 1st to 9th days for “drop” (p-value <0.05) re-
spectively.
By doing the Granger causality test, it shows that the
"rise" and "drop" is moderate in predicting Shanghai
Composite Index in a short run. In addition, as we have
found the correlation coefficient between keywords heat
and closing prices, getting the coefficient between the
"rise" and the closing prices was 0.358 and the coeffi-
cient between "drop" and the closing prices was -0.168.
Combined with the semantics of "rise" and "drop", we
can pose a hypothesis: on the micro blog platform,
whether investors' positive emotions or negative emo-
tions have more powerful prediction ability towards the
stock market.
3.4. SVM Training and Forcasting
The foregoing analysis is mainly based on linear regres-
sion, which to some extents has found a connection be-
tween some keywords and the closing price of Shanghai
Composite Index to predict the fluctuations. This section
will use machine learning to classify the future fluctua-
tions from a nonlinear perspective. Some scholars have
used support vector machines (SVM) classification to
forecast the fluctuations, which has achieved a good re-
sult [12]. SVM is also used as a research tool in this pa-
per.
SVM classification is a novel machine learning algo-
rithm based on structural risk minimization (SRM) [13],
which is a prediction tool with good generalizations.
LIBSVM tool is used to train the model and classify.
LIBSVM is a simple, convenient SVM pattern recogni-
tion and regression kit developed by Pro. Lin Chih-Jen
and Dr. Chang Chih-Chung. [14]
The input data of training set and test set was the se-
quences of keywords heat and the output data was the
fluctuation labels (1 and -1) of the closing price of
Shanghai Composite Index. The data from July 2nd, 2011
to December 30th, 2011 (124 days) was used as the sam-
ple. Half of them were the training set and the other half
were the test set. The classification model could be ob-
tained by using the training set to train SVM, and the
model predicted the labels of test set.
Table 2. The test results of Granger causality.
lag Bull Market Bear Market Positive News Negative News Rise Drop
1 0.2755 0.5458 0.7054 0.8275 0.9443
1.00E-05
2 0.4232 0.6847 0.4388 0.8725 0.0352 2.00E-05
3 0.6053 0.8214 0.2597 0.8237 0.0919
4.00E-05
4 0.7294 0.5454 0.2867 0.8391 0.237 0.0003
5 0.8372 0.6807 0.4157 0.6282 0.3035 0.0007
6 0.6917 0.6063 0.6166 0.7754 0.4315 0.0008
7 0.7663 0.6707 0.4533 0.503 0.3643 0.0046
8 0.8273 0.608 0.4707 0.4779 0.4033 0.0106
9 0.8675 0.4965 0.3732 0.6155 0.3829 0.0488
10 0.9238 0.6403 0.5199 0.7949 0.6522 0.0742
p-value<0.05 shown in bold
Current Distortion Evaluation in Traction 4Q Constant Switching Frequency Converters
Copyright © 2013 SciRes. JSEA
40
Figure 3. The change of the prediction ability over time.
To improve the accuracy of SVM model in prediction,
the data needed to be normalized. The appropriate kernel
function was found to optimize the correlated parameters.
After using controlled variable, the best normalization
was [0,1] and the neuron sigmoid function was used as
kernel function. The model is shown as Eq. (4). The
cross validation is used to find the appropriate penalty
parameter c and kernel parameter g.
() 1/(1)
t
f
xe
 (4)
According to the Granger causality analysis, only
"Stock index rise" and "Stock index drop" are the
Granger reason of Shanghai Composite Index closing
price. The SVM modeling showed that the accuracy of
“Stock index rise” is 59.46% and “Stock market drop” is
78.38%. So the keyword “Stock market drop” is better
than “Stock index rise” in prediction.
“Stock market drop” as a Granger reason has certain
lagging effect (lag>1). So in the following research, the
data from the next day to the sixth day is used to predict
the fluctuations to measure the prediction circle. The
SVM model with the combination was used and the re-
sult is shown in Figure 3 and the overall accuracy is de-
clining over time, which means that the prediction ability
of keywords heat is a short-time effect.
4. Discussion
The investors' emotional data achieved from the micro
blogs have certain connections with the stock market and
they can in a way predict it. According to the results, the
investors' emotions can be detected from the heat of
some keywords on the micro blogs. In addition, the
variations of the heat demonstrate the attention to certain
issues affecting the stock market. Among the six key-
words, "Stock index rise" and "Stock index drop" are the
Granger reason of Shanghai Composite Index closing
prices. The change of the heat of two keywords reveals
the change of Shanghai Composite index lagging within
a week's time. Support vector machine can be used to
predict the fluctuations. It has been shown that the heat
of "Stock index drop" is more accurate in prediction.
Based on the results of lag effect, the prediction circle is
short, which means the best data is from 16:00 the pre-
vious day to 10:00 the right day.
There are several limitations about the research, which
are needed to improve. To begin with, both positive and
negative emotions can be found in some micro blogs, for
example a micro blog with "Bull market" and "Bear
market". A more advanced search pattern should be used
to classify the emotions. Next, during the selected time
period, the overall trend of stock market is falling stead-
ily. So the negative keyword "Stock market drop" may
predict the stock market more accurately. More historical
data should be introduced to prove the accuracy. Besides,
the future trend also should be predicted. Finally,
whether the investors on micro blog can represent all the
investors and whether the micro blogs are trustworthy
could be included in the future research.
REFERENCES
[1] Zhu Yu. Reaserch on Stock Market Prediction. North-
western Polytechnical University, 2006.
[2] Peters, E., Chaos and Order in the Capital Markets: a
New View of Cycles, Prices and Market Volatility, New
York: John Wiley and Sons, 1991: 5~6.
[3] Lin Xiaoming, Wang meijin. China’s Stock Market
Chaos Phenomenon and Market Efficiency, The Journal
of Quantitative & Technical Economics, 1997, 4: 51~53.
[4] Hu Binbin, Dang Jiarui, Lan Baixiong. Research on
China’s Stock Market Predictablilty. Finanace & Eco-
nomics, 2001 (3).
[5] Fisher KL, Statman M. Investor sentiment and stock re-
turns. Financial Analysts Journal, 2000, 3/4(2):13-23.
[6] Chan S Y, Fong W M. Individual Investors’ sentiment
and temporary stock price pressure. Journal of Business
Finance and Accounting, 2004, 31(5-6): 823-836.
[7] Verma R, Verma P. Noise trading and stock market vola-
tility. Journal of Multinational Financial Management,
2007, 17: 231-243.
[8] Yu Peikun, Zhong Ruijun, Can Individual Investor Sen-
timent Predict Market Rate of Return, Nankai Business
Review, 2009, 12(1): 96-101.
[9] Asur, S. Predicting the Future with Social Media. In Pro-
ceedings of 2010 IEEE/WIC/ACM International Con-
ferences on Web Intelligence and Intelligent Agent
Technology, pages 492-499, Tornoto, Canda, August
2010. IEEE.
[10] John B, Huina M, Xiaojun Z. Twitter Mood Predicts the
Stock Mraket, Journal of Computational Science, 2: 1-8,
2011.
[11] Yang Miao, The multivariate generalization of the
Granger causality and the Applied Research, Southwest-
ern University of Finance and Economics, 2002.
Stock Market Prediction Using Heat of Related Keywords on Micro Blog
Copyright © 2013 SciRes. JSEA
41
[12] Zhang Zuoquan. Application of SVM in Security Invest-
ment Analysis, Beijing Jiaotong University, 2007.
[13] Vapnik V N. The nature of statistical learning theory.
New York:Springer,1995.
[14] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a li-
brary for support vector machines. ACM Transactions on
Intelligent Systems and Technology, 2:27:1--27:27, 2011.
Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm