J. Biomedical Science and Engineering, 2010, 3, 633-637
doi:10.4236/jbise.2010.36086 Published Online June 2010 (http://www.SciRP.org/journal/jbise/
Published Online June 2010 in SciRes. http://www.scirp.org/journal/jbise
The application of hidden markov model in building genetic
Rui-Rui Ji, Ding Liu, Wen Zhang
Department of Automation and Information Engineering, Xi’an University of Technology, Xi’an, China.
Email: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
Received 6 March 2010; revised 15 April 2009; accepted 25 April 2009.
The research hotspot in post-genomic era is from se-
quence to function. Building genetic regulatory net-
work (GRN) can help to understand the regulatory
mechanism between genes and the function of organ-
isms. Probabilistic GRN has been paid more attention
recently. This paper discusses the Hidden Markov
Model (HMM) approach served as a tool to build
GRN. Different genes with similar expression levels
are considered as different states during training
HMM. The probable regulatory genes of target genes
can be found out through the resulting states transi-
tion matrix and the determinate regulatory functions
can be predicted using nonlinear regression algo-
rithm. The experiments on artificial and real-life
datasets show the effectiveness of HMM in building
Keywords: Genetic Regulatory Network; Hidden Markov
Model; States Transition; Gene Expression Data
In order to understand the functioning in living organ-
isms, it should be known which genes are expressed,
when and where, and to which extent. The regulation of
gene expression is achieved through the interactions
between DNA, RNA, proteins, and small molecules.
This regulatory system can be described by the structure
of network called genetic regulatory network (GRN).
Building GRN is a reverse-engineering of real gene ex-
pression data to help studying the relationship between
genes systematically, understanding the essential rule of
biological phenomena and provide valuable idea to treat
some complex diseases [1,2]. Many mathematics models
have been proposed during GRN research, such as Boo-
lean model, Linear combination model, Weighted matrix
model, Differential equations model, and so on [3,4].
Resent years, probabilistic GRN model has been paid
more attention for the real biological system is stochastic
and the determinate model cannot infer the complex
process. Bayesian network and Markov Chain have been
studied [5-7]. The final result of probabilistic GRN is
represented using a graph consisting vertices (genes) and
edges (relationships). The relationships between genes are
described through probability. Considering the current
probabilistic GRN is simple and cannot give the dynam-
ics behavior. This paper studies the application of Hid-
den Markov Model (HMM) and nonlinear regression
algorithm to GRN building.
This paper is organized as follows: Section 2 gives a
theoretical representation of HMM. Section 3 focuses on
the application of HMM in building GRN. Section 4
shows the experimental result and discussion. Section 5
is the conclusion.
2. HIDDEN MARKOV MODEL
2.1. Basic Theory
Classical Hidden Markov Model is a kind of stochastic
state machine based on statistical signal, which is a dou-
ble stochastic process where the sequences of states can
not to be observed directly. The fundamental stochastic
process in this model is a Markov chain which describes
the state transition; another stochastic process describes
the statistical corresponding relationship between states
and the observed value. Different states generate par-
ticular sequences of observation according to different
probability. The observer can know the feature of the
state only through the stochastic process. So this model
is called Hidden Markov Model.
A parameter set(, , )
is used to describe a
means the initial state probability, is
the states transition matrix and B is the observed probabil-
ity. Suppose the number of state is N, 12
is the state set. means the state of the model at time t,
, 12 T
(, ,...,)O OOO
is the observed value se-
quence and k is the observed value, the three parame-
ters in a HMM model are defined as follows,
634 R. R. Ji et al. / J. Biomedical Science and Engineering 3 (2010) 633-637
Copyright © 2010 SciRes. JBiSE
is the probability of the initial state equals
a,, (2) (/)
where means the transition probability from state
is a matrix. NN
()( / )
means the observed probability that generat-
ing under state
. B is a M × N matrix.
2.2. Fundamental Problems in HMM
There are three problems needed to be solved when
building a HMM.
For a given HMM (, , )
and the observed se-
quence 12, evaluation means calculating
corresponding the observed
sequence generated by the model, which can evaluate
the similarity of the observed sequence with the given
Forward-Backward Algorithm proposed by Baum is
used to solve the evaluation problem. In practical appli-
cation, the results are very small and usually normalized
or carried on logarithmic operation in calculation process.
Ascertaining an optimal state transition sequence
on the given HMM
Qqqq（）(, , )
and observed sequence is called
Viterbi Algorithm can solve the decoding problem.
Here, the optimal state transition sequence
） is the Q which can make the value
Learning can obtain the optimal parameters
HMM through training algorithm, where
value of (|)PO
maximum, for the given HMM
(, , )
and observed sequence .
There are two solutions to solve the learning problem.
One uses gradient technique, another is based on itera-
tion or recursion like Baum-Welch algorithm which is
often used to train parameters of HMM.
( ,OO ,. )O..,T
3. MODELING GRN BASED ON HMM
3.1. Constructing HMM
A fundamental assumption is that genes sharing similar
expression levels are commonly regulated, and the genes
are involved in related biological functions. Most of
GRNs are built on the basis of clustering. The process of
constructing HMM in this paper is carried on the genes
clustered into the same class.
Considering the genes clustered into same class, dif-
ferent genes are considered as different states. So the
size of state transition matrix is just the same as the num-
ber of genes. State transition probability corresponds with
the regulatory probability between genes. One gene may
be regulated by any other genes, even itself, so the
wholly connected connection structure is used.
2) Observed sequence:
The expression profiles of genes are considered as the
observed sequence. Since these data is easily affected by
noise, smoothing is used firstly to reduce the influence
3) Training steps:
Step 1: initializing parameters 0
of HMM. The num-
ber of states is equal to the number of genes and each
value of state transition matrix is initialized as average
value1/, N is the number of states, and N0
can be computed;
Step 2: revaluating HMM’s parameter0
algorithm is used to train HMM model to acquire λ;
Step 3: computing (|)PO
under the obtained mo-
del λ using Forward-Backward algorithm;
Step 4: judging the convergence criterion. If
(|)(| )PO PO
is not satisfied, then 0
and return to step 2. Else, training process is finished and
final HMM model close to the observed sequence can be
3.2. Building Probabilistic GRN
Regulatory genes for each target gene can be found out
based on the state transition matrix after training HMM.
Then the structure of regulatory network can be built
according to the following steps:
Step 1: For one target gene(1, 2,...,)
, the genes
whose transition probability in the trained state transition
matrix A is bigger than the initial average probability are
found out and these genes are regarded as the parental
regulatory genes of each target gene;
Step 2: repeating step 1 until finding out the global
information for each target gene;
Step 3: predicting the determinate regulatory function
between target gene and its parental regulatory genes
using multiple nonlinear regression and least squares
4. EXPERIMENTS AND DISCUSSIONS
4.1. Experiment with Artificial Data
In order to evaluate the efficiency of our algorithms, a
group of networks are required whose structure had been
known. However, the real structures of GRN are un-
known completely because the research about GRN is
R. R. Ji et al. / J. Biomedical Science and Engineering 3 (2010) 633-637
Copyright © 2010 SciRes.
still in an early stage. So artificial data reported in paper
 are used in this study. Here, the adopted ALARM
network contains 37 discrete variables, 46 edges and the
value of every variable ranges from 2 to 4.
Table 1. The comparison of BANJO and HMM.
algorithm BANJO HMM
right side number 22 31
reverse side number2 5
sensitivity 0.48 0.67
specificity 0.29 0.76
F-factor 0.36 0.71
The network with known structure is called target
network tand the result of our algorithm is called
deduced networkd. Three index sensitivity, specificity
and F-factor are used to evaluate our algorithm. Sensi-
tivity is used to test the inference ability, specificity re-
flects the degree of accuracy and F-factor is the balance
of above two indicators. The bigger F-factor means the
1) Experimental results
Initial elements of state transition matrix are assigned as
1/15, Table 2 is the final state transition matrix trained,
and each row shows the transition probability corre-
sponding with target gene, through which the probable
regulatory genes of target genes can be found out.
means the number of the same edges in both
and d andt; t
means the total number of the edges
Considering the transition probability of MBP1, SWI6,
MCM1 and CLN3 to gene SWI4 are bigger than initial
probability, so these genes can be regarded as the regu-
latory genes of target gene SWI4. Table 3 lists the regu-
latory genes of target genes SWI4, SWI5, CLN2, and
means the total number of the edges in .
Table 1 shows the comparison of standard Simulated
Annealing algorithm, called BANJO developed by Har-
temink and our algorithm. It can be seen that our result is
better than general Simulated Annealing algorithm in all
2) Determinate regulatory relationships
Predicting the determinate regulatory relationships can
offset the drawbacks of the probabilistic GRN, which
can not describe the specific dynamics behavior. The
obtained regulatory relationships between target gene
) and its regulatory genes MCM1 (5
4.2. Experiment with Real-Life Data
The real-life experimental data in this paper comes from
yeast cell cycle expression datasets created by Spellman
, which imply the regulatory information about ge-
netic property in yeast cell cycle. However, the above
mentioned index cannot be used to evaluate the result
because the real biological regulatory network is un-
known completely. So the existing regulatory relation-
ships which had been already proved are used to evalu-
ate our results.
0.0039 0.0026 0.17
Figure 1 compares the real expression profile of gene
CLB2 marked in blue and its predicted value marked in
As listed in Futcher’s paper , 15 main transcrip-
tional factors: MBP1, SWI4, SWI6, FKS1, MCM1, FKH1,
NDD1, SWI5, ACE2, CDC28, CLN3, CLB2, SIC1,
CLN2 and HHT1 are discussed. It had been verified that
there exist interactions among these genes. For conven-
ience, these genes are marked No. 1, 2,…15.
Figure 2 gives a local structure of our resulting GRN.
Where, black connecting lines represent the verified ex-
isting edges . Blue connecting lines represent the
reversed direction with known relationships. Red con-
-necting lines represent the regulatory relationships pre-
Table 2. The states transition matrix.
Gene MBP1 SWI4 SWI6 MCM1SWI5 ACE2CLN3 CLB2SIC1 CLN2
MBP1 0.0542 0.0832 0.0623 0.05830.10140.05010.1357 0.06020.0532 0.0536
SWI4 0.0348 0.0523 0.0401 0.03470.04340.04820.1523 0.05310.0379 0.1768
SWI6 0.0629 0.0961 0.0658 0.06590.06530.06620.0631 0.06610.0598 0.0652
MCM1 0.0435 0.0759 0.0403 0.04310.1108 0.04630.103 0.12040.0446 0.1462
SWI5 0.0622 0.0635 0.0608 0.05920.06420.06310.0629 0.06330.1261 0.0642
ACE2 0.0582 0.0596 0.0574 0.11370.13020.06020.0588 0.058 0.0569 0.0611
CLN3 0.0578 0.1138 0.0584 0.06030.05360.05480.0637 0.05810.0592 0.0514
CLB2 0.0571 0.0562 0.0583 0.06150.05920.15830.0583 0.06330.0544 0.1019
SIC1 0.0585 0.0599 0.0592 0.06230.05820.05920.0552 0.05380.0653 0.0916
CLN2 0.0598 0.0635 0.066 0.05760.06580.06490.0538 0.04980.0531 0.0653
636 R. R. Ji et al. / J. Biomedical Science and Engineering 3 (2010) 633-637
Copyright © 2010 SciRes. JBiSE
Table 3. The regulatory genes of several target genes.
target gene regulatory genes
dicted by our algorithm which is remained to be verified
It had been verified in Reference  that gene CLN3
activates gene SWI4, gene SWI4 regulates gene CLN2
meanwhile, gene CLN1 and CLN2 are both influenced
by gene CLB2, and CLB2 regulates CLN2. The expres-
sion pattern of gene SWI5 is similar with SIC1 and it has
been proved that SWI5 regulates SIC1. These conclu-
sions confirm that our algorithm is effective.
Moreover, it had been verified that CLN3 is regulated
by gene SWI4 and CLB2 is regulated by gene SIC1,
which is identical with the predicted results by our algo-
This paper discusses the application of HMM in building
GRN. The regulatory genes for each target gene can be
found out through the state transition matrix and then the
global structure of GRN can be determined. Simulative
experiment proves that this algorithm is more effective.
The results in real-life data also show its rationality.
Compared with the determinate model, HMM is more
scientific because it describes the transcriptional regula-
tory degree between genes through probability. Espe-
cially, the present algorithms can find out self-regulatory
relationships of genes.
There are still many problems should also be consid-
ered during the research of GRN using HMM, for exam-
Figure 1. The real expression profile and regression result of
Figure 2. The local structure of GRN.
ple, how to choose the initial model. Since the biological
GRN is a time-continuous and complicated dynamic sys-
tem and haven’t been completely known, as a result, how
to evaluate the GRN integrated with biological meanings
effectively is the next research.
 Zhao, G.P. (2002) Bioinformatics. Science Press, Beijing.
 Hidde, D.J. (2002) Modeling and simulation of genetic
regulatory systems: A literature review. Journal of Com-
putational Biology, 9(9), 67-103.
 Akustu, T., Miyano, S. and Kuhara, S. (2000) Inferring
Qualitative relations in genetic networks and metabolic
arrays. Bioinformatics, 16(8), 727-734.
 Bower, J. (2001) Computational modeling of genetic and
biochemical networks. MIT Press, Cambridge.
 Hartemink, A., Gifford, D., Jaakkola, T., et al. (2002)
Bayesian methods for elucidating genetic regulatory
networks. IEEE Intelligent Systems, 17(2), 37-43.
 Ching, W., Fung, E., Ng, M. and Akustu, T. (2005) On
construction of stochastic genetic networks based on
gene expression sequences. International Journal of
Neural Systems, 15(4), 297-310.
 Zhang, S.-Q., Ching, W.-K. and Yue, J. (2008) Construc-
tion and control og genetic regulatory networks: A mul-
tivariate Markov chain approach. Journal of Biomedical
Science and Engineering, 1, 15-21.
 Tsamardinos, I., Brown, L.E. and Aliferis, C.F. (2006)
The max-min hill-climbing bayesian network structure
learning algorithm. Machine Learning, 65(1), 31-78.
 Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R.,
Amders, K., Eisen, M.B., Brown, P.O., Botstein, D. and
Futcher, B. (1997) Comprehensive identification of cell
cycle-regulated genes of the yeast Saccharomyces cere-
visiae by microarray hybridization. Molecular Biology of
the Cell, 9(12), 3273-3297.
 Futcher B. (2002) Transcriptional regulatory networks
and yeast cell cycle. Current Opinion in Cell Biology,
R. R. Ji et al. / J. Biomedical Science and Engineering 3 (2010) 633-637
Copyright © 2010 SciRes.
 Paul A.F. M.D. (2004) IncyteDB/OL. http://www.i-ncyte.
 Zhang, Z.-F. (2004) Constructing and predicting gene
regulatory network using micro-array data. National
Central University, Taiwan.