Construction and control of genetic regulatory
networks:a multivariate Markov chain
approach
Construction and control of genetic regulatory
networks:a multivariate Markov chain
approach
1223 4
Shu-Qin Zhang, Wai-Ki Ching, Yue Jiao, Ling-Yun Wu &Raymond H. Chan
1 2
School of Mathematical Sciences, Fudan University, Shanghai, 200433, China. Advanced Modeling and Applied Computing Laboratory,
3
Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong.Institute of Applied Mathematics, Academy of
4
Mathematics and System Sciences, Chinese Academy of Sciences.Department of Mathematics, the Chinese University of Hong Kong, Shatin, N.T.,
Hong Kong. * Correspondence should be addressed to Shu-Qin Zhang (zhangs@fudan.edu.cn).
[1,2,4,11,15,16,17,21] have been proposed. Among
ABSTRACT all the models, BNs and PBNs have received much
attention. The approach is to model the genetic regu-
In the post-genomic era, the construction and latory system by a Boolean network and infer the net-
control of genetic regulatory networks using work structure from real gene expression data. Then
gene expression data is a hot research topic.by using the inferred network model, the underlying
Boolean networks (BNs) and its extensiongene regulatory mechanisms can be uncovered. This
Probabilistic Boolean Networks (PBNs) have is particularly useful as it helps to make useful pre-
been served as an effective tool for this pur-dictions by computer simulations. We refer readers to
pose. However, PBNs are difficult to be used the survey paper by Shmulevich et al. [18, 19] and the
in practice when the number of genes is large book by Shmulevich and Dougherty [20].
because of the huge computational cost.In thisThe BN model was first introduced by Kauffman
paper, we propose a simplified multivariate Markov [12, 13, 14]. The advantages of this model can be
model for approximating a PBN. The new modelfound in Akutsuet al. [1], Kauffman [14] and
can preserve the strength of PBNs, the abilitytoShmulevich et al. [17]. Since genes exhibit switching
capture the inter-dependence of the genes in thebehavior [10], BN models have received much atten-
network, and at the same time reduce the com-tion. In a BN, each gene is regarded as a vertex of the
plexity of the network and therefore the compu-network and is quantized into two levels only (ex-
tational cost. We then present an optimal con-pressed (1) or unexpressed (0)). We remark that the
trol model with hard constraints for the purposeidea and the model can be extended easily to the case
of control/intervention of a genetic regulatoryof more than two states. The target gene is predicted
network. Numerical experimental examples based by several genes called its input genes through a
on the yeast data are given to demonstrate the Boolean function. If the input genes and the Boolean
effectiveness of our proposed model and control functions are given, a BN is defined. The only ran-
policy. domness involved here is the initial system state.
However, the biological system has its stochastic
nature and the microarray data sets used to infer the
network structure are usually not accurate because of
the experimental noise in the complex measurement
process. Thus stochastic models are more reasonable
1. INTRODUCTIONchoices. To overcome the deterministic nature of a
An important issue in systems biology is to under-BN, Akutsuet al. [1] proposed the noisy Boolean net-
stand the mechanism in which cells execute and con-works together with an identification algorithm. In
trol a huge number of operations for normal functions, their model, they relax the requirement of consis-
and also the way in which the cellular systems fail in tency imposed by the Boolean functions. Regarding
disease, eventually to design some control strategy to the effectiveness of a Boolean formalism, Shmulevich
avoid the undesirable state/situation. Many mathe-et al. [17] proposed a PBN that can share the appeal-
matical models such as neural networks, linear model, ing rule-based properties of Boolean networks and it
Bayesian networks, non-linear ordinary differentialis robust in the presence of uncertainty. The model
equations, Petri nets, Boolean Networks (BNs) andparameters can be estimated by using Coefficient of
its generalization Probabilistic Boolean Networks Determination (COD) [8].
(PBNs), multivariate Markov chain model etc.
Keywords: Gene expression sequences; Multivariate
Markov chain; Optimal control policy; Probabilistic
Boolean networks
J. Biomedical Science and Engineering, 2008, 1, 15-21Scientific
Research
Publishing
JBiSE
Published Online May 2008 in SciRes.http://www.srpublishing.org/journal/jbise
SciRes Copyright ©2008
(ij)
The dynamics of the PBN can be studied in the con-Gene i. The matrixP is a transition probability
text of standard Markov chain [17, 18, 19]. Thismatrix for the transitions of states in Sequence j to
makes the analysis of the network easy. However, thestates in Sequencei in one step, see for instance [3].
number of parameters (state of the system) grows In matrix form we have
exponentially with respect to the number of genes n.
Therefore it is natural to develop heuristic methods
for model training or to consider other approximate
model. Here we propose a simplified multivariate
Markov model, which can capture both the intra- and
inter-associations (transition probabilities) among
the gene expression sequences. The number of
2
parameters in the model is onlyO(n) wheren is thewhere
number of genes in a captured network. We remark
that this order is already minimal. We then develop
efficient model parameters estimation methods based
on linear programming. We further propose an opti-
mal control formulation for regulating the network so
as to avoid some undesirable states which may corre-
spond to some disease like cancer.
The rest of the paper is structured as follows. In
section 2, we present the simplified multivariate We note that the column sum ofQ is not equal to one
Markov model. In section 3, the estimation method(ij)
(the column sum of each P is equal to one). The fol-
for model parameters is given. In section 4, an opti-
lowings are two propositions [3] related to some prop-
mal control formulation is proposed. In section 5, we
erties of the model.
apply the proposed model and method to some syn-
Proposition 2.1 If 0 for 1i,jn , then the matrix
thetic examples and also the gene expression dataset ij
of yeast. Concluding remarks are then given to Q has an eigenvalue equal to 1 and the eigenvalues ofQ
address further research issues in section 6. have modulus less than or equal to 1.
(ij)
Proposition 2.2 Suppose that P (1i,jn ) are
2. THE MULTIVARIATE MARKOV irreducible and0 for 1i,jn . Then there is
ij
CHAIN MODELa vector
In this section, we first review a multivariate Markov
chain model proposed in Ching, et al. [3] for model-
ing categorical time series data. We remark that the
model has been first applied to predicting demand of
inventory of correlated products. Later the model such that
was applied to the building of genetic regulatory net-
works [4] from gene expression data. However, the
number of parameters is still large and further reduc-
tion of the model parameters is necessary and a sim-and
plified model was proposed in [5]. In the remainder
of this section, we present the simplified multivariate
Markov chain model.
Givenn categorical time sequences, we assume
they share the same state spaceM. We denote the where m is the number of states.
(ij)
state probability distribution of Sequence j at timetIn Proposition 2.2, we require allP are irreduc-
(j)
byV,j=1,2, ,n . In Ching, et al. [3], the following ible. But actually, if Q is irreducible, we can get the
t
same conclusion. If the model is applied to gene
first-order model was proposed to model the relation-
expression data sequences, one may takeM={0,1}
ships among the sequences:
(i)
and V to be the expression level of the i-th gene at
t
the time t. From (1), the expression probability distri-
bution of the i-th gene at time (t+1) depends on the
(ij)(j)
weighted average of PV. We remark that this is a
Where t
first-order model and actually give the weighting
ij
of how much Geneidepends on Gene j. In Ching, et
al. [4], this model has been used to find cell cycles.
The most proper parent genes for the i-th gene
Here is the non-negative weighting of Genej to (i)
ij (i.e.,V) can be retrieved from the corresponding
t+1
(1)
(2)
SciRes JBiSECopyright © 2008
16S.Q. Zhanget al./J. Biomedical Science and Engineering 1 (2008) 15-21
. The higher the value of , the stronger the par-
ijij rence of each gene and we denote it by
ent and child relationship between i-th andj-th gene
will be. When this process is repeated for each j, the
whole genetic network can be constructed. Given a
We therefore expect that
set of genes
If for any gene in this set, the rest genes are the only
candidates being a corresponding parent gene, then
this set of genes forms a cycle.
A simplified model was proposed in Chinget al. [5]
by assuming
From the above equation, it suggests one possible
V
way to estimate the parameters ={ } as follows:
ij
The simplified model has smaller number of
parameters and it has been shown to be statistically
better in terms of BIC, see for instance [5]. Moreover,
Propositions 1 and 2 still hold for the simplified
model. subject to
3. ESTIMATION OF MODEL PARAME-
TERS
(ij)
In this section, we present methods to estimate P
and . We estimate the transition probability matrix
ij We note that the following formulation ofn linear
(ij)programming problems can give the necessary solu-
P by the following method. First we count the tran-tions of Problem (4). For eachi:
sition frequency of the states in the i-th sequence.
After making a normalization, we obtain an estimate
of the transition probability matrix. We have to esti-
mate n such m-by-mtransition probability matrices to Subject to
(ij)
get the estimate forP as follows:
Where
(ij)(ij)
From F, one can obtain the estimate for P as
follows: and
V
Hereis thei-th row of .
ij
We remark that the estimation method can be
applied to the simplified model (3). We remark that
..
Where other vector norms such as and can also be
1
2
used but they have different characteristics. The for-
mer will result in a quadratic programming problem
while will still result in a linear programming
problem. The main computation cost comes from solv-
ing the linear programming problem. In the estima-
tion of , it involves only counting frequencies of
transitions and therefore the cost is minimal. Once
Besides , we need to estimate the parameters .
ij the model parameters are available, one can then con-
It can be shown that the multivariate Markov model struct the underlying genetic network easily. We will
has a “stationary vectorV in Proposition 2. The vec-demonstrate this in the section of numerical exam-
torV can be estimated from the gene expression ples. The model can also be further modified to
sequences by computing the proportion of the occur-include extra conditions such as someare known
ij
SciRes JBiSECopyright © 2008
17
S.Q. Zhanget al./J. Biomedical Science and Engineering 1 (2008) 15-21
(3)
(5)
(4)
where v(iiirepresents all the possible net-
to be zero. Such information can be included by add-tt-11
ing the constraints=0 . Furthermore, for large net-work state probability distribution vectors up to time
ij t. We define
work, it is known that the in-degree follows the Pois-
son distribution while the out-degree follows the
power-law, i.e., the number of out-degree to some
negative power. These important properties can also
be easily included in our proposed model [24].
4. THE OPTIMAL CONTROL FORMU-to be the set which contains all the possible state prob-
LATION ability vectors up to time t. We note that one can con-
In this section, we present the optimal control prob-duct a forward calculation to compute allthe possible
lem based on the simplified multivariate Markov statevectorsinthe setsU(1),U(2), U(T) recursively.
model (3) and formulate it based on the principle ofHere the main computational cost is the matrix-
2
dynamic programming. In the simplified model (3) vector multiplication and the cost isO((2n) ) wheren
we proposed above, the matrix Q can be regarded as a is the number of genes in the network. We note that
“transition probability matrix for the multivariatesome state probability distribution actually does not
Markov chain in certain sense, andV can be regarded
texist because the maximum number of controls is K,
as a joint state distribution vector. We then present athe total number of vectors involved is only
control model based on the paper by Ching,et al.[6].
Beginning with an initial joint probability distribu-
tion V the gene regulatory network (or the multivariate
0
Markov chain) evolves according to two possible tran-For example ifK=1, the complexity of the above algo-
sition probability matricesQ and Q. Without any 2
01 rithm is O(T(2n)).
external control, we assume that the multivariate Returning to our original problem, our purpose is
Markov chain evolves according to a fixed transitionto make the system go to the desirable states. The
probability matrix Q (Q). When a control is
0objective here is to minimize the overall average of
applied to the network at one time step, the Markovthe distances of the state vectorsv(ii) (t=1,2, ,T)
t1
chain will evolve according to another transition to the target vectorz, i.e.,
probability Q (with more favorable steady states or a
1
more favorable state distribution). It will then return
back to Q again if there is no control. We note that
0
one can have more than one type of controls, i.e.,
To solve (6), we have to define the following cost
more than one transition probability matrixQ to
1function
choose in each time step. For instance, in order to sup-
press the expression of a particular gene, one can
directly toggle off this gene. One may achieve the
goal indirectly by means of controlling its parent
as the minimum total distance to the terminal state
genes which have a primary impact on its expression
at timeT when beginning with state distribution vec-
too. But for the simplicity of discussion, we assume
torv(w) at timet and that the number of controls
that there is only one direct possible control here. Wet
then suppose that the maximum number of controls used is k. HereW is a Boolean string of lengtht.
t
that can be applied to the network during a finite Given the initial state of the system, the optimization
investigation periodT(finite-horizon) is K where problem can be formulated as:
KT. The objective here is to find an optimal control
policy such that the state of the network is close to a
target state vector Z. Without loss of generality, here
we focus on the first gene among all the genes.
(1) subject to:
Accordingly, we remark that the sub-vector Z
denotes the vector containing the first two entries in
Z. It can be a unit vector (a desirable state) or a prob-
ability distribution (a weighted average of desirable
states). The control system is modeled as:
To solve the optimization problem, one may con-
sider the following dynamic programming formula-
tion:
SciRes JBiSE Copyright © 2008
18 S.Q. Zhanget al./J. Biomedical Science and Engineering 1 (2008) 15-21
(7)
(6)
Here 0w and 1w are Boolean strings of size t.
t-1 t-1
The first term in the right-hand-side of (8) is the cost
(distance) when no control is applied at timet while
the second term is the cost when a control is applied.
The optimal control policy can be obtained during the
process of solving (8). We remark that instead of con-
sidering the objective (6), one can consider
and
With{}a new weighting and a different vector
i
.
norm .Furthermore, it is interesting to study the
l
case of infinite horizon. In this caseis chosen to be
t
t-1
(1-) for some discount factor(0,1).
5. NUMERICAL EXPERIMENTS
5.1.A Simple Example
In this subsection, we consider a small five-gene net-The target here is to suppress the first gene but no
work whose gene expression series can be found inpreference on other genes. The control we used is to
the Appendix.showsthefive-gene network.suppress the first gene directly. Thus the control
We note that Gene 1 and Gene 4 depends on all the matrix is as follows:
other genes, Gene 2 depends on Gene 1 and Gene 3
only, Gene 3 depends on Gene 1 and Gene 2 only,
while Gene 5 depends on itself only.
To solve the linear programming problem in equa-
tion (5), infinity norm is chosen for all numericalWithout loss of generality, we assume that the ini-
V
experiments. The matrices , P, and Q (without con-tial state vector is the uniform distribution vector (for
0
trol) are obtained from the proposed model as follow:each gene), that is
Moreover, we assume that the total time T is 12 and
we try several different numbers of controls
K=1,2,3,4,5.shows the numerical results. All
the computations were done in a PC with Pentium D
and Memory 1GB with MATLAB 7.0. In ,
"Policy" represents the optimal time step at the end
Where
of which a control should be applied. For instance,
means that the optimal control policy is to apply the
control at the end of thet=1,2,3-th time step. From
, observable improvements of the optimal
value is obtained when K increases from 1 to 5.
5.2. The Yeast Example
Figure 1
Table 1
Table 1
Table 1
Figure 1. The Five-gene Network.
Table 1.Numerical results for the 5-gene network.
K
Control
Policy
Objective
Value
Time in
Seconds
1
[1]
0.5628
0.02
2
[2]
0.4277
0.02
3
[1,2,3]
0.3379
0.06
4
[1,2,3,7]
0.2717
0.15
5
[1,2,3,7,8]
0.2090
0.23
SciRes JBiSE Copyright © 2008
19
S.Q. Zhanget al./J. Biomedical Science and Engineering 1 (2008) 15-21
(8)
Q=
0
In this subsection, we apply our proposed simplified first gene in the first 4 steps, and will not control it in
multivariate Markov models to the yeast data other steps.These experiments show that even the
sequences [23]. Genome transcriptional analysis is number of genes (384 genes in this data set) is com-
animportant analysis in medicine, etiology andparatively large, the method still can find the control
bioinformatics. One of the applications of genome policies fast.
transcriptional analysis is used for eukaryotic cell
cycle in yeast. The fundamental periodicity in 6. CONCLUDING REMARKS
eukaryotic cell cycle includes the events of DNA rep-In this paper, we proposed a simplified multivariate
lication, chromosome segregation and mitosis. It is Markov model for approximating PBNs. Efficient
suggested that improper cell cycle regulation leads to estimation methods based on linear programming
genomic instability, especially in the etiology of both method are presented to obtain the model parameters.
hereditary and spontaneous cancers [9, 22]. Eventu-Methods for recovering the structure and rules of a
ally, it is believed to play one of the important roles PBN are also illustrated in details. We then give an
in the etiology of both hereditary and spontaneousoptimal control formulation for control the network.
cancers. The dataset used in our study is the selected Numerical experiments on synthetic data and gene
set from Yeung and Ruzzo (2001) [23]. In the expression data of yeast are given to demonstrate the
discretization, if an expression level is above (below)effectiveness of our proposed model and formulation.
a certain standard deviation from the average expres-For future research, we will extend the control
sion of the gene, it is over-expressed (under-problem to the case of having multiple control policy.
expressed) and the corresponding state is 1 (0) [4]. We will develop efficient heuristic methods for solv-
To solve the linear programming problem in (5), ing the control problem and genetic algorithm is a
infinity norm is chosen for all numerical experiments.
V
possible approach [7]. Extension of the study to the
The matrices ,P ,andQ (without control) are
0case of infinite horizon is also interesting. Finally,
obtained from the proposed model. The initial statewe will also apply our model to more real world
vector is assumed to be the uniform distribution (for datasets.
each gene) vector
APPENDIX
The five gene expression sequences.
In addition, we assume that the total time T is 12
and several different maximum numbers of controls
K=1,2,3,4,5 are tried in our numerical experiments.
The target is to suppress the first gene but no prefer-
(1)
ence on other genes. That is the target state vectorZ
T
is (1,0) . The control we used is to suppress the first
gene directly. Thus the control matrixQ takes the
1
same form as the following:
ACKNOWLEDGEMENTS
Research supported in part by RGC Grants 7017/07P and HKU
CRCG Grants. The preliminary version of the paper has been pub-
lished in the proceedings of the 2nd International Conference on
Bioinformatics and Biomedical Engineering, Shanghai, China,
It means that we want to control the first gene such 2008.
that it will be unexpressed with more probabilities.
The transitions of all the other genes will not beREFERENENCE
changed.reports the numerical results and [1]T.Akutsu, S. Miyano & S. Kuhara. Inferring Qualitative Rela-
the computational time for different numbers of con-tions in Genetic Networks and Metabolic Arrays.
Bioinformatics 2000, 16: 727-734.
trolsK. From, observable improvements of[2]J. Bower.Computational Modeling of Genetic and Biochemical
the optimal value is obtained whenK increases fromNetworks.MIT Press, Cambridge 2001, M.A.
1 to 5. For example, if we will conduct 4 controls [3]Ching, W., E. Fung & M. Ng. A multivariate Markov Chain
totally in the 12 time steps, we need to suppress theModel for Categorical Data Sequences and Its Applications in
Demand Predictions. IMA Journal of Management Mathemat-
ics 2002, 13: 187-199.
[4]Ching, W., E. Fung, M. Ng & T. Akutsu. On Construction of Sto-
chastic Genetic Networks Based on Gene Expression
Sequences.International Journal of Neural Systems 2005, 15:
297-310.
[5]Ching, W., Zhang, S., & M. Ng. On Multi-dimensional Markov
Chain Models. Pacific Journal of Optimization 2007, 3: 235-
243.
[6]Ching, W., Zhang, S., Jiao, Y., T. Akutsu & Wong, A.Optimal
Finite-Horizon Control for Probabilistic Boolean Networks
Table 2
Table 2
SciRes JBiSE Copyright © 2008
Table 2. Numerical results for the yeast data set.
K
Control
Policy
Objective
Value
Time in
Seconds
1
[1]
0.6430
4.00
2
[2]
0.5751
20.60
3
[1,2,3]
0.5165
67.90
4
[1,2,3,4]
0.4582
152.88
5
[1,2,3,4,5]
0.4000
245.95
20S.Q. Zhanget al./J. Biomedical Science and Engineering 1 (2008) 15-21
with Hard Constraints.The International Symposium on Opti-
mization and Systems Biology 2007.
[7]Ching, W., H. Leung, Tsing, N. & Zhang, S. Optimal Control for
Probabilistic Boolean Networks : Genetic Algorithm Approach,
2008.
[8]E. Dougherty, S. Kim & Chen, Y. Coefficient of Determination
in Nonlinear Signal Processing. Signal Processing2000, 80:
2219-2235.
[9]M. Hall, & G. Peters. Genetic Alterations of Cyclins, Cyclin-
dependent Kinases, and Cdk Inhibitors in Human Cancer.Adv.
Cancer Res. 1996, 68: 67-108.
[10]Huang, S. & D.E. Ingber. Shape-dependent Control of Cell
Growth, Differentiation, and Apoptosis: Switching Between
[16]F. Nir, L. Michal, N. Iftach & P. Dana. Using Bayesian Networks
to Analyze Expression Data. Journal of Computational Biology
2000, 7(3-4): 601-620.
[17]I. Shmulevich, E. Dougherty, S. Kim & W. Zhang. Probabilis-
tic Boolean Networks: A Rule-based Uncertainty Model for
Gene Regulatory Networks.Bioinformatics 2002, 18: 261-274.
[18]I. Shmulevich, E. Dougherty, S. Kim & Zhang, W. Control of
Stationary Behavior in Probabilistic Boolean Networks by
Means of Structural Intervention. Journal of Biological Sys-
tems 2002, 10: 431-445.
[19]I. Shmulevich, E. Dougherty, S. Kim & W. Zhang. From
Boolean to Probabilistic Boolean Networks as Models of
Genetic Regulatory Networks.Proceedings of the IEEE 2002,
Attractors in Cell Regulatory Networks. Exp. Cell Res.2000,90: 1778-1792.
261:91-103.[20]I. Shmulevich & E. Dougherty.Genomic Signal Processing,
[11]H. de Jong. Modeling and Simulation of Genetic Regulatory PrincetonUniversityPress,USA,2007.
Systems: A Literature Review.J. Comput. Biol.2002,9: 69-[21]P. Smolen, D. Baxter & J. Byrne. Mathematical Modeling of
103.Gene Network. Neuron 2002, 26: 567-580.
[12]S. Kauffman. Metabolic Stability and Epigenesis in Randomly[22]Wang, T. C., R.D. Cardiff, L. Zukerberg, E. Lees, A. Amold &
Constructed Gene Nets.J. Theoret. Biol. 1969,22:437-467.E.V. Schmidt. Mammary Hyerplasia and Carcinoma in
[13]S. Kauffman. Homeostasis and Differentiation in Random MMTV-cyclinD1TransgenicMice.Nature 1994, 369: 669-
Genetic Control Networks.Nature 1969, 224: 177-178.671.
[14]S. Kauffman. The Origin of Orders, Oxford University Press,[23]K. Yeung & W. Ruzzo. An Empirical Study on Principal Com-
NewYork,1993.ponent Analysis for Clustering Gene Expression Data.
[15]S. Kim, S. Imoto & S. Miyano. Dynamic Bayesian Network Bioinformatics 2001, 17: 763-774.
andNonparametricRegressionforNonlinearModelingof[24]Zhang, S., Ching, W., Tsing, N., H. Leung & Guo, D. A. Multi-
Gene Networks from time Series Gene Expression Data. Proc.ple Regression Approach for Building Genetic Networks.The
1st Computational Methods in Systems Biology, Lecture Note Proceedings of the International Conference on BioMedical
in Computer Science2003, 2602: 104-113.Engineering and Informatics 2008, Sanya, China.
SciRes JBiSE Copyright © 2008
21
S.Q. Zhanget al./J. Biomedical Science and Engineering 1 (2008) 15-21