J. Biomedical Science and Engineering, 2010, 3, 977-985 JBiSE
doi:10.4236/jbise.2010.310128 Published Online October 2010 (http://www.SciRP.org/journal/jbise/).
Published Online October 20 10 in SciRes. http://www.scirp.org/journal/jbise
Global pattern of pairwise relationship in genetic network
Ao Yuan, Qingqi Yue, Victor Apprey, George E. Bonney
National Human Genome Center, Howard University, Washington DC, USA.
Email: yuanao@hotmail.com; ayuan@howard.edu
Received 6 April 2010; revised 22 April 2010; accepted 10 May 2010.
ABSTRACT
In recent times genetic network analysis has been
found to be useful in the study of gene-gene inter-
actions, and the study of gene-gene correlations is a
special analysis of the network. There are many
methods for this goal. Most of the existing methods
model the relationship between each gene and the
set of genes under study. These methods work well
in applications, but there are often issues such as
non-uniqueness of solution and/or computational
difficulties, and interpretation of results. Here we
study this problem from a different point of view:
given a measure of pair wise gene-gene relationship,
we use the technique of pattern image restoration to
infer the optimal network pair wise relationships.
In this method, the solution always exists and is
unique, and the results are easy to interpret in the
global sense and are computationally simple. The
regulatory relationships among the genes are in-
ferred according to the principle that neighboring
genes tend to share some common features. The
network is updated iteratively until convergence,
each iteration monotonously reduces entropy and
variance of the network, so the limit network
represents the clearest picture of the regulatory
relationships among the genes provided by the data
and recoverable by the model. The method is illus-
trated with a simulated data and applied to real
data sets.
Keywords: Convergence, Gene-Gene relationship, Neigh-
borhood, Pattern analysis, Relationship measure.
1. INTRODUCTION
A gene regulatory network (also called a GRN or genetic
regulatory network) is a collection of DNA segments in
a cell which interact with each other (indirectly through
their RNA and protein expression products) and with
other substances in the cell, thereby governing the rates
at which genes in the network are transcribed into
mRNA. From methodology point of view, genetic net-
works are models that, in a simplified way, describe
some biological phenomenon from interactions between
the genes. They provide a high-level view and disregard
most details on how exactly one gene regulates the ac-
tivity of another. The gene-gene pair wise relationships
provide a special insight of the network and are of inter-
est in the study.
Our work is closely related to that of genetic network
analysis, and we first give a brief review of the methods.
Some methods are deterministic, such as differential
(difference) equation models [1-3], which may not be
easy to solve nor have unique solutions. Since the ge-
netic network is a complex system, any artificial model
can only explain part of its mechanism; the unexplained
parts are random noises, so we prefer a stochastic model.
Existing stochastic methods for this problem including
the linear models [4,5] or generalized linear models [6,7],
the Bayesian network [8,9] etc. All these methods have
their pros and cons, but have the common disadvantage
that the solution may not be unique and the results are
not easy to interpret. Also, when the network size ex-
ceeds that of the data, these methods break down. In
genetic work the pair wise regulatory relationships
among the genes are important. For such data, it is of
interest to investigate the underlying patterns that may
have biologic significance, in particular those arising
from pair wises regulatory relationships among the
genes. Here we study this problem from a different point
of view. Given a measure of pair wise gene-gene rela-
tionship, we compute the measures from the data, and
use the technique of pattern recognition and image res-
toration to infer the underlying network relationships.
The pair wise regulatory relationships among the genes
are inferred according to the principle that neighboring
genes tend to share some common features, as neigh-
boring genes tend to be co-regulated by some enhancers
because of their close proximity [10]. In this method, the
solution is unique and computationally simple, the re-
sults are easy to interpret and the network can be of any
size. In the following we describe our method, study its
978 A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
basic properties, and illustrate its application. This me-
thod is used to reveal the true relationships of structured
high dimensional data array [11-14].
2. MATERIALS AND METHODS
The gene expression data are generally time dependent,
as in Iyer et al. [15]. Let ()
ij
X
t (1,..., ;1,imj
...,;1,... )nt k be the observed gene expression re-
sponse for subject i, gene j at time t. Denote
1
( )(( ),...,( ))'
ii in
x
txtxt be the observations across all
the genes for subject i, and we use ()
x
t to denote a
general sample of the ()
i
x
t’s. Often for this type of data,
m and k are in the low tens, and n in the tens to thousands.
The commonly used differential equation model for
genetic network analysis is a set of first order homogeneous
differential equations with constant coefficients, in the
simple case, has the form
() ()
dx tWx t
dt ,
where ()
ij
Ww is the n × n matrix of unknown regu-
latory coefficients to be solved. This type of models and
its more specific and complicated variations characterize
well the dynamic of the network over time. The base
solution of the above equation set is the matrix exponen-
tial 1
0
:/!:(( ),...,( ))'
tWr r
n
r
etWrvtvt

, and the general
solution of it has the form
1
() ()
n
ijj
j
x
tcvt
x_{i} (t),
(i = 1,...,m), where the
j
c’s are constants to be deter-
mined by initial conditions from the data. So there are in
total n² + n = n(n + 1) coefficients, n2 of them from W
and n from the
j
c’s, to be determined from a total of
mnk data points. When mnk < n(n + 1) these coefficients
can not be determined; when mnk n(n + 1) they may
be uniquely or non-uniquely determined, or may still be
not determined. For differential (difference) equation
models more complicated than this, solutions are more
difficult to get.
The commonly used stochastic model is the multi-
variate linear model
(1)() ,
iii
xt Wxt
 () 0,
i
E
(i = 1, , m; t = 1, …,
k-1)
where 1
(,...,)'
iin

is the random deviations unex-
plained by the model. Denote () ( ())
ij
X
txt, if X(t)X(t)
is non-singular, the least-squares solution of the above
model is W = X’(t + 1) X(t) (X’(t) X(t))-¹, and it may have
multiple solutions for different t. For '( )( )
rr
X
tXt to be
non-singular, one must have n m. Even for n < m,
'( )()
rr
X
tXt may not necessarily be non-singular. This
puts an immediate restriction on the size of the network
to be analyzed. Also, the solution of the above model
may not be unique due to different time points.
For these reasons, we study the problem from a dif-
ferent point of view; by analyze the pair wise gene-gene
relationships in the network. In the following we de-
scribe our model in which there is always an unique so-
lution, the result is easy to interpret, and there is no re-
striction on the size of the network. Since the pattern in
the genetic network is based on the principle of neigh-
boring similarity, the order of the genes matters in the
study, and generally we assume the genes are arranged in
their chromosome order.
First we need a measurement for the relationship be-
tween any pair of genes, and the network can be repre-
sented by the matrix of the pair wise relationships. For
large network, linear relationship is not adequate to use,
as most of the coefficients will be very small. Also, as
mentioned above, such model in this case has no solu-
tion because of the small sample size. Pearson’s correla-
tion is a good choice for this purpose, other choices in-
cluding Kendal’s tau and Spearman’s rho, etc. Here we
illustrate the method with Pearson’s correlation, and our
goal is to infer the triangular correlation matrix
1
()
ijijn
Rr
from the observed data, where ij
r is the
Pearson’s correlation coefficient between genes i and j.
As usually the number m of individuals is small (some-
times as few as 2), estimate the correlations using the
data at each time point alone is inadequate. So we use all
the data to estimate them. An empirical initial version of
these correlations are
()
11
(( )( ))(()())
1
(())(())
mk rii sjj
o
ij
rs ij
x
txtxtxt
rmk Var xtVar xt


 ,
((1 )ijn
 (1)
where
1
1
() (),
n
iri
r
x
txt
m
2
1
1
(())( ()()),
n
irii
r
Varx txtx t
m

(i = 1,...,n; s=1,...,k).
here the ()
ri
x
t’s are not i.i.d. over the time t’s, and the
sample size mk is often not large, so the above empirical
correlations are very crude evaluations of the true corre-
lations ij
r’s. The initial table (0) (0)
(:1 )
ij
Rr ijn
is used as the raw data for the next step analysis. For
each fixed i the observations ()
i
x
ts at different time
conditions reduced the common features in the data, this
table is biased as an estimate of R. We need to restore
their values according to the basic property of the ge-
netic regulatory system. Many reports have shown that
nearby genes tend to have similar expression profiles
[16-19], thus nearby pairs of genes tend to have similar
relationships, and their correlations tend to be close. This
A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
979
is just the same principle as used in image restoration of
data arrays of any size. In the following we use this
technique to reduce the bias and improve the estimate of
R based on the observation (0)
R.
Meloche and Zammar [14] considered a method for
image restoration of binary data, here we adopt their idea
and revise their method to gene expression analysis for
continuous data. We assume the following model
(0) ,
ijij ij
rr
 ij
~ 2
(0, )N
, (1 i < j n) (2)
for some unknown 2
> 0, where the ij
’s represent
the part of measurements unexplained by the true reg-
ulatory relationships in the model. Define the neighbor
(0)
ij
R of (0)
ij
r to be the collection of the nine immedi-
ate members of (0)
ij
R of (0)
ij
r = {(0) :|| 1,
ab
rai
||1bj}, which includes (0)
ij
r itself at the center.
For (0)
ij
r’s on the boundary of (0)
Rthe definition is
modified accordingly. For example, (0)
1,2
R and
(0)
1,nn
R has only three members, (0)
1, j
R (3 < j < n-1)
has six members, etc. Larger neighbors of different
shapes can also be considered; here we only illustrate
using the above neighbor systems. We assume the
(0)
ij
r’s only depend on their neighbors (0)
ij
R’s. The aim
is to provide estimates ij
r
’s for the true ij
r’s based on
the records (0)
R. We assume the estimates have the
form for some function h(
) to be specified. The per-
formance of the estimates will be measured by the aver-
age conditional mean squared error.
(0)
()
ij ij
rhR, (1 i < j n), (3)
2(0)
1
2[() |]
(1)ij ij
ijn
Er rR
nn  
(4)
The optimal set of estimates is the one which mini-
mizes (4). Although ij
r is deterministic, we may view
it as a realization of the random variable
I
J
r with (I, J)
uniformly distributed over the integer set
{(,) :1}Sijijn. So (4) can be rewritten as
2(0) 2(0)
(0)2 (0)
[() |][() |]
[( ())|]
IJIJ IJIJ IJ
IJ
EErr RErr R
EhRr R


Thus by (3), the minimizer of (4) is achieved by
* (0)(0)(0)
:()(| )(|)
IJIJIJIJ IJ
rhRErRErR 
, and so
*(0) (0)
()(|).
ij ijIJ
rhR ErR
To evaluate the above conditional expectation, we
need a bit more preparation. Note 2
is estimated by
2(0)2
(, )
2()
(1) ij
ij S
rr
nn

, (0)
(, )
2
(1) ij
ij S
rr
nn
.
Denote 2
(| )t
the normal density function with
mean 0 and variance 2
. Denote ij
S as the collection
of indices for (0)
ij
R. Given (0)
ij
R, for (,)ij
I
JS
, view
(0)
IJ
r as a random vector over indices (I,J). We define
the conditional distribution of (0)
I
J
r as
(0)(0) (0)
(0) (0)
(|)
{#} 1
|| ||
IJuv ij
ij uv
ij ij
PrrR
member inRr
SS


In the above we used the fact that the (0)
uv
rare con-
tinuous random variables, so the collection {members in
(0)
ij
R= (0)
uv
r}={ (0)
uv
r} almost surely. The correspond-
ing conditional probability is defined as
(0)
(0)(0) 2(0)(0)
(0)(0) 2(0)(0)
(,)
((,)(,) |)
(|)(|)
(|)(|)
ij
ij
uv ijuvij
uv ijuvij
uv S
PIJuv R
rrPr R
rrPr R



, (u,v) ij
S
By (2), we deduce (0) (0)
(| ,(,)(,))
IJ IJuv
Er RIJuvr,
so we have
(0) (0)
(,)
(0)
(| )(| ,(,)
(,))((,)(,)|)
ij
ijIJijIJ ij
uv S
ij
rErRErR IJ
uvPI JuvR


(0)(0)(0) 2(0)(0)
(,)
(0)(0) 2(0)(0)
(,)
(|)(|)
(|)(|)
ij
ij
uvuv ijuvij
uv S
uv ijuvij
uv S
rrr PrR
rrPrR


(0)(0)(0) 2(0)(0)(0) 2
(,) (,)
(0)(0) 2(0)(0) 2
(,) (,)
(|)(|)
(|) (|)
ij ij
ij ij
uvuv ijuvuv ij
uv Suv S
uv ijuv ij
uv Suv S
rrr rrr
rr rr

 






,
(i, j)ij
S
(5)
The matrix ()
ij
Rr
is our one-step restored esti-
mate of the genetic correlation network R, we also de-
note it by (1) (1)
()
ij
RR r .
Denote F() the operator given in (5), as
(1) (1)
()
ij ij
rFR, and denote (1) (1)
()
ij ij
rFR
(1) (0)(0)
()(|)RFR ERR .
We view F() as a filter for the noises, so (1)
R is a
smoothed version of (0)
R. Let
I
J
Rr be the random
variable of the ij
r’s over the random index (I,J) and the
variation of possible values of the ij
r’s, with density p(),
its uncertainty can be characterized by variance and en-
tropy, which is defined as
H(p) = –E[log p(R)]= –p(r)log p(r)dr.
It is maximized or most uncertain when R is uni-
formly distributed, and has smaller value when the dis-
tribution of r is more certain. It has some relationship
with variance. The former depends on more innate fea-
tures, such as moments, of the distribution than the latter,
which only measures the disparity from the mean. When
980 A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
p() is a normal density with variance 2
, then
2
() 12Hp
 . For many commonly used parametric
distributions, entropy and variance agree with each other,
i.e. an increase in one of them implies so for the other.
But this is not always true and a general closed form
relationship between variance and entropy does not exists.
Variance is more popular in practice because of its sim-
plicity.
Although generally, in the image restoration context,
R is estimated by just applying F once, a natural question
is what will happen if we use the operator F repeatedly?
i.e. let (1) (1)()()
()()(|)
kk kk
ij
Rr FRERR

  for k 0.
To investigate this question, we impose the model
()() ,
kk
ij ijij
rr
 ()k
ij
~ ),0( )(2 k
N
, (1 i < j n)
(6)
The estimators ()k
ij
r ‘s are obtained by minimizing
2()
1
2[() |]
(1)
k
ij ij
ijn
Er rR
nn  
(7)
and are given by
() ()
(| )
kk
ijIJ ij
rErR.
similarly 2()k
is estimated by
2( )( )( )2
(, )
2()
(1)
kkk
ij
ij S
rr
nn

,
() ()
(, )
2
(1)
kk
ij
ij S
rr
nn
.
Since n is usually large, 2( )k
is a good estimator of
2( )k
. Corresponding to (5), we have
(1)()()
(|)
kkk
ijIJ ij
rErR

()(0) (0)2
(,)
(0)(0) 2
(,)
( )(0)(0)2()
(,)
(0)(0)2()
(,)
(|)
(|)
(|)
(|)
ij
ij
ij
ij
k
uvuv ij
uv S
uv ij
uv S
kk
uvuv ij
uv S
k
uv ij
uv S
rrr
rr
rrr
rr




, (i,j) ij
S , k1 (8)
In the above we do not replace the (0)
ij
r’s by the
()k
ij
r’s in (|)
 but with 2(0)
replaced by the step k
estimator 2()k
, only for the reason of simplicity in the
proof of the Proposition below. Finally, 2( )k
is re-
placed by 2()k
in actual computation.
Although few density functions are convex, many of
them are log-convex. For example, the normal, exponen-
tial (in fact any quadratic exponential families), Gamma,
Beta, chisquare, triangle, uniform distributions. But
some are not, such as the T and Cauchy distributions.
Condition A) does not require all the ()
()
k
p’s to belong
to the same parametric family, nor even to be parametric.
Condition B) is satisfied for almost all parametric fami-
lies as few parametric families require more than the
first two moments to determine. The only restriction we
make is that all the ()
()
k
p
’s belong to the same para-
metric family.
View ()k
r as a random realization of the ()k
ij
r’s and
as of (0)
R, let ()
()
k
p
be the density function of ()k
r.
To study the property of the algorithm, we say a
non-negative function f() is log-convex if log f(
) is
convex, and assume the following conditions
A) ()
()
k
p
is log-convex for all k.
B) All the ()
()
k
p
’s belong to a parametric family
which is determined by the fist two moments.
Our algorithm has the following desirable property
(see Appendix for the proof)
Proposition. 1) Assume either A) or B), then
(1)()
()(),
kk
Hp Hp
k 0.
2) 2(1)2()kk

, k 0.
3) As k →∞, the table ()k
R converges in
the component wise sense:
() *k
RR
for some stationary array **(0)
(,)RRRF.
This Proposition tells us that, if the assumption of
neighboring similarity is valid for (0)
R, then the esti-
mates ()k
R become more and more clear (less entropy),
and more and more accurate as an estimator of R (less
variance). So (*)
R is the sharpest picture the data (0)
R
provide and can be restored by the filter F, the innate
regulatory relationships among the genes can be recov-
ered by filter F and provided by the data (0)
R under the
ideal situation of no noise. Intuitively, this picture has
some close relationship with the haplotype block struc-
tures.
As of small sample size (mk) and large number (n(n –
1)/2) of parameters, there is no way of talking about the
consistency of R to R. So in general (*)
R and R may not
equal, however our algorithm enable us to do the best
effort we can. Convergence of ()k
R can be accessed by
the distance criteria: for a given > 0 (usually =1/100 or
1/1000)
(1)
(1) ()()
1
2
(,)| |
(1)
k
kk k
ij ii
ij
dR Rrr
nn

or
(1)
(1)()()1/2
2
2
(,)(())
(1)
k
kk k
ij ii
ij
dR Rrr
nn

.
A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
981
Network at each time. We may also investigate the
problem at each different time point t. In this case (1)
is replaced by
()
1
(()())(()())
1
(( ))(( ))
mrii rjj
o
ij
rij
x
txtxtxt
rmVar xtVar xt

,(1 )ijn 
where,
1
1
() (),
m
iri
r
x
txt
m
2
1
1
(())( ()()),
m
irii
r
Varx txtx t
m

(i = 1,...,n; t =
1,...,k)
and (0)
R(t) = ((0)()
ij
rt: 1 i < j n) be the corre-
sponding initial table at each t, and the neighborhood
for (0) ()
ij
rt is (0) ()
ij
Rt = {(0)
ab
r: |ai| 1, |bj|
1}. In this case (6) is
() ()
() ()(),
kk
ij ijij
rtrt t
 ()()
k
ij t
~ 2( )
(0,( ))
k
Nt
,
(1 i < j n)
and ()()()
() (()|()).
kkk
ijIJ ij
rtEr tRt
let 2()( )( )2
(, )
2
()( ()())
(1)
kkk
ij
ij S
trtrt
nn

,
() ()
(, )
2
() ()
(1)
kk
ij
ij S
rt rt
nn
.
(8) is now (1)()()
() (()|)
kkk
ijIJ ij
rtErtR

()(0)(0)2
(,)
(0)(0) 2
(,)
()(0)(0) 2()
(,)
(0)(0)2( )
(,)
() (()()|)
(() ()|)
() (()()|())
(()()|())
ij
ij
ij
ij
k
uvuv ij
uv S
uv ij
uv S
kk
uvuv ij
uv S
k
uv ij
uv S
r trtrt
rtrt
r trtrtt
rtrtt




, (i,j) ij
S
,
k1
The matrix ()()
() (())
kk
ij
Rtr t is the k-step re-
stored estimate of the genetic correlation network
()( ())
ij
Rtr t at time t. The proposition is then hold
for each fixed t.
3. SIMULATION STUDY
We simulate 40 genes over 12 time conditions at time
(hour) points 112
( ,...,)tt = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12) for 6 individuals by mimicking the setting of the data
analyzed in Iyer et al. [15]. We simulate the genes from
6 clusters, the numbers of genes in each cluster are given
by the vector 16
( ,...,)nn = (8, 6, 4, 4, 6, 12). The base-
line values of the gene expressions over time t [0,12]
in cluster k are generated by functions of the form
12 3
sinsin/2sin(/3)
kk kk
htat atat , (k =
1,....,6).
Let h(t) be the vector of length 40, with first n com-
ponents given by h1(t), second n2 components by h2(t),....,
last n6 components by h6(t). Denote 123
(, , )
kkkk
aaaa.
We arbitrarily choose the k
a’s as a1 = (0.54, –0.18,
1.23), a2 = (–0.12, –0.25, 0.45), a3 = (1.0, –0.55, –0.15),
a4 = (–0.32, –0.15, –0.65), a5 = (0.15, 0.25, 0.35) and a6
= (–0.52, –0.45, –0.55). First we need to simulate the ij
r’s
with coordinated patterns. We divide the 40 genes into the
6 clusters, and assume independence among the clusters.
Then for given a covariance matrix 16
  
we generate the data using this and the time condi-
tions, where k
is the covariance matrix for the genes
in cluster k. Directly specifying a high dimensional posi-
tive matrix is not easy, we let each k
has the structure
'
kkk
QQ
, for some k
Q non-singular, so that k
is positive definite. Note that the 'kk
QQ’s may not be
correlation matrices, but they are covariance matrices, so
is
. Let k
Q be upper diagonal with dimension k
n.
The non-zero elements of Q1 are drawn from U(0.5,0.8);
those for Q2 from U(0.2,0.4); those for Q3 from
U(0.2,0.6), those for Q4 from U(–0.3, –0.1); those for Q5
from U(0.6,0.9); and those for Q6 from U(–0.8, –0.6).
Then let 1/2
16
QQ  . The 6 individuals are i.i.d,
so we only need to describe the simulation of observa-
tion 11
{():1,...,40;1,...,12}
j
xxtj t
of the first indi-
vidual. Note for each fixed t, 1111,40
( )((),...,( ))
x
txtxthas
covariance matrix
. We first generate 140
( ,...)yyy
with the components i.i.d. N(0,1), then
1/2
1() ()()
x
thty t

is the desired sample, where for each fixed t,
140
( )((),...,( ))tt t

is the noise, with the ()
it
’s i.i.d
N(0,1) and independent over t. Convert the covariance
matrix ()
ij
to a correlation matrix()
ij
Rr
as
/
ijijii jj
r

only for i < j. Using the data
(())
ij
X
xt
, we compute the ()k
R’s from (8) then use
perspective plots to compare the restored correlations
after convergence at step k, ()k
R, the one-step restored
(1)
R, the initial estimated (0)
R and the true simulated
correlations R.
After computation, the algorithm meets the conver-
gence criterion at iteration 14 with = 10–4. The distances
between the observed, first step estimate and last step
estimate are: d1((0)
R, R) = 0.125, d1((1)
R,R) = 0.108 and
d1((14)
R, R) = 0.094. We see that the estimate after con-
vergence is closest to the true correlations. The results
are displayed in Figure 1. We only display the correla-
tions ij
r for j > i. Those values for ij
r is 1’s, and those
for ij
r (i > j) are set to zero’s, which can be obtained by
symmetry.
982 A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
From this figure we see that the correlations computed
from the raw data, panel (b), are very noisy, the true pat-
tern, panel (a), in the network is messed up. The
one-step estimate, panel (c), gives some limited sense,
while the final estimates, panel (d), recover the true pic-
ture with reasonably well. Considering the large number
of 40(40+1)/2 = 820 parameters and the small number of
15 individuals on 40 genes, the last step estimates are
quite a success. Large number of simulations yield simi-
lar results, the convergence criterion is met with 10 to 15
iterations.
Note we only used networks of 40 genes, as large
networks are not easy to display graphically. The com-
putations of a network with n genes is in the order
n(n-1)/2, so there should be no computational problem
for ordinary computer using this method to restore even
the whole genome.
4. RESULTS
We use the proposed method to analyze the data with 30
microarray chips from the Stanford microarray database:
http:// smd.stanford.edu/cgi-bin/search/QuerySetup.pl.
The Category is Normal tissue and the subcategory is
PBMC, the following 30 files are the Raw data in the
database: 19430.xls, 19438.xls, 19439.xls, 19446.xls,
19447.xls, 19448.xls, 19449.xls, 19450.xls, 19451.xls,
19500.xls, 19505.xls, 19506.xls, 19507.xls, 21407.xls,
21408.xls, 21409.xls, 21410.xls, 21411.xls, 21412.xls,
21413.xls, 21414.xls, 21415.xls, 21416.xls, 21424.xls,
21425.xls, 21426.xls, 21427.xls, 21428.xls, 21429.xls,
21430.xls. The data we used are the overall intensity
(mean), the 67th column in the 30 excel files. We choose
three subsets of genes on the 30 arrays: set I is genes
0-49, set II is genes 1000-1049 and set III is genes
5000-5049 from the original data set. There are 80 variable
for each array. We choose the intensity from normal
people for our analysis. The initial correlation coeffi-
cients among the genes computed from the raw data in
each set, and those estimated after convergence by our
algorithm are shown in Figures 2-4. Clearly the initial
correlations are noisy and difficult to see any patterned
Figure 1. Network Correlations: (a). Simulated R, (b) Initial (0)
R, (c) One-step Restored (1)
R, (d) k-step
Converged ()k
R
A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
983
relationships among the genes. In contrast, the restored
pictures are quite clear. For set I, the coefficients are
rather homogeneous with values around 0.5, but there is
a clear boundary around gene 43, which suggests that
most of the genes in this set have similar relationships,
or functions. But gene 43 seems to have its own separate
mechanism. Genes 38 and 29 also have weak relation-
ships with the other genes. For set II, the relationships
among the genes are not so homogeneous. The genes are
moderately correlated with coefficients around 0.5, some
genes around positions 10, 16, 24, 30, and 38 have weak
interactions with the other genes. For set III, there is
moderate coordinating pattern among the genes, but
three genes, around positions 15, 29, and 40, appears to
have relatively independent patterns of regulatory func-
tioning.
5. CONCLUDING REMARKS
We considered a image restoration method for genetic
network analysis. This method gives unique solution, the
results are easy to interpret and computationally simple.
We may implement the genetic distances among the
genes into the updating system given in (8). The method
is not confined to correlation coefficients among genes,
other measures of gene-gene relationships can be con-
sidered analogously. Very large networks can be ana-
lyzed in principle, the only challenge is how to display
the results. We found when the number of genes exceeds
50, the figure is difficult to distinguish visually. The
computation for a network of size 40 takes about a cou-
ple of minutes using the Splus software. It will be much
faster using the C program, and there should be no prob-
lem to analyze the whole genome by this method. The
only requirement is that the data be arranged in their
chromosomal order, otherwise the results may not easy
to interpret.
The method can also be used for other analysis purposes
and data types, such as cluster analysis. Cluster objects
by pattern similarities, etc. It can be used to analyze qua-
litative data type such as haplotype analysis.
Figure 2. Real data I: initial (left panel) and restored (right panel) correlations.
Figure 3. Real data II: initial (left panel) and restored (right panel) correlations.
984 A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
Figure 4. Real data III: initial (left panel) and restored (right panel) correlations.
6. ACKNOWLEDGEMENTS
The research has been supported in part by the National
Center for Research Resources at NIH grant 2G12
RR003048.
REFERENCES
[1] Goodwin, B.C. (1965) Oscillatory behavior in enzymatic
control processes. In Weber, G., Ed., Advances in Enzyme
Regulation, Pergamon Press, Oxford, 425-438.
[2] Tyson, J.J. and Othmer, H.G. (1978) The dynamics of
feedback cellular control circuits in biochemical path-
ways. Progress in Biophysics, Academic Press, New
York, 1-62.
[3] Reinitz, J., Mjolsness, E. and Sharp, D.H. (1995) Model
for cooperative control of positional information in Dro-
sophila by bicoid and maternal hunchback. Journal of
Experimental Zoology, 271(1), 47-56.
[4] Wessels, L.F.A., Van Someren, E.P. and Reinders, M.J.T.
(2001) A comparison of genetic network models. Pacific
Symposium on Biocomputing, 6(4), 508-519.
[5] D’haeseleer, P., Liang, S. and Somogyi, R. (1999) Gene
expression data analysis and modeling. Pacific Sympo-
sium on Biocomputating, Hawaii, USA.
[6] Savageau, M.A. (1976) Biochemical System Analysis: A
Study of Function and Design in Molecular Biology. Ad-
dison-Wesley, Reading, Massachusetts.
[7] Voit, E.O. (2000). Computational Analysis of Biochemi-
cal Systems, Cambridge University Press, Cambridge.
[8] Friedman, N., Linial, M., Nachman, I. and Pe’er, D.
(2000) Using Bayesian network to analyze expression
data. Journal of Computational Biology, 7(3-4), 601-620.
[9] Zhang, B.T. and Hwang, K.B. (2003) Bayesian network
classifiers for gene expression analysis. In: Berrar D.P.,
Dubitzky W. and Granzow M. Ed., A Practical Approach
to Microarray Data Analysis, Kluwer Academic Publish-
ers, Netherlands, 150-165.
[10] Chen, L. and Zhao, H. (2005) Gene expression analysis
reveals that histone deacetylation sites may serve as par-
titions of chromatin gene expression domains. BMC Ge-
netics, 6(1), 44.
[11] Owen, A. (1984) A neighborhood based classifier for
LANDSAT data. Canadian Journal of Statistics, 12(3),
191-200.
[12] Ripley, B.D. (1986) Statistics, images, and pattern recog-
nition. Canadian Journal of Statistics, 14(2), 83-111.
[13] Besag, J. (1986) On the statistical analysis of dirty pic-
tures. Journal of the Royal Statistical Society, 48(3), 259-
302.
[14] Meloche, J. and Zammar, R. (1994) Binary-image resto-
ration. Canadian Journal of Statistics, 22(3), 335-355.
[15] Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T.,
Lee, J.C., Trent, J.M., Staudt, L.M., Hudson J.J., Boguski,
M.S., Lashkari, D., Shalon, D., Botstein, D. and Brown,
P.O. (1999) The transcriptional program in the response
of human fibroblast to serum. Science, 283(5398), 83-87.
[16] Cohen, B.A., Mitra, R.D., Hughes, J.D. and Church, G.M.
(2000) A computational analysis of whole-genome ex-
pression data reveals chromosomal domains of gene ex-
pression. Nature Genetics, 26(2), 183-186.
[17] Caron, H., Schaik, B., Mee, M., Baas, F., Riggins, G.,
Sluis, P., Hermus, M.C., Asperen, R., Boon, K., Voute,
P.A., et al. (2001) The human transcriptome map: clus-
tering of highly expressed genes in chromosomal do-
mains. Science, 291(5507), 1289-1292.
[18] Lercher, M.J., Urrutia, A.O. and Hurst, L.D. (2002)
Clustering of housekeeping genes provides a unified
model of gene order in the human genome. Nature Ge-
netics, 31(2), 180-183.
[19] Spellman P.T., Rubin G.M. (2002) Evidence for large
domains of similarly expressed genes in the Drosophila
genome. Journal of Biology, 1(1), 5.
[20] Ebrahimi, N., Maasoumi, E. and Soofi, E. (1999) Order-
ing univariate distributions by entropy and variance.
Journal of Econometrics, 90(2), 317-336.
A. Yuan et al. / J. Biomedical Science and Engineering 3 (2010) 977-985
Copyright © 2010 SciRes. JBiSE
985
Appendix
Proof of the Proposition. Recall
(1)(1) (1)()
()
kkkk
IJ IJ
rr rR

 is random in (I,J) and
()k
R; for fixed (I,J)=(i,j), (1)(1) ()
()
kkk
ij ji
rrR

is ran-
dom in ()k
R; (1)(1)()(0)
(|)
kkk
IJ IJ
rrRR

is random in
(I,J) (discrete); also
(1) ()()()()
(|)[|]
kkk kk
IJUVIJ
rErR ErR
 for random
index (I,J)
S and random index (, )
I
J
UV S.
1) We first prove the result under condition A). We
have
(1)(1)() (1)
(1)
(1)(1)()
()
log( )log( )
()
()log(||)0,
()
kk kk
k
kkk
k
Ep rEpr
pr
prdrDpp
pr



which is the relative entropy between (1)
()
k
p
and
()
()
k
p. It is known that (1) ()
(||)0
kk
Dpp
with “=”
if and only if (1)()
() ()
kk
pp
 . Note log-convexity of
()
()
k
p imply, for each given ()k
IJ
R,
()()()() ()()
log([| ])[log()| ]
kkkkk k
IJ IJ
pErREprR.
Thus by the above two inequalities we get
( 1)(1)(1)()( 1)
()() ()()() ()
()log ()log()
log([|])( [log(|])
kkkkk
kkkkkk
IJ IJ
HpE prE pr
EpErREEprR

 
 
(1)()()
log()().
kk k
Ep rHp

Under condition B), the result in Ebrahimi et al. [20]
states that entropy and variance agree each other. i.e. one
increase/decrease implies the other. Now the conclusion
is immediate from 2).
2) By the total variance formula, we have
() ()
() ()() ()
()
[( |)] ([ |])
kk
UV
kk kk
UV IJUV IJ
Var r
EVar rRVar ErR

()()()
()()(1)(1)
[(|)]()
[( |)]
kk k
UV IJIJ
kk kk
UV IJ
EVar rRVarr
EVar rR




,
(0,1, 2,...).k
3) We only need to prove the convergence of the
component ()k
ij
r for any fixed (i,j). In fact from (8), for
any integer m and k we have
(1 )(0)(0)2(1)
(,)
()
(0)(0)2( 1)
(,)
(1 )
(|)
:
(|)
()
ij
ij
km km
uvuv ij
uv S
km
ij km
uv ij
uv S
km
ij
rrr
rrr
FR


 


,
(i,j)
S, k,m = 0,1,2...
Since (0)
and () 0
k
, by ii), we have
() *k
 for some 0 *
< . So if we let
(1 )(0)(0)2*
(,)
()
(0)(0) 2*
(,)
(1 )
(|)
:
(|)
()
ij
ij
km
uvuv ij
uv S
km
ij
uv ij
uv S
km
ij
rrr
rrr
FR






,
(i,j)
S, k,m = 0,1,2...
then ()() (1)
km km
ij ij
rro


as k →∞, thus we only need
to prove the convergence of
{()km
ij
r
}.
Note
(1 )2*
(1 )(0)(0)2*
(,)
() (0 |):
(|)
ij
km
ij
ij
km
ijuv ij
uv S
FR C
rrr







,
We have 01
ij
C
, and for all m,
(1)()(1)(1)()(0)
||| |||
ll kmkkm
ijijij ijijij ijij
rrCr rCrr

 
 
11
(1)()(1)(0)
00
(1) (0)
|| ||
||,
1
mm
kllkl
ijijijijij ijij
ll
k
ij
ij ij
ij
Crr CCrr
Crr
C


 


 

thus ()
1,2 ,...
{}
km
ij k
r
is a Cauchy sequence, and the con-
vergence follows.