Identification of the interactive region by the homology of the sequence spectrum

doi:10.4236/jbise.2010.39117

Paper Menu >>

Journal Menu >>

J. Biomedical Science and Engineering, 2010, 3, 868-883

doi:10.4236/jbise.2010.39117 Published Online September 2010 (http://www.SciRP.org/journal/jbise/

JBiSE

Published Online September 2010 in SciRes. http://www.scirp.org/journal/jbise

Identification of the interactive region by the homology of the

sequence spectrum

Masatoshi Nakahara1, Masaharu Takeda2*

1Department of Computer and Information Sciences, Sojo University, Ikeda, Kumamoto, Japan;

2Department of Materials and Biological Engineering,Tsuruoka National College of Technology, Tsuruoka, Yamagata, Japan.

Email: mtakeda@tsuruoka-nct.ac.jp

Received 4 June 2010; revised 9 July 2010; accepted 12 July 2010

ABSTRACT

The base sequence in genome was governed by some

fundamental principles such as reverse-complement

symmetry, multiple fractality and so on, and the anal-

ytical method of the genome structure, the “Sequence

Spectrum Method (SSM )”, based on the struct ural f ea -

tures of genomic DNA faithfully visualized these prin-

ciples. This paper reported that the sequence spec-

trum in SSM closely reflected the biological phe-

nomena of protein and DNA, and SSM could identify

the interactive region of protein-protein and DNA-

protein uniformly. In order to investigate the effec-

tiveness of SSM we analyzed the several protein-

protein and DNA-protein interaction published pri-

marily in the genome of Saccharomyces cerevisiae. The

method proposed here was based on the homology of

sequence spectrum, and it advantageously and sur-

prisingly used only base sequence of genome and did

not require any other information, even information

about the amino-acid sequence of protein. Eventually

it was concluded that the fundamental principles in

genome governed not only the static base sequence

but also the dynamic function of protein and DNA.

Keywords: Spectrum of Genome Base Sequence; Ho-

moology of Sequence Spectrum; Interactive Region;

Reversese-Complement Symmetry; Multiple Fractality;

Analytical Method Of Genome

1. INTRODUCTION

As described in the previously [1,2], it was very impor-

tant to investigate the structure of the entire genome be-

cause the four bases should be arranged in a sophisti-

cated fashion in the genome, and essentially the base

sequences might reflect the conformations of protein,

RNA and DNA. DNA sequences were deeply affected

by the adjoining sequences. In other words, the non-cod-

ing sequences might play some important roles to express

each gene (the coding sequences) in genome. That is, not

only the coding region, but also the non-coding region

might be necessary to transmit and to transform the bio-

logical information precisely, rapidly, and stably. There-

fore, if we would find meaningful structure in the ge-

nome, we might also obtain important information about

the functions of protein, RNA and DNA from their

structure.

Previously, we showed that the four bases in genomic

DNA were organized based on the generation-rules in all

organisms by analyzing the appearance frequency of the

bases, and we proposed three generation-rules of the base

sequences in a single-strand of DNA: 1) reverse-com-

plement symmetry of the 1 ~ 9 successive base sequen-

ces, 2) multiple fractality of each base distribution de-

pending on the distance, and 3) bias of four bases, A, T,

G and C. These rules were universally observed regard-

less species [1]. Further we also defined the sequence

spectrum by the appearance frequency of the base se-

quence in genome, and we have developed the powerful

method “Sequence Spectrum Method (SSM)” in order to

visualize and analyze the generation-rules in entire geno-

me explained above. As one of important results, we re-

vealed by using SSM that there was the remarkable ho-

mology of sequence spectrum between proteins and tR-

NAs [2]. This fact suggested the sequence spectrum

could be closely associated with the function of protein,

and the homology of sequence spectrum could be related

to the mutual interactive region. Identification of mutual

interactive region of protein, RNA and DNA was defi-

nitely important to figure out their functions, and usually

the homology of base sequence or amino-acid sequence

was used for it.

To investigate the effectiveness of SSM, in this paper,

we showed that SSM could identify the interactive region

of the protein-protein and the protein-DNA by the ho-

mology of the sequence spectra. The advantages of the

proposed method were as follows.

1) It used only base sequence of genome and did not

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

869

require any other information, even information about

amino-acid sequence of protein. As SSM faithfully re-

flected the biological information, the conservation of the

bases sequences of genomic DNA was also conserved in

the translated amino acids sequence of the protein se-

quence [1,2].

2) It could identify the interactive region of both pro-

tein-protein and protein-DNA in completely the same

manner.

3) It could be executed fully on a personal computer

and did not require a special high performance computer.

Moreover the identification was done in a few seconds.

2. MATERIALS AND METHODS

2.1. Sequence Spectrum Method (SSM)

SSM was carried out in the same way as the published

procedures [2]. The outline of the proposed method was

as follows. The base sequence of interest was sectioned

by a small number of bases from the top (5’-end). The

key sequences of the nine successive base sequences (d

= 9) was 262,144 sequences (= 49, Reference [2]). The

appearance frequency of the key sequence was counted

in the entire genome, and was plotted at the position of

the first base of the key sequence as described in the

next paragraph. These procedures were carried out for

the entire base sequence of interest with one base shift (p

= 1). The next step was to average the appearance fre-

quencies so that a recognizable pattern of appearance

frequency was obtained for the base sequence. This pat-

tern of the averaged appearance frequency was called the

“sequence spectrum”. Finally, the homology factor be-

tween two sequence spectra was calculated to determine

the degree of homology. The exact procedure was ex-

plained below in a mathematical way.

Let S be an entire set of base sequences, and B = [bi]

be a partial set of interest in S. A base element was deno-

ted by bi (i = 1..M), and M was the base sequence size of

B. The base element bi become A (adenine), T (thymine),

G (guanine) or C (cytosine). The key sequence ki and the

appearance frequency fi were defined for bi as follows.

Key sequence ki : base sequence comprised of sequen-

tial base elements bi~bi+d-1 (d : base size of the key se-

quence).

Appearance frequency fi : appearance count of ki in S.

The key sequence ki was compared with the base se-

quence of the entire set S, and the appearance frequency

fi was increased by one every time the key sequence ki

matches the partial base sequence of the entire set S.

This procedure was iterated for all key sequences ki to

obtain fi (i = 1..M). In practice all fi were counted and

tabulated in advance by scanning all base sequence in S.

Consequently, the appearance frequency vector F = [fi] (i

= 1..M) was determined (actually, the appearance fre-

quencies for the last (d-1) base elements of B could not

be calculated; however, this was neglected because M >>

d-1).

Next, the appearance frequency fi was averaged as

follows:

mij

si f

f







12

where the parameter m was average width. This aver-

aged appearance frequency Fs = [fsi] (i = 1..M) was

called the “sequence spectrum”.

The next step was to calculate the homology factor to

determine the degree of homology. The homology factor

determines the homologous region of a target base se-

quence with respect to a reference base sequence. In order

to derive the homology factor, the mutual correlation

function MF within the window width of homology was

calculated as































kjj

kii

jkjj

kjj

iki

ikii

kjiki

fst

fsr

fstfstfstfstFst

fsrfsrfsrfsrFsr

fstfstfsrfsr

FstFsr

FstFsrMF

)(*)(

),(

where

Fsr— sequence spectrum of the reference base se-

quence

Fst— sequence spectrum of the target base sequence

w— window width of homology

The mutual correlation function MF ranges from -1 to

1, and then the homology factor HF was defined as

[%]100*

)1(

),( 

ij

FstFsrHF

The higher the homology factor, the more similar the

sequence spectra were. The similar regions of the target

base sequence with respect to the reference base se-

quence were obtained by calculating the homology fac-

tors HFij for all i (i = 0..Mr-w, Mr: size of reference se-

quence) and j (j = 0..Mt-w, Mt: size of target sequence).

When the base sequence was very large, elements of

the sequence spectrum were skipped by the size factor p

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

870

(http://www. yeastgenome.org/).

to reduce the size as follows.

NCBI genome data base. (2010) (http://www.ncbi.nlm.

nih.gov/sites/entrez?db=genome).

1*)1( 

pii fsfs

For instance, when p = 2

...,,...,,531321 fsfsfsfsfsfs  2.2. Appearance Frequencies of Bases

This operation reduced the size to 1/p. For nine successive bases, the appearance frequency was

counted for the entire genome by matching from the start

of the base sequence in a genome with one base shift (p

= 1) as follows.

The base sequences of the genomes were obtained

from the databases listed below.

Saccharomyce Genome Database. (2010)

Ex. Nine successive bases: AATAAAGAA

AATA AAGAA (one base shift)

Base Sequence:

5’-ATCGAATAAAGAACCGTTCGGTAAGTCGAATAAAGAAT-CTGGCATTT-3’

1 2

Count of AATAAAGAA: 2

In the case of the genome composed of the plural chr-

omosomes such as S. cerevisiae, we have calculated the

sum of the base frequencies of the 16 chromosomes (in

numeric order) plus mtDNA [1].

2.3. The Parameters “d”-, “m”- , “p”-, and

“w”-Values of SSM Analysis for the

Interaction

JBiSE

Controllable parameters in the sequence spectrum were

the base size “d” of the key sequence, the average width

“m”, the skip base number (the size factor) “p” and the

window width “w” of homology. The parameter “d” de-

termined the highest resolution for extracting the structural

feature of the base sequence. Therefore this parameter

should be chosen to be as a large value as possible to

extract the exact feature. The large “m” values were usu-

ally used to obtain the overall features of the structure,

and smaller “m” values were applied to investigate the

structure in detail. The value of “m” normally ranges from

1/10 to 1/100 of the base sequence size [2]. This parame-

ter was adjusted to the base sequence size especially when

the homology factor between a small reference and a large

target was calculated [2]. The window width of homology,

“w” determined the width of similar region to identify. In

this paper the values of “d”, “m”, “p” and “w” were 9, 10,

1 and 200, respectively, to identify the interactive region

of protein and DNA.

In figures of the sequence spectrum the horizontal pa-

rameter was the base size of sequence, M of each gene

or genomic DNA, and the vertical parameter was the se-

quence spectrum. These parameters were appropriately

scaled to show the similar region clearly.

2.4. Procedure of Identification of the Interactive

Region by SSM

To simplify the procedure, it was assumed that the inter-

active region of one protein was given (shown in pur-

ple-blue), and SSM identified the interactive region of

the other protein (shown in red). The procedure to iden-

tify the interactive regions of two proteins by SSM was

as follows. In the following procedure one of two pro-

teins was replaced by DNA when the protein-DNA in-

teraction was investigated.

[Step 1] One protein with the given interactive region

(shown in purple-blue) was designated as a reference

protein, and the other protein with the interactive region

(shown in red) which SSM identified was designated as

a target protein.

[Step 2] The sequence spectra of both the reference

and target proteins were calculated.

[Step 3] The similar regions between the sequence spec-

tra of the reference and target proteins were calculated.

[Step 4] The pair of similar regions (red/purple-blue)

with the highest homology factor (HF) was selected as a

candidate of interactive regions.

[Step 5] The base sequence of the reference protein

was converted to be the reverse complementary and the

steps [2-4] were repeated because of the reverse-com-

plement rule in genome.

[Step 6] In two candidates obtained in steps [4] and

[5], the similar region of the target protein with higher

HF was called first identified region, and the other was

called second identified region.

3. RESULTS AND DISCUSSION

This section demonstrates that the homology of the se-

quence spectrum was closely associated with the mutual

interaction of proteins or DNA. The identified interac-

tive regions of the proteins were all the first identified

regions in the examples below. We showed some of the

interactive regions analyzed by SSM in this section.

3.1. Mutual Interaction of Protein-Protein

1) MAS1 and MAS2

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

871

Figure 1 showed the interactive region (in purple-blue)

of MAS1 [Mas1p (β-MPP), Reference [3]] - MAS2 [Mas-

2p (α-MPP), Reference [4]]. These proteins formed a

complex to cleave the mitochondrial targeting signal of

precursors. In Figure 1(a) the active region (in pur-

ple-blue) around the key amino acid E73 of MAS1

(Mas1p) was the reference, and the whole coding region

of MAS 2 (Mas2p) was the target (Figure 1(b)). Previous

reports proposed a model in which the glycine-rich re-

gion of MAS2 (Mas2p, in red) cooperated with the active

region of MAS1 (Ma- s1p, in purple-blue). Our results

strongly supported this model because the most similar

region of MAS2 (in red; HF = 90.5%) with the active

region of MAS1 (in purple-blue) was completely identi-

cal to the reported glycine-rich region [5,6, in red].

Moreover, the positions of the key amino acids in both

proteins (E73 in Mas1p and K296 in Mas2p) were also

identical.

Figure 1. Sequence spectra of MAS1 and MAS2 (d = 9, m = 10，p = 1). (a)

Coding region of MAS1 (Mas1p, M = 1,386). The active region of MAS1 (Ma-

s1p, reference: M = 200, in purple-blue). This region (corresponding to E46 –

E106) carries the characteristic metal-binding motif associated with the cata-

lytic activity (5, 6). (b) Coding region of MAS2 (Mas2p) containing the 5’-

and 3’- non-coding region (target: M = 1,446). The region most similar to the

reference is shown in red (HF = 90.5%). The most similar region is gly-

cine-rich and closely related to the catalytic function (I261 – G327 of Mas2p).

E73 (shown in red letter) of Mas1p presumably interacts with K296 (shown in

red letter) of Mas2p (position of arrowhead). The scales of the axes for the

sequence spectra of the similar regions were the same. The amino acid se-

quences of Mas1p and Mas2p neighboring the interactive regions were shown

in figures, respectively.

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

872

2) PHO4 and PHO80

Figure 2 showed the sequence spectra of PHO4 (a,

Pho4p, reference: the interactive region around the key

amino acid P174, in purple-blue) and PHO80 (b, Pho80p,

target: the whole coding region). PHO4 (Pho4p) was a

transcription factor, and PHO80 (Pho80p) inhibited the

transcriptional function of PHO4 (Pho4p). Ogawa & Os-

hima [7] and Okada & Toh-e [8] reported that there was

interaction between P174 in Pho4p and M42 in Pho80p,

respectively. The red region in (b) in which M42 (Figure

2(b), arrow head) of Pho80p was located was the region

most similar to the reference region of Pho4p, in which

P174 (Figure 2(a), arrow head) was located (HF =

89.1%). The interactive regions between Pho4p and Ph-

o80p were also discussed in the Pho2p results (6) later.

3) RPB2 and RPB12

Figure 3 showed the sequence spectra of RPB2 and

RPB12. The RPB protein family forms DNA-directed

RNA polymerase II [9]. RPB2 (Rpb2p encoding gene)

and RPB12 (Rpb12p) were members of the family, and

RBP12 (Rpb12p) combined with RPB2 (Rpb2p). Rpb12p

was a very small protein with 70 amino acids whereas

Figure 2. Sequence spectra of PHO4 and PHO80 (d = 9, m = 10, p = 1). (a) Cod-

ing region of PHO4 (Pho4p, M = 936, the active region was shown in purple-blue).

(b) Coding region of PHO80 (Pho80p, target: M = 880). The region most similar to

the reference is shown in red (HF = 89.1%). It has been shown that P174 (shown in

red letter) of Pho4p interacts with M42 (shown in red letter) of Pho80p [7, 8]. The

arrowhead in each spectrum respectively indicates the position of the amino acid

P174 of Pho4p, and M42 of Pho80p. The scales of axes in (a) and (b) are the same.

The amino acid sequences of Pho4p and Pho80p neighboring the interactive re-

gions were shown in figures, respectively. The red letter indicated to report as a

functional amino acid.

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

873

Figure 3. Sequence spectra of RPB12 and RPB2 (d = 9, m = 10，p = 1). (a) Coding

region of RPB12 (Rpb12p, reference: M = 210). (b) Coding region of RPB2 gene

containing the 5’- and the 3’- non-coding region (Rpb2p, target: M = 3,672). The

region most similar to the reference is shown in red (HF = 87.1%). The scales of

axes in (a) and (b) are the same. The amino acid sequences of Rpb12p and Rp- b2p

neighboring the interactive regions were shown in figures, respectively.

Rpb2p was a large one with 1224 amino acids. Therefore

in this case the whole coding region of RPB12 (Rpb12p)

was suitable for the reference (a) and the coding region

of RPB2 (Rpb2p) for the target (b). The result was sho-

wn in Figures 3(a-b). The red region is the most similar

re- gion of RPB2 (Rpb2p) with RPB12 (Rpb12p, HF =

87.1%). The literature [9] revealed that the interaction

between RPB2 (in red) and RPB12 (in purple-blue) occ-

urred at two regions of RBP2, and Figure 3 showed one

of these two interaction regions. This result was unlikely

to be a coincidence because the target size was about 18

times larger than the reference size. In addition, interest-

ingly the other interacting region was very close to the

second identified region in the coding region (not sho-

wn), although it was not completely identical (a previous

report [9] specified the region around the 900th amino

acid of Rpb2p, but our results specified the region aro-

und the 940th amino acid).

4) GCR1 and GCR2

The interactive region of GCR1 [Gcr1p,10] and GCR2

[Gcr2p,11] was very interesting. In Figure 4 the red re-

gion of GCR1 (Gcr1p, leucine zipper) was the first iden-

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

874

tified region (HF = 92.9%) with respect to the reference

region (in purple-blue) of GCR2 (Gcr2p). The sequence

spectra suggested that the leucine-zipper region of GCR1

(Gcr1p) might interact with the C-terminus of GCR2

(Gcr2p, purple-blue region), although considerable con-

troversy still existed concerning the interaction between

Gcr1p and Gcr2p [12,13]. This case is quite interesting

for following reasons: a) the identified region was de-

rived from the reverse-complement reference region of

GCR2, that is, the reverse-complement base sequence of

GCR2 was also useful to the analysis of the interactive

region by SSM (designated it as the reverse-complement

rule), and b) the portion of the reference region exceeded

outside to the downstream region. This means that in this

case the proposed method identified both the different

objects, the protein region for GCR2 (Gcr2p) and the

DNA region for GCR2 of the reference region. That is,

the sequence spectrum of a given gene might reflect the

information of both protein and DNA, and SSM could be

applied to analyze both of them.

Figure 4. Sequence spectra of GCR1 and GCR2 (d = 9, m = 10, p = 1). (a) The

reverse-complement sequence of whole region of GCR2 (Gcr2p) containing the

5’- and the 3’- non-coding region was used as the reference (M = 3,157, the active

region was shown in purple-blue). (b) The functional region (K266 – R300, leucine

zipper) of GCR1 (Gcr1p, ref.10-13). The region most similar to the reference (HF

= 92.9%). This region (leucine zipper, ref. 12, 13) of Gcr1p might interact with

the reference region of Gcr2p. The scales of axes in (a) and (b) are the same. The

arrowhead of black and red were the start codon (M1) and the stop codon (TGA)

of GCR2, respectively. The bold black arrowhead of GCR1 was the position of

E262 (red letter in the amino acid sequence of Gcrp1). The amino acid sequences

of Gcr2p and Gcr12p neighboring the interactive regions were shown in figures,

respectively. The red letter indicated to report as a functional amino acid.

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

875

5) SLA1 and SLA2

This example proved that SSM could apply to large

size proteins. The size of proteins Sla1p (coded by SLA1)

and Sla2p (coded by SLA2) were 1244 and 968 amino

acids respectively, and Figure 5 showed the interactive

regions of these proteins. In Figure 5 the red region of

SLA1 (Sla1p) was the first identified region (HF = 94.3%)

with respect to the reference region (in purple-blue) of

SLA2 (Sla2p) which was converted to be reverse com-

plementary. The literature [14] showed that this result

was valid.

The three examples 6) ~ 8) below were results of pre-

dicting the interactive regions by SSM. In these exam-

ples one of the interactive regions was known and the

other was unknown, and SSM predicted the unknown

interactive region.

Figure 5. Sequence spectra of SLA2 and SLA1 (d = 9, m = 10, p = 1). The re-

verse-complement of the base sequence gave more homologous than the normal

base sequence could be shown in the interaction SLA2 (Sla2p)/SLA1 (Sla1p). (a)

The reverse-complement sequence of coding region of SLA2 (Sla2p) was used as

the reference (M = 2,904, the active region was shown in purple-blue). (b) The

sequence spectrum region of SLA1 (M = 3,732. Sla1p, ref.14). The amino acid

sequences of Sla2p and Sla1p neighboring the interactive regions were shown in

figures, respectively. The region most similar to the reference (HF = 94.3%).

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

876

6) PHO2, PHO4 and PHO80 [15-17]

The identification of the interactive regions might be

applied the characterization of the molecular mechanism

of the metabolism. For instance, the example focusing

on the interactive regions of PHO2 (Pho2p) - PHO80

(Pho80p) - PHO4 (Pho4p) was very suggestive. PHO2

was a gene coding a transcription factor, Pho2p regulat-

ing several genes like PHO5 with co-regulated with other

transcription factor, Pho4p [15-17]. It was well known

that Pho2p had a cooperative interaction with Pho4p, and

the literature [15] reported that the amino acids around

S230 of Pho2p played an important role concerning the

interaction with Pho4p. In this connection SSM predicted

the target interactive region of Pho4p with the reference

region around S230 of Pho2p. The predicted region of

Pho4p was located very close to or overlapped partially

with the interactive region with Pho80p, and the posi-

tions of the key amino acids, S230 of Pho2p and P174 of

Pho4p were identical (Figure 6).

As described in the above section (2) PHO4 and PH-

O80, P174 of Pho4p and M42 of Pho80p were functioned

in the interaction of theses proteins (Figure 2). Namely

the positions of the three key amino acids P174 of Pho4p,

M42 of Pho80p, and S230 of Pho2p were identical

Figure 6. Sequence spectra of PHO2 and PHO4 genes (d = 9, m = 10，p = 1). (a)

Coding region of PHO2 (Pho2p, reference: M = 1677). The region most similar to

the reference is shown in purple-blue. (b) The reverse-complement sequence of

coding region of PHO4 (Pho4p, M = 936). The active region was shown in red

(HF = 93.7%). It has been shown that P174 (shown in red letter) of Pho4p interacts

with S230 (shown in red letter) of Pho2p [15-17]. The arrowhead in each spectrum

respectively indicates the position of the amino acid S230 of Pho2p, and P174 of

Pho4p. The scales of axes in (a) and (b) are the same. The amino acid sequences

of Pho2p and Pho4p neighboring the interactive regions were shown in figures,

respectively. The red letter indicated to report as a functional amino acid.

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

877

in the identified interactive regions by SSM. This fact

suggested that Pho80p might be interfered in the coop-

eration between Pho4p and Pho2p, and this result was

very reasonable [15-17] although more experimental

confirmations would be necessary.

7) PHO2 and SWI5 [18]

SWI5 was a gene encoding a transcription factor, Sw-

i5p that activates transcription of genes expressed at the

M/G1 phase boundary and in G1 phase such as PHO2

encoding a regulatory protein involved in cooperatively

phosphate metabolism, Pho2p. The base number of the

interactive region in SWI5 is known and unknown in

PHO2 [18]. We predicted the unknown interactive re-

gion of Pho2p by the SSM (Figure 7).

Figure 7. Sequence spectra of SWI5 and PHO2 genes (d = 9, m = 10, p = 1). (a)

Coding region of SWI5 (Swi5p, M = 2127, the active region was shown in pur-

ple-blue). (b) Coding region of PHO2 (Pho2p, target: M = 1677). The region most

similar to the reference is shown in red (HF = 95.0%). It has been shown that the

amino acids sequences (shown in red letter) of Swi5p interacts with the amino

acids sequences (shown in red letter) of Pho2p [18]. The arrowhead in each spec-

trum respectively indicates the position of the functional amino acid N471 of

Swi5p, and N3305 of Pho2p. The scales of axes in (a) and (b) are the same. The

amino acid sequences of Swi5p and Pho2p neighboring the interactive regions

were shown in figures, respectively. The red letter indicated to report as functional

amino acids sequences.

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

878

8) AT P 3 and AT P15 [19-21]

AT P 3 and AT P15 were genes encoding F1F0-ATPase

complex γ and ε subunits respectively, which partici-

pated in a rotation of the complex [19-21]. In this exam-

ple the interactive regions both of AT P 3 and AT P 15 were

unknown. However we could choose the entire coding

region of ATP15 as the reference because the genome

size of ATP15 was small (186 nt). Therefore, we used as

w = 186 by SSM in this case. Other values, m, d, and p

were the same, 10, 9, and 1, respectively as before. In

addition, the reverse-complement base sequence of

AT P 15 was used because HF was higher in this analysis.

We predicted the unknown interactive region of ATP 3 by

the SSM (Figure 8).

In x-ray crystallography of γ - ε complex of ATP syn-

thase in E. coli and bovine, presumably, the 200th amino

acid and the adjacent amino acids of γ - subunit (Atp3p)

locating the foot-position could be interacted with ε -

subunit (Atp15p) [19,20]. The prediction by SSM might

be in accord with the results of these literatures for X-ray

crystallography. The experiment to confirm the interac-

tive regions of Atp15p and Atp3p analyzed by SSM is

under the progress.

SSM was the analytical method to identify the base

numbers (position from 5’-ATG = the start codon) of the

interactive regions (sites) of the reference- and the target-

protein. However there were not many examples where the

interactive regions with the base numbers were identi-

fied for the reference and target proteins in the yeast

genome databases such as SGD etc. Therefore we could

not select many examples for the SSM analyses and

showed all examples we have in this manuscript.

Figure 8. Sequence spectra of ATP15 and ATP3 genes (d = 9, m = 10，p = 1).

The reverse-complement of the base sequence gave more homologous than

the normal base sequence could be shown in the interaction ATP15

(Atp15p)/AT P3 (Atp3p). (a) Coding region of ATP15 (Atp15p, M = 186, the

active region was shown in purple-blue). (b) Coding region of ATP3 (Atp3p,

target: M = 933). The region most similar to the reference is shown in red (HF

= 88.6%). It has been shown that the amino acids sequences (shown in red

letter) of Atp15p interacts with the amino acids sequences (shown in red re-

gion) of Atp3p [19-21]. The scales of axes in (a) and (b) are the same. The

amino acid sequences of Atp15p and Atp3p neighboring the interactive re-

gions were shown in figures, respectively. The arrowhead and the red letter

amino acid residue, N200 of Atp3p might be interacted with Atp15 from X-ray

crystallography [19,20].

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

879

The results in this paper could be sufficient to confirm

the validity of SSM method because the probability to iden-

tify the interactive regions was very small by coincidence.

For instance, in the case of MAS1 (Mas1p)/MAS2

(Mas2p), MAS2 was composed of about 1,400 nt, which

meant that the identification probability by coincidence

was lower than 1/7 (= 200 / 1400) under the condition of

the homology window width w = 200 nt. The probabili-

ties of other examples in this manuscript were following.

PHO4/PHO80, lower than 2/9 (= 200/900);

RPB12/RPB2, 1/20 (= 200/4000);

GCR2/GCR1, 1/15 (= 200/3000);

SLA2/SLA1, 1/20 (= 200/4000);

PHO2/PHO4, 1/15 (= 200/3000);

PHO2/SWI5, 1/10 (= 200/2000);

AT P 15/ATP3, 1/5 (= 200/1000);

GAL1/GA4, 1/15 (= 200/3000);

GAL4/GAL10, 2/7 (= 200/700);

GAL4/GAL2, 1/7 (= 200/1000);

GAL4/GAL7, 1/4 (= 200/800);

Therefore the results in this paper made sense statisti-

cally to confirm the validity of the proposed method. In

addition the positions of the key amino acids were iden-

tical in the identified interactive regions in case of the

examples of MAS and PHO proteins. This fact definitely

reinforced the proposed method.

Finally we predicted the interactive regions of many

proteins which were chosen randomly from 16 different

chromosomes of S. cerevisiae [22], and summarize the

prediction results in Table 1 to demonstrate the effec-

tiveness of SSM. For the examples in Tab le 1 we used

the same analytical conditions, m = 10, d = 9, p = 1 and

w = 200, and predicted the interactive regions both of

the reference and target proteins. However the proposed

method in this paper was based on the condition that the

interactive region of the reference protein was known

and that of the target protein was unknown. Therefore

some of these prediction results might be revised in our

future work because the identification ability of SSM

was not strong at present when the interactive regions

both of the reference and target proteins were unknown.

We are improving SSM to apply these cases now.

Table 1. Possible interactive region. The upper column indicated the 1st, and the lower column indicated the 2nd interactive region,

respectively. *1) Conditions, m = 10, d = 9, p = 1, w = 200; *2) Reference gene; *3) Chromosome located the reference gene; *4)

Amino acid residues of the reference protein; *5) Interactive region of the reference protein predicted by SSM; *6) Target gene; *7)

Chromosome located the target gene; *8) Amino acid residues of the target protein; *9) Interactive region of the target protein; *10)

Homology factor between the target to the reference protein; *11) Either protein was used as the reverse-complement base sequence.

Reference*2 Chromosome*3 Amino

acids*4

Interactive

reagion*5 Target*6Chromosome*7Amino Acids*8Interactive

region*9 HF (%)*10 Complement*11

GDH3 1 457 272-338 GDH115 454 116-182 93.7

52-118 52-118 92.3

CDC24 1 854 183-249 ACT16 478 83-149 94.7

234-300 94-160 94.2 ○

PHO11 1 467 374-440 PHO52 467 374-440 94.3

88-154 144-210 93 ○

ATP2 10 511 170-236 ATP32 311 57-123 93.1 ○

300-366 20-86 92.6

SUP45 2 437 311-377 RPS1215 143 4-70 92.1

188-254 (-7)-59 91.3

YDJ1 14 409 292-358 PRD13 712 508-574 94

67-133 552-618 93.6 ○

GCD2 7 651 550-616 GCD712 381 274-341 94.9 ○

221-287 (-21)-45 91.4

PHO87 3 923 433-499

SPL2 8 148 24-90 92.1

564-630 60-126 92 ○

HXT15 4 567 323-389 GAL212 574 52-118 96.9 ○

483-549 399-465 96.8 ○

NAB2 7 525 69-135 SNF34 884 480-544 95.4

243-309 175-241 95.2 ○

ECM10 5 644 140-206 SSA11 642 237-303 94

70-136 395-461 93.4 ○

HEM1 4 548 16-82 LCB24 561 296-362 95 ○

269-335 447-513 94.1

○

POL4 3 582 169-235 CCA15 546 218-284 97 ○

67-133 103-169 93.9

GUT1 8 709 99-165 XKS17 600 24-90 94.5

649-(715) (-12)-54

93.7

YAP1 13 650 335-401 CAD14 409 96-162 93.9

531-597 217-283 93.3

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

880

3.2. Mutual Interaction of Protein-DNA

This section clarified that the homology of sequence sp-

ectra was also related to the mutual interaction between

protein and DNA. The interactions of the transcription

factor GAL4 [23] and the promoters of GAL genes (UA-

SGal signal, GAL1, GAL10, GAL2 and GAL7) [24-26]

were taken as an example. Figure 9 showed the seque-

nce spectra of the upstream region of GAL1 as the refer-

ence (a) and the reverse-complement base sequence of

the coding region of GAL4 as the target (b). We em-

ployed the upstream region of GAL1 to demonstrate the

effectiveness of the method although its base size was

668 wh- ich was a little large for the reference region. In

Figure 9 the red region was the first identified region of

GAL4. Surprisingly this red region is completely identi-

cal to the DNA binding region of GAL4 with the zinc

finger motif, and the purple-blue region is the promoter

region of GA- L1. This means that in this case the pro-

posed method perfectly identified both the interactive

reference (in purple-blue) and target regions (in red) at

the same time despite the different objects, the protein

region for GAL4 and the DNA region for GAL1.

Thus interactive analysis might be applied to other GAL

genes, GAL10, GAL2, and GAL7, which their promoter

regions were also interacted with the N-terminal DNA

binding domain (zinc-finger domain) of GAL4 (Gal4p).

Figure 10 showed all the promoter regions identified

by SSM with the DNA binding region of the Gal4p (the

reverse-complement base sequence) in Figure 9 as the

reference region (in purple-blue). In this figure the ref-

erence region of GAL4 was fixed to arrange the layout of

Figure 9. Sequence spectra of GAL1 and GAL4 genes (d = 9, m = 10，p = 1).

(a) Upstream region of GAL1 (668 nt) was used as the reference (in pur-

ple-blue). The arrowheads were indicated several promoter sequences. (b)

DNA binding region of GAL4 (reverse-complement sequence of GAL4 (Gal4p,

M = 2,643) was useful in comparison with GAL1 gene. The first 107 amino

acids at the N-terminus of Gal4p, which is involved in DNA binding (shown in

red, ref. 23), were used as the target. The bold arrowhead of Gal4p was indi-

cated the position of L64.

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

881

[Enlargement of the spectrum of the interactive region

of GAL promoter region with Gal4p DNA binding region.]

Figure 10. Sequence spectra of other GAL genes (d = 9, m = 10, p = 1). (a) DNA binding region of GAL4 (reverse-complement sequ-

ence of GAL4 (Gal4p, M = 200) was used as the reference (shown in purple-blue), and other GAL genes upstream, GAL10, GAL2 and

GAL7 were as the target to search their promoter regions (the arrowheads were indicated several promoter sequences). (b) Upstream

region of GAL10 (target: M = 668: HF = 89.8%). (c) Upstream region of GAL2 (target: M = 964: HF = 85.8%). (d) Upstream region

of GAL7 (target: M = 728: HF = 84.9%). The bracket in each GAL gene indicated the promoter regions (upstream activator sequences,

UASGal) binding with the zinc finger motif of Gal4p [23-26]. The UASGal signals (arrowhead) of each GAL gene were concentrated in

the similar region shown in red. The red regions in (b), (c) and (d) were the most similar regions. The base numbers on the abscissa

were matched in each panel either to the coding or upstream region. The bold arrowhead of Gal4p was indicated the position of L64.

identified regions for the promoter. It was clear from this

figure that the promoter sites in the red regions over

lapped with each other. We obtained similar results for

PH- O genes (data not shown).

3.3. Crucial Problems and Discussions

Our results raised various crucial problems below which

were definitely related to fundamental principles of life.

However we had to admit that we did not have perfect

answer to these problems at the moment. Therefore our

discussions below had some uncertain hypotheses.

[Question 1] Why was the sequence spectrum asso-

ciated with functions of protein and DNA?

Originally the sequence spectrum was devised to ex-

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

882

amine the generation-rules in genome, and succeeded in

visualizing the rules of reverse-complement symmetry,

multiple fractality and so on. Therefore the fact that the

sequence spectrum was associated with the functions of

protein and DNA led to the fact that the generation-rules

could govern not only the static base sequence in ge-

nome as the blueprint of life but also the dynamic phe-

nomena of proteins and DNAs as the principle of life

mechanism.

[Question 2] Why was the homology of sequence sp-

ectrum closely associated with the interaction of pro-

teins?

A possible answer to this problem was that the se-

quence spectrum could reflect the higher order structure

of proteins. The interacting region was considered to

consist of the specific sequence of amino acids. This

specificity of the amino acid sequence could be reflected

to the appearance frequency of the base sequence corre-

sponding to the amino acid sequence. The homology of

the sequence spectrum could be interpreted to be an af-

finity of the interactive regions of the proteins.

[Question 3] Why was the homology of sequence

spectrum closely associated with the interaction of

protein and DNA?

Similarly to the problem [Question 2], a possible answer

to this problem was that the sequence spectrum could

reflect the higher order structure of protein and DNA.

However, this fact would raise another crucial problem.

Why could the sequence spectrum reflect the higher or-

der of both protein and DNA in the same manner which

was totally different objects? In order to answer this

problem, it was definitely necessary to examine the rela-

tion between the higher order structures of protein and

DNA (or RNA). Our results implied that there could

exist a close structural relation between them. For in-

stance, it was well known that a domain of EF-G factor

protein emulated amino acyl-tRNA [26]. It could be

even possible that the structure of protein could inherit

the structure of its original DNA in genome because in-

heritance could be most simple answer for this problem.

SSM basically could detect the interacting regions of

gene DNAs through the homology of the sequence spec-

trum, and this automatically could lead to detect the in-

teracting regions of proteins translated from the gene

DNAs through the structure inheritance. We suspected

that tRNA and codon table gave an important clue on

this issue because tRNA were directly associated with

the amino acid of protein and the triplet codon of DNA.

Moreover the sequence spectrums of tRNA and protein

possess the similar relation. For instance the GTP bind-

ing protein RAS2 [27,28] and Gly(GGG)-tRNA which

were both related to guanine(G) in common were similar

in the sequence spectrum [2].

4. CONCLUSIONS

The conclusions obtained in this study were summarized

as follows.

1) The homology of the sequence spectrum was clo-

sely associated with the interaction of protein and DNA.

2) The SSM was a suitable prediction method to iden-

tify interacting regions regardless of the biological mac-

romolecules: DNA, RNA and protein.

3) The SSM was so fast and useful that it did not re-

quire a super computer but rather a personal computer.

4) The generation-rules in genome could govern not

only the static base sequence in genome but also the dy-

namic phenomena of proteins and DNAs.

5) The sequence spectrum could reflect the higher or-

der structure of protein and DNA.

6) There could be a close relation between the struc-

tures of protein and DNA.

The proposed method by SSM should be improved to

identify or predict both the reference and target regions

at the same time in any cases. This project is now ongo-

ing in our laboratory and we will report on this subject in

the next paper.

REFERENCES

[1] Takeda, M. and Nakahara, M. (2009) Structural features

of the nucleotide sequences of genomes. Journal of Com-

puter Aided Chemistry, 10, 38-52.

[2] Nakahara, M. and Takeda, M. (2010) Characterization of

the sequence spectrum of DNA based on the appearance

frequency of the nucleotide sequences of the genome-A

new method for analysis of genome structure. Journal

Biomedical Science and Engineering, 3, 340-350.

[3] Geli, V., Yang, M., Suda, K., Lustig, A. and Schatz, G.

(1990) The MAS-encoded processing protease of yeast

mitochondria. Overproduction and characterization of its

two nonidentical subunits. Journal of Biological Chem-

istry, 265(31), 19216-19222.

[4] West, A.H., Clark, D.J., Martin, J., Neupert, W., Hartl,

F.U. and Horwich, A.L. (1992) Two related genes enco-

ding extremely hydrophobic proteins suppress a lethal

mutation in the yeast mitochondrial processing enhanc-

ing protein. Journal of Biological Chemistry, 267(34),

24625-24633.

[5] Ito, A. (1999) Mitochondrial processing peptidase: mul-

tiple-site recognition of precursor proteins. Biochemical

and Biophysical Research Communication, 265(3), 611-

616.

[6] Nagao, Y., Kitada, S., Kojima, K., Toh, H., Kuhara, S.,

Ogishima, T. and Ito, A. (2000) Glycine-rich region of

mitochondrial processing peptidase α-subunit is essential

for binding and cleavage of the precursor proteins. Jour-

nal of Biological Chemistry, 275, 34552-34556.

[7] Ogawa, N. and Oshima, Y. (1990) Functional domains of

a positive regulatory protein, PHO4, for transcriptional

control of the phosphatase region in Saccharomyces cer-

M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883

883

JBiSE

evisiae. Molecular and Cellular Biology, 10(5), 2224-

2236.

[8] Okada, H. and Toh-e, A. (1992) A novel mutation occur-

ring in the PHO80 gene suppresses the PHO4c mutations

of Saccharomyces cerevisiae. Current Genetics, 21(2), 95-

99.

[9] Cramer, P., Bushnell, D.A. and Kornberg, R.D. (2001)

Structural basis of transcription: RNA polymerase II at

2.8 Angstrom resolution. Science 292(5523), 1863-1876.

[10] Baker, H.V. (1991) GCR1 of Saccharomyces cerevisiae

encodes a DNA binding protein whose binding is abol-

ished by mutations in the CTTCC sequence motif. Proce-

eding National Academy of Sciences of the United States

of America, 88(21), 9443-9447.

[11] Uemura, H. and Jigami, Y. (1992) Role of GCR2 in tran-

scriptional activation of yeast glycolytic genes. Molecu-

lar and Cellular Biology, 12(9), 3834-3842.

[12] Deminoff, S.J., Tornow, J. and Santangelo, G.M. (1995)

Unigenic evolution: A novel genetic method localizes a

putative leucine zipper that mediate dimerization of the

Saccharomyces cerevisiae regulator Gcr1p. Genetics,

141(4), 1263-1274.

[13] Deminoff, S.J. and Santangelo, G.M. (2001) Rap1p req-

uires Gcr1p and Gcr2p homodimers to activate ribosomal

protein and glycolytic genes, respectively. Genetics,

158(1), 133-143.

[14] Gourlay, C.W., Dewar, H., Warren, D.T., Costa, R., Sat-

ish, N. and Ayscough, K.R. (2003) An interaction be-

tween Sla1p and Sla2p plays a role in regulating actin

dyn- amics and endocytosis in budding yeast. Journal of

Cell Science, 116(12), 2551-2564.

[15] Liu, C., Yang, Z., Yang, J., Xia, Z., and Ao, S. (2000) Re-

gulation of the yeast transcription factor PHO2 activity

by phosphorylation. Journal of Biological Chemistry,

275(41), 31972-31978.

[16] Yang, J. and Ao, S.Z. (1996) Interaction of the yeast

PHO2 protein or its mutants with the PHO5 UAS in vitro.

Sheng Wu Hua Xue Yu Sheng Wu Li Xue Bao (Shanhai)

28(3), 316-320.

[17] Shimizu, T., Toumoto, A., Ihara, K., Shimizu, M., Kyo-

goku, Y., Ogawa, N., Oshima, Y. and Hakoshima, T.

(1997) Crystal structure of PHO4 bHLH domain-DNA

complex: Flanking base recognition. EMBO Journal,

16(15), 4689-4697.

[18] Bhoite, L.T. and Stillman, D.J. (1998) Residues in the

Swi5 zinc finger protein that mediate cooperative DNA

binding with the Pho2 homeodomain protein. Molecular

and Cellular Biology, 18(11), 6436-6446.

[19] Rodgers, A.J. and Wilse, M.C. (2000) Structure of the

gamma-epsilon complex of ATP synthase. Nature Struc-

tural Biology, 7(2000), 1051-1054.

[20] Montgomery, G.C., Lesile, A.G. and Walker, J.E. (2000)

The structure of the central stalk in bovine F(1)-ATPase

at 2.4 A resolution. Nature Structural Biology, 7(11 ), 1055-

1061.

[21] Tsumuraya, M., Furuike, S., Adachi, K., Kinoshita, K. jr.

and Yoshida, M. (2009) Effect of ε subunit on the rota-

tion of thermophilic Bacillus F1-ATPase. FEBS Letters,

583(7), 1121-1126.

[22] Saccharomyce G.D (2010) (http://www.yeastgenome.org/).

[23] Ding, W.V. and Johnston, S.A. (1997) The DNA binding

and activation domains of Gal4p are sufficient for con-

veying its regulatory signals. Molecular and Cellular Bi-

ology, 17(5), 2538-2549.

[24] Johnston, M. and Davis, R.W. (1984) Sequences that

regulate the divergent GAL1-GAL10 promoter in Sac-

charomyces cerevisiae. Molecular and Cellular Biology,

4(11), 1440-1448.

[25] Lorch, Y. and Kornberg, R.D. (1985) A region flanking

the GAL7 gene and binding site for GAL4 protein as up-

stream activating sequences in yeast. Journal of Molecu-

lar Biology, 186(4), 821-824.

[26] Tajima, M., Nogi, Y. and Fukazawa, T. (1986) Duplicate

upstream activating sequences in the promoter region of

the Saccharomyces cerevisiae GAL7 gene. Molecular

and Cellular Biology, 6(1), 246-256.

[27] Nissen, P., Kjeldgaard, M., Thirup, S., Polekhina, G., Re-

shetnikova, L., Clark, B.F. and Nyborg, J. (1995) Crystal

structure of the ternary complex of Phe-tRNAPhe, EF-Tu,

and a GTP analog. Science, 270(5241), 1464-1472.

[28] Kataoka, T., Powers, S., McGill, C., Fasano, O., Strath-

ern, J., Broach, J. and Wigler, M. (1984) Genetic analysis

of yeast RAS1 and RAS2 genes. Cell, 37(2), 437- 445.

[29] Mabuchi, T., Ichimura, Y., Takeda, M. and Douglas, M.G.

(2000) ASC1/RAS2 suppresses the growth defect on

glycerol caused by the atp1-2 mutation in the yeast Sac-

charomyces cerevisiae. Journal of Biological Chemistry,

275(14), 10492-10497.