Paper Menu >>
Journal Menu >>
J. Biomedical Science and Engineering, 2010, 3, 340-350 JBiSE doi:10.4236/jbise.2010.34047 Published Online April 2010 (http://www.SciRP.org/journal/jbise/). Published Online April 2010 in SciRes. http://www.scirp.org/journal/jbise Characterization of the sequence spectrum of DNA based on the appearance frequency of the nucleotide sequences of the genome ——A new method for analysis of genome structur e Masatoshi Nakahara1, Masaharu Takeda2 1Department of Computer and Information Sciences, Sojo University, Ikeda, Japan; 2Department of Materials and Biological Engineering, Tsuruoka National College of Technology , Tsuruoka, Japan. Email: mtakeda@tsuruoka-nct.ac.jp Received 13 January 2010; revised 25 January 2010; accepted 30 January 2010. ABSTRACT The nucleotide (base) sequence of the genome might reflect biological information beyond the coding se- quences. The appearance frequencies of successive base sequences (key sequences) were calculated for entire genomes. Based on the appearance frequency of the key sequences of the genome, any DNA sequences on the genome could be expressed as a sequence spec- trum with the adjoining base sequences, which could be used to study the corresponding biological phe- nomena. In this paper, we used 64 successive three- base sequences (triplets) as the key sequences, and determined and compared the spectra of specific genes to the chromosome, or specific genes to tRNA genes in Saccharomyces cerevisiae, Schizosaccharomyces pombe and Escherichia coli. Based on these analyses, a gene and its corresponding position on the chromosome showed highly similar spectra with the same fold enlargement (approximately 400-fold) in the S. cere- visiae, S. pombe and E. coli genomes. In addition, the homologous structure of genes that encode proteins was also observed with appropriate tRNA gene(s) in the genome. This analytical method might faithfully reflect the encoded biological information, that is, the conservation of the base sequences was to make sense the conservation of the translated amino acids se- quence in the coding region, and might be universally applicable to other genomes, even those that consisted of multiple chromosomes. Keywords: Appearance Frequency of Triplet in Genome Base Sequence; Self-Similarity; Analytical Method of Genome Structure 1. INTRODUCTION It was well known that there were structural hierarchies in the genome, such as the chromosome, nucleosome, ORF (open reading frame) and so on [1]. Among them, much attention have been paid to the ORF, and many research projects were being performed from the view- point of protein function using methods such as pro- teome and transcriptome analyses [2-5]. Many studies of entire genome sequences have been reported [6-11], al- though complete genome base sequences have only been revealed within the last 10 years or so. However, we currently have limited tools to analyze a large-scale molecule such as a whole genome, including pertinent hard-and software. It was very important to investigate the structural features of the entire genome because the four bases could be arranged in a sophisticated fashion in the genome, and in principle the base sequences might be reflected in the conformations of protein, RNA and DNA. In other words, if we could identify a meaningful structure, or an analytical method for analysis of the ge- nome, we could also obtain important information about the functions of protein, RNA and DNA from that struc- ture. The four bases in genomic DNA were arranged sophisticatedly in all organisms and distinguish the cod- ing-and the non-coding region clearly on the genome. By analyzing the appearance frequency of the bases, it was shown that first, the symmetry [8-11], second, the bias [12-15] and third, the fractality [16-19] could b e n e c- essary to generate genome base sequences. We analyzed genome structure based on the appearance frequency of genome base sequences [20]. We have studied many ge- nome sequences down-loaded from databases like NCBI [21], and calculated the appearance frequencies of an optional base sequence (key sequence) in a genome. Subsequently, we determined the sequence spectra of chromosome, gene and DNA from the key sequence of the genome (chromosome), and analyzed both the cod- ing-and non-coding sequences because the key se- M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 341 quences were used throughout the genome in cells. However, in the coding regions in the DNA, the appear- ance frequencies of the key sequences of an individual gene should vary in the genome because the pro- tein-encoding gene and the adjoining (5’- and 3’-) non-coding base sequences were different. In other words, the appearance frequencies of the base sequences should be different for each gene. Even if the base se- quences of the gene were identical, the adjoining base sequences differ, suggesting that each DNA sequence might have an effect on the expression of the gene and function as an informative DNA molecule [20]. Each gene was transcribed to mRNA, and translated to a protein on the ribosome (polyribosome) according to the DNA sequence of each coding region. In other words, the biological information of DNA (base sequence) should be transferred to protein via mRNA (base se- quence). That is, the information of the base sequence of DNA was transformed to the amino acid sequence by tRNAs corresponding to the base sequences of the mRNA on the ribosome [22]. However, the coding regions varied in individual ge- nomes and species [23,24]. The non-coding sequences might be necessary to precisely, rapidly, and consistently regulate gene expression [24,25]. In other words, the genome might be a “field” on which the four bases were sophisticatedly arranged into genes that were regulated and expressed to carry out the biological phenomena of life. Therefore, analytical methods to characterize ge- nome structure were needed to understand the encoded biologic al phenom e na . In this study, we developed an analytical method based on the frequencies of the nucleotide (base) se- quences in the whole genome according to the flow of biological information, and focused on the self-similarity in the genomes of S. cerevisiae and S. pombe, where most of the genes had introns, and E. coli, in which most of genes were in operons. 2. MATERIALS AND METHODS 2.1. Sequence Spectrum Method (SSM) The outline of the proposed method was as follows. The base sequence of interest was sectioned by a small num- ber of bases from the top (5’-end). The sectioned base sequence was called the key sequence. In the case of three successive base sequences (d = 3), the appearance frequencies of the 64 triplets (the genetic codon) were shown in Ta b l e 1 (key sequence at d = 3). The key se- quences of the nine successive base sequences (d = 9) was 262,144 sequences (= 49, ref. 20). The appearance frequency of the key sequence was counted in the entire genome, and was plotted at the position of the first base of the key sequence as described in the next pa ragraph of the Materials and Methods. These procedures were car- ried out for the entire base sequence of interest with one base shift (p = 1). The next step was to average the ap- pearance frequencies so that a recognizable pattern of appearance frequency was obtained for the base se- quence. This pattern of the averaged appearance fre- quency was called the “sequence spectrum”. Finally, the homology factor between two sequence spectra was cal- culated to determine the degree of homology. The exact procedure was explained below in a mathematic al way. Let S be an entire set of base sequences, and B = [bi] be a partial set of interest in S. A base element was de- noted by bi (I=1, ... , M), and M was the base sequence size of B. The base element bi become A (adenine), T (thymine), G (guanine) or C (cytosine). The key se- quence ki and the appearance frequency fi were defined for bi as follows. Key sequence ki: base sequence comprised of sequen- tial base elements bi~bi+d-1 (d : base size of the key se- quence) Appearance frequency fi: appearance count of ki in S The key sequence ki was compared with the base se- quence of the entire set S, and the appearance frequency fi was increased by one every time the key sequence ki matches the partial base sequence of the entire set S. This procedure was iterated for all key sequences ki to obtain fi (I = 1, ... , M). Consequently, the appearance frequency vector F = [fi] (I = 1, ... , M) was determined (actually, the appearance frequencies for the last (d-1) base elements of B could not be calculated; however, this was neglected because M >> d-1). Next, the appearance frequency fi was averaged as fol- lows: 1 21 im s ij jim f f m where the parameter m was average width. This aver- aged appearance frequency Fs = [fsi] (I = 1, … , M) was called the “sequence spectrum”. The next step was to calculate the homology factor to determine the degree of homology. The homology factor determines the homologous region of a target base se- quence with respect to a reference base sequence. In order to derive the homology factor, the mutual correla- tion function MF was calculated as 1 1 1 1 1 1 (,)( )*() ()*() ()*() 1 1 Mr kiikk i k Mr ii i Mr kikkikk i Mr i i Mr kik i M FFsr Fstfsrfsrfstfst Fsr Fst Fsrfsr fsrfsr fsr Fstfst fstfstfst fsr fsr Mr fst fst Mr M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 342 Table 1. Key sequences of the three succssive base sequences*1 in genome*2. Triplet Frequency Triplet Frequency Triplet Frequency Triplet Frequency (a) S. cerevisiae AAA 478,677 AAT 359,378 AAG 263,401 AAC 219,288 ATA 302,770 ATT 358,051 ATG 221,867 ATC 214,197 AGA 246,395 AGT 184,087 AGG 138,976 AGC 139,262 ACA 208,942 ACT 183,292 ACG 106,020 ACC 141,084 TAA 271,996 TAT 301,699 TAG 156,650 TAC 172,399 TTA 271,724 TTT 475,621 TTG 279,349 TTC 286,655 TGA 244,596 TGT 207,422 TGG 179,858 TGC 150,406 TCA 245,024 TCT 244,505 TCG 110,351 TCC 154,145 GAA 288,804 GAT 213,000 GAG 136,067 GAC 118,074 GTA 172,583 GTT 218,208 GTG 128,946 GTC 117,316 GGA 154,364 GGT 139,691 GGG 81,268 GGC 95,122 GCA 150,888 GCT 139,012 GCG 67,875 GCC 95,478 CAA 281,266 CAT 222,808 CAG 152,602 CAC 129,575 CTA 155,668 CTT 261,471 CTG 152,121 CTC 135,857 CGA 110,589 CGT 105,859 CGG 70,348 CGC 68,463 CCA 181,394 CCT 138,308 CCG 71,012 CCC 82,880 ( b ) S. p ombe AAA 569,684 AAT 409,666 AAG 277,238 AAC 234,759 ATA 310,191 ATT 409,111 ATG 227,572 ATC 207,984 AGA 225,118 AGT 196,340 AGG 128,892 AGC 158,220 ACA 212,145 ACT 193,959 ACG 110,332 ACC 123,580 TAA 334,648 TAT 310,127 TAG 162,059 TAC 183,503 TTA 334,208 TTT 572,331 TTG 296,280 TTC 292,897 TGA 244,964 TGT 213,557 TGG 156,002 TGC 157,500 TCA 245,161 TCT 227,278 TCG 123,339 TCC 149,364 GAA 291,250 GAT 207,564 GAG 134,381 GAC 108,437 GTA 185,292 GTT 236,486 GTG 113,029 GTC 109,314 GGA 148,699 GGT 123,656 GGG 67,242 GGC 75,049 GCA 157,454 GCT 157,621 GCG 64,622 GCC 75,416 CAA 295,764 CAT 227,501 CAG 134,892 CAC 113,317 CTA 160,646 CTT 277,788 CTG 135,142 CTC 134,949 CGA 122,848 CGT 110,569 CGG 62,511 CGC 64,344 CCA 156,714 CCT 129,667 CCG 61,979 CCC 67,351 ( c ) E. coli AAA 108,901 AAT 82,992 AAG 63,364 AAC 82,578 ATA 63,692 ATT 83,395 ATG 76,229 ATC 86,476 AGA 56,618 AGT 49,774 AGG 50,611 AGC 80,848 ACA 58,633 ACT 49,863 ACG 73,263 ACC 74,899 TAA 68,837 TAT 63,279 TAG 27,241 TAC 52,591 TTA 68,824 TTT 109,825 TTG 76,968 TTC 83,846 TGA 83,483 TGT 58,369 TGG 85,132 TGC 95,221 TCA 84,033 TCT 55,469 TCG 71,733 TCC 56,025 GAA 83,490 GAT 86,547 GAG 42,460 GAC 54,737 GTA 52,670 GTT 82,590 GTG 66,108 GTC 54,225 GGA 56,199 GGT 74,291 GGG 47,470 GGC 92,123 GCA 96,010 GCT 80,285 GCG 114,609 GCC 92,961 CAA 76,607 CAT 76,974 CAG 104,785 CAC 66,752 CTA 26,762 CTT 63,653 CTG 102,900 CTC 42,714 CGA 70,934 CGT 73,159 CGG 86,870 CGC 115,673 CCA 86,442 CCT 50,412 CCG 87,031 CCC 47,764 *1; 5'- to 3'-end correspond to the left to the right letter of each triplet. *2; S. cerevisiae g e nome is composed of 16 c hromosomes pl us mtDNA. S. pombe genome is composed of 3 chromosomes plus mtDNA. M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 343 where Fsr: sequence spectrum of the reference base sequence with base size Mr Fst: sequence spectrum of the target base sequence with base size Mt (> Mr) The mutual correlation function MF ranges from –1 to 1, and then the homology factor HF was defined as 1 ,*100 % 2 k k MF HFFsr Fst The higher the homology factor, the more homologous the sequence spectra were. The homologous regions of the target base sequence with respect to the reference base sequence were obtained by calculating the homol- ogy factors HFk for all k (k = 0, ... , Mt-Mr), and target- ing the regions with hi g he r ho mology factors . When the target base sequence was very large, ele- ments of the target sequence spectrum were skipped by the size factor p to reduce the size as follows. 1* 1 iip fst fst For instance, when p = 2 123 135 ,, ,, f st fstfstfst fstfst This operation reduced the size to 1/p. The base sequences of the genomes were obtained from the databases listed below. Saccharomyces cerevisiae: http://www.mips.biochem.mpg.de/ Schizosaccharomyces pombe: http://www.sanger.ac.uk / Escherichia coli: http://bmb.med.miami.ed u/Eco gene/ecoWeb / 2.2. Appearance Frequencies of Bases or Base Sequences. In order to analyze the structure of the base sequence, the most appropriate parameter was considered to be the appearance frequency. For three successive bases (trip- lets), the appearance frequency was counted for the en- tire genome by matching the triplet from the start of the base sequence in a genome with one base shift (p = 1) as follows. Ex. T riplet bases: AAT AAT (one base shift) BaseSequence: 5’-ATCGAATCCGTAATTCGGAGTCGAATT-3’ Count of AAT: 1 2 3 3. RESULTS 3.1. Sequence Spectrum Figure 1 showed the sequen ce spectrum of the F1F0-ATPase subunit gene ATP1 [26, YBL099W] in Saccharomyces cerevisiae. In this figure, the vertical parameter of the sequence spectrum fsi was not designated, and it was scaled properly because the shape of the sequence spec- trum only makes sense in this manuscript. The horizontal parameter was the base sequence number i (I=1, ... , M), and it was also omitted in the following figures because it was easily derived from the base sequence size M. Controllable parameters in the sequence spectrum were the base size d of the key sequence, the average width m, and the size factor p (skipped base numbers). The parameter d determines the highest resolution for extracting the structural features of the base sequence. In this report, we used the key sequence as d = 3 (appear- ance frequency table of triplet, Table 1) for numerical experiments of the homologous structure discussed in the following sections. However, as shown in Figure 1, smaller m-values caused a harder zigzag pattern of the sequence spectrum, and eventually it become more difficult to identify the structure of the base sequence (Figure 1(a)). Therefore, large m-values were usually used to obtain the overall features of the structure, and smaller m-values were ap- plied to investigate the structure in detail (Figure 1(b)). The value of m normally ranges from 1/10 to 1/100 of the base sequence size. In this manuscript, m = 2 for a tRNA, m = 60 for a gene, and m = 8,000 for a chromo- some. The size factor p was adjusted to the base se- quence size especially when the homology factor between a small reference and a large target was calculated. The possible appearance frequencies fi of ke y sequences ki were calculated for the entire set S in advance. The ap- pearance frequency table depended on the entire set S, and in general S was the genome of the t arget spec i es . 3.2. Reverse-Complement Symmetry in the Appearance Frequency Table Table 1 showed the appearance frequencies (3 succes- sive base sequences = triplet, d = 3) of the key sequence for Saccharomyces cerevisiae (a), Schizosaccharomyces pombe (b), and Escherichia coli (c). This table gave some important features about the genome. In the case of S. cerevisiae, first, it was notable that the appearance frequencies of the key sequence and its reverse- complementary key sequence were almost the same. The reverse-complement key sequence was derived from reversing the base order of the original key sequence in S. cerevisiae, exchanging A and T, and exchanging G and C. For example, the appearance frequency of 5’-ATT is 358,051 an d that of 5’-A AT was 359,378. The difference was less than 1%. The largest difference was about 2% for 5’-GGG (81,268) and 5’-CCC (82,880). This fact is valid regardless of the species, such as Escherichia coli (Table 1(b)) or Schizosaccharomyces pombe (Table 1(c)). This reverse-complement symmetry led to the fact that the numbers of A and T were almost equal, and the number s of G and C we r e almost equ a l. M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 344 (a) ATP1, m = 0 (b) ATP1, m = 60 Figure 1. Sequence spectrum of ATP1. Sequence spectra of AT P1 [26-28] from Saccharomyces cerevisiae with different average widths (a) m = 0, and (b) m = 60. The vertical parame- ter (appearance frequency of the triplet, d = 3) of the sequence spectrum is not designated, and it is scaled properly. The hori- zontal axis is the base sequence of AT P1 (1,638 nt designated as M, ATG = start codon – TAA = stop codon). The skipped base numbers (p) are shown in the figures. The zigzag motif becomes more moderate and the resolution becomes lower as the average width of m becomes larger. Generally it was well known that the numbers of A and T and the numbers of G and C were the same due to the double helix structure of DNA. However, in this case, this coincidence of base numbers occurred in the genome, so it had nothing to do with the double helix structure. Therefore, the coincidence of base numbers occurred when the base sequence size was very large even in a sin- gle strand. Actually this reverse-complement symmetry occurred in each chromosome as well. On the other hand, it did not occur when the base se- quence size was not large enough. For instance, the base sequence size of a single gene was not adequate. The fact that the appearance frequencies of the key sequence and its reverse-complementary key sequence were almost equal implies that there must be a certain amount of symmetry in the genome. Second, the appearance frequency (in parentheses) for each key sequence was not random, but some of the key sequences had very close appearance frequencies even when they did not have a complementary relationship. For example, in the case of S. cerevisiae, the key se- quences 5’-AAC (219,288), 5’-ATC (214,197) and 5’-ACA (208,942) had close appearance frequencies of about 210,000, and those of the key sequences 5’-ACG (106,020), 5’-CGA (110,589) and 5’-GAC (110,874) were about 110,000. These different key sequences with close appearance frequencies might have a similar effect on the sequence spectrum. In other words, sin- gle-stranded DNA with base-symmetry might be able to make many double-helical stems in a molecule, and the peaks of the sequence spectrum, the “up” of the dou- ble-helical stem might have the same effect on the “down” of it. Needless to say, these facts were valid re- gardless of the species. 3.3. Homologous Structure in Genomes (Enlargement-Reduction of the base Sequence) ATP1 (YBL099W) of S. cerevisiae was present on the left arm of chromosome II (37,045-38,679 from the left telomere). Figure 2 showed the spectra of ATP1 (1,638 nt, Figure 1(b)), and (a) chromosome II (813,139 nt), respectively. The red arrowhead indicated the position of ATP1 on chromoso me II [27, 28]. When the spectrum of ATP1 (1,638 nt) was skipped 3 bases and the homology analyzed between chromosome II and the skipped-ATP1, the red-region (20,401 ~ 60,401 = 40,000 nt) of chromo- some II was homologous to the 3 bases-skipped-ATP1 (1,341 ~ 1,638 = 297 nt) (Figure 2(b), HF of the red-region of chromosome II to the purple-region of ATP1 = 95%). When ATP1 was skipped 10, or 16 bases, the ho- mologous area of ATP1 to the red-region of chromosome II was enlarged to 990 nt (Figure 2(c), 648 ~ 1,638), or 1,584 nt (Figure 2(d), 54 ~ 1,638), r espectively. That is, the base sequence of the complete ATP1 gene had self-similarity to the gene-position on chromosome II. Other genes of S. cerevisiae were highly homologous with the gene-position of each chromosome irrespective to the sizes, the order, the direction of transcription and the chromosomes. The fold-enlargement of the gene to each chromosome was calculated as approximately 400-fold (Table 2(a)). The same relationship of the enlargement-reduction of the chromosome-gene was observed in S. pombe (eu- karyotic cells, Tab le 2(b)) and E. coli (prokaryotic cells, Tab l e 2 (c )). In the case of small intron-containing genes in S. pombe , and genes in operons in the E. coli genome, the homology condition of the base width was also 100 nt, like that of the S. cerevisiae genome. Therefore, the homology pattern in a wide range of organisms might be dependent on the base sequence sizes for the gene analyzed. In any case, in the S. cerevisiae, S. pombe and E. coli genomes, genes and the base sequence near the chromosomal position of the gene had self-similarity with each other in the same ratio, approximately 400-fold. In some preliminary experiments, we observed the self- M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 345 (a) Chromosome Ⅱ (b) ATP1, p = 3 HF = 95.0% (c) ATP1, p = 10 HF = 95.9% (d) ATP1, p = 16 HF = 92.4% Figure 2. Homology of chromosome II to ATP1. (a) Sac- charomyces cerevisiae chromosome II (813,139 nt, from the left telomere sequence to the right telomere sequence), m = 8,000, d = 3, p = 400. The ATP1 gene is located 37,001 bases from the left telomere of chromosome II (arrowhead) [26-28]. The red-region is composed of 40,000 nt (the numbers on the abscissa 20,401 – 60,401). The numbers on the abscissa indi- cate the base number from the left telomere according to MIPS. (b) ATP1 gene (1,638 nt, F1F0-ATPase complex subunit) (26), m = 60, d = 3, p = 3. (c) ATP 1 (1,638 nt), m = 60, d = 3, p = 10. (d) ATP1 (1,638 nt), m = 60, d = 3, p = 16. The homologous region (purple) of ATP1 to the red-region was designated the base number of the initiated base “A” (the start codon, ATG) of the coding region of AT P1 as 1 [26, 28]. similarity of a gene to the chromosomal position in H. sapiens (for instance, Hs.5174 and chromosome 22; data not shown). This self-similarity might be universal in all species. 3.4. Homologous Structure in tRNAs (Enlargement-Reduction of the Base Sequence) If a homologous structure was general, it must exist not only in protein-coding genes but also in RNA genes. Ac- tually, the sequence spectrum of each gene was more than 80% similar to the tRNA genes in S. cerevisiae, S. pombe and E. coli ( Table 3). Most amino acids have plural genetic codons. Each genetic codon had plural tRNA genes on several different chromosomes. How were the plural tRNA genes used properly to construct proteins during the transformation of the biological in- formation in organisms? The genetic codons for gluta- mate (Glu) were 5’-GAA and 5’-GAG. In S. cerevisiae, the nuclear-encoded Glu(GAA)-tRNA genes were 14 on various chromosomes, and all of them were composed of 72 identical nucleotides (bases). Three out of these 14 Glu(GAA)-tRNA genes were present on chromosome V (576,869 bp), located at positions 177,098 ~ 177,169, 354,930 ~ 355,001 and 487,397 ~ 487-326, and were designated Glu (GAA-1), Glu (GAA-2) and Glu (GAA-3), respectively [29-31, Figure 3 lower panel]. Figure 3 showed that the sequence spectra of these 3 Glu (GAA)-tRNA genes on chromosome V and ATP1 [26-28] were depicted. The window length of the tRNA gene was 70 nt in the analysis because Glu (GAA)-tRNA genes were composed of 72 nt (bold-black bar in upper panel). In addition, the Glu (GAA)-tRNA spectra analy- sis used DNA sequences (112 bp) adjoined to the 5’-, 3’- 20 nucleotides (green letters) added to these three Glu (GAA)-tRNA genes (72 bp, black letters). As a result, the homology factors (HF) of ATP1 to these three Glu (GAA)-tRNA genes were different; that is, 77.0% for GAA-1, 77.0% for GAA-2 and 88.5% for GAA-3, re- spectively, although these Glu (GAA)-tRNA genes were all composed of 72 identical nucleotides. The sequence spectra of ATP1 (1,638 nt) and the nu- clear-encoded 14 Glu (GAA)-tRNA (72 nt) were fairly homologous. The red area of the Glu (GAA)-tRNA gene was homologous to the homologous area (purple) of the ATP1 gene (1,638 bp), and the bracket in Figure 3 showed the Glu (GAA)-tRNA gene consisting of 72 bp. The homologous area (red) of the Glu (GAA)-tRNA to the ATP 1 gene overlapped with a part of the adjoining sequences of the tRNA-gene (the homologous region of the tRNA gene with the ATP1 gene was also indicated from the red-base to the red base in the lower panel of Figure 3). In other words, the sequence spectrum analy- ses based on the frequencies of the base sequences in the genome indicated that the sequence spectrum of the gene might be influenced by the adjoined DNA sequences. The smaller the base numbers of the DNA sequence, such as for the tRNA-genes, the greater these effects. In the same way, other nuclear-encoded 11 Glu (GAA)- tRNA genes on several different chromosomes were generally homolog ous to the ATP1 gene on chromosome II, which encoded the subunit of the F1F0-ATPase complex [26-28], but their homology factors (HF) varied. The maximum homologous Glu (GAA) tRNA gene was on chromosome IX (HF = 89.2%, position, 370,414-370,485, M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 346 Table 2. Self-similarity with a gene to the chromosome. Gene nt (*1) Chromosom e (*2)nt (*3) intron # p-value (*4) HF (%) (*5) (a) S. cerevisiae SEO1 1,779 1, left 230,203 0 17 61.2 FLO1 4,611 1, right 0 46 73.3 ATP1 1,638 2, left 813,139 0 16 92.4 SUP45 1,311 2, right 0 13 72.4 PRD1 2,136 3, left 315,350 0 21 77.7 PHO87 2,769 3, right 0 27 75.2 ATP16 480 4, left 1,531,929 0 4 93 RAD9 3,927 4, right 0 39 74.2 PAU2 360 5, left 576,870 0 3 85.2 GLC7 1,461 5, right 0 14 73.6 EMP47 1,335 6, left 270,148 0 13 81.4 PHO4 939 6, right 0 9 82.8 POX1 2,244 7, left 1,090,936 0 22 80.9 TFC4 3,075 7, right 0 30 69.9 GUT1(STE20) 2,127 8, left 562,638 0 21 61.6 IRE1(NDT80) 3,345 8, right 0 33 80 HOP1 1,815 9, left 439,885 0 17 64.5 MRS1(PAN1) 1,089 9, right 0 10 91.7 CYR1 6,078 10, left 745,440 0 61 79.7 ATP2 1,533 10, right 0 15 75.1 SDH1 1,920 11, left 666,445 0 17 71.1 CCP1((NUP133) 1,083 11, right 0 10 76.2 HSP104 2,724 12, left 1,078,173 0 27 68.4 MAS1 1,386 12, right 0 13 81.2 CYB2(CAT2) 1,773 13, left 924,430 0 17 88.4 HXT2(AAC1) 1,623 13, right 0 16 70.1 RAS2 966 14, left 784,328 0 9 86.1 POP2 1,299 14, right 0 13 75.4 ADH1 1,044 15, left 948,061 0 10 85.2 ADE2 1,713 15, right 0 17 78.8 TBF1(PHO85) 1,686 16, left 948,061 0 16 67.7 PZF1 1,287 16, right 0 12 91.2 (b) S. pombe ATP2 1,578 1 (968,783) 5,579,133 0 15 78.6 RPL37 337 1 (1,275,535) 1 3 81.3 RPL37(exon) 270 2 77.5 CDC24 1,823 1 (2,863,965) 6 18 81.3 CDC24(exon) 1,506 15 77.5 ATP1 2,049 1 (5,256,781) 2 20 75 ATP1(exon) 1,611 16 76.1 MEU6 2,083 2 (454,230) 4,539,804 2 20 82.3 MEU6(exon) 1,956 19 82.3 CDC2 1,189 2 (1,500,340) 4 11 76.4 CDC2(exon) 894 8 78.8 ATP16 483 2 (3,046,873) 0 4 90.4 SPO4 1,672 2 (3,827,178) 2 16 85.1 SPO4(exon) 1,290 12 74.6 RAF1 1,917 3 (100,255) 2,455,984 0 19 73 HIF2 1,875 3 (194,552) 3 18 71.9 HIF2(exon) 1,695 16 N.D.(*6) SRK1 3,932 3 (1,302,900) 1 39 64 SPK1(exon) 1,743 17 69 M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 347 Gene nt (*1) Chromosom e (*2)nt (*3) intron # p-value (*4) HF (%) (*5) GAF1 2,568 3 (1,666,310) 0 25 76.2 TIF6 1,104 3 (2,223,154) 2 11 86.4 TIF6(exon) 735 7 76 ATP5 838 3 (2,268,884) 2 8 74.8 ATP5(exon) 651 6 93 (c) E. coli (K12) araA 1,503 66,835 4,639,221 0 15 74.5 lacZ 3,075 362,455 0 30 66.3 galE 1,017 790,262 0 10 87.7 trpD 1,596 1,317,813 0 15 77.5 cybB 531 1,488,926 0 5 87.8 galF 894 2,111,458 0 8 88 argA 1,332 2,947,264 0 13 68.9 secY 1,332 3,440,788 0 13 82.8 atpA 1,741 3,916,339 0 17 73.9 purA 1,299 4,402,710 0 12 83 *1, Base numbers of the gene without intron. *2, Gene position on the chromosome (from the left to the right = S. pombe). *3, Size (base numbers) of chromosome or genome. *4, Skipped base numbers of the g ene (max.p-value). *5, Entire gene in the max.p- value-chromosome HF (%) in the homologous region. *6, not determined. Table 3. Self-similarity with a protein to tRNA gene. Gene size (nt) (*1) chromosome (*2)tRNA (*3) size (nt, *4) chromosome (*5) p (*6) (S. cerevisiae) ATP1 1,638 2 Glu(GAA) 72 12 16 RAS2 936 14 Lys(AAG) 72 6 9 ADH1 1,047 15 Arg(AGG) 72 10 10 TFC4 3,075 7 Ser(TCG) 103 3 30 PAU2 360 5 Ser(AGC) 101 6 3 CYR1 6,078 10 Ser(AGC) 101 6 60 (S. pombe) ATP1 1,611 1 Tyr(TAC) 84 2 16 YPT3 645 1 Arg(AGA) 73 2 6 CDC2 894 2 Ser(TCT) 82 1 8 SPO4 1,290 2 Thr(ACT) 72 3 12 GAF1 2,568 3 Ser(AGC) 95 2 25 TIF6 735 3 Arg(AGA) 73 3 7 (E. coli) galE 1,017 K12 genome Ser(TCC) 88 K12 genome 10 atpA 1,735 Ser(AGC) 93 17 cybB 531 Ser(TCC) 88 5 lacZ 3,075 Arg(CGT) 77 30 *1; base numbers of gene without intron *2; Chromosome presented the gene *3; Homologous tRNA gene. *4; Size of tRNA gene. *5; Chromosome presented the tRNA gene. *6; Skipped base numbers of the gene . M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 348 (a) Glu (GAA)-tRNA-1, HF = 77.0% (b) Glu (GAA)-tRNA-2, HF = 77.0% (c) Glu (GAA)-tRNA-3, HF = 88.5% (d) ATP1, M = 1,638, d = 3, p = 23 Upper panel: Sequence spectrum of (a) Glu(GAA-1)-tRNA gene; (b) Glu(GAA-2)-tRNA gene; (c) Glu(GAA-3)-tRNA gene on chromosome V of S. cerevisiae; (d) Sequence spectrum of ATP1. The bold black line indicates the area of the Glu(GAA)-tRNA gene consisting of 72 bp. (a) (GAA-1) 177,098 ~ 177,169 (Watson strand, left to right) (b) (GAA-2) 354,9 30 ~ 355,001 (Watson strand, left t o right) (c) (GAA-3) 4 87,397 ~ 487,326 (Crick strand, right to left) Lower panel: The adjoining DNA sequences of each Glu(GAA)-tRNA gene, and the orientation of each tRNA gene. The base sequences of Glu(GAA)-tRNA (72 bp, black letter), adjoining sequences (5’-20 bp, 3’-20 bp, green letter), and the outside se- quences that were analyzed are shown in pink letters [29-31]. ATP1-homologous region of each Glu(GAA)-tRNA gene from the underlined red base to the underlined red base (70 bp). Figure 3. Homology of Glu(GAA)-tRNA gene to ATP1 gene. Watson-strand) and the minimum was on chromosome VII (HF = 73.8%, position, 328,586-328,657, Wat- son-strand). These results indicated that the analyses of such small DNA sequences were deeply affected by the adjoining sequences. Other protein-encoding genes were highly homolo- gous to the appropriate tRNA genes in the yeast S. cer- visiae. Similar homology of protein-encoding genes to appropriate tRNA genes in the same organism was ob- served for other genes in S. pombe and E. coli (data not shown). These results showed that the homologous structures spread consistently from a very small gene (tRNA) to a complete chromosome with the same scale regardless the species. 4. DISCUSSION The results obtained in this study might lead to the de- velopment of generation-rules for the base sequence of the genome. The reason why genomes possess homolo- gous structure regardless of the size of the base sequence could be related to the physical hierarchy in the structure of the genome, such as the double helix structure of DNA, nucleosome structure, super helix structure, and so on. The phenomenon in which homologous patterns M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 349 appear in various size levels is known as “self-similarity” or “fractal”. Therefore, the structure of the genome could be essentially related to the fractal. During the 1990s, many papers reported that the ge- nome bases should follow the fractal-rule [15-18 etc], and Genome Projects for many species had revealed genomic base sequences in the last 10 years. Therefore, analyses of the concrete biological phenomena based on the structures of genomes should be in progress. In this paper, the analyses of the sequence spectrum, m = 2 for a tRNA, m = 60 for a protein, and m = 8,000 for a ch romo s ome w er e us ed. In the c as e of th e se quen ce spectrum of protein, m = 10 (average of 20 nt) or m = 60 (average of 120 nt) was easier to use for the analysis of the sequence spectrum when the m-value corresponded to 6 ~ 7, or 40 amino acid residues, respectively [32]. In the case of the chromosome, m was adjusted to 8,000 (average of 16,000 nt = 80 nucleosomes) or 10,000 (average of 20,000 nt = 100 nucleosomes). In any case, the smaller the adjusted m-value is, the higher the resolution of the sequence spectrum. These results suggested that “m” might be reflected in the higher order structure of a molecule, a gene for tRNA, or protein or chromosome, but the detailed biological meaning of the m-value is in progress [33, 34]. In addition, as described previously, each genetic codon had multiple tRNA genes on several different chromosomes. How were the multiple tRNA genes used properly to construct proteins during the transformation of biological information in organisms? In biological processes, the base sequence of DNA was transcribed to mRNA, and then the base sequence of mRNA was transferred to the amino acid sequence by tRNAs. In such cases, the higher homologous structure (HF) of tRNA genes might be one of the distinctions of an ap- propriated protein. In other words, the base sequence of DNA was reflected in the amino acid sequence through the base sequence of RNA. Therefore, the above method might be applicable to the interactive-sites of DNA, RNA, and protein. In such analyses, the selection of the d- and p-values might be important to obtain the highest resolution of the sequence spectru m correspond ing to the structural features of the target DNAs o r pr ot ei ns. Genomic DNA might be enlarged and reduced be- cause the base sequence of the genomic DNA had frac- tality; therefore, it had similarity to related sites an d was able to prefer a gene over the chromosome. The coding- and non-coding r egions of a genome were different with respect to bases as described. As a result, biases of the four bases occurred on genomic DNA [20]. The analyses based on the appearance frequency of the base sequences in a genome should be universally applicable to everything that was expressed by base se- quences, not only in Saccharomyces cerevisiae, but also Homo sapiens, Escherichia coli and all genomes; there- fore, this method might be applied as a first screen to characterize interaction-sites i n b iological p henomena. 5. CONCLUSIONS The results obtained in this study were summarized as follows. 1) Homologous structure exists in the appear- ance frequency of short base sequences such as triplets over an entire chromosome in the genome, and the 5’- and 3’-adjoining base sequences of the DNA were deeply affected by the homology factor when the target DNA was small in size or located at the boundary, 2) homologous structure was universally observed in a va- riety of species, 3) the homology of the sequence spec- trum of a gene was observed in the appropriate tRNA genes, and the analysis (SSM) of the DNA base se- quences might be reflected in that of protein; in other words, 4) the SSM might be reflected as a vehicle of biological information, and a suitable prediction method to identify interacting regions DNA, RNA or protein by the appropriate conditions of “m”, “d” and “p”, in each gene, or genomic DNA, 5) SSM was faithfully reflected the biological information, therefore, the conservation of the bases sequences of genomic DNA were also con- served the translated amino acids sequence, the protein sequence, in the coding region, 6) SSM could deal con- sistently with molecules that consists of base sequences. 6. ACKNOWLEDGEMENTS The authors wish to thank to Dr. Hiroshi Shibata at Sojo University for his comments about the fractal analysis in this research. REFERENCES [1] Singer, M. and Berg, P. (1991) Genes & genomes – A changing perspective-. University Science Books. [2] Garrel, J.I. (1997) The yeast proteome handbook. Third edition, Beverly, Proteome Inc. [3] Velculescu, V.E., Zhang, L., Zhou, W., Vogelstein, J., Basral, M.A., Bassett, D.E.Jr., Hieter, P., Vogelstein, B. and Kinzler, K.W. (1997) Characterization of the yeast transcriptome. Cell, 88, 243-51. [4] Wan, X.F., VerBerkmoes, N.C., McCue, L.A., Stanek, D., Connlly, H., et al. (2004) Transcriptomic and proteomic characterization of the fur modulon in the metal- reduc- ing bacterium Shewanella oneidensis. The Journal of Bacteriology, 186, 8385-8400. [5] Sakharkar, K.R., Sakharkar, M.K., Culiat, C.T., Chow, V. T. and Pervaiz, S. (2006) Functional and evolutionary analyses on expressed intronless genes in the mouse ge- nome. FEBS Letters, 580, 1472-1478. [6] Karkas, J.D., Rudner, R. and Chargaff, E. (1968) Separa- tion of B. subtilis DNA into complementary strands. II. Template functions and composition as determined by transcription by RNA polymerase. Proceedings of the National Academy of Sciences of the United States of America, 60, 915-920. [7] Bell, S. J., Fordyke, D. R. (1999) Accounting unit of in DNA. Journal of Theoretical Biology, 197, 51-61. M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 340-350 Copyright © 2010 SciRes. JBiSE 350 [8] Abe, T., Kanaya, S., Kinouchi, M., Kudo, Y., Mori, H. et al. (1999) Gene classification method based on batch- learning SOM. Genome Informatics Seris, 10, 314-315. [9] Baisnee, P.-F., Hampson, S. and Baldi, P. (2002) Why are complement ary DNA strands symmetric? Bioinformatics, 18, 1021-1033. [10] Chen, L. and Zhao, H. (2005) Negative correlation be- tween compositional symmetries and local recombination rates. Bioinformatics, 21, 3951-3958. [11] Albrecht-Buehler, G. (2006) Asymptotically increasing compliance of genomes with Chargaff’s second parity rules through inversions and inverted transpositions. Proceedings of the National Academy of Sciences of the United States of America, 103, 17828-17833. [12] Wilson, J. T., Wilson, L. B., Reddy, V. B., Cavallesco, C., Ghosh, P. K., et al. (1980) Nucleotide sequence of the coding portion of human alpha globin messenger RNA. Journal of Biological Chemistry, 255, 2807-2815. [13] Wada, A., Suyama, A. and Hanai, R. (1991) Phenome- nological theory of GC/AT pressure on DNA base com- position. Journal of Molecular Evolution, 32, 374-378. [14] Nakamura, Y., Itoh, T. and Martin, W. (2007) Rate and polarity of gene and fission in Oryza sativa and Arabi- dopsis thaliana. Molecular Biology and Evolution, 24, 110-121. [15] Paila, U., Kondam, R. and Ranjan, A. (2008) Genome bias influences amino acid choice: analysis of amino acid substitution and re-compilation matrices exclusive to an AT-biased genome. Nucleic Aci ds Research. [16] Voss, R.F. (1992) Evolution of long-range fractal correla- tion and 1/f noise in DNA base sequences. Physical Re- view Letters. 68, 3805-3809. [17] Bains, W. (1993) Local self-similarity of sequence in mammalian nuclear DNA is modulated by a 180 bp pe- riodicity. Journal of Theoretical Biology, 161, 13-143. [18] Weinberger, E.D. and Stadler, P.F. (1993) Why some fitness landscape are fractal. Journal of Theoretical Biol- ogy, 163, 255-275. [19] Lu, X., Sun, Z., Chen, H. and Li, Y. (1998) Characteriz- ing self-similarity in bacteria DNA sequences. Physical Review E—Statistical, 58, 3578-3584. [20] Takeda, M. and Nakahara, M. (2009) Structural Features of the Nucleotide Sequences of Genomes. Journal of Computer A ided Chemis try, 10, 38-52. [21] NCBI Genome Data Base (2009) http://www.ncbi.nlm. nih.gov/sites/entrez?db=genome. [22] Crick, F.H. (1968) The origin of genetic code. Journal of Molecular Biology, 38, 367-379. [23] International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human ge- nome. Nature, 409, 860-921. [24] Mattick, J.S. (2004) RNA regulation: A new genetics? Nature Reviews Genetics, 5, 316-323. [25] Lynch, M. (2007) The frailty of adaptive hypothesis for the origins of organismal complexity. Proceedings of the National Academy of Sciences of the United States of America, 104, 8597-8604. [26] Takeda, M., Chen, W.-H., Saltzgaber, J. and Douglas, M.G. (1986) Nuclear genes encoding the yeast mito- chondrial ATPase complex-analysis of ATP1 coding the F1-ATPase subunit and its assembly-. Journal of Bio- logical Chemistry, 261, 15126-15 133. [27] Takeda, M., Okushiba, T., Hayashida, T. and Gunge, N. (1994) ATP1 and ATP2, F1F0-ATPase and subunit genes of Saccharomyces cerevisiae, are respectively lo- cated on chromosome II and X. Yeast, 10, 1531-1534. [28] Mewes, H. W., Albermann, K., Bähr, M., Frishmann, D., Gleissner, A., et al. (1997) Overview of the yeast genome. Nature, 387 (supp), 7-65. [29] Dietrich, F. S., Mulligan, J., Hennessy, K., Yelton, M. A., Allen, E., et al. (1997) The nucleotide sequence of Sac- charomyces cerevisiae chromosome V. Nature, 387 (supp), 78-81. [30] Saccharomyce Genome Database. (2009) (http://www. yeastgenome.org/). [31] Transfer RNA data base. (2009) (http://gtrnadb.ucsc.edu/). [32] Matthews, B.W. (1993) Structural and genetic analysis of protein stability. Annual Review of Biochemistry, 62, 139-160. [33] Kornberg, R.D. (1974) Chromatin structure: a repeating unit of histones and DNA. Science, 184, 868-871. [34] van Holde, K. and Zlatonova, J. (1995) Chromatin higher order structure: Chasing a mirage? Journal of Biological Chemistry, 270, 8373-8376. |