Paper Menu >>
Journal Menu >>
J. Biomedical Science and Engineering, 2008, 1, 127-132 Published Online August 2008 in SciRes. http://www.srpublishing.org/journal/jbise JBiSE Exploring diversity of different classificatory hu- man tandem repeats Zhong-Yu Liu1, Xu-FengLi1, Xian-PingDing1, Bo Liu1, Hao Peng2&Yi Yang1 1College of Life Sciences, Sichuan University, Chengdu, 610064, China. 2Institute of Microbiology, Chinese Academy of Sciences, Beijing, 100101, China. Correspondence should be addressed to Z.Y. Liu (zhongyujohn@gmail.com). ABSTRACT Tandem repeats (TRs) are associated with dis- ease genes, play an important role in evolution and are important in genomic organization and function. Much research has been done on de- scriptions of properties of tandem repeats, such as copy-number, period, etc, and correlation be- tween mutations within tandem repeats and dis- ease. This project aims to detect some differ- ences w hich may exist within the features of dif- ferent tandem repeats associated w ith disease in human whole-genome. The features of tandem repeats associated with diabetes genes were compared to the counterparts of non-diabetes disease genes. Availability: TRbase is available at http://www.trbase2.cn/ 1. INTRODUCTION Repetitive DNA sequences have been identified in large quantities in both eukaryotic and prokaryotic genomes [2]. In some cases they can account for a large portion of the genome, for example, in the human genome they have been known to contribute around 40-50% of the total DNA sequence. The existence of repetitive DNA in pro- karyotes is limited, but it is found widely distributed throughout a large variety of eukaryotes, and can be found throughout the genome in both protein coding re- gions and inergenic regions. One of the reasons tandem repeats are of great interest is because of their nature to expand and contract unpredictably. It has been reported that microsatellites are biased towards expanding in length [15]. It has also been reported that repeats within coding regions appear to have some kind of constraint hindering their expansion, whereas tandem repeats in untranslated regions do not appear to have these con- straints, therefore much higher copy numbers of these repeats are often present [16]. Tandem repeats are known to have high mutability rates which cause differences in repeat length between lineages; this implies that these high mutability rates con- tribute to overall genome evolution. The frequent changes in tandemly repeated regions within genomes, although caused by mutation, are more specifically assumed to be due to slippage during DNA replication or unequal alignment during DNA recombination [17]. However these processes are not thought to be the sole cause of the differences observed between lineages in the divergence of tandem repeats within their genomes [6, 17]. Within this current study these concepts have been expanded by investigating the influence that disease may have had on the evolution of tandem repeats that either cause the dis- ease or are within disease genes. From an evolutionary standpoint, the sequences with tandem repeats have sev- eral interesting feature [9], Formation of the repeating sequences is an error-prone process, with mutations in genomic DNA repeats occurring far more frequently than the background rate of point mutations [2]. This suggests that repetitive sequences evolve more quickly than non- repetitive sequences. In general, one of the most interest- ing features of prokaryotic and eukaryotic genomes (both coding and non-coding regions) is the presence of rela- tively short perfect tandemly repeated DNA sequences. These repeated DNA sequences are distributed almost at random throughout the genome [7, 8, 13]. Much research indicates that at least ten kinds of inherited neurological disease including Huntington’s disease and spinocerebel- lar ataxia, as well as many less serious diseases such as epilepsy and deafness, are known to be the product of tandem repeat expansions (http://tandem.bu.edu) [10, 18]. Diabetes mellitus is characterized by abnormally high levels of sugar (glucose) in the blood. The most common forms of diabetes are type 1 diabetes (5%), which is an autoimmune disorder, and type 2 diabetes (90%), which is associated with obesity (http://www.ncbi.nlm.nih.gov/ books/bookres.fcgi/diabetes/pdf_ch1.pdf). The vast ma- jority of diabetes cases fall into the categories of type 1 and type 2 diabetes. However, up to 5% of cases have other specific causes and include diabetes that results from the mutation of a single gene. About 18 regions of the genome have been linked with influencing type 1 dia- betes risk. These regions, each of which may contain sev- eral genes, have been labeled IDDM1 to IDDM18. In rare forms of diabetes, mutations of one gene can result in disease. However, in type 2 diabetes, many genes are thought to be involved. “Diabetes genes” may show only as subtle variation in the gene sequence, and these varia- tions may be extremely common. The difficulty lies in linking such common gene variations, known as single nucleotide polymorphisms (SNPs), with an increased risk of developing diabetes. One method of finding the diabe- tes susceptibility genes is by whole-genome linkage stud- ies. SciRes Copyright © 2008 128 Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132 SciRes Copyright © 2008 JBiSE This study specifically concentrated on the statistical comparisons of some features (copy number, percentage match, period, indels, and %GC ) of tandem repeats asso- ciated with disease; its aim was to find some differences among features of tandem repeats between one specific set of disease genes (in this case, these are diabetes genes) and other disease genes. If those differences existed, fur- ther research would be carried out to explore the relation- ship between those diversities. This paper will present an inchoative detection of those features in terms of com- parison of features of tandem repeats associated with dia- betes genes and non-diabetes disease genes selected ran- domly from all disease genes (excluding diabetes genes) on chromosome 12. 2. IMPLEMENTATION AND METHODS 2.1. TRbase extension All the disease genes and relevant data about the features of tandem repeats of disease genes were retrieved from a web-accessible relational tandem repeats database TRbase that relates tandem repeats to gene locations and disease genes of the human genome [1]. DNA sequences and annotations were retrieved for the completed chromo- somes 4, 5, 6, 14, 16, 18, 19, 20, 21, and 22 [1]; however this project required data on all those disease genes and their relevant information in whole-human genomes, in- dicating that TRbase need extending to all human chro- mosomes prior to data preparations. DNA sequences and annotations of the remaining chromosomes (1, 2, 3, 7, 8, 9, 10, 11, 12, 13, 15, 17, X and Y) were downloaded from GenBank (http://www.ncbi.nlm.nih.gov/mapview/maps. cgi?TAXID=9606&MAPS=ideogr,cntg-r,ugHs,genes&C HR=1), all tandem repeats were detected using the TRF program (version 3.01; Benson, 1999) with parameters as in [1] applied to DNA sequences extracted from GenBank in the FASTA format using the Seqret program of EMBOSS [12]. 2.2. Representative chromosome generation It has been proposed that vertebrate genomes, including human, are made up of compositionally homogeneous DNA segments based on G+C content [16]. These re- gions, known as isochores, have been studied diabetes genes using density gradient centrifugation on mechani- cally sheared DNA in the range of 50-100 kb [16] since their discovery in the 1970s [17]. Isochores are biologi- cally interesting due to the association between increasing G+C content and high gene density [18, 19, 20 ]. According to Bernardi’s theories, there are five fami- lies of isochores, each having a different level of cytosine and guanine (C and G, respectively) as described in Ta- ble 1. There are two G+C-poor isochore families L1 and L2 that make up approximately 60% of the humangenome. The isochore family L1 is defined to be regions corre- sponding to less than 37% G+C content; L2 is defined to be regions containing between 37% and 41% G+C. The Table 1. Isochore classifications. Isochore classifications are the GC ranges for each of the five isochore classifications as defined by Bernardi (2000). ANote that the L1 and L2 isochore classes together represent 60% of the human genome. Isochore Class Percent (G+C) Percent of Genome L1 0-37 L2 37-41 60.0A H1 41-46 24.0 H2 41-46 7.5 H3 53-100 4.7 isochore family H1 forms 24% of the human genome and corre-sponds to regions between 41% and 46% G+C. The other G+C rich isochore family H2 forms 7.5% of the human genome and corresponds to those regions contain- ing between 46% and 53% G+C. The final isochore fam- ily, H3 forms almost 5% of the genome and corresponds to those very G+C rich regions which are greater than 53% G+C. Since the overall composition of the human genome is approximately 60% AT and 40% GC, the L1 and L2 families correspond to isochore regions containing less than average G+C content while the H1, H2, and H3 families correspond to isochore regions containing higher than average G+C content. Columns 1-4 in Table 2 were created using these guidelines to split the histograms for 75 kb fragments for the various chromosomes into densi- ties of 60%, 84%, and 91.5%, which would theoretically find the isochore boundaries. The first three columns in Table 2 were retrieved from an article of [14]. X1, X2 …. X9 stands for the substantial value of each feature in each row in Table 2. In each row, Y = (X1- mean) 2 + (X2- mean) 2…. + (X9- mean) 2 (mean is the value in bottom row in Table 2). The lowest value in column 11 corre- sponded to chromosome 12, which indicated that chro- mosome 12 was the most representative one in all human chromosomes. 2.3. Diabetes genes and control variables genera- tion and parameters selection All the diabetes genes were detected by the Search Dis- ease Information at TRbase website (http://trbase2.cn) [1], which provided information of their location within the human genome. The tandem repeats in diabetes genes were identified at the advanced composite search page at the TRbase website. When selecting diabetes genes, the parameter settings are: High Stringency detection parame- ter was used; the tandem repeat copy number ranged from 1.9 to 13086.4; the percentage match to the consensus sequence was kept above 70%; the tandem repeat unit length varied from 1 to 1000. The preceding parameter settings were applied to tandem repeats within intron, exon and intergene regions. All diabetesgenes were iden tified one by one using the above parameters, and each result formed a table with many columns. The col- Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132 129 SciRes Copyright © 2008 JBiSE umns(excluding labeled gene, copy Number, pe- riod,%Match, Indels and Consensus) would be deleted in the combined table. A new column composed of the % G+C of each consensus was inserted into the table with deletion of consensus column. Alternatively, diabetes genes data were gained from the MySQL TRbase, using MySQL commands equivalent to the processes stated above. The non-diabetes disease genes [with five features (copy Number, period, % Match, indels and % G+C)] for two control groups were selected randomly from all dis- eases genes (excluding four diabetes genes) on chromo some 12; the number of diseases genes selected randomly in the two control groups was equal to the number of dia- betes genes in the whole genome. The method to prepare the diabetes gene table was applied to generate control 1 Table 2. Representative chromosome dependent on all those data. Columns 1-4 shows that those data associated with G+C contents and the breakpoints of 60%, 84%, and 91.5% indicate breakpoints for the defined isochore classes L2-H1, H1-H2, and H2-H3 (Bernardi, 2000). The data in columns 5-10 retrieved (July 23, 2006) from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome&cmd= Re- trieve&dopt= Overview&list_uids=2). Y = (X1- mean) 2 + (X2- mean) 2…. + (X9- mean) 2 (mean is the value in bottom row in Table 2).The number of genes and protein coding, the nucleotide length, structural RNAs, Pseudogenes, contigs of each chromosome were obtained from Entrez Genome in NCBI (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome&cmd=Retrieve&dopt=Over- view&list_uids=1. Isochore Boundary locations based on totalercent of all frag- ments Some Features of each chromosome retrieved from Entrez ge- nome in NCBI Chromosome 60% of all fragment L2-H1 Boundary (X1) 84% of all fragment H1- H2 Boundary (X2) 91.5%of all fragment H2-H3 Boundary (X3) Gene den- sity (/Mb) (X4) Pro- tein coding den- sity (/Mb) (X5) Struc- tural RNAs (X6) Pseud o genes (X7) Con- tigs (X8) %co ding (X9) Y 1 44% 49% 51% 11.2313.65180 364 39 1 4.32E+0 4 2 44% 47% 49% 7.68 8.52 49 255 18 1 4043.4 3 41% 47% 49% 7.38 8.47 39 223 5 0 1545.2 4 40% 43% 45% 6.08 6.25 52 207 14 0 366.41 5 41% 44% 46% 7.08 8.26 48 188 6 0 335.33 6 39% 43% 45% 8.91 8.96 197 302 9 1 2.94E+0 4 7 46% 51% 52% 9.28 10.2979 240 12 1 2376.7 8 42% 45% 49% 7.01 7.82 47 151 10 0 2567.7 9 47% 53% 54% 8.62 9.74 46 218 39 1 1670.6 10 44% 48% 49% 8.08 10.1833 157 18 1 2346.5 11 46% 52% 55% 13.6913.8567 371 8 1 1.15E+0 4 12 44% 48% 50% 10.2411.8 55 195 7 1 105.03 13 41% 44% 47% 4.87 4.43 39 119 5 0 6306.9 14 43% 51% 55% 11.478.55 102 248 1 0 4524.5 15 43% 46% 47% 9.58 11.17115 152 11 1 4334 16 47% 51% 55% 12.4716.6357 132 5 1 3928.4 17 49% 52% 54% 18.3122.4797 146 10 2 3563.5 18 41% 44% 46% 7.75 6.23 6 79 5 0 16587 19 51% 54% 55% 25.4529.2395 133 4 3 5274.3 20 47% 50% 53% 11.4813.7920 127 6 1 6416.8 21 50% 55% 56% 7.82 9.2 11 71 4 0 17912 22 50% 54% 56% 15.2116.0622 98 9 1 10970 X 40% 43% 45% 8.68 9.28 69 287 17 0 8778.3 Y 39% 42% 43% 5.57 2.65 21 184 17 0 2060.6 mean 43% 48% 51% 9.68 11.1464 193.6 3 11.63 1 7.92E+0 3 130 Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132 SciRes Copyright © 2008 JBiSE Figure 1. Frequency distributions of copy number of tandem repeats in three data group: Diabetes, Control 1, Control 2. Figure 2. Frequency distributions of percentage of match of tandem repeats in three data group: Diabetes, Control 1, Control 2. Figure 3. Frequency distributions of period of tandem repeats in three data group: Diabetes, Control 1, Control 2. Figure 4. Frequency distributions of indels of tandem repeats in three data group: Diabetes, Control 1, Control 2. Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132 131 SciRes Copyright © 2008 JBiSE Figure 5. Frequency distributions of %G+C of tandem repeats in three data group: Diabetes, Control 1, Control 2. and control 2 tables in which the tandemrepeats from control 1 and control 2 groups respectively were identi- fied at TRbase website. After preparation of the three tables (diabetes genes, control 1, control 2), distributions, chi-square test and independent-sample t-test were performed for the data in the three tables. 3. RESULTS AND DISCUSSION 3.1. Quantitative comparison between tandem repeats in diabetes genes and non-diabetes dis- ease gene In order to identify whether differences exist in quantity between tandem repeat-containing genes and non- tandem repeat-containing genes of diabetes and non- diabetes disease genes, Table 3 was created. χ2 tests were performed between (1) diabetes genes and control 1; (2) diabetes genes and control 2; (3) con- trol 1 and control 2. The results of the χ2 tests are re- spectively: χ2 (1)=10.5, p(1)=0.01; χ2 (2)=3.4 , Table 3. Comparison of the 3 group data. ChromosomeTR genes Non-TR genes Total Diabetes genes All 9 38 47 Control 1 12 23 24 47 Control 2 12 17 30 47 Total 49 92 141 Table 4. P-value of t-test. p >0.05 means not significantly differ- ent; p < 0.05 means significantly different. Copy Num- ber %match Period Indels %G+C Diabetes vs control 1 0.080 0.000 0.000 0.000 0.056 Diabetes vs control 2 0.128 0.000 0.530 0.000 0.422 Control 1 vs control 2 0.830 0.139 0.001 0.540 0.557 p(2)=0.1; χ2 (3)=2.12 , p(3)=0.2. The result between diabetes genes and control 1 is inconsistent with the re- sult between dia-betes genes and control 2, which indi- cates that the quantitative distribution of TRs versus non- TRs in diseases genes is irregular. 3.2. Property of tandem repeats in diabetes genes and non-diabetes diseases genes Frequency distributions were plotted for the five features of tandem repeats. In Figure 1-5 are respectively the histograms of frequency distributions of the five features of tandem repeats within the diabetes genes, control1 and control 2 data groups. In Figure 1 as well as Figure 5, the frequency distributions of experimental (diabetes) genes, contro1 1 and control 2 are very similar; the three frequency distributions of period in Figure 3 are not obviously different from each other. In the Figure 2 and Figure 4, the key two items of %match and indels of tandem repeats in diabetes genes differ from in control 1and control 2 genes. This means that the significant differences exist between the data in diabetes genes and non-diabetes disease genes. 3.3. Independent sample t-test The data were compared pairwise between each feature of tandem repeats in diabetes genes, control 1 and con- trol 2, Using independent sample t-tests. The results (shown in Table 4) of t-test manifest that only percent age match and indels of tandem repeats have significant differences between diabetes genes and non-diabetes disease genes. 4. CONCLUSION AND FUTURE PER- SPECTIVE TRbase extended provides a platform to study the asso- 132 Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132 SciRes Copyright © 2008 JBiSE ciations between disease genes and previously uncharac- terized tandem repeats in whole human genomes. In all features (copy number, percentage match, period, indels, %G+C) of tandem repeats associated with disease genes, statistically significant differences only exist for %match and indels features of tandems repeats associated with different disease genes. Currently, just a very prelimi- nary research work has been done in mining of those differences, further investigations are being conducted, for example, correlations, regressions and modelling of the differences will be performed, and use machine learning methods to train and test the model. ACKNOWLEDGEMENTS The authors wish to express sincere appreciation to Dr. A.-M. Patch for instruction in extending TRbase. In addition, special thanks are due to Dr S. J. Aves, whose critical eye, and enlightened mentoring were instrumental and inspiring. I wish to acknowledge my gratitude to Dr Z. R. Yang, whose instructions and inspiration to me were helpful throughout the whole project. I also thank Robin Batten for his assis- tance in source data support. REFERENCES [1] Buard, J. and Vergnaud, G. (1994) Complex recombination events at the hypermutable minisatellite CEB1 (D2S90). EMBO J. 13, 3203-3210. [2] Van Belkum, A., Scherer, S., and Verbrugh, H. (1998) Short- sequences DNA repeats in prokaryotic genomes, Microbiol. Mol.Biol.Rev. 62, 275-293. [3] Rubinsztein. D.C., Amos, B, and Cooper, G. (1999) Microsatellite and trinucleotide-repeat evolution: evidence for mutational bias and different rates of evolution in different lineages. Phil. Trans. R. Soc. Lond. B 354, 1095-1099. [4] Sutherland, G..R.and Richards, R.I. (1995) Simple tandem DNA repeats and human genetic disease. Proc. Natl. Acad. Sci. USA 92 3636-3641. [5] Tóth,G., Gaspari,Z. and Jurka,J. (2000) Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res., 10, 967– 981. [6] Hancock, J.M., Worthey, E.A. and Santibez-Koref, M.F. (2001) A Role for Selection in Regulating the Evolutionary Emergence of Disease-Causing and Other Coding CAG Repeats in Humans and Mice. Mol. Biol. Evol. 18, 1014-1023. [7] Kajava .A.V (2001) Review: Proteins with repeated sequence – structural prediction and modeling. Jour. Stru. Biol. 134, 132-144. [8] Huang C., Lin Y., Yang Y., Huang S. and Chen C. (1998). The telomeres of Streptomyces chromosomes contain conserved palin- dromic sequences with potential to form complex secondary struc- tures. Mol. Microbiol. 28, 905–916. [9] Richard G. F., Hennequin C., Thierry A. and Dujon B. (1999) Tri- nucleotide repeats and other microsatellites in yeasts. Res. Micro- biol. 150, 589–602. [10] Heslop-Harrison J. S. (2003( Tandemly repeated DNA sequences and centromeric chromosomal regions of Arabidopsis species. Chromosome Res. 11, 241–253. [11] Lalioti M. D., Scott H. S., Buresi C., Bottani A., Norris M. A., Malafosse A. and Antonarakis S. E. (1997) Dodecamer repeat ex- pansion in cystatin B gene in progressive myoclonus epilepsy. Na- ture 386, 847–852. [12] Wren J. D., Forgacs E., Fondon J. W., Pertsemlidis A., Cheng S. Y. and Gallardo T. et al. 2000 Repeat polymorphisms within gene re- gions: phenotypic and evolutionary implications. Am. J. Hum. Genet. 67, 345–356.. [13] Boby T.., Patch A.M., Aves S.J. (2005) TRbase: a database relat- ing tandem repeats to disease genes for the human genome. Bioin- formatics, 21(6):811-6. [14] Benson,G. (1999) Tandem Repeats Finder: a program to analyze DNA sequences.Nucleic Acids Res., 27, 573–580. [15] Rice. p., Longdnen, I. and Bleasby, A. (2000) Mini- and microsa- tellite expansions: the recombination connection. EMBO Rep., 1, 122-126. [16] Bernardi, G. (1993) "The isochore organization of the human genome and its evolutionary history – a review."Gene, 135:57-66. [17] Macaya, G., Thiery, J.P., Bernardi, G. (1976) "An approach to the organization of eukaryotic genomes at a macromolecular level." Journal of Molecular Biology,108(1): 237-254. [18] Mouchiroud, D., D'Onofrio, G., Aissani, B, Macaya, G., Gautier, C. Bernardi, G. (1991) "The distribution of genes in the human ge- nome." Gene, 100:181-187. [19] Gardiner,K. (1996) "Base composition and gene distribution: critical patterns in mammalian genome organization." Trends in Genetics, 12(12):519-524. [20] Zoubak, S., Clay, O., Bernardi, G. (1996) "The gene distribution of the human genome." Gene, 174:95-102. [21] Rouchka, E.C., States, D.J. (2002) .Compositional Analysis of Homogeneous Regions in Human Genomic DNA. Technical Re- port. Washington University Department of Computer Science, WUCS-2002-2 |