Exploring diversity of different classificatory human tandem repeats

doi:10.4236/jbise.2008.12021

Paper Menu >>

Journal Menu >>

J. Biomedical Science and Engineering, 2008, 1, 127-132

Published Online August 2008 in SciRes. http://www.srpublishing.org/journal/jbise JBiSE

Exploring diversity of different classificatory hu-

man tandem repeats

Zhong-Yu Liu1, Xu-FengLi1, Xian-PingDing1, Bo Liu1, Hao Peng2&Yi Yang1

1College of Life Sciences, Sichuan University, Chengdu, 610064, China. 2Institute of Microbiology, Chinese Academy of Sciences, Beijing, 100101, China.

Correspondence should be addressed to Z.Y. Liu (zhongyujohn@gmail.com).

ABSTRACT

Tandem repeats (TRs) are associated with dis-

ease genes, play an important role in evolution

and are important in genomic organization and

function. Much research has been done on de-

scriptions of properties of tandem repeats, such

as copy-number, period, etc, and correlation be-

tween mutations within tandem repeats and dis-

ease. This project aims to detect some differ-

ences w hich may exist within the features of dif-

ferent tandem repeats associated w ith disease in

human whole-genome. The features of tandem

repeats associated with diabetes genes were

compared to the counterparts of non-diabetes

disease genes.

Availability: TRbase is available at http://www.trbase2.cn/

1. INTRODUCTION

Repetitive DNA sequences have been identified in large

quantities in both eukaryotic and prokaryotic genomes [2].

In some cases they can account for a large portion of the

genome, for example, in the human genome they have

been known to contribute around 40-50% of the total

DNA sequence. The existence of repetitive DNA in pro-

karyotes is limited, but it is found widely distributed

throughout a large variety of eukaryotes, and can be

found throughout the genome in both protein coding re-

gions and inergenic regions. One of the reasons tandem

repeats are of great interest is because of their nature to

expand and contract unpredictably. It has been reported

that microsatellites are biased towards expanding in

length [15]. It has also been reported that repeats within

coding regions appear to have some kind of constraint

hindering their expansion, whereas tandem repeats in

untranslated regions do not appear to have these con-

straints, therefore much higher copy numbers of these

repeats are often present [16].

Tandem repeats are known to have high mutability

rates which cause differences in repeat length between

lineages; this implies that these high mutability rates con-

tribute to overall genome evolution. The frequent changes

in tandemly repeated regions within genomes, although

caused by mutation, are more specifically assumed to be

due to slippage during DNA replication or unequal

alignment during DNA recombination [17]. However

these processes are not thought to be the sole cause of the

differences observed between lineages in the divergence

of tandem repeats within their genomes [6, 17]. Within

this current study these concepts have been expanded by

investigating the influence that disease may have had on

the evolution of tandem repeats that either cause the dis-

ease or are within disease genes. From an evolutionary

standpoint, the sequences with tandem repeats have sev-

eral interesting feature [9], Formation of the repeating

sequences is an error-prone process, with mutations in

genomic DNA repeats occurring far more frequently than

the background rate of point mutations [2]. This suggests

that repetitive sequences evolve more quickly than non-

repetitive sequences. In general, one of the most interest-

ing features of prokaryotic and eukaryotic genomes (both

coding and non-coding regions) is the presence of rela-

tively short perfect tandemly repeated DNA sequences.

These repeated DNA sequences are distributed almost at

random throughout the genome [7, 8, 13]. Much research

indicates that at least ten kinds of inherited neurological

disease including Huntington’s disease and spinocerebel-

lar ataxia, as well as many less serious diseases such as

epilepsy and deafness, are known to be the product of

tandem repeat expansions (http://tandem.bu.edu) [10, 18].

Diabetes mellitus is characterized by abnormally high

levels of sugar (glucose) in the blood. The most common

forms of diabetes are type 1 diabetes (5%), which is an

autoimmune disorder, and type 2 diabetes (90%), which

is associated with obesity (http://www.ncbi.nlm.nih.gov/

books/bookres.fcgi/diabetes/pdf_ch1.pdf). The vast ma-

jority of diabetes cases fall into the categories of type 1

and type 2 diabetes. However, up to 5% of cases have

other specific causes and include diabetes that results

from the mutation of a single gene. About 18 regions of

the genome have been linked with influencing type 1 dia-

betes risk. These regions, each of which may contain sev-

eral genes, have been labeled IDDM1 to IDDM18. In rare

forms of diabetes, mutations of one gene can result in

disease. However, in type 2 diabetes, many genes are

thought to be involved. “Diabetes genes” may show only

as subtle variation in the gene sequence, and these varia-

tions may be extremely common. The difficulty lies in

linking such common gene variations, known as single

nucleotide polymorphisms (SNPs), with an increased risk

of developing diabetes. One method of finding the diabe-

tes susceptibility genes is by whole-genome linkage stud-

ies.

128 Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132

This study specifically concentrated on the statistical

comparisons of some features (copy number, percentage

match, period, indels, and %GC ) of tandem repeats asso-

ciated with disease; its aim was to find some differences

among features of tandem repeats between one specific

set of disease genes (in this case, these are diabetes genes)

and other disease genes. If those differences existed, fur-

ther research would be carried out to explore the relation-

ship between those diversities. This paper will present an

inchoative detection of those features in terms of com-

parison of features of tandem repeats associated with dia-

betes genes and non-diabetes disease genes selected ran-

domly from all disease genes (excluding diabetes genes)

on chromosome 12.

2. IMPLEMENTATION AND METHODS

2.1. TRbase extension

All the disease genes and relevant data about the features

of tandem repeats of disease genes were retrieved from a

web-accessible relational tandem repeats database

TRbase that relates tandem repeats to gene locations and

disease genes of the human genome [1]. DNA sequences

and annotations were retrieved for the completed chromo-

somes 4, 5, 6, 14, 16, 18, 19, 20, 21, and 22 [1]; however

this project required data on all those disease genes and

their relevant information in whole-human genomes, in-

dicating that TRbase need extending to all human chro-

mosomes prior to data preparations. DNA sequences and

annotations of the remaining chromosomes (1, 2, 3, 7, 8,

9, 10, 11, 12, 13, 15, 17, X and Y) were downloaded from

GenBank (http://www.ncbi.nlm.nih.gov/mapview/maps.

cgi?TAXID=9606&MAPS=ideogr,cntg-r,ugHs,genes&C

HR=1), all tandem repeats were detected using the TRF

program (version 3.01; Benson, 1999) with parameters as

in [1] applied to DNA sequences extracted from GenBank

in the FASTA format using the Seqret program of

EMBOSS [12].

2.2. Representative chromosome generation

It has been proposed that vertebrate genomes, including

human, are made up of compositionally homogeneous

DNA segments based on G+C content [16]. These re-

gions, known as isochores, have been studied diabetes

genes using density gradient centrifugation on mechani-

cally sheared DNA in the range of 50-100 kb [16] since

their discovery in the 1970s [17]. Isochores are biologi-

cally interesting due to the association between increasing

G+C content and high gene density [18, 19, 20 ].

According to Bernardi’s theories, there are five fami-

lies of isochores, each having a different level of cytosine

and guanine (C and G, respectively) as described in Ta-

ble 1. There are two G+C-poor isochore families L1 and

L2 that make up approximately 60% of the humangenome.

The isochore family L1 is defined to be regions corre-

sponding to less than 37% G+C content; L2 is defined to

be regions containing between 37% and 41% G+C. The

Table 1. Isochore classifications. Isochore classifications are the

GC ranges for each of the five isochore classifications as defined

by Bernardi (2000). ANote that the L1 and L2 isochore classes

together represent 60% of the human genome.

Isochore Class Percent (G+C) Percent of Genome

L1 0-37

L2 37-41

60.0A

H1 41-46 24.0

H2 41-46 7.5

H3 53-100 4.7

isochore family H1 forms 24% of the human genome and

corre-sponds to regions between 41% and 46% G+C. The

other G+C rich isochore family H2 forms 7.5% of the

human genome and corresponds to those regions contain-

ing between 46% and 53% G+C. The final isochore fam-

ily, H3 forms almost 5% of the genome and corresponds

to those very G+C rich regions which are greater than

53% G+C.

Since the overall composition of the human genome is

approximately 60% AT and 40% GC, the L1 and L2

families correspond to isochore regions containing less

than average G+C content while the H1, H2, and H3

families correspond to isochore regions containing higher

than average G+C content. Columns 1-4 in Table 2 were

created using these guidelines to split the histograms for

75 kb fragments for the various chromosomes into densi-

ties of 60%, 84%, and 91.5%, which would theoretically

find the isochore boundaries. The first three columns in

Table 2 were retrieved from an article of [14]. X1, X2 ….

X9 stands for the substantial value of each feature in each

row in Table 2. In each row, Y = (X1- mean) 2 + (X2-

mean) 2…. + (X9- mean) 2 (mean is the value in bottom

row in Table 2). The lowest value in column 11 corre-

sponded to chromosome 12, which indicated that chro-

mosome 12 was the most representative one in all human

chromosomes.

2.3. Diabetes genes and control variables genera-

tion and parameters selection

All the diabetes genes were detected by the Search Dis-

ease Information at TRbase website (http://trbase2.cn) [1],

which provided information of their location within the

human genome. The tandem repeats in diabetes genes

were identified at the advanced composite search page at

the TRbase website. When selecting diabetes genes, the

parameter settings are: High Stringency detection parame-

ter was used; the tandem repeat copy number ranged from

1.9 to 13086.4; the percentage match to the consensus

sequence was kept above 70%; the tandem repeat unit

length varied from 1 to 1000. The preceding parameter

settings were applied to tandem repeats within intron,

exon and intergene regions. All diabetesgenes were iden

tified one by one using the above parameters, and each

result formed a table with many columns. The col-

Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132 129

umns(excluding labeled gene, copy Number, pe-

riod,%Match, Indels and Consensus) would be deleted in

the combined table. A new column composed of the %

G+C of each consensus was inserted into the table with

deletion of consensus column. Alternatively, diabetes

genes data were gained from the MySQL TRbase, using

MySQL commands equivalent to the processes stated

above.

The non-diabetes disease genes [with five features

(copy Number, period, % Match, indels and % G+C)] for

two control groups were selected randomly from all dis-

eases genes (excluding four diabetes genes) on chromo

some 12; the number of diseases genes selected randomly

in the two control groups was equal to the number of dia-

betes genes in the whole genome. The method to prepare

the diabetes gene table was applied to generate control 1

Table 2. Representative chromosome dependent on all those data. Columns 1-4 shows that those data associated with G+C contents

and the breakpoints of 60%, 84%, and 91.5% indicate breakpoints for the defined isochore classes L2-H1, H1-H2, and H2-H3 (Bernardi,

2000). The data in columns 5-10 retrieved (July 23, 2006) from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome&cmd= Re-

trieve&dopt= Overview&list_uids=2). Y = (X1- mean) 2 + (X2- mean) 2…. + (X9- mean) 2 (mean is the value in bottom row in Table

2).The number of genes and protein coding, the nucleotide length, structural RNAs, Pseudogenes, contigs of each chromosome were

obtained from Entrez Genome in NCBI (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genome&cmd=Retrieve&dopt=Over-

view&list_uids=1.

Isochore Boundary locations based on totalercent of all frag-

ments

Some Features of each chromosome retrieved from Entrez ge-

nome in NCBI

Chromosome

60% of all

fragment

L2-H1

Boundary

(X1)

84% of all

fragment H1-

H2 Boundary

(X2)

91.5%of all

fragment

H2-H3

Boundary

(X3)

Gene

den-

sity

(/Mb)

(X4)

Pro-

tein

coding

den-

sity

(/Mb)

(X5)

Struc-

tural

RNAs

(X6)

Pseud

genes

(X7)

Con-

tigs

(X8)

%co

ding

(X9) Y

1 44% 49% 51% 11.2313.65180 364 39 1

4.32E+0

2 44% 47% 49% 7.68 8.52 49 255 18 1 4043.4

3 41% 47% 49% 7.38 8.47 39 223 5 0 1545.2

4 40% 43% 45% 6.08 6.25 52 207 14 0 366.41

5 41% 44% 46% 7.08 8.26 48 188 6 0 335.33

6 39% 43% 45% 8.91 8.96 197 302 9 1

2.94E+0

7 46% 51% 52% 9.28 10.2979 240 12 1 2376.7

8 42% 45% 49% 7.01 7.82 47 151 10 0 2567.7

9 47% 53% 54% 8.62 9.74 46 218 39 1 1670.6

10 44% 48% 49% 8.08 10.1833 157 18 1 2346.5

11 46% 52% 55% 13.6913.8567 371 8 1

1.15E+0

12 44% 48% 50% 10.2411.8 55 195 7 1 105.03

13 41% 44% 47% 4.87 4.43 39 119 5 0 6306.9

14 43% 51% 55% 11.478.55 102 248 1 0 4524.5

15 43% 46% 47% 9.58 11.17115 152 11 1 4334

16 47% 51% 55% 12.4716.6357 132 5 1 3928.4

17 49% 52% 54% 18.3122.4797 146 10 2 3563.5

18 41% 44% 46% 7.75 6.23 6 79 5 0 16587

19 51% 54% 55% 25.4529.2395 133 4 3 5274.3

20 47% 50% 53% 11.4813.7920 127 6 1 6416.8

21 50% 55% 56% 7.82 9.2 11 71 4 0 17912

22 50% 54% 56% 15.2116.0622 98 9 1 10970

X 40% 43% 45% 8.68 9.28 69 287 17 0 8778.3

Y 39% 42% 43% 5.57 2.65 21 184 17 0 2060.6

mean 43% 48% 51% 9.68 11.1464

193.6

3 11.63 1 7.92E+0

130 Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132

Figure 1. Frequency distributions of copy number of tandem repeats in three data group: Diabetes, Control 1, Control 2.

Figure 2. Frequency distributions of percentage of match of tandem repeats in three data group: Diabetes, Control 1, Control 2.

Figure 3. Frequency distributions of period of tandem repeats in three data group: Diabetes, Control 1, Control 2.

Figure 4. Frequency distributions of indels of tandem repeats in three data group: Diabetes, Control 1, Control 2.

Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132 131

Figure 5. Frequency distributions of %G+C of tandem repeats in

three data group: Diabetes, Control 1, Control 2.

and control 2 tables in which the tandemrepeats from

control 1 and control 2 groups respectively were identi-

fied at TRbase website.

After preparation of the three tables (diabetes genes,

control 1, control 2), distributions, chi-square test and

independent-sample t-test were performed for the data

in the three tables.

3. RESULTS AND DISCUSSION

3.1. Quantitative comparison between tandem

repeats in diabetes genes and non-diabetes dis-

ease gene

In order to identify whether differences exist in quantity

between tandem repeat-containing genes and non-

tandem repeat-containing genes of diabetes and non-

diabetes disease genes, Table 3 was created.

χ2 tests were performed between (1) diabetes genes

and control 1; (2) diabetes genes and control 2; (3) con-

trol 1 and control 2. The results of the χ2 tests are re-

spectively: χ2 (1)=10.5, p(1)=0.01; χ2 (2)=3.4 ,

Table 3. Comparison of the 3 group data.

ChromosomeTR genes Non-TR

genes

Total

Diabetes

genes

All 9 38 47

Control 1 12 23 24 47

Control 2 12 17 30 47

Total 49 92 141

Table 4. P-value of t-test. p >0.05 means not significantly differ-

ent; p < 0.05 means significantly different.

Copy Num-

ber

%match Period Indels %G+C

Diabetes vs

control 1

0.080 0.000 0.000 0.000 0.056

Diabetes vs

control 2

0.128 0.000 0.530 0.000 0.422

Control 1 vs

control 2

0.830 0.139 0.001 0.540 0.557

p(2)=0.1; χ2 (3)=2.12 , p(3)=0.2. The result between

diabetes genes and control 1 is inconsistent with the re-

sult between dia-betes genes and control 2, which indi-

cates that the quantitative distribution of TRs versus non-

TRs in diseases genes is irregular.

3.2. Property of tandem repeats in diabetes genes

and non-diabetes diseases genes

Frequency distributions were plotted for the five features

of tandem repeats. In Figure 1-5 are respectively the

histograms of frequency distributions of the five features

of tandem repeats within the diabetes genes, control1

and control 2 data groups. In Figure 1 as well as Figure

5, the frequency distributions of experimental (diabetes)

genes, contro1 1 and control 2 are very similar; the three

frequency distributions of period in Figure 3 are not

obviously different from each other. In the Figure 2 and

Figure 4, the key two items of %match and indels of

tandem repeats in diabetes genes differ from in control

1and control 2 genes. This means that the significant

differences exist between the data in diabetes genes and

non-diabetes disease genes.

3.3. Independent sample t-test

The data were compared pairwise between each feature

of tandem repeats in diabetes genes, control 1 and con-

trol 2, Using independent sample t-tests. The results

(shown in Table 4) of t-test manifest that only percent

age match and indels of tandem repeats have significant

differences between diabetes genes and non-diabetes

disease genes.

4. CONCLUSION AND FUTURE PER-

SPECTIVE

TRbase extended provides a platform to study the asso-

132 Z. Y. Liu et al. / J. Biomedical Science and Engineering 1 (2008) 127-132

ciations between disease genes and previously uncharac-

terized tandem repeats in whole human genomes. In all

features (copy number, percentage match, period, indels,

%G+C) of tandem repeats associated with disease genes,

statistically significant differences only exist for %match

and indels features of tandems repeats associated with

different disease genes. Currently, just a very prelimi-

nary research work has been done in mining of those

differences, further investigations are being conducted,

for example, correlations, regressions and modelling of

the differences will be performed, and use machine

learning methods to train and test the model.

ACKNOWLEDGEMENTS

The authors wish to express sincere appreciation to Dr. A.-M. Patch for

instruction in extending TRbase. In addition, special thanks are due to

Dr S. J. Aves, whose critical eye, and enlightened mentoring were

instrumental and inspiring. I wish to acknowledge my gratitude to Dr Z.

R. Yang, whose instructions and inspiration to me were helpful

throughout the whole project. I also thank Robin Batten for his assis-

tance in source data support.

REFERENCES

[1] Buard, J. and Vergnaud, G. (1994) Complex recombination events

at the hypermutable minisatellite CEB1 (D2S90). EMBO J. 13,

3203-3210.

[2] Van Belkum, A., Scherer, S., and Verbrugh, H. (1998) Short-

sequences DNA repeats in prokaryotic genomes, Microbiol.

Mol.Biol.Rev. 62, 275-293.

[3] Rubinsztein. D.C., Amos, B, and Cooper, G. (1999) Microsatellite

and trinucleotide-repeat evolution: evidence for mutational bias and

different rates of evolution in different lineages. Phil. Trans. R. Soc.

Lond. B 354, 1095-1099.

[4] Sutherland, G..R.and Richards, R.I. (1995) Simple tandem DNA

repeats and human genetic disease. Proc. Natl. Acad. Sci. USA 92

3636-3641.

[5] Tóth,G., Gaspari,Z. and Jurka,J. (2000) Microsatellites in different

eukaryotic genomes: survey and analysis. Genome Res., 10, 967–

981.

[6] Hancock, J.M., Worthey, E.A. and Santibez-Koref, M.F. (2001) A

Role for Selection in Regulating the Evolutionary Emergence of

Disease-Causing and Other Coding CAG Repeats in Humans and

Mice. Mol. Biol. Evol. 18, 1014-1023.

[7] Kajava .A.V (2001) Review: Proteins with repeated sequence –

structural prediction and modeling. Jour. Stru. Biol. 134, 132-144.

[8] Huang C., Lin Y., Yang Y., Huang S. and Chen C. (1998). The

telomeres of Streptomyces chromosomes contain conserved palin-

dromic sequences with potential to form complex secondary struc-

tures. Mol. Microbiol. 28, 905–916.

[9] Richard G. F., Hennequin C., Thierry A. and Dujon B. (1999) Tri-

nucleotide repeats and other microsatellites in yeasts. Res. Micro-

biol. 150, 589–602.

[10] Heslop-Harrison J. S. (2003( Tandemly repeated DNA sequences

and centromeric chromosomal regions of Arabidopsis species.

Chromosome Res. 11, 241–253.

[11] Lalioti M. D., Scott H. S., Buresi C., Bottani A., Norris M. A.,

Malafosse A. and Antonarakis S. E. (1997) Dodecamer repeat ex-

pansion in cystatin B gene in progressive myoclonus epilepsy. Na-

ture 386, 847–852.

[12] Wren J. D., Forgacs E., Fondon J. W., Pertsemlidis A., Cheng S. Y.

and Gallardo T. et al. 2000 Repeat polymorphisms within gene re-

gions: phenotypic and evolutionary implications. Am. J. Hum.

Genet. 67, 345–356..

[13] Boby T.., Patch A.M., Aves S.J. (2005) TRbase: a database relat-

ing tandem repeats to disease genes for the human genome. Bioin-

formatics, 21(6):811-6.

[14] Benson,G. (1999) Tandem Repeats Finder: a program to analyze

DNA sequences.Nucleic Acids Res., 27, 573–580.

[15] Rice. p., Longdnen, I. and Bleasby, A. (2000) Mini- and microsa-

tellite expansions: the recombination connection. EMBO Rep., 1,

122-126.

[16] Bernardi, G. (1993) "The isochore organization of the human

genome and its evolutionary history – a review."Gene, 135:57-66.

[17] Macaya, G., Thiery, J.P., Bernardi, G. (1976) "An approach to the

organization of eukaryotic genomes at a macromolecular level."

Journal of Molecular Biology,108(1): 237-254.

[18] Mouchiroud, D., D'Onofrio, G., Aissani, B, Macaya, G., Gautier, C.

Bernardi, G. (1991) "The distribution of genes in the human ge-

nome." Gene, 100:181-187.

[19] Gardiner,K. (1996) "Base composition and gene distribution:

critical patterns in mammalian genome organization." Trends in

Genetics, 12(12):519-524.

[20] Zoubak, S., Clay, O., Bernardi, G. (1996) "The gene distribution of

the human genome." Gene, 174:95-102.

[21] Rouchka, E.C., States, D.J. (2002) .Compositional Analysis of

Homogeneous Regions in Human Genomic DNA. Technical Re-

port. Washington University Department of Computer Science,

WUCS-2002-2