In recent years, Jatropha curcas L. has gained popularity as a potential biodiesel plant. The varying oil content, reported between accessions belonging to different agroclimatic zones, has necessitated the assessment of the existing genetic variability to generate reliable molecular markers for selection of high oil yielding variety. EST derived SSR markers are more useful than genomic markers as they represent the transcriptome, thus, directly linked to functional genes. The present report describes the in silico mining of the microsatellites (SSRs) using J. curcas ESTs from various tissues viz. embryo, root, leaf and seed available in the public domain of NCBI. A total of 13,513 ESTs were downloaded. From these ESTs, 7552 unigenes were obtained and 395 SSRs were generated from 377 SSR-ESTs. These EST-SSRs can be used as potential microsatellite markers for diversity analysis, MAS etc. Since the Jatropha genes carrying SSRs have been identified in this study, thus, EST-SSRs directly linked to genes will be useful for developing trait linked markers.
In recent years, Jatropha curcas L. has gained popularity as a potential biodiesel plant. It is commonly known as purging nut/Barbados nut. This plant belongs to the family Euphorbiaceae and is a native of Mexico and Central America and was later on introduced in many parts of tropics and subtropics. J. curcas is commonly known to be a poisonous plant. It is a semi-evergreen shrub or small tree reaching a height of 6 mt (20 ft). It can survive arid conditions; therefore, can be grown on drylands and wastelands. The seeds of this plant are highly toxic but produce oil that can be used as biodiesel after transesterification, besides that, in soap and candle making. Being traditionally considered as a weed, its oil has recently started gaining importance as “fuel of the future” or “green fuel” and has been in news, with transport companies eager to run trains, cars and aeroplanes using biodiesel to cut down both on cost and pollution.
The oil content in Jatropha curcas is reported to be varying between accessions belonging to different agroclimatic zones (40% to 58% in kernels) of India [
DNA markers are not typically influenced by environmental conditions, therefore, can be used to describe patterns of genetic variation among plant populations and to identify duplicated accessions within germplasm collections [
The existing information regarding the extent and pattern of genetic variation in J. curcas population is limited [
There are less popular but extremely useful markers like SSRs (Simple Sequence Repeats) and SNPs (Single Nucleotide Polymorphisms) [
The traditional methods of developing SSR markers are usually time consuming and labor-intensive [
Expressed Sequence Tags (ESTs) are generated by end sequencing of large number of randomly picked clones from cDNA library constructed using mRNA isolated from specific tissue or specific developmental stage of an organism. EST-derived SSR markers are generally less polymorphic than genomic SSRs [
The present report describes the in silico mining of the microsatellites (SSRs) using the J. curcas ESTs from various tissues viz., embryo, root, leaf and seed available in the public domain of NCBI. At the time of mining, a total of 13513 ESTs were available and downloaded. From these ESTs, 7552 unigenes were obtained, and 395 EST-SSRs were generated from 377 SSR-ESTs. The EST-SSRs obtained through computational method in this study can be used as potential microsatellite markers for various studies like diversity analysis, MAS etc. Since, the Jatropha genes carrying SSRs have been identified in this study, thus, EST-SSRs directly linked to genes will be useful for developing trait linked markers.
EST sequences of J.curcas were downloaded from NCBI’s dbEST database (http://ncbi.nlm.nih.gov/) [
To find the singletons and to assemble the contigs from the total ESTs, an online tool “EGassembler” (http://egassembler.hgc.jp/) [
The SSR search was carried out for repeat motifs (ranging from mono- to hexa-nucleotides). For each repeat motif the parameters were: Mononucleotide repeat-20, Dinucleotide repeat-10, Trinucleotide repeat-07, Tetranucleotide repeat-05, Pentanucleotide repeat-04, Hexanucleotide repeat-04 (the numbers indicating repeat unit i.e. minimum number of times the motif was repeated at a stretch); Space between SSRs-100, Space between imperfect SSRs [<=]-05. After obtaining the motifs, the sequence complementarity was taken into consideration and accordingly the complementary motifs like AG and CT or AC and GT or AAC and GTT motifs were grouped into a single class under mono-, di-, tri-, tetra-, penta- or hexa-nucleotides, respectively. After getting SSRs, the primers were designed from the flanking regions using the same software as for SSR search. The parameters provided in the software for primer designing are given in
EST Sequences, which have credit in the primer designing, were searched for their gene annotations using BLASTX at The Arabidopsis Information Resource (TAIR) (http://www.arabidopsis.org/index.jsp) [
The size of the available EST data used in this study has been calculated in accordance with the size of the
. Parameters for primer designing
Sr. No. | Criteria | Minimum | Maximum | Optimum |
---|---|---|---|---|
1 | Amplicon size | 150 | 1000 | - |
2 | GC Clamps | 0 | - | - |
3 | Primer Size | 18 | 22 | 20 |
4 | Tm | 55 | 61 | 59 |
5 | Content G/C | 45 | 50 | - |
6 | Region scanned | Auto | Auto | - |
7 | End Stability | 250 | - | - |
genome of J. curcas (C = 416 Mb) reported by Carvalho and coworkers [
For searching the SSRs, the repeat motifs in the software, were selected from mono- to hexa-nucleotide as
Overview of the study indicating the major steps and the statistics leading to generation of the EST-SSRs
going above this motif range, the frequency of occurrence of SSRs is drastically reduced. Thus, the SSRs were obtained in the form of repeat motifs ranging from mono- to hexa-nucleotides. Out of the 7552 unigenes searched for SSRs, 395 SSRs (
The 395 SSRs were present in 377 SSR-ESTs as 17 (4%) SSR-ESTs contained more than one SSR e.g. FM889616.1 with 3 SSRs, having motifs (GA)32, (AG)14, (AG)14 (data not shown). The SSRs in mononucleotide class were found to be the most abundant with a frequency of 1/17.83 kb followed by dinucleotide 1/36.00 kb, trinucleotide 1/84.82 kb, tetranucleotide 1/424.11 kb, pentanucleotide 1/347.00 kb and hexanucleotide 1/381.70 kb.
The overall analysis of the distribution of the microsatellites into various classes of the repeat types (mono-, di-, tri-, tetra- penta- and hexa-nucleotides) showed that the number of the microsatellites decreased with increasing motif size (
The non-dominance of trinucleotides compared to other classes, by virtue of which the decreasing trend of various classes with increasing motif size, is in contrast to several earlier studies but in concurrence to that reported for several dicots [
In terms of SSR coverage of available Unigenes data (~3.8 MB), it was observed that a total of 11.7 kb (0.31%) region was covered by SSR motifs. Out of this, mono-represented 6.5 kb (0.17%) region, di—3.3 kb (0.08%), tri—1.1 kb (0.03%), tetra—0.18 kb (0.004%), penta—0.26 kb (0.006%) and hexa-nucleotides 0.24 kb (0.006%).
The various classes of repeat motifs, when analyzed further, showed that some motifs in each category were more abundant than others (
. Categorization of SSRs by repeat units and repeat motif
Repeat Type | Repeat Motif | Number of Repeat Units | Total | Analysis | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | >20 | ||||
Mononucleotide | A/T | - | - | - | - | - | - | - | - | - | - | - | - | 211 | 211 | 214 (54%) |
G/C | - | - | - | - | - | - | - | - | - | - | - | - | 3 | 3 | ||
214 | ||||||||||||||||
Dinucleotide | AG/CT | - | - | - | - | - | - | 6 | 4 | 4 | 1 | 4 | 2 | 14 | 35 | 106 (27%) |
AT/AT | - | - | - | - | - | - | - | 1 | 2 | 2 | 1 | 1 | 13 | 20 | ||
AC/GT | - | - | - | - | - | - | - | - | - | - | - | 1 | - | 1 | ||
TA/TA | - | - | - | - | - | - | 3 | 3 | 1 | - | 1 | 2 | 7 | 17 | ||
GA/TC | - | - | - | - | - | - | 4 | 7 | 5 | 4 | - | 3 | 10 | 33 | ||
13 | 15 | 12 | 7 | 6 | 9 | 44 | ||||||||||
Trinucleotide | AAC/GTT | - | - | - | 1 | - | - | - | - | 1 | - | - | - | - | 2 | 45 (11.5%) |
AAT/ATT | - | - | - | 1 | 1 | - | - | - | 1 | - | - | - | - | 3 | ||
ACC/GGT | - | - | - | 1 | - | - | - | - | - | - | - | - | - | 1 | ||
AGA/TCT | - | - | - | 4 | - | 3 | 1 | 3 | - | 1 | - | - | - | 12 | ||
AGC/GCT | - | - | - | 2 | - | - | - | - | - | - | - | - | - | 2 | ||
ATA/TAT | - | - | - | 1 | 2 | - | 1 | 1 | - | - | - | - | - | 5 | ||
ATG/CAT | - | - | - | - | - | 1 | - | - | - | - | - | - | - | 1 | ||
CAC/GTG | - | - | - | 1 | - | - | - | - | - | - | - | - | - | 1 | ||
CAG/CTG | - | - | - | 1 | - | 1 | - | 1 | - | - | - | - | - | 3 | ||
CTT/AAG | - | - | - | 1 | - | - | - | - | - | - | - | - | - | 1 | ||
GAA/TTC | - | - | - | 2 | 1 | 1 | - | - | - | - | - | - | - | 4 | ||
GCA/TGC | - | - | - | 1 | 1 | - | - | - | - | - | - | - | - | 2 | ||
GGA/TCC | - | - | - | 1 | - | - | - | - | - | - | - | - | - | 1 | ||
TAA/TTA | - | - | - | 1 | - | 2 | 1 | 2 | - | - | - | - | - | 6 | ||
TTG/CAA | - | - | - | 1 | - | - | - | - | - | - | - | - | - | 1 | ||
19 | 5 | 8 | 3 | 7 | 2 | 1 | ||||||||||
Tetranucleotide | AAGA/TCTT | - | 2 | - | - | - | - | - | - | - | - | - | - | - | 2 | 9 (2%) |
AATT/AATT | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
CATA/TATG | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TATT/AATA | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTAA/TTAA | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTAT/ATAA | - | - | 1 | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTCT/AGAA | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTTA/TAAA | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
8 | 1 | |||||||||||||||
Pentanucleotide | AAGAA/TTCTT | - | 1 | - | 1 | - | - | - | - | - | - | - | - | - | 2 | 11 (3%) |
AGGAA/TTCCT | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
ATTTT/AAAAT | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | |||
CTTCT/AGAAG | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TAAAA/TTTTA | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TATTT/AAATA | 1 | - | 1 | - | - | - | - | - | - | - | - | - | - | 2 | ||
TCTTT/AAAGA | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTATA/TATAA | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTTCT/AGAAA | - | 1 | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
5 | 4 | 1 | 1 | |||||||||||||
Hexanucleotide | AAAAAG/CTTTTT | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | 10 (2.5%) |
CAGCTC/GAGCTG | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
GCTGGT/ACCAGC | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
GGATCA/TGATCC | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
GTTTCA/TGAAAC | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTCCAT/ATGGAA | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTTATT/AATAAA | 1 | - | - | - | - | - | - | - | - | - | - | - | - | 1 | ||
TTTCTC/GAGAAA | 3 | - | - | - | - | - | - | - | - | - | - | - | - | 3 | ||
10 | ||||||||||||||||
Total | 395 |
Distribution of SSRs into various classes
. Abundance of SSRs of various types
Repeat Types | No. of SSRs | Abundance (%) |
---|---|---|
Mononucleotide | 214 | 54 |
Dinucleotide | 106 | 27 |
Trinucleotide | 45 | 11 |
Tetranucleotide | 9 | 2 |
Pentanucleotide | 11 | 3 |
Hexanucleotide | 10 | 3 |
. Most abundant motifs and their relative abundance in each of the SSR types
Repeat Types | Most Abundant Motifs | Relative Abundance (%) |
---|---|---|
Mononucleotide | A/T | 98 |
Dinucleotide | AG/CT | 33 |
Trinucleotide | AGA/TCT | 27 |
Tetranucleotide | AAGA/TCTT | 22 |
Pentanucleotide | AAGAA/TTCTT | 18 each |
TATTT/AAATA | ||
Hexanucleotide | TTTCTC/GAGAAA | 30 |
26.6% and rest of them ranging from 2% - 13% of the total microsatellites in this class. The CCG/CGG motif is reported to be the rarest motif in dicots [
The analysis of repeat units under each motif class revealed a varying range of repeat units in each of the classes of repeat motifs. It was observed that, in dinucleotide motif, repeat units ranged from 10 - 45; in trinucleotide motif, from 7 - 13; in tetranucleotide, from 5 - 6 units; in pentanucleotide, from 4 - 6; and hexanucleotide motif was represented by a single class of 6 repeat units only. Further analysis of the number of repeat units in every class of the SSRs, especially tri-, tetra- penta- and hexa-nucleotides, showed that the number of the microsatellites decreased with increasing repeat unit length with little variation, e.g. for trinucleotide motif, SSRs with 7 repeats were represented by 42.2% while 2.2% by 13 repeat units. Amongst the pentanucleotide SSRs, the category with 4 repeat units shared as much as 45.5% of the total class in comparison to 9% for repeat unit of seven (
For the use of SSRs as markers, it is necessary to design the primers. The SSRs commonly used for marker development are those belonging to di-, tri- and tetra-nucleotides [
For each of the SSRs, a pair of reverse and forward primer was designed from the flanking regions of their respective SSR-ESTs by the software. 181 SSRs generated from 172 SSR-ESTs were used for primer designing and yielded 79 SSR mediated primer pairs (data not shown). These 79 primer pairs were designed from 76 SSR-ESTs as some of these contained more than one SSR e.g. JES 56 and 57 (Supplementary
Distribution of SSRs as per repeat unit size in different types
The GC level of the genome of J. curcas is typical of core dicots, therefore, it should be easy to annotate by sequence comparison with Arabidopsis [
The data showed that most of the ESTs-PD are expressing functional proteins and still there are some for which the protein is not yet predicted. On the basis of the functions related to the predicted protein, the ESTs-PD were classified into three major classes viz. Cellular Component, Biological Process and Molecular Function (
Functional categorization of ESTs-PD by loci A: Cellular component, B: Biological process, C: Molecular function
The in silico mining of EST-SSRs of Jatropha was carried out in this study taking advantage of the availability of enormous EST data in the public database, the importance of ESTs in SSR mining and, the potential of modern bioinformatics tools combined with their speed and ease. The stringency of the preset parameters was kept high so as not to compromise on the level of polymorphism in potential EST-SSRs, thus, their utility as markers, more so in this study, as low levels of polymorphisms have been reported in Jatropha. The functional annotation of the SSR-ESTs showed that most of them are associated with expressed proteins and therefore, trait linked genes. Thus, in this study, the genes of Jatropha carrying SSRs were identified. The EST-SSRs generated would be useful for developing trait linked markers. As the expressed sequences are highly conserved, the SSRs developed from the ESTs are characterized by transferability across species. Owing to this characteristic, these SSRs could also be useful as markers across closely related species like Ricinus, thus, saving time and resources in reiteration of SSR mining or; for related species with limited or no sequence information. EST-SSRs like JES35 generated from EST expressing gene of fatty acid biosynthesis pathway (AT1G48750) would be of utmost importance towards marker development in Jatropha. With more data being submitted at a rapid pace to the public database, more such SSRs can be looked for in comparative genomic studies and, the knowledge generated in this study is a step towards development of markers in this plant and also related species.
The authors are thankful to the Department of Biotechnology, Government of India, for financial support for research and fellowships of two authors Dr. Poonam Bhargava and Mr. Ganesh B. Patil and, Puri foundation for Education in India, for providing infrastructure facilities.