Chitinases catalyze the hydrolysis of chitin, a linear homopolymer of β-(1,4)-linked N-acetylglucosamine. The broad range of applications of chitinolytic enzymes makes their identification and study very promising. Metagenomic approaches offer access to functional genes in uncultured representatives of the microbiota and hold great potential in the discovery of novel enzymes, but tools to extensively explore these data are still scarce. In this study, we develop a chitinase mining pipeline to facilitate the comprehensive search of these enzymes in environmental metagenomic databases and also to explore phylogenetic relationships among the retrieved sequences. In order to perform the analyses, UniprotKB fungal and bacterial chitinases sequences belonging to the glycoside hydrolases (GH) family-18, 19 and 20 were used to generate 15 reference datasets, which were then used to generate high quality seed alignments with the MAFFT program. Profile Hidden Markov Models (pHMMs) were built from each seed alignment using the hmmbuild program of HMMER v3.0 package. The best-hit sequences returned by hmmsearch against two environmental metagenomic databases (Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis—CAMERA and Integrated Microbial Genomes—IMG/M) were retrieved and further analyzed. The NJ trees generated for each chitinase dataset showed some variability in the catalytic domain region of the metagenomic sequences and revealed common sequence patterns among all the trees. The scanning of the retrieved metagenomic sequences for chitinase conserved domains/signatures using both the InterPro and the RPS-BLAST tools confirmed the efficacy and sensitivity of our pHMM-based approach in detecting putative chitinases sequences. These analyses provide insight into the potential reservoir of novel molecules in metagenomic databases while supporting the chitinase mining pipeline developed in this work. By using our chitinase mining pipeline, a larger number of previously unannotated metagenomic chitinase sequences can be classified, enabling further studies on these enzymes.
Enzymes are catalysts that support the development of environmental-friendly industrial processes. At present, most of the industrial enzymes of major importance are of microbial origin, so the search for novel of these catalysts is a key step towards the development of innovative bioprocesses. Chitinases are enzymes responsible for the hydrolysis of chitin, a linear homopolymer of β-(1,4)-linked N-acetylglucosamine, which is the second most abundant biopolymer in nature. A set of different enzymes are needed to drive the complete hydrolysis of chitin to free N-acetylglucosamine (GlcNAc), involving diverse mode of actions known to be synergistic and consecutive [
Based on amino acid sequence similarities these chitinolytic enzymes are classified into glycoside hydrolases (GH) family 18, 19 and 20 [
The low discovery rate of novel natural products from culturable microorganisms [
Typical metagenomic analyses rely on similarity search against some databases, followed by annotation of the output. The most frequently used similarity search tool is BLAST [
The aim of this work was to develop and validate a data mining strategy based on profile HMM (pHMM) in order to be able to broadly explore environmental metagenomic databases for putative chitinase sequences. The results confirmed the efficacy of our pipeline in detecting chitinase sequences and highlighted the power of pHMM-based strategies to identify remote homologues.
Fungal and bacterial curated amino acid sequences of chitinases belonging to the glycoside hydrolase (GH) families 18, 19 and 20 were retrieved from the UniprotKB version 2011-06 database (http://www.uniprot.org) on July 2011. A total of 170, 13 and 46 sequences were collected for GH family-18, GH family-19 and GH family- 20, respectively. GH family-18 sequences UniprotKB IDs were: P04067, P36912, P80036, F3Y8V4, D6ESW9, F3NDC4, O83008, Q6T6I1, P07254, A8G807, Q9ALZ0, Q8KKF5, Q9L5D5, Q43919, Q25BN2, Q8GHI4, P32823, A7M6A0, Q9WX41, Q9AMP1, C3LU56, D2YR61, D2YAB4, Q48373, Q5MYT4, Q9RCG5, Q56077, Q845S2, O30678, Q09IY6, Q6BCF8, A5YRG4, B7UB89, P20533, Q48494, Q9KHB3, C6IW88, B1VBB0, A6FD95, P96168, A6CVZ0, Q9CE95, Q7PC52, Q9Z493, Q59143, Q59924, Q9KY99, Q59141, D0VV10, D0VV09, Q81A65, D5TUL7, P11797, A6B8H6, A7LHM6, B2TQ75, B8DGV4, C1IAI6, D0ELI3, D0WTF8, D1RSJ9, D3YGV3, E3YUT8, F3RZA6, O50076, Q0MRC3, Q1EM71, Q547S1, Q5WKC0, Q8KWS2, Q99PX0, Q9KED7, Q9REI6, O69311, D5ZUF3, D6EQC0, O86826, Q9S5K1, Q05638, Q09WI7, A0Q8N1, F4BBA4, E2MRS9, B2SEL0, P36909, Q6A4C3, Q700B8, Q75ZW9, Q7PC51, Q8KVU8, Q9L8G0, Q9Z9M8, Q9ZIX2, Q8RQP6, B1W0A0, D6ANP5, A4GZI8, B5H9B1, D5ZXC4, P11220, P27050, Q9Z9M7, E3FMX3, Q099U8, Q1CZN0, Q092X1, C6J4E8, E8U3R7, D6EPZ7, D9XVU0, D9XI74, B5I3A2, A7UGE4, Q12735, Q9UV45, Q9UV49, A6YNL9, Q99006, P48827, Q9C1T8, Q9C1T7, Q9C1T6, O59928, Q9C1U0, Q65YQ7, Q9C1T4, Q9C1T9, O14456, Q9Y841, A9LI60, Q8J042, Q5MNU1, A6Y9S8, P32470, Q870C0, Q873X9, Q06HA3, Q3YLC5, Q9HGU5, Q92222, Q9HEW6, Q9P4Q1, Q5YLC0, A6YJX1, Q4FCX2, Q92270, Q7Z8C9, Q8J1Y3, Q96VR2, E5KCK8, A5JV26, A3RLY3, A5X8W3, Q96UW2, A2VEC4, Q8NJQ4, D6N0Y7, D6N0Y8, F6MIV5, E5LEW9, A2SW11, E9F7R6, P29026, P29027, P29025, P54197, P40954, P40953, P29029, P46876. GH family-19 sequences UniprotKB IDs were: Q9WXI9, Q59I46, Q9LBM0, Q8GI53, Q9S6T0, Q8CK55, B3XZQ2, O50152, Q9Z4P2, Q5J1K1, Q9RHU4, Q9RHU5, Q25BT4. GH family-20 sequences UniprotKB IDs were: Q9F9B4, Q75V90, A7M7B5, Q9LC82, Q7WUL4, Q9L448, Q9ZN69, Q9WXH9, D2KW09, P49008, A1XNE6, P49007, Q8VUM1, Q9R6Y9, Q9FAC5, Q9ZH38, Q7PC48, Q7PC49, Q54468, P49610, Q9ACN7, O85361, Q83WL6, Q9RHV6, Q84FS9, P96155, D9ISD9, D9ISE0, P13670, Q60081, Q04786, C8VMN3, Q8J2T0, A2SW08, P43077, Q309C3, P13723, Q643Y1, Q9URR8, E3NYM0, P87258, Q0ZLH7, P78738, P78739, Q8NIN7, Q8NIN6. The great sequence diversity found in the GH family-18 required the partitioning of it into nine subsets of bacterial sequences and three subsets of fungal sequences. This division was carried out taking into account both the existing chitinase subfamilies and a Neighbor-Joining guide tree topology. The retrieved sequences were then used to generate 15 multi-fasta chitinase reference sets (with 12 GH family-18, one GH family-19 and two GH family-20 sets).
Two environmental metagenomic databases were selected to test our chitinase mining strategy. The first one was CAMERA v2.0 [
First, multiple sequence alignments were generated for each chitinase reference set (seed alignments) using the default settings (“-auto”) of MAFFT v6.717b program [
The resulting sequence database searches (described in detail in Section 2.3) were used to extract the best-hit sequences of each metagenomic dataset, that is, the hits which presented the lowest e-value parameter among all the sequences of a metagenomic project. Best-hit sequences were retrieved in a fasta format using fastacmd program of BLAST package [
Best-hit sequences (described in detail in section 2.4) were selected to perform phylogenetic reconstructions using the Neighbor-Joining (NJ) algorithm from MEGA 5.05 [
The construction of chitinase-reference sequence sets was a key step in the success of the mining strategy applied in this work. The collection and grouping of chitinase sequences on subsets allowed the generation of 15 chitinase groups covering all the three chitinase GH families, in which 9 were fungal GH family-18, three were bacterial GH family-18, one was bacterial GH family-19, one was fungal GH family-20 and one was bacterial GH family-20 (
The hmmsearch analysis performed against CAMERA and IMG/M metagenomic environmental databases retrieved a total of 708, 104 and 256 best-hit sequences putative of GH family-18, 19 and 20, respectively. The scanning of these sequences using a RPS-BLAST search revealed the presence of chitinase conserved domains in 74.6%, 97.1% and 97.7% of the GH family-18, GH family-19 and GH family-20 sequences, respectively (Figures 2(a)-(c)). Only a small portion of the sequences presented hits with conserved domains other than the chitinase ones (4.8% of GH family-18 and 0.8% of GH family-20). No hits sequences were 20.6% of GH family-18, whilst just 2.9% of GH family-19 and 1.6% of GH family-20 (Figures 2(a)-(c)). The InterPro search inferred the occurrence of chitinase signatures in 81.7%, 89.4% and 98.8% of the metagenomic sequences belonging to GH family-18, 19 and 20, respectively (Figures 2(d)-(f)). Compared to the RPS-BLAST search, the InterPro analysis revealed a higher percentage of sequences hosting protein signatures other than the chitinase ones (10.3% of GH family-18, 8.7% of GH family-19 and 0.4% of GH family-20) and a smaller percentage of sequences presenting no hits against the databases examined (8.0% of GH family-18, 1.9% of GH family 19 and 0.8% of GH family 20) (Figures 2(d)-(f)).
A large difference in diversity among all the three chitinase GH families was revealed in the RPS-BLAST and the InterPro analysis. That is, GH family-19 and GH family-20 presented no more than 12 types of conserved domains, and most of the sequences shared the same conserved domain hits (Tables 1 and 2). In contrast, GH family-18 displayed up to 34 different sorts of conserved domains and there was not a predominant set of conserved domains to the majority of the sequences (at most, half of the sequences shared the same conserved domain hits) (Tables 1 and 2). In addition, the scanning of IMG/M sequences has showed that some sequences annotated as hypothetical protein exhibited chitinase conserved domain hits, showing the sensitivity of our mining pipeline.
The phylogenetic analysis generated NJ trees corresponding to each chitinase dataset. All datasets showed some variability in the amino acid sequence of the catalytic domain region, except for the two active site residues (aspartate and glutamate in GH family-18 and 20, and two glutamates in the case of GH family-19), which
were conserved in almost all sequences examined (data not shown). In addition, the NJ tree analysis also revealed two common sequence patterns, that is, all the trees presented metagenomic sequences phylogenetically related to characterized chitinases; and all these trees also displayed metagenomic sequences which did not cluster with any characterized chitinase (Figures 3-6). Interestingly, some metagenomic sequences annotated as
“hypothetical protein” in the IMG/M database were retrieved after running our mining pipeline and were grouped with chitinase GH family-18 reference sequences in the NJ phylogenetic analysis (
The broad range of applications of chitinolytic enzymes makes their identification and study very promising. Metagenomic approaches offer access to functional genes in uncultured representatives of the microbiota and hold great potential in the discovery of novel enzymes, but tools to extensively explore these data are still scarce. This study aimed the development of a chitinase mining pipeline to facilitate the comprehensive search of these enzymes in metagenomic databases. The use of a pHMM-based strategy allowed sensitive and efficient detection of putative chitinase sequences.
The generation of representative seed alignments and the selection of the homology detection method are key steps in sequence mining pipelines. The quality of an alignment is critical to its utility in different approaches, such as functional analysis, evolutionary studies and structure prediction [
pHMMs are statistical models that use multiple alignments of homologous sequences to quantify amino acids frequencies and the position-specific probabilities for inserts and deletions along the alignment [
aOnly the conserved domains hits found in more than 10% of the sequences analyzed were displayed in table; bPercentage of sequences which showed hit with that conserved domain.
aOnly the conserved domains hits found in more than 10% of the sequences analyzed were displayed in table; bPercentage of sequences which showed hit with that conserved domain.
the sequence family than a single sequence [
The scanning for the presence of chitinase conserved domains and motifs/signatures in the best hit sequences (the ones retrieved after the hmmsearch analysis) was carried out in order to evaluate the performance of our chitinase mining pipeline on detecting true putative chitinase sequences. Many annotation pipelines use searches against conserved domain databases since these regions are evolutionarily conserved units in proteins [
the best hit metagenomic sequences, validating our chitinase mining pipeline. The presence of best hit metagenomic sequences showing no hits to any conserved domain may represent putative novel chitinases that possibly would not be identified using sequence-sequence similarity searches. Furthermore, some IMG/M metagenomic sequences annotated as hypothetical proteins resulted in hits with chitinase conserved domains in our analysis, indicating that our pipeline may have high sensitivity and it is able to detect remote homologues.
The results obtained in the RPS-Blast and InterPro analyses emphasized the large differences in diversity among the three chitinases GH families-18, 19 and 20. As described in previous reports, GH family-18 holds higher variability in evolutionary terms and contains the greatest number protein members [
Traditional sequence search pipelines frequently are not able to extensively exploit metagenomic databases. The current flood of sequence data from metagenomic studies and the wide range of applications of chitinases brought about the need to develop a new data search pipeline. The chitinase mining pipeline developed in this work was based on the generation of high quality seed alignments from reliable chitinase reference sets, which
were then used on the construction of chitinase pHMMs. The searches using these pHMMs were able to retrieve high percentages of putative chitinase sequences, which were confirmed in silico by a scanning for chitinase conserved domains and motif/signatures and in NJ phylogenetic reconstructions. The results confirmed the efficacy of our pipeline in detecting chitinase sequences and highlighted the sensitivity of pHMM-based strategies to identify remote homologues. These analyses provide insight into the potential reservoir of novel molecules in
metagenomic databases while supporting the in silico chitinase mining pipeline developed in this work and identifying phylogenetic relationships among the chitinase sequences. By using our chitinase mining pipeline, a larger number of previously unannotated metagenomic chitinase sequences can be classified, enabling further exploration of these enzymes.
This research was supported by CAPES/PNPD.