Open Journal of Genetics, 2013, 3, 183-194 OJGen http://dx.doi.org/10.4236/ojgen.2013.33021 Published Online September 2013 (http://www.scirp.org/journal/ojgen/) Correctness and accuracy of template-based modeled single chain fragment variable (scFv) protein anti-breast cancer cell line (MCF-7) Elham O. Mahgoub, Ahmed Bolad Alneelain Medical Research Center, Faculty of Medicine Department of Microbiology and Unit of Immunology, Al-Neelain University, Khartoum, Sudan Email: ilhamomer@yahoo.com Received 8 February 2013; revised 20 March 2013; accepted 3 April 2013 Copyright © 2013 Elham O. Mahgoub, Ahmed Bolad. This is an open access article distributed under the Creative Commons Attri- bution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ABSTRACT Multiple sequence alignments can be used in the tem- plate-based modelling of protein structures to build fragment-based assembly models. Therefore, useful functional information on the 3D structure of the anti-MCF-7 scFv protein can be obtained using available bioinformatics tools. This paper utilises several commonly-used bioinformatics tools and da- tabases, including BLAST (Basic Local Alignment Search Tool), GenBank, PDB (Protein Data Bank), KABAT numbering and SWISS-MODEL, to gain specific functional insights into the anti-MCF-7 scFv protein and the assembly of single-chain fragment variable (scFv) antibodies, which consist of a variable heavy chain (VH) and a variable light chain (VL) connected by the linker (Gly4-Ser)3. The linker has been built as a loop structure using the Insight II software. The accuracy of the loop structure has been evaluated using Root Mean Square Deviation (RMSD). The accuracies of the VL and VH tem- plate-based structures are enhanced by using the evaluation methods Verify3D, ERRAT and Ram- chandran plotting, which measure the error in the residues. In the results, 100% of the light-chain resi- dues scored above 0.2, whereas 88.5% of the heavy- chain residues’ scored above 0.15 in the Verify3D evaluation method. Meanwhile, using ERRAT, the alignments of both chains scored more than 70% in space. Additionally, the Ramchandran plot evalua- tion method showed large numbers of residues in the favoured areas in both chains; these findings demon- strated that all of the chosen templates were the best candidates. Keywords: Single Chain Fragment Variable; Homology Modeling; SWISS-MODEL; Insight II; Model Evaluation Method 1. INTRODUCTION The prediction of protein structure is one of the most important goals pursued by bioinformatics and theoreti- cal chemistry. The scFv anti-MCF-7 gene was con- structed from the mouse B-cell hybridoma line C3A8 using phage display technology in a previous study. The objective of scFv protein homology modelling is to pre- dict the three-dimensional structure of the VH and VL chains of the scFv protein from their amino acid se- quences. Modelling prediction includes additional rele- vant information, such as the structures of related pro- teins. In other words, it deals with the prediction of a protein’s tertiary structure from its primary structure. Chua et al. [1] investigated many uses for this technol- ogy in scFv (single-chain variable fragment) genes cloned from anti-CMV (anti-cucumber mosaic virus). The scFv anti-MCF-7 antibody structure is modelled using SWISS-MODEL, and the VH and VL models are connected by the linker (Gly4-Ser)3 in the Insight II software. Thus, the complimentary-determining regions CDRs in the modelled antibody structure are determined by KABAT numbering and mapped to provide insight for further epitopes analysis. Homology modelling is based on the reasonable the- ory that two homologous proteins will share very similar structures. Because a protein’s folding is more evolution- arily conserved than its amino acid sequence, a target sequence can be modelled with reasonable accuracy on a very distantly related template, provided that the rela- tionship between the target and the template can be dis- cerned through sequence alignment. Homology modelling OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 184 was first applied by Tom Blundell in the late 1970’s, us- ing early computer imaging methods [2]. It has been sug- gested that the primary bottleneck in comparative model- ling arises from difficulties in alignment rather than from errors in structure prediction, given a known-good align- ment [3]. Unsurprisingly, suggested homology modelling is most accurate when the target and template have simi- lar sequences. Modeller is a popular software tool for producing homology models using methodology derived from NMR spectroscopy data processing. The standard procedure of template-based modelling consists of four steps: 1) finding known structures (tem- plates) related to the sequence to be modelled (target); 2) aligning the target sequence onto the template structures; 3) building the structural framework by copying the aligned regions, or by satisfying spatial constraints from the templates; 4) constructing the unaligned loop regions and adding side-chain atoms. The first two steps are usu- ally performed as a single procedure because the correct selection of templates relies on their accurate alignment with the target [4]. Similarly, the last two steps are also performed simultaneously because the atoms of the core and loop regions interact closely. SWISS-MODEL pro- vides an automated web server for basic homology mod- elling. Accordingly, models are pre-computed similarity relationships between sequences, structures and binding sites [5]. Stru cture evaluation is the most important component of structure prediction. There are several methods to evaluate protein structures, such as Ramchandran plot- ting, Verify3D and ERRAT. These programmes are freely available at the UCLA-DOE server. Moreover, the Ramchandran plot was developed by Gopalsamudran Narayana [6], and Verify3D was demonstrated by Eisenberg [7]. In this study, the heavy and light chains are modelled using SWISS-MODEL and connected to- gether with the peptide linker (Gly4-Ser)3, which was built using the Insight II software. The CDRs in the modelled antibody structure were determined by KABAT numbering and mapped inside the model structures. Moreover, the template structures of the heavy and light chains were evaluated to gain confidence about the cor- rectness of the predicted structures. 2. METHODS 2.1. Protein Homology Modelling of the Heavy Chain and Light Chain All of the procedures were performed to predict protein structure through homology modelling. First, the Ex- PASy website (http://www.us.expasy.org /tools/dna.html ) was used to translate the nucleotide sequence into the protein sequence. Next, the amino acid sequences of the VH and VL chains were submitted to ncbi-genbank (http://blast.ncbi.n lm.nih.gov/Blast.cgi) to identify the template structures with the highest percentage of align- ment. Additionally, similarity was confirmed between the VH and VL sequences and their template sequences. The alignment between the sequences was refined manu- ally using Pairwise (http://www/search/pairwise.shtml). The alignment was obtained from the Pairwise website, and then Cluster-X software was used to predict the VH and VL protein models. The alignment between the se- quences was then submitted to the SWISS-MODEL Automated Comparative Protein Modelling Server web- site (http://swissmodel.expasy.org/workspace/index). The structure was visualised with the Accelrys Visu- alize software (http://www.accelrys.com). Also, the mod- els were represented as ribbons generated using the Dis- cover software from Accelrys (San Diego, CA, USA) The higher sequence similarity of the combining sites of VH and VL of the scFv protein was then used to con- struct 3D structures. Furthermore, comparing the amino acids against the DNA was allowed to construct realistic models of the VH and VL chains of the scFv protein. The target amino acids were manually changed until they were similar to the 3BKY and 1AY1 sequences. 2.2. Build scFv Full Structure Using Builder/Insight Ii Software The Builder/Insight II software was used to connect the VH and VL models using the linker (Gly4-Ser)3 and then to build the scFv secondary structure. The scFv secon- dary structure was built using the Build Model command in Builder/Insight II. This command prepares Modeller input files to connect the VH and VL models by the linker (Gly4-Ser)3. Certain other commands were also used to build the linker (Gly4-Ser)3, such as the Get command, which reads files containing single-letter amino acid codes; the Put command, which writes output to files of either single-sequence rows or full alignments; and the Copy command, which copies the amino acid sequence row. The last command used was the Start command, which starts the Modeller background job. 2.3. Energy Minimisation of scFv Predicted Structures Insight II contains all of the necessary information to define the topology, coordinates, and force field parame- ters. These parameters include the atom types and partial charges. When doing energy minimization, the Discover module of Insight II provides a convenient interface. This module builds Discover input files from information provided through graphical interfaces, and it allows Dis- cover jobs to run interactively. In Insight II, the force field parameters were set up using three command steps: first, the Forcefield/Select Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 185 command was entered, and then atom types were as- signed using the Fix command for Potential Action in Forcefield/Potentials. Alternatively, the atoms types were assigned with the Atom/Potential command in the Bio- polymer module, and then the Accept option for Poten- tial Action in Forcefield/Potentials was used. Finally, to assign the charges, the Fix command was used for both Partial Chg Action and Formal Chg Action under Force- field/Potentials. The next step was used to minimise the energy of the scFv antibody structure. The correctness of the structure has already been checked using the assigned atom types and partial charges commands. To perform this step, the command Potential or Partial charge in Molecule/Label was used to label each atom. The structural information was specified by moving to the Discover module in In- sight II. The Constraint Pull-down menu contains various atom-constraining and restraining procedures. In Parame- ters, the simulation type for Discover (Minimize, Dy- namics, etc.) and the choices for the cut-off parameters for non-bonded interactions were selected. Additionally, to start a simulation, the command Run/Run was entered for the object being calculated. Each Discover run was assigned a number based on the order of the execution start times. The files created during the execution were identified by the calculation object and the job integer, and the file extension specifies the file type. 2.4. Structural Evaluation of the Heavy-Chain Model and Light-Chain Model To evaluate the scFv structures, Ramchandran plotting, Verify3D and ERRAT were used. These programmes are freely available at the UCLA-DOE server: (http://www.Shan non.mbi.ucla.edu/DOE/services/SV/). These structural evaluation methods allowed the reliable recognition of suitable templates for the heavy and light chains of the scFv protein structure. Additionally, the structural evaluation methods were able to produce se- quence-structure alignments with fewer gaps. Root mean square deviation (RMSD) is a technique that was developed by Giannakakos (2000). This method was used to evaluate the similarity of protein structures to their templates and to determine the accuracy of the alignment of the residues of two structures. The units used are Angstroms (Å). 2 1 N i I SDD N where, i is the index that identifies a pair of corresponding residues in two structures. N is the number of atoms. Di is the distance between corresponding i atoms. The computation of the RMSD requires a sequence alignment that defines which pairs of residues corre- spond to each other and an optimal superposition of the two structures in space. 3. RESULTS 3.1. VH and VL Chains Nucleotide and Amino Acids Sequences The nucleotide and amino acid sequences of the VH and VL chains are shown in Figures 1 and 2. The DNA se- quences of both the VH and VL chains were obtained from First BASE Laboratories Sdn. Bhd and translated into amino acid sequences in the TRANSLATE pro- gramme. The three CDRs for both chains were high- lighted using KABAT numbering. The nucleotide and amino acid sequences of the VH and VL chains were ob- tained for use in model prediction. The CDR sequences are shown in red lettering in Figure 2. 3.2. Template Search and Selection Generally, all current comparative modelling consists of four sequential steps: fold assignment and template se- lection, template-target alignment, model building and model evaluation. The selection of the template structure is generally performed by a programme that detects se- quence similarity only, such as FASTA, BLAST, and programmes based on dynamic programming methods [8,9]. However, a slightly related sequence-structure pair needs to be identified through a more difficult method that relies on structural information or multiple se- quences from the family of interest. First, a database search through unrelated sequence similarity searches was conducted by BSI-BLAST at the NCBI database http://www.ncbi.n lm.nih.gov.BLAST to identify a ho- mologous protein that possessed a crystal structure for use as a template. The X-BLAST identified many templates that were chosen to align with the VH and VL target chains, as shown in Tables 1 and 2. A reliable structure can only be obtained when the target and template are properly aligned. That state can only be achieved when the se- quence identity between the modelled sequence and at least one known structure is >30% [10]. The heavy chain (VH) consisted of approximately 113 amino acids with 75% identity with 3BKY, the template sequence shown in Table 1. The amino acid sequence of the light chain (VL) consisted of approximately 105 amino acids with 85% identity with 1AY1, the template sequence shown in Table 2. The CDR regions in the VH and VL amino acids were determined using KABAT (www.kabatdatabase.com), as shown in Figures 1 and 2. The CDRs of the heavy chain are in boldface, with CDR-H1 shown in red, CDR-H2 in blue and CDR-H3 in Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 Copyright © 2013 SciRes. 186 Figure 1. DNA Blast in NCBI GenBank of the investigated scFv gene. The identity was 99% for the heavy chain of the scFv gene and was 100% for the light chain of the scFv gene gi[1612455[gb[AAG28706.2]. yellow. The CDRs of the light chain are also in boldface, with CDR-L1 shown in red, CDR-L2 in yellow and CDR3-L3 in green. The similarity between two corre- sponding amino acids in the sequence alignment of the target chains and their templates in this work was very high; therefore, the predicted structures were accurate and reliable. can be higher if the segments of the model are selected from homologous sequences (Blundell and Srinivasan 1996). High identity between the target and template sequences generally allows the construction of a pre- dicted 3D structure with high accuracy. An identity of above 60% tends to produce a structure comparable to medium-resolution NMR or low-resolution crystallogra- phy without crystallisation or experimental structure de- termination [12]. Because homology modelling was used to produce the structural models in this work, no crystal- lisation or experimental structural determination was needed. Additionally, the numbers of structurally con- served regions (SCRs), comprising approximately 85% of the light chain and 75% of the heavy chain, were iden- tified, and the accuracy of the predicted structure was high and reliable. The 3D structures predicted for the light chain and heavy chain were constructed through the SWISS-MODEL website (http://swissmodel.expasy.org/workspace/index), as shown in Figures 4(a) and (b). 3.3. Target-Template Alignments Multiple sequence alignment is useful for placing dele- tions or insertions in areas where the sequences are sig- nificantly different [11]. The structural information from the template structure can also be used to guide the alignment by modifying the gap penalty function to fa- vour gaps in structurally reasonable contexts. The VL and VH chain models were further aligned with the tem- plate sequences by box-shading the conserved regions to elucidate the variability of the amino acids that conferred certain differences between the sequences. The target domains that were assessed to interact through the inter- face modes in a given PDB structure were listed as can- didate members of the heavy- and light-chain complex, as shown in Tables 1 and 2. Figure 3(a) shows several amino acid variations and insertion regions, especially between the heavy-chain amino acid sequence alignment with the 3BKY template and as shows in Figure 3(b) the light-chain amino acid alignment with the 1AY1 tem- plate. 3.4. Building the Full Structure of the scFv Antibody Using Builder/Insight II Software Builder/Insight II software was used to connect the VH and VL models by the linker (Gly4-Ser)3 and then to build the full scFv secondary structure in CPK display, as shown in Figures 5(a)-(c). The CPK model shows all of the CDRs on the surface of the molecule. The peptide linker appears in the middle of the structure, whereas Comparative protein modelling stresses that accuracy OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 187 Figure 2. The nucleotide and amino acid sequences for the VH and VL chains were determined to use later in model prediction. The sequences were obtained from First BASE Laboratories Sdn. Bhd and translated using the TRANSLATE programme. Table 1. Selecting target templates for the heavy chain. pdb|3DIF|B Crystal Structure of Fabox117 75% pdb|3BKY |H Crystal Structure of the heavy chain of Chimeric Antibody C2 (the chosen template) 75% pdb|3HQK|Q Q Chain Q, X-Ray Crystal Structure of An Arginine Ag 73% pdb|2OSL|H Chain H, Crystal Structure of Rituximab Fab 76% CDR-HI, CDR-H2 and CDR-H3 are found in the upper part of the structure, and CDR-L1, CDR-L2, and CDR-L3 are at the bottom of the structure. The linker provides the molecular flexibility required to move 35 to 40 Å (10−9 Kcal/mol) [1]. The root mean square devia- tion (RMSD) evaluation method was used to measure the accuracy of the loop structures, such as the linker (Gly4-Ser)3 and the sequences in the VH and VL inser- tion gaps. The insertion of gaps into an alignment be- tween two protein sequences, known as the loop struc- Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 188 Table 2. Selecting target templates for the light chain. pdb|1AY1|L Chain L, Anti Taq Fab Tp7 (the chosen template) 85% pdb|1BGX|L Chain L, Taq Polymerase In Complex With ligand bound Tp7, An Inhibitory Fab 85% pdb|1BAF|L Chain L, 2.9 Angstroms Resolution Structure Of An Anti-Dinitrophenyl-Spin-Label Mono- clonal Antibody Fab Fragment With ligand bound 86% (a) (b) Figure 3. (a) Heavy-chain sequence alignment with the 3BKY template in the ncbi-blast website. The sequence identity was 75%. This result was then used for the heavy-chain model prediction. (b) Light-chain sequence alignment with the 1AY1 template se- quence in the ncbi-blast website. The sequence identity was 85%. This result was then used for the light-chain model prediction. (a) (b) Figure 4. The heavy- and light-chain 3D structures for the scFv antibody and its CDRs. (a) Heavy chain: CDR-HI (red), CDR-H2 (blue) and CDR-H3 (yellow). (b) Light chain: CDR-Ll (red), CDR-L2 (yellow) and CDR-L3 (green). The CDR amino acid regions in the heavy chain (VH) and light chain (VL) were determined using KABAT numbering. ture, is a major determinant of the accuracy of the align- ment. 3.5. Energy Minimisation of the Predicted Structures In the energy minimisation of the protein, the hydrogen atoms were relaxed first, followed by the side chains of the amino acid residues, and finally the whole molecule. Despite the logic of this approach, however, the struc- tures minimised by an unconstrained path fit the experi- mental structures better than those minimised by con- trained paths. Moreover the unconstrained path s , Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 189 (a) (b) (c) Figure 5. (a) The full scFv protein model built by joining the VH and VL chains together by the peptide linker (Gly4-Ser)3 using BUILDER/Insight II. (b) The full scFv protein model is shown in CPK display. The heavy-chain, linker and light-chain models are clearly shown. (c) The full scFv protein model is shown in CPK display. The CPK model was energy-minimised in a CFF91 force field. The CDR molecules are shown on the surface of the CPK model. required much less computer time. The effects of the steepest descents were compared with those of the con- jugate gradient algorithms in energy minimisation. Fi- nally, steepest descents were used in the initial stages of the minimisation and conjugate gradients in the final stages of the minimisation. The full scFv model was en- ergy-minimised using 30 steps of steepest descent fol- lowed by 50 steps of conjugate gradient in the water shell, calculated with Amber 6.0 (University of Califor- nia, USA) with certain restraints to preferred geometric regions; also, two Na+ ions were added to neutralise the system. 3.6. Structural Evaluation of the Heavy-Chain and Light-Chain Models A knowledge-based homology modelling approach was used to predict the 3D structures of the heavy and light chains. The templates of the predicted structures were evaluated using three independent evaluation methods to gain confidence about the correctness and accuracy of the templates. All of the templates were submitted to the structure evaluation website (UCLA-DOE). The struc- tures were evaluated using three programmes, Ram- chandran plotting [6], Verify3D [7] and ERRAT. These methods were essential for understanding 3D protein Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 190 models and the estimation of their accuracy. Both the overall accuracy and the accuracy in the individual re- gions of a model must be determined. The predicted structures of the VH and VL chains met the above standard, as x-blast expanded the set of homo- logues of the target sequence, and the scoring matrix was used to search for new homologues. Additionally, tem- plate sequences with high identities to the target se- quences were used, specifically 99% identity for the heavy chain and 100% for the light chain. The high se- quence identity ensured a high accuracy for the models because the average structural similarity increases with sequence identity. 3.6.1. ERRAT Method As shown in Figure 6(a), ERRAT is a programme for verifying protein structures that have been determined by crystallography [13]. It is also useful for verifying pro- tein structures from the numbers of non-bounded con- tacts within a cut-off distance of 3.5 Å between different pairs of atom types (CC, CN, CO, NN, NO, OO). The error function is based on the statistics of non-bound atom-atom interactions in the reported structure com- pared with high-resolution structures. As shown in Fig - ure 6(a), the predicted structure of the light chain exhib- ited an overall quality factor of 78.505%. Additionally, in Figure 6(b), two lines were drawn to indicate the con- fidence with, which it was possible to reject regions that exceed the error value. The predicted models show- ing high resolution in the crystal structure generally pro- duce values of approximately 70% [14]. The confidence level of an overall quality factor for the heavy chain of 70.347% significantly determined the correctness of the predicted structure (Figure 6(b)). The model evaluation method outperforms the programmes in the high se- quence identity range, producing good modelling accu- racy overall. 3.6.2. Verif y 3D Method Verify3D evaluates the environment of each residue in a model with respect to the expected environment, as found in high-resolution X-ray structures [15]. Verify3D analyses the compatibility of an atomic model (3D) with its own amino acid sequence (1D) [16]. The accuracy of a 3D model can be assessed by its 3D profile, regardless of whether the model has been produced by X-ray, NMR or computational procedures, by comparing the model to its amino acid sequence using its 3D profile [17]. The 3D-1D average score against sequence number, as indi- cated in Figure 7(a), shows that 100% of the total resi- dues scored from 0.2 to 0.7 in the light chain, whereas 88.5% of the total residues scored from 0.15 to 0.7 in the heavy chain. As shown in Figure 7(b), both predicted models have 3D-1D average scores of more than 0.15. These models contain high-scoring regions, with the correctness of the good models above 0.15. The results significantly determined the correctness of the model as the average score of distinct structures. The average is often a score below 0.1 that may dip below zero at its lowest points [17]. 3.6.3. Ramchandran Plo t Method The Ramchandran plot method tests the light- and heavy-chain polypeptide angles and identifies favoured residues and allowed residues. In the light-chain pre- dicted model, the test showed that 83.0% (73) of the (a) (b) Figure 6. (a) The ERRAT evaluation methods for the light-chain residues gave 78.505% as the overall quality factor; this ERRAT value is considered good enough to use this model. In the ERRAT histogram, the correct regions are shown in black, and the incorrect regions are shown in grey. (b) The ERRAT evaluation method for the heavy-chain residues gave 70.347% as the overall quality factor; this ERRAT value is considered good enough to use this model. In the ERRAT histogram, the cor- rect regions are shown in black, and the incorrect regions are shown in grey. Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 191 (a) (b) Figure 7. (a) The Verify3D curve for the light-chain model execs between residue numbers and 3-1 dimensions score. The light-chain model gave more than 86%. The residues of the light-chain model scored 0.3 of 3D-ID. (b) The Verify3D curve of the heavy-chain model execs between residue numbers and 3-1 dimensions score. The heavy-chain model gave more than 85%. The residues of the heavy-chain model scored more than 0.3 of 3D-ID. residues lie in the most favoured region, with 14.8% (13) of the residues in the additional allowed region, as shown in Figure 8(a). The quality of the plot was better than that of the template 1AY1, as only 78.0% and 21.0% of the residues of the template structure 1AY1 fell into the most favoured region and additional allowed region, re- spectively. However, 2.3 residues were in the disallowed region for both models. The catalytic serine residue (Ser, Gly and Met as 113, 114, and 116, respectively) lies in the most favoured region. This standard, described by [18], was a typical conformation for the nucleophilic elbow, which was located in the tightly constrained Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 192 beta-turn-type structure between a beta-strand and an alpha-helix. The Ramachandran plot of the heavy-chain predicted model, as shown in Figure 8(b), reveals that 81.8% (81) of the residues lie in the most favoured re- gion, with 16.2% (16) of the residues in the additional allowed region. The quality of the plot was better than that of the template 3BKY, as only 76.0% and 23.0% of the residues of the template structure 3BKY fell into the most favoured region and additional allowed region, re- spectively. However, zero residues were in the disal- lowed region for both models. The catalytic serine resi- due (Ser113) lies in the allowed region. 4. DISCUSSION Knowledge-based homology modelling relies on the identification of one known protein structure, which is likely to resemble the structure of the query sequence, and on the production of an alignment that maps the residues in the query sequence to the residues in the tem- plate sequence. Therefore, the heavy- and light-chain genes were sequenced, and the sequences were deposited in GenBank. The mapped residues in the query were aligned to residues in the template sequence. A number of scFv structures at the Protein Data Bank (PDB) www.rcsb.org/pdb were used [19], and general informa- tion on antigen binding was documented. Hence, the scFv protein sequences in PDB were used in x-BLAST to identify suitable templates for homology modelling. Figure 3(a) shows the amino acid sequence alignment of the light-chain predicted structure and the 1AY1 tem- plate. Additionally, Figure 3(b) shows the amino acid sequence alignment of the heavy-chain predicted struc- ture and the 3BKY template structure. There were a few amino acid variations, and there were several insertion regions, especially between the heavy-chain predicted structure and the 3BKY template structure, as shown in Figure 3(b). Normally, an optimal alignment leads to a more accu- rate model. The PDB search results showed a high se- quence similarity of 75% for 3BKY, the heavy-chain template, and of 85% for 1AY1, the light-chain template. Any model can be predicted with sequence similarity equal to or greater than 30% [10]. Thus, the availability of a structural homolog at PDB was confirmed. The scFv antibody sequence was then submitted to SWISS-MO- DEL, and the VH and VL structures were separately modelled. As shown in Figures 4(a) and (b), mapping the complimentary-determining regions (CDRs) are im- portant for supporting library diversity [20]. The ca- nonical conformations for the CDRs in the scFv antibody 3D structure were successfully mapped. The CDRs, as shown in Figures 4(a) and (b), were mapped to identify their positions in the heavy and light chains. The VH and VL models were linked by the synthetic peptide [(GlY4Ser)3, followed by energy minimisation in a CFF91 force field. The modelled scFv structure was rep- resented as a CPK model, and the CDRs were mapped with Accelrys Visualize at the website (http://www.accelrys.com). Thus, the CDRs in the mod- elled antibody structure were determined by KABAT numbering (Figures 4(a) and (b)). The loop region of the structure was the most impor- tant task in modelling the scFv protein. The loop regions of the model are the structures constructed without a template guide [21,22]. The loop evaluation was meas- ured using the root mean square deviation (RMSD). Therefore, the synthetic peptide (GlY4Ser)3 that was built using BUILDER/Insight II had to be measured. Also, the loop structure was recorded with the root mean square deviation (RMSD), which was 4 Kcal/mol in the heavy chain and 2 Kcal/mol in the light chain. An optimal su- perposition (minimal RMSD) can be achieved by trans- lating and rotating one structure to its relative structure in space [23]. Therefore, the optimal alignment (optimal set of pairs of corresponding residues) was obtained and is given in Tables 1 and 2. As expected for structures of good quality, the templates of the correct models have average energy profiles smaller than zero over most of their lengths. The models based on incorrect alignments show higher energy compared with reliable structures. These results confirm the efficiency of the achieved minimisation strategy in modelling closely related ho- mologies. To determine the reliability of the united atom approximation, all of the above minimisations were per- formed with united atom models. This approximation gave structures with similar but slightly higher RMS deviations than the all-atom models, but gave additional savings of 60% - 70% in computer time. Previously, steepest descents have been used in the initial stages of minimisation and conjugate gradients in the final stages of minimisation. Therefore, the structures minimised by conjugate gradients alone resembled the structures mini- mised initially by the steepest descents and subsequently by the conjugate gradient algorithms. The predicted VL and VH structures were evaluated using three independent evaluation methods to gain con- fidence about the correctness of the predicted structures. Also, the evaluation of a model normally involves checking the sequence identity and functional environ- ment [15]. The VH and VL structures were evaluated using Ramachandran plots, Verify3D and ERRAT. These methods are freely available at the UCLA–DOE server (www.Shannon.mbi.ucla.edu). Furthermore, Ha- tem et al. [14] reported that very good models score above 70% with ERRAT evaluation methods; thus, in this work, the correctness of both predicted structures was significantly above this confidence level, with scores of 78.505% for the light chain and 70.347% for the heavy chain. Moreover, in Verify3D, in which the method analysed the compatibility of an atomic model Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 193 (a) (b) Figure 8. (a) A Ramachandran plot showing the analysis of 118 structures with a resolution of at least 2.0 Angstroms and an R-factor no greater than 20%. A good quality model would be expected to have over 90% of the residues in the most favoured regions. In this model, more than 90% of the residues are in the favoured regions. (b) A Ramachandran plot showing the analy- sis of 118 structures with a resolution of at least 2.0 Angstroms and an R-factor no greater than 20%. A good quality model would be expected to have over 90% of the residues in the most favoured regions. In the heavy-chain model, more than 99% of the residues are in the favoured region. (3D) with its own amino acid sequence (1D) [16], the light-chain and heavy-chain residues scored more than 0.3 of 3D-ID in that method, as shown in Figures 7(a) and (b). Therefore, the results determined that both mod- els were correct models that could be predicted with the templates used, as approved by Hatem et al. [14]. A basic requirement for a good model is the stereo- chemistry in displaying main-chain torsion angles phi, psi (φ, ψ) as determined by procheck [21]. procheck is widely used to calculate the Ramchandran angles of pro- tein structures, particularly crystal structures available in the Protein Data Base (PDB) [9]. In the Ramchandran method, the polypeptide chain is displayed using the φ, ψ angles pair in a given protein structure as described by Ramchandran [6]. In this paper, the models were consid- ered to be good quality because 99% and 90% of the heavy- and light-chain residues were in the favoured regions, as shown in Figures 8(a) and (b), respectively. Moreover, none of the residues were in the disallowed region for either model. The Ramchandran plot is less effective than Verify3D at revealing damaged fragments, as it sometimes appears normal even though the structure is completely wrong. 5. CONCLUSION The study of the scFv protein prediction models was ac- curate enough to be useful in essential ligand characteri- sation. This work presents the anti-MCF-7 scFv protein sequence against PDB (protein database), using BL- AST-P to identify suitable templates for homology mod- elling. The PDB search results show a high sequence similarity (99%) to a synthetic peptide. Thus, the avail- ability of a structural homolog at PDB was confirmed. Next, the anti-MCF-7 scFv amino acid sequence was submitted to SWISS-MODEL, and the VH and VL structures were separately modelled. The models were represented as ribbons, generated using RasMol. The canonical conformations for the CDRs in scFv anti- MCF-7 are mapped in 3D and mapped regions. The indi- vidually modelled VH and VL structures were linked by a synthetic peptide [(Gly4Ser)3 using BUILDER/Insight II, followed by energy minimisation in a CFF91 force field. The modelled anti-MCF-7 scFv structure is repre- sented as a CPK model, and the CDRs are mapped. Thus, the structure of an anti-MCF-7 scFv was modelled, and the CDRs were mapped to the structure in 3D. The model was subsequently evaluated using Verify3D, ERRAT and Ramachandran plots. Parts of the protein with unsatisfactory energy were realigned to the template, and the whole process of model building and evaluation was repeated until most of the average energy profile was below zero. REFERENCES [1] Chua, K.H., et al. (2006) Bioinformatics in molecular Copyright © 2013 SciRes. OPEN ACCESS
E. O. Mahgoub, A. Bolad / Open Journal of Genetics 3 (2013) 183-194 Copyright © 2013 SciRes. 194 OPEN ACCESS immunology laboratories demonstrated: Modeling an anti-CMV scFv antibody. Bioinformation, 1, 118-120. [2] Blundell, T.L. and Srinivasan, N. (1996) Symmetry, sta- bility, and dynamics of multidomain and multicomponent protein systems. Proceedings of National Academy of Sciences of the USA, 93, 14243-14248. doi:10.1073/pnas.93.25.14243 [3] Katchalski, K., Katzir, E., Shariv. I., et al. (1992) Mo- lecular surface recognition: Determination of geometric fit between proteins and their ligands by correlation tech- niques. Proceedings of National Academy of Sciences of the USA, 89, 2195-2199. [4] Wu, S. and Zhang, Y. (2009) Chapter 11: Protein struc- ture prediction. In: D. Edwards, et al., Eds., Bioinformat- ics: Tools and Applications, Springer Science+Business Media, LLC, Berlin, 225-242. [5] Janin, J., Henrick, K., Moult, J., et al. (2003) A critical assessment of predicted interactions. Proteins: Structure, Function, and Bioinformatics, 52, 2-9. doi:10.1002/prot.10381 [6] Ramachandran, G.N., Ramakrishnan, C. and Sasisekharan, V. (1963) Stereochemistry of polypeptide chain configu- rations. Journal of Molecular Biology, 7, 95-99. doi:10.1016/S0022-2836(63)80023-6 [7] Eisenberg, D., Lüthy, R. and Bowie, J.U. (1997) VER- IFY3D: Assessment of protein models with three-dimen- sional profile. Methods in Enzymology, 277, 396-404. doi:10.1016/S0076-6879(97)77022-8 [8] Sanchez, R. and Sali, A. (1998) Large-scale protein struc- ture modeling of the Saccharomyces cerevisiae genome. Proceedings of National Academy of Sciences of the USA, 95, 13597-13602. doi:10.1073/pnas.95.23.13597 [9] Dlakic, M. (2002) A model of the replication fork block- ing protein Fob1p based on the catalytic core domain of retroviral integrates protein. Protein Science, 11, 1274- 1277. [10] Marti-Renom, M.A., Stuart, A.C., Fiser, A., et al. (2000) Comparative protein structure modeling of genes and genomes. Annual Review of Biophysics and Biomolecular Structure, 29, 291-325. doi:10.1146/annurev.biophys.29.1.291 [11] Madhusudhan, M.S., et al. (2006) Variable gap penalty for protein sequence-structure alignment. Protein Engi- neering, Design and Selection, 19, 129-133. doi:10.1093/protein/gzj005 [12] Sanchez, R., Pieper, U., Melo, F., et al. (2000) Protein structure modeling for structural genomics. Nature Struc- tural Biology, 7, 986-990. [13] Colovos, C. and Yeates, T.O. (1993) Verification of pro- tein structures: Patterns of non-bonded atomic interac- tions. Protein Science, 2, 1511-1519. doi:10.1002/pro.5560020916 [14] Hatem, R., Pierre, B., Elie, E., et al. (2005) Structural and functional analysis of the C-terminal STAS (sulfate transporter and anti-sigma antagonist) domain of the Arabidopsis thaliana sulfate transporter SULTR. The Journal of Biological Chemistry, 280, 15976-15983. [15] Fiser, A., Sanchez, R., Melo, F., et al. (2001) Comparative protein structure modeling. In: Becker, O.M., Ed., Com- putational Biochemistry and Biophysics, Marcel Dekker, New York, 275-312. [16] Bowie, J.U., Luthy, R. and Eisenberg, D. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164-170. doi:10.1126/science.1853201 [17] Lüthy, R., Bowie, J.U. and Eisenberg, D. (1992) Assess- ment of protein models with three-dimensional profiles. Nature, 356, 83-85. doi:10.1038/356083a0 [18] Tyndall, D.A., Linda, A., Fothergill-Gilmore, P., et al. (2000) Crystal structure of a thermostable lipase from Bacillus stearohermophilus P1. Journal of Molecular Bi- ology, 323, 859-869. [19] Bernstein, F.C., Koetzle, T.F., Williams, J.B., et al. (1977) Protein data bank: A computer-based archival file for macromolecular structures. Journal of Molecular Biology, 112, 535-542. doi:10.1016/S0022-2836(77)80200-3 [20] DeBartolo, J., Colubri, A., Jha, A.K., et al. (2009) Mim- icking the folding pathway to improve homology free protein structure prediction. Proceedings of National Academy of Sciences of the USA, 106, 3734-3739. [21] Laskowski, R.A., Moss, D.S. and Thornton, J.M. (1993) Main-chain bond lengths and bond angles in protein structures. Journal of Molecular Biology, 231, 1049- 1067. [22] Vriend, G. (1990) WHATIF: A molecular modeling and drug design program. Journal of Molecular Graphics, 8, 52-56. doi:10.1016/0263-7855(90)80070-V [23] Ryan, D., Xiaotao, Q., Rosemarie, S., et al. (2011) Rela- tive packing groups in template-based structure predic- tion: Cooperative effects of true positive constraints. Journal of Computational Biology, 18, 17-26. LIST OF ABBREVIATIONS scFv—Single Chain Fragment Variable VL—Hyper variable light chain VH—Hyper variable heavy chain MCF—7-mammary gland carcinoma of the breast cells line anti-CMV—anti-cucumber mosaic virus CDRs—complementary determining regions Gly—Glycine amino acid Ser—Serine amino acid.
|