Cysteine-dependent protein sequences were downloaded from annotated database resources to generate comprehensive EGF, Sushi, Laminin and Immu- noglobulin (IgC) motif-specific sequence files. Each dataset was vertically registered and the cumulative distribution of amino acid functional group chemistry determined relative to the respective complement of cysteine residues providing critical disulfide stabilization of these four well-known modular motif families. The cysteine-aligned amino acid distribution data revealed limited ionic, polar, hydrophobic or other side chain preferences, unique to each protein scaffold. In contrast, all four cysteine-dependent protein families exhibited strong positional preference for the aromatic residues phenylalanine (Phe) and tyrosine (Tyr), relative to analogous cysteine landmarks. More than eighty percent of the members in each protein family were found to possesses the same conserved -Cys- (Xxx)3-4-(Phe/Tyr)- arrangement, placing an aromatic amino acid at analogous EGF-C5+4, Sushi-C2+4, Laminin-C7+4 and IgC-C1+5. Over seventy percent of EGF, Sushi and IgC sequences exhibited a second obvious Cys-associated aromatic site -(Phe/Tyr)-Xxx- Cysat EGF-C4-2, Sushi-C2-2 and IgC-C2-2. The cysteine-associated placement of aromatic amino acid chemistry in four major disulfide-dependent protein families likely represents conservation of a molecular determinant of global importance in the structure- function of this large and diverse subset of extracellular proteins.
One of the major impacts of advanced protein bioinformatics has been to demonstrate the wide-spread deployment of a few selected “modular” protein structures across the range of modern genomes [1,2]. The incorporation of one or more epidermal growth factor (EGF), Immunoglobulin (IgC), complement-like Sushi and/or Laminin motif structures into numerous circulating enzymes, growth factors and biochemical modulators, mixed-module and poly-modular membrane-bound receptors and extracellular matrix proteins is well documented in the protein database [3-6]. The prevalence of EGF, Immunoglobulin, Sushi and Laminin homologues in characterized genomes, from slime mold to humans, clearly demonstrates evolutionary selection of some ancestral utility. However, whether these heavily employed and often repetitive protein modules simply represent convenient peptide backbones upon which to incorporate diverse usage-defined determinants of molecular interaction, or whether they contribute some inherent motifspecific or global functionality to the larger protein super-structures in which they are incorporated remains largely unknown.
The reliable annotation of individual EGF, Sushi, Laminin and IgC domain sequences and subsequent assembly into separate motif-specific sequence files, based on extended taxonomical relationships identified by respected sequence alignment tools [7,9], makes secondary analysis of these now well-established families of related protein orthologs and paralogs possible. The Immunoglobulin, Sushi, EGF and Laminin sequence families exhibit distinct consensus patterns of two, four, six or eight cysteine residues, respectively. These familial cysteine landmarks, which constitute the basis for four distinct disulfide-stabilized native protein structures, are easily recognized and aligned despite often dramatic variability in the length, amino acid composition and sequence of the inter-cysteine strands that comprise each member of these large and diverse protein families. The exhaustive compilation and vertical registration of cysteine-aligned EGF, Sushi, Laminin and IgC domain sequence files have made it possible to visualize this sequence diversity, as well as to identify both motif-specific and shared elements of structure-function occurring within the respective cysteine-defined consensus structures.
The continuously updated taxonomical listing of available EGF, Sushi, Laminin and Immunoglobulin constant domain (IgC) protein sequences was accessed through the National Center for Biotechnology Information’s network of linked bioinformatics resources at
www.ncbi.nlm.nih.gov [10,11]. This study evaluated more than fifteen hundred individual EGF and calciumbinding EGF (cbEGF) motif modules, including the EGF sequence families cd00053, pfam00008, smart00181, cd00054, and smart00179 registered in NCBI’s Conserved Domains database [
With nearly absolute conservation of their motif-defining disulfide-bond structures, the alignment of analogous cysteines within respective Laminin, EGF, Sushi, and IgC protein families was relatively uncomplicated. Guided by the NCBI’s Multiple Alignment Viewer, the extensive sequence files generated for each protein family were visually aligned by simple vertical registration of analogous cysteine (Cys) residues, with variable-length intercysteine strands each divided mechanically at the midpoint, with no attempt to align other homologous amino acid chemistry (see
“registered” sequence arrays represent the actual linear arrangement of amino acids in the vicinity of each landmark cysteine, with none of the gaps introduced by methods employing traditional homology scoring and local alignment algorithms. In order to ensure that every EGF, Sushi, Laminin and IgC sequence represented a unique entry in the database, specific effort was made to exclude the many identical sequence entries bearing different identifiers. As long as each entry satisfied NCBI’s statistical homology-based scoring (E-value) standards to be included in one of the respective protein families, no effort to filter or otherwise adjudicate the inclusion or exclusion of specific individual or groups of protein orthologs or paralogs was applied.
The Cys-aligned sequence text files were then imported into standard spreadsheet format and subsequently converted to aligned “sequence tables” for automated tabulation of amino acid distribution. The cumulative occurrence of each of the twenty possible amino acids was assessed at each position in every Laminin, EGF, Sushi and IgC sequence in their respective datasets. A total of more than one-hundred-and-fifty-thousand individual amino acids were evaluated, and the cumulative positional occurrence of each residue plotted relative to the nearest Cys landmark. This relative positional mapping permitted examination of these enormously diverse sequence datasets despite often dramatic variation in the length of analogous inter-cysteine strands, both within and across protein families.
As a consequence of compiling and evaluating the distribution of individual amino acids in motif-specific datasets, general amino acid composition data for each protein family was generated (
The distribution of amino acids in disulfide-dependent protein families was systematically evaluated and the cumulative occurrence of homologous amino acid chemistry determined relative to essential familial cysteine landmarks. Data for the different amino acid functional classes, acidic residues, basic residues, alcohol-based and amide/imidazole-containing polar amino acids, residues with aliphatic and aromatic sidechains, and the structural residues glycine and proline (not shown) were determined separately for each protein family.
aThe total number of residues identified in each class for each protein family were calculated as a percent of 19,054 total Laminin residues; 62,185 total EGF residues; 27,017 total Sushi residues; and 58,821 total IgC residues, respectively.
Full-length Laminin motif modules employ eight consistently located cysteines that represent landmarks around which the distribution/accumulation of all other residues was determined (
The EGF superfamily represents perhaps the largest and most diverse collection of related protein structures in all of nature. A few sites of preferred amino acid chemistry can be observed for most functional group classes (
The Sushi domain data demonstrated a slightly higher and more uniform distribution of all amino acid classes (
quences incorporate an amide/imidazole site at SushiC1+8, a similar degree of conservation as observed for this functional class in the EGF family.
Immunoglobulin constant (IgC) domains have the longest peptide chain-length of the four families and a single disulfide (
Analysis of the data compiled for aromatic residues showed a strong preference for site-specific accumulation/distribution of phenylalanine (Phe) and tyrosine (Tyr) residues in all four protein families (
Over time, the usual mechanisms of genetic variation, acting on protein structure-function at the molecular level, have resulted in the duplication, mobilization and adaptation of ancestral EGF, Sushi, Laminin and Immunoglobulin genes to establish the recognized families now widely employed across modern eukaryotic genomes [1,2,14,15]. More than three thousand annotated EGF, Sushi, Laminin and Immunoglobulin sequences, each representing a unique protein ortholog, paralog or other polymorphism were systematically extracted from the protein database. The composition of each of the resulting motif-specific sequence datasets was not manipulated or purposefully constructed in any intentional way, to avoid introducing any preconceived bias into the data, but this otherwise blind approach to assemble the most comprehensive datasets possible does certainly reflect selective trends of past and present sequence discovery efforts. The inclusion of a few highly repetitive or heavily represented parologs might be observable, but minimally affect accumulation data of the size and diversity employed here. To fully appreciate the sequence diversity of relevant protein families, readers are directed to representative alignments of original source data at www.ncbi.nlm.nih.gov/cdd [
The method of simple vertical registration has been similarly used in previous tabulation of antibody framework residues in examining large sets of Immunoglobulin variable (IgV) domains [16,17]. The study presented here looked exclusively at amino acid functional group chemistry with respect to actual linear sequence relationships only, unlike algorithm-based alignment methods complicated by the many natural insertions and deletions responsible for variable length inter-cysteine strands. The mechanical splitting of inter-cysteine sequences of different length makes data recorded at sites farther from landmark cysteines generally less significant, especially for the highly variable-length strands of the EGF family. The less variable strands in Sushi, Laminin and IgC families exhibited sequence homology even some distance from relevant cysteines. This method for examining the evolutionary success of four major motif modules has made it possible to illustrate the sequence diversity that has accumulated, as well as to demonstrate the extreme conservation of selected cysteine-associated sites of preferred amino acid chemistry shared within and across modern protein families.
Selective pressure asserted at the molecular level has provided for conservation of those elements of protein structure-function ultimately beneficial to survival, allowing diversity to exist elsewhere in protein primary structure, as long as the biochemical “fitness” of the native protein has not been too severely impaired. It would seem reasonable that in the evolutionary progression of primordial proteomes, the forces of diversification that introduce new function must be balanced by conservation of those elements of protein structure-function necessary for any modified protein to achieve and maintain a viable protein structure and permit the altered molecular entity to be successfully maintained or integrated into existing macromolecular or cellular structures, biochemical processes or physiological pathways [18,19].
While this study identified some preferred amino acid chemistry at a limited number of sites in each of the families examined, the data clearly demonstrate that all four of these disulfide-dependent modular motif structures represent readily diversifiable peptide scaffolds onto which a significant number of differently arranged amino acid chemical functionalities can be deployed to
meet the biological purposes of a range of protein superstructures. The events responsible for introducing and maintaining the shared aromatic determinants -C-(X)3-4- (F/Y)- and -(F/Y)-X-Cmust, however, represent an early evolutionary parallel in a diversification/selection process leading to their accumulation in the multitude of protein orthologs and paralogs now comprising these four major protein families. Being conserved in four different modular motif structures, the observed positional restriction of aromatic chemistry must provide beneficial biochemical function beyond the level of individual motif structure and likely has global ramifications for protein structure-function, potentially representing some shared determinant of common regulation, cellular trafficking, or quality control in the folding and assembly of diverse multimeric/multidomain proteins.
Although amino acid sequence conservation is generally thought of in context of mature native protein structure, the possibility that the cysteine-associated distribution of aromatic residues detected in linear protein sequence data might be conserved to intentionally function in nascent protein or other premature non-native protein structures must also be considered. The presence of an appropriate export leader sequence or other extracellular reference was identified for each of the proteins evaluated here, confirming that all of these cysteine-rich proteins are targeted for export via the endoplasmic reticulum (ER), where they would all be subjected to the same protein folding environment and quality control inspection. Analogous positioning of aromatic residues near essential cysteines in multiple extracellular protein families may be relevant to protein folding, quality control or trafficking, in light of evidence suggesting an aromaticbased recognition of ER folding substrates by Protein Disulfide Isomerase [20,21].
We postulate that the Cys-associated placement of aromatic functionality, observed in four major disulfidestabilized protein families, represents the conservation of a common element for either facilitating and/or signaling changes in protein conformation as part of the protein folding/quality control process in the ER. In particular, placement of a conserved aromatic site near essential cysteines might allow one or more of the growing list of ER-resident chaperones, like Disulfide Isomerase, to more easily recognize unfolded or misfolded domains where critical disulfides are either reduced or scrambled, by virtue of interaction with aromatic groups whose positions have been evolutionarily restricted to facilitate detection in non-native folding intermediates. Subsequent achievement of native disulfide configuration and the consequential establishment of a mature protein conformation where exposed aromatic groups become sufficiently concealed from any Phe/Tyr-specific ER binding proteins could contribute to signaling readiness for export.
In theory, the incorporation of a single, simple, generic determinant marking non-native protein structure would constitute the most efficient system for inspecting the enormous variety of disulfide configurations and protein conformations presented by the diverse population of disulfide-dependent protein structures that transit the ER. This capability could be especially important in monitoring the folding/assembly of repetitive or multi-subunit superstructures where the presence of an easily recognized, position-restricted determinant that demarcates and potentially facilitates the segregation of individual modules within these larger proteins during folding and assembly would seem advantageous. If or how such an aromatic “folding sensor” might cooperate with established glyco-based protein inspection systems should be considered.