^{1}

^{*}

^{2}

^{*}

^{3}

^{*}

A general and elementary protein folding step was described in a previous article. Energy conservation during this folding step yielded an equation with remarkable solutions over the field of rational numbers. Sets of sequences optimized for folding were derived. In this work, a geometrical analysis of protein beta-sheet backbone structures allows the definition of positions of topological interest. They correspond to amino acids’ alpha carbons located on a unique axis crossing all beta-sheet’s strands or at proximity of this axis defined here. These positions of topological interest are shown to be highly correlated with the absence of sequences optimized for folding. Applications in protein structure prediction for the quality assessment of structural models are envisioned.

Protein structure prediction from sequences remains a major challenge even though the problem is several decades old [1,2]. Protein structure prediction was recently achieved using ab initio methods for small proteins, using templates with sequence or fold similarity or using sets of correlated mutations [3-9]. One-dimensional protein sequences can generally be predicted from gene sequences on genomic scales [10,11]. Secondary structures can also be efficiently predicted computationally from protein sequences [12-17]. However, three-dimensional protein structures have generally been solved experimentally and computationally by time-consuming and costly approaches such as X-ray diffraction on protein crystals or nuclear magnetic resonance on concentrated protein solutions. Independently, studies on protein folding allowed major conceptual advances on the understanding of general protein properties linked to their conversion of one dimensional sequences into three-dimensional structures [18-21]. Molten globules and pre-molten globules have been characterized [22,23]. A rugged funnel-like energy landscape was described for protein folding [

The program pdb2 [

The gap is characterized by an integer value, which is the integer part of the middle of the gap’s ends corresponding to the set of amino acid positions for which no SOF were found (Figures 1 and 2). Independently, positions of topological interest (TIPs) were determined from the protein domain structures’ backbone either by visual analysis of the structure using the Pymol software or by automatic annotation using pdb22 (see below). For each protein consisting of L amino acids, the number of TIPs T and the number of gaps G were noted in the Annex

The program pdb22 is available at the address: http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::pdb22; it is also a program written in perl and uses the same entry files as pdb2. The pdb22 output file (.xls) provides for each protein within the list its PDB name, the amino acid number and name in three-letter code, the start and the end of beta-strands indicated as amino acid numbers, the name of the sheet noted on the lines corresponding to amino acids found at the intersection of a beta-strand with the sheet axis and the distance D, which is calculated

in Angströms and averaged per beta-strand for each sheet consisting of n strands using the following equation:

where mindist(i) is the minimal distance between an alpha carbon of strand i and the sheet axis. The distance d is estimated for each pair of amino acids defining an axis characterized by the atomic coordinates of one amino acid’s alpha carbon in the first strand and another one in the sequence’s last strand. The sheet axis is defined as the axis for which the distance d is minimal. For a sheet, the minimum of all distances d is noted D.

The probability q for having C coincidences occurring at random, that is the probability for G gaps to coincide with T TIPs within the error range e was calculated according to Equation (2) deriving from the exclusion-inclusion principle (cf. Annex for the equation’s proof).

with

The corresponding probabilities q are reported for each protein structure defined by its PDB reference in the Annex

In order to compute the p-value of the test, the probability of failing at most 14 times within 46 experiments (one experiment for each protein structure associated to a PDB reference) when the probability of failure is taken as 0.5 was computed using the binomial law as in Equation (3):

The severity of this statistical test is highlighted for example by the data obtained for the protein of 193 amino acids referenced 3pn3 in the PDB, for which the correct identification of one coincidence for the gap was not considered as successful because of the large number of TIPs defined which is associated to a probability (q > 0.5; Annex

Independently, a program (pdb7) was written to make use of lists of PDB files as entries and to provide within the output sequence file the gaps and TIPs calculated using pdb2 and pdb22 respectively. For each beta-sheet, the axis was defined as the line minimizing the distance for all strands from one alpha carbon per beta-strand to the line defined by two alpha carbons taken in the first and last strands in the protein sequence as described above. Analysis of the pdb7 output files yielded the results for the 248 correlations evaluated between gaps and TIPs (

An elementary step of protein folding was described as a folding unit or chemical group folding onto a folding entity to yield a larger folding entity [

A gap was defined as one or several amino acid(s) position(s) for which no sequence with optimal folding properties (SOF) is found. A quarter of the proteins analyzed yielded graphs of SOF which did not contain any gap. As an example, a single gap was noted between amino acids 114 and 115 for the central domain of C. symbiosum pyruvate phosphate dikinase (

Topologically interesting positions (TIPs) can be determined from protein domain structures’ atomic coordinates. Beta-sheets are typically curved planes in three dimensions because of the twist found within beta-strands [

An error e for the gaps’ positions prediction was allowed and chosen to increase slightly as a function of in

^{a}Distance is the difference between a gap position and a topologically interesting position (TIP) within a beta-strand sequence; ^{b}relative occurrence assuming a random assignment of gaps and TIPs within seven amino acids long beta-strands; ^{c}occurrences in a non-redundant set of proteins with at least one seven amino acids long beta-strand deriving from the PDB.

creasing proteins’ length as described in the methods section (

To obtain an independent proof of this conclusion, another program (pdb7) was then written for automatic annotation of gaps and TIPs on protein sequences: the hypothesis that the observed distribution of distances between gaps and TIPs (

An elementary protein folding step was described [

While numerical applications of equations from classical mechanics are commonly done over the field of real numbers, the following pieces of evidence indicate that discreteness provides a useful basis which is adapted in particular for the understanding of why the genetic code is the way it is. The genetic code is remarkable because of its quasi-universality within living organisms on earth and because it is about four billion years old [

This mathematical and physical formalism provides information on beta-sheet structures from protein sequences as shown recently for the prediction of edge strands [

In this work, the absence of sequences optimized for folding was linked to topological information on protein beta-sheets. It should be of interest to extend this analysis to other secondary structure elements such as protein helices while considering the impact of protein families and classes [71,72].

There is a need for practical methods describing complex chemical processes [

The authors thank B. Néron for interfacing pdb2 and pdb22, Y. Benoist and A. Kempf for discussions.

PDB is the Protein Data Bank; SCOP is the structural classification of proteins; SOF is a sequence optimized for folding; a gap consists of one or several contiguous amino acids position(s) for which no SOF were found; G is the number of gaps; TIPs are Topologically Interesting Positions; T is the number of TIPs; e is the error tolerated around a gap and is defined as plus or minus a number of amino acids; C is the number of coincidences between TIPs and gaps; L is the number of amino acids of a protein or its length; D is the sum for all strands of a beta-sheet of the distances from the sheet axis to the closest amino acid alpha carbon in each strand; q is a probability as defined in Equation (2); p is the statistical p-value as defined in Equation (3).

We give here a proof of the formula for the probability of having at least C coincidences between the G gaps and T TIPs in a protein of length L, up to an acceptable error of e:

The last fraction in the sum is the probability for the T TIPs to avoid j choosen gaps. With the additional binomial coefficient

But some events appear several times in these terms. Fix an integer k between G – C + 1 and G. A distribution of TIPs avoiding exactly k gaps gets counted

for every

For the sake of readability, let us note

Hence the sum we are interested in is rewritten:

And it remains to prove:

We may use an analytic proof. Define the polynomial in two variables:

Now let

Putting

^{a}In Annex