One of the problems in the development of mathematical theory of the genetic code (summary is presented in [ 1], the detailed—to [ 2]) is the problem of the calculation of the genetic code. Similar problem in the world is unknown and could be delivered only in the 21st century. One approach to solving this problem is devoted to this work. For the first time a detailed description of the method of calculation of the genetic code was provided, the idea of which was first published earlier [ 3]), and the choice of one of the most important sets for the calculation was based on an article [ 4]. Such a set of amino acid corresponds to a complete set of representation of the plurality of overlapping triple gene belonging to the same DNA strand. A separate issue was the initial point, triggering an iterative search process all codes submitted by the initial data. Mathematical analysis has shown that the said set contains some ambiguities, which have been founded because of our proposed compressed representation of the set. As a result, the developed method of calculation was reduced to two main stages of research, where at the first stage only single-valued domains were used in the calculations. The proposed approach made it possible to significantly reduce the amount of computation at each step in this complex discrete structure.
The idea of calculating the genetic code arose after many years of research on mathematical genetics. The basic idea was that the code, apparently, today―half a century after its discovery, can be calculated on the basis of experimental data already known to date. The approach proposed below is not the main, or comprehensive, and it can only be considered as one of the attempts to find an approach to the solution of the task. After all, it’s about finding a solution in a complex structure that describes the main characteristics for almost all existing living organisms.
Each new century put forward ever new tasks of both theoretical and applied mathematics. In this paper, we consider the newest problem of discrete mathematics, which could be put only in the 21st century. This task belongs to the field of biomathematics, because biology and related tasks are key in the present century. In the world there is no such kind of publications on this topic; therefore this work can be considered as the world’s first full version of the work on the calculation of the genetic code.
First of all, the question arose: What sets can be used in this case. Our approach was based on the search for all codes satisfying the set of amino acids that record overlapping genes. The author has studied the subject matter of the mathematical analysis of such genes for a number of years [
We consider unusual ways of recording genetic information-overlapping genes, when the same DNA portion corresponds to more than one protein. We investigated all 5 possible cases of overlapping of genes resolved by DNA structure, which were studied earlier [
The principal position of this research is indicated in [
In the mathematical analysis of overlaps of more than two genes, we have investigated some problems. Of course, it would be possible to construct sets of all e.o. From 3 to 6 genes. It is not difficult to do this with the help of modern computer facilities. However, the main thing-what new conclusions-it can give. And that’s why we are going the traditional way-from tasks. Let us first briefly discuss only some of them, solutions for which we have already published. The first of these concerns the analysis of ambiguities [
encodings correspond to the same pair of amino acids (see the example below). Another problem was connected with the construction and analysis of a set of elementary overlaps for 3 genes overlapping in the same DNA chain. It is established that there are only 307 such overlaps. On the basis of these overlaps, a new problem was posed, connected with the calculation of the genetic code by mathematical methods [
We call an e.o.-elementary overlap for i amino acids, where, iÎ(2,6). Thus, the e.o. introduced earlier in [
For the amino acid Ama (isolated by hatching) encoded by the triplet n1n2n3, there are 5 alternative amino acids Ama1-Ama5, the encodings of which are formed by −1, +1 shifts in the same DNA chain (→) and −1, 0, +1 in the complementary DNA strand (←). The designations ni, iÎ(0,4) are the nucleotides from the set A, T, C, G; N n ′ i iÎ(0,4)-complementary components: i.e. For n ′ i i = A; ni = T; ni = C, n ′ i = G for any iÎ(0,4) and vice versa. In order to sequentially isolate e.o.-2 for all 5 cases of pair overlaps from [
consistently leave one of the 5 pairs of amino acids: (Ama, Amai), iÎ(1,5). In order to sequentially isolate all overlap cases for e.o.-3 in
Occurs
Theorem. The homogeneous subset e.o.-6 contains only 31 e.o. which belong only to the sets of e.o.-2-e.o.-3, of which:
-5 e.o.-2 for overlaps from one DNA chain:
-19 e.o.-2 for overlaps from various DNA chains:
-4 e.o.-3 for overlaps for overlaps from one DNA chain.
-3 e.o.-3 for overlaps from various DNA chain
We give a comment on the statement of the theorem. Cases 22.1 and 22.2 correspond to the same amino acid Ser-Ser; This is one of six ambiguities, which was established earlier [
It follows from (2) that similar e. There are no overlaps from different DNA chains, so they are not possible in the structures of homogeneous e.o 4-e.o. 6
For amino acid Ama (highlighted by hatching), encoded by the triplet n1n2n3, there are 5 alternative a
Introduction of basic sets
As follows from the previous section, homogeneous elementary overlaps occur only for overlaps with participation of not more than three amino acids. To select working sets, consider all such overlaps.
First of all, it is necessary to exclude from consideration all homogeneous overlaps in which two strands of DNA participate. Consideration of these overlaps requires the introduction of a double strand of DNA-this is an additional condition in the problem. Eliminating such homogeneous overlaps, we proceed from the principle of constructing an algorithm with a minimum number of conditions. Therefore, in our examination there remain only homogeneous overlaps belonging to the same DNA chain: for pairs of amino acids (1), there are only 5 of them and similar overlaps for three amino acids (3)-th total of 4. Thus, we selected the main working sets E.o., namely, those in which these homogeneous overlaps are present. The final version of these sets is presented on pages 312-319 in [
Let us consider the question in more detail. On sets with these overlaps. Earlier [
In
Amino acids shown in
Formulation of the problem. Introduction of a compressed set.
FORMULATION OF THE PROBLEM. Let’s have a set of 4 letters: N: a, b, c, d, and also triplets-any triples of these letters, there are 64 in all. Moreover, each of the 20 canonical amino acids can be encoded by an arbitrary combination of such triplets. The task is to search for all the genetic codes that correspond to
all the elements designated above the three sets U1, U2, U3, corresponding to the genetic experiments.
i | Amai | U2 | U1 | ||
---|---|---|---|---|---|
m12 | m23 | m1-3 | m3-1 | ||
1 | Met | 4 (Tyr, His, Asn, Asp) | 2 (Trp, Cys) | 3 (Lys, Pro, Gly) | 1 (Gly) |
2 | Trp | 3 (Met, Val, Leu) | 1 (Gly) | 3 (Phe, Pro, Gly) | 1 (Gly) |
3 | Phe | 4 (Phe, Ile, Val, Leu) | 3 (Phe, Ser, Leu) | 3 (Phe, Pro, Gly) | 2 (Phe, Pro) |
4 | Tyr | 3 (Ile, Val, Leu) | 3 (Met, Ile, Thr) | 3 (Phe, Pro, Gly) | 2 (Phe, Pro) |
5 | His | 4 (Pro, Thr, Ala, Ser) | 3 (Met, Ile, Thr) | 3 (Phe, Pro, Gly) | 2 (Phe, Pro) |
6 | Asn | 3 (Gln, Lys, Glu) | 3 (Met, Ile, Thr) | 3 (Lys, Pro, Gly) | 2 (Phe, Pro) |
7 | Asp | 2 (Gly, Arg) | 3 (Met, Ile, Thr) | 3 (Lys, Pro, Gly) | 2 (Phe, Pro) |
8 | Cys | 3 (Met, Val, Leu) | 2 (Val, Ala) | 3 (Phe, Pro, Gly) | 2 (Phe, Pro) |
9 | Gln | 4 (Pro, Thr, Ala, Ser) | 4 (Asn, Lys, Ser, Arg) | 3 (Phe, Pro, Gly) | 2 (Lys, Gly) |
10 | Lys | 3 (Gln, Lys, Glu) | 4 (Asn, Lys, Ser, Arg) | 3 (Lys, Pro, Gly) | 2 (Lys, Gly) |
11 | Glu | 2 (Gly, Arg) | 4 (Asn, Lys, Ser, Arg) | 3 (Lys, Pro, Gly) | 2 (Lys, Gly) |
12 | Ile | 4 (Tyr, His, Asn, Asp) | 4 (Phe, Tyr, Ser, Leu) | 3 (Lys, Pro, Gly) | 3 (Phe, Lys, Pro) |
13 | Val | 4 (Cys, Gly, Ser, Arg) | 6 (Trp, Phe, Tyr, Cys, Ser, Leu) | 3 (Lys, Pro, Gly) | 4 |
14 | Pro | 4 (Pro, Thr, Ala, Ser) | 5 (His, Gln, Pro, Leu, Arg) | 3 (Phe, Pro, Gly) | 4 |
15 | Thr | 4 (Tyr, His, Asn, Asp) | 5 (His, Gln, Pro, Leu, Arg) | 3 (Lys, Pro, Gly) | 4 |
16 | Ala | 4 (Cys, Gly, Ser, Arg) | 5 (His, Gln, Pro, Leu, Arg) | 3 (Lys, Pro, Gly) | 4 |
17 | Gly | 3 (Trp, Gly, Arg) | 5 (Asp, Glu, Val, Ala, Gly) | 3 (Lys, Pro, Gly) | 4 |
18 | Ser | 7 (Phe, Gln, Lys, Glu, Ile, Val, Leu) | 7 (His, Gln, Val, Pro, Ala, Leu, Arg) | 4 | 4 |
19 | Leu | 8 (Phe, Ile, Val, Pro, Thr, Ala, Ser, Leu) | 6 (Trp, Phe, Tyr, Cys, Ser, Leu) | 3 (Phe, Pro, Gly) | 4 |
20 | Arg | 7 (Gln, Lys, Glu, Pro, Thr, Ala, Ser) | 5 (Asp, Glu, Val, Ala, Gly) | 4 | 4 |
Notation. The first 4 columns indicate the number of overlaps, and in parentheses the list of amino acids for overlaps: 1 and 2. Bases (m12), on 2 and 3 bases (m23) on 1 base with the third base (m1-3), on 3 bases with the first base. (M3-1). Columns 3 and 4 refer only to overlaps with Lys, Phe, Pro, Gly, so this number can not be more than 4; .: Х: а, d; Y: b, с; М: а, b, с; N: а, b, с, d.
For the future, we use standard three-letter abbreviations for each of the 20 amino acids.
In [
We introduce one concise representation for 307 elements of the principal set-U3. In
It should be noted that these ambiguities correspond to the values of Ser, Leu,
Arg, both along the abscissa axis and along the ordinate axis. However, the most significant area in
The above property allowed us to refer to the first stage of the calculation, when the calculation of the encodings for all elements is made Ama value on the basis of the encodings for the corresponding pair Ama1 and Ama2. The results of the ste.o.-by-ste.o. solution of the problem are presented in
The initial approximation
THE SOLUTION OF THE PROBLEM. We use the standard three-letter abbreviations for each of the 20 amino acids listed in the first column of
A0: Amai, iÎ(1,20). (5)
We introduce the definition. Let us turn to the previously introduced homogeneous overlaps. As before, we call a combination of amino acids, constructed on the basis of an elementary genetic overlap, homogeneous if the same amino acid participates in it. For homogeneous elements of the set we have.
Property. Let the encodings Ama for homogeneous u3 have one of the fol-
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|
Met | abd | ||||||||
Trp | bdd | ||||||||
Phe | bbb | bbY | |||||||
Tyr | baY | ||||||||
His | caY | ||||||||
Asn | aaY | ||||||||
Asp | daY | ||||||||
Cys | bdY | ||||||||
Gln | саХ | ||||||||
Lys | ааа | ааХ | |||||||
Glu | daX | ||||||||
Ile | abМ | ||||||||
Val | dbN | ||||||||
Pro | ccc | ccN | |||||||
Thr | acN | ||||||||
Ala | dcN | ||||||||
Gly | ddd | ddN | |||||||
Ser | bса bсс ddY | bcd bсb | |||||||
Leu | сbХ сbb bbХ | cbс | |||||||
Arg | сdN adX | ||||||||
Σ | 4 | 12 | 24 | 27 | 31 | 35 | 43 | 58 | 61 |
lowing three representation:
n1N1N2, N3n2N4, N5N6n3, (6)
Where small letters denote the unit components of the set N, and large-some subsets of this set, up to N. Then homogeneous u3 can exist only if at least one base triplet or triplet with three identical letters is used.
For the proof we successively substitute each of the representation (6) in u3:
where n'i is the single component of the set Ni, where iÎ(1,6), and the string na.-nucleotide sequences that are formed after this substitution. In the first case, in (3), the base codon n1n1n1 was used for encoding amino acid Ama from the bottom position, in the second-n2n2n2-from the middle position, and in the third-n3n3n3-from the top position.
We turn to homogeneous u3 from the set U3, which turned out to be 4:
Within the framework of the assumption specified in the Property, the following ste.o.-by-ste.o. process of searching for a genetic code is proposed; See
Lys: aaa, Phe: bbb, Pro: ccc, Gly: ddd (9)
For further calculations, we turn to some generalized data on the sets U2 and U1, which are given in
Step. 2. From
where the first 4 elements of u1 correspond to overlaps for 3 positions (in parentheses the alternative variants are indicated, see column m3-1 from
Lys: aaX, Phe: bbY, Pro: ccN, Gly: ddN, (11)
From
The solution search in step 3 is illustrated in
where n.s. is the nucleotide sequence. From (12) we have single-valued encodings for 4 amino acids: Gln, Glu, Val, Ala, and with (9) we find:
Gln:caX, Glu:daX, Val:dbN, Ala:dcN, (13)
Thus, it is shown that the use of the introduced reduced set leads to minimal costs in the ste.o., compared to a direct search for 307 elements of the main set-U3.
Step 4. The solution is explained in
On the basis of which we obtain
Trp: bdd, Cys: bdY (15)
Step 5. In the set U3 there are no elements that have two amino acids from the sets (11), (13), (15): all the bands for the
We get two encodings for Thr: xca, xcc, where x- is not yet known. To find all the encodings for Thr, let’s look at its values m1-3 and m3-1. From equality m1-3 = 3 (Lys, Pro, Gly) it follows that x can take no more than two values: a, d, since in the listed set there is no Phe. However, the value of d is impossible, because Dca and dcc are Ala encodings. In addition, from the equality m3-1 = 4 (this means that the first positions of the 4 amino acid encodings: Lys, Phe, Pro, Gly overlap with the encodings of the third Thr position), the third position in the Thr encoding is equal to N. Thus, we have:
Thr: acN. (17)
Step 6. In
On the basis of which we calculate four encodings for Tyr: bac, Asn: aac His: cac, Asp: daс. Taking into account the data from column m3-1, we finally find:
Tyr: baY, Asn: aaY, His: caY, Asp: daY (19)
Step 7. For these amino acids we conduct 4 bands horizontally and vertically, respectively. At the same time, only two amino acids-Met and Ile-were found in 8 positions, see the bold font in
Let us single out two overlaps,
Which were sufficient for calculation. Taking into account the data from column m3-1, we finally find:
Met: abd, Ile: abM, (21)
where M: a, b, c
Step 7 finishes the search for encodings for the entered uniqueness domain- they turned out to be 43 for the first 17 amino acids; they are given in
Step 8. Finding solutions in an area where the values of Ama1 and Ama2 belong not only to these 17 amino acids. On the basis of
From these overlappings we find the following encodings:
Ser: bca, bcc, adY; Leu: cbX, cbb, bbX; Arg: cdN, adX. (23)
Step 9. From the ambiguity region in
From the first and second overlap we find: Ser: bcd, bcb, and from the third- Leu: cbc. The final encodings for Ser, Leu, Arg are presented in
When passing from the set of letters a, b, c, d to the canonical nucleotides A, C, T, G, 24 similar genetic codes can be obtained. Only one of them is standard, with a = A, b = T, c = C, d = G, and triplets baX, bda become TAA, TAG, TGA; they play a role terminator codons-codons, stopping protein synthesis.
The author thanks a brilliant interpreter O. N. Kozlov, who translated this text from Russian.
The work was supported by Russian Foundation for Basic Research (project codes 16-01-00018, 17-01-00053).
Kozlov, N.N. (2017) Computation of the Genetic Code: Full Version. Journal of Computer and Communications, 5, 78-94. https://doi.org/10.4236/jcc.2017.510008