J. Biomedical Science and Engineering, 2010, 3, 291-299
doi:10.4236/jbise.2010.33039 Published Online March 2010 (http://www.SciRP.org/journal/jbise/
Published Online March 2010 in SciRes. http://www.scirp.org/journal/jbise
Covariation of mutation pairs expressed in HIV-1 protease and
reverse transcriptase genes subjected to varying treatments
David King, Roger Cherry, Wei Hu*
Department of Computer Science, Houghton College, Houghton, USA.
Email: wei.hu@houghton.edu.
Received 7 October 2009; revised 27 November 2009; accepted 4 December 2009.
A previous study, focused on the correlation of muta-
tion pairs of synonymous (S) and asynonymous (A)
mutations, distinguished only between the treated
and untreated data of protease and reverse tran-
scriptase (RT) of HIV-1 subtype B. It is well known
that single mutation patterns in HIV-1 are treat-
ment-specific. It logically follows that covariation
between mutations will also be treatment specific.
Thus, our motivation is to give a more in depth study
of the covariation between mutation pairs, analyzing
not only treated and untreated, but what specific
treatments were used, and how they affected the co-
variation between the mutations differently. We in-
tended to further deepen this study by analyzing the
covariation of mutations in protease and RT in dif-
ferent subtypes of HIV-1. We found that virus sam-
ples subjected to antiretroviral Protease- and RT-
inhibitors do show different patterns of mutation
covariation in B-subtype protease and RT of HIV-1,
while maintaining the same overall trend. <A, A>
covariation will tend to be higher and more distinct
from <A, S> and <S, S> covariation after treatment.
The same trend continues in protease and RT re-
gardless of subtype. We also found the highly cova-
ried codon positions, position pairs, and position-
covariation clusters in protease, affected by different
treatments. Most of them are well known major
drug-resistance sites for these treatments.
Keywords: HIV; Covariation; Synonymous Mutation;
Asynonymous Mutation; Protease; Reverse Transcriptase;
Drug Resistance
Analysis of mutations in human immunodeficiency virus
type one (HIV-1) has become a vital component of
treatment development. This is largely due to the ability
of mutations to alter the effectiveness of retroviral drugs
in treatment.
In particular, the study of correlation, or covariation,
between mutations has been a focus. A particularly strong
correlation between amino acids can be seen as evidence
of a functional link between those amino acids. Studying
the covariation of these mutations will help both our
understanding of the HIV-1 virus, and our ability to treat it.
There is more than one type of mutation which HIV
undergoes. However, the changes in the HIV-1 genome,
which is a string of nucleotides, do not necessarily lead to
changes in the amino acids a particular portion of the
genome generates. Asynonymous mutations, or muta-
tions that affect the viral amino acid sequences, have been
the focus of much research. In a previous study [1], there
has been shown to be a significant increase in the co-
variance of asynonymous (A) mutations after treatment.
The other mutation type, synonymous (S), those muta-
tions which do not affect the viral amino acid sequence,
has not shown as extreme change due to treatment. Pre-
vious studies [1] have also shown that on the average, the
correlation between two mutations decreases as the
physical distance between the mutations increased.
These studies are hindered by the scarcity of data for
many subtypes of HIV and several varieties of antiretro-
viral drugs, since clinical tests are administered according
to the needs of the patients, not the desire for data. Ge-
netic records primarily focus on subtype B HIV, the most
prevalent variety of the virus in the western world, so
most research in turn focuses on mutations in HIV-1,
subtype B.
Previous studies [1,2,3,4,5] in this field have been
limited in scope, focusing mainly on sequences of sub-
type B, and mainly distinguishing between treated and
untreated sequences without considering the specific
treatment involved.
Our current study expands upon that research. We have
run analysis of datasets of HIV-1 sequences, distin-
guishing based on the specific drug administered. In
addition, we have run analysis on other subtypes of
HIV-1, in order to get a more complete picture of the
ways treatment, protease inhibitors (PIs) and nucleotide
292 D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes JBiSE
reverse transcriptase inhibitors (NRTIs), affects the co-
variation of HIV-1 mutations.
2.1. HIV-1 Sequence Datasets
We used datasets from the Stanford HIV Drug Resistance
Database (http://hivdb.stanford.edu/). All reference se-
quences were taken from the Los Alamos HIV Sequence
Database (http://www.hiv.lanl.gov /content/index). All
data were in FASTA-format nucleotide sequences.
A reference sequence, in this study, is a consensus
sequence, found to be normative of a given genomic
region and subtype. We used one reference sequence for
each genomic region and each subtype. A mutation is
considered to be a deviation from this reference sequence.
We used two categories of datasets. Our primary
dataset, the treatment-specific, consisted only of B-
subtype protease and RT, downloaded exclusively from
the Stanford database. Only data sets of significant size
(100 or more) were used. We used two datasets of pro-
tease sequences, both of subtype B, one treated with the
drug IDV (642 sequences) and another treated with NFV
(899 sequences). The RT datasets were also of subtype B
exclusively, and included a set of sequences treated with
the drug AZT (361 sequences), and one with a common
combination of drugs, AZT, 3TC, and EFV (114 sequences).
Our second set of datasets was of treated/untreated
protease and RT of different subtypes. B-subtype,
C-subtype, and recombinant subtype AE were obtained
for both datasets. Of these, there were 8335 untreated
B-subtype protease sequences, 8138 treated. There were
8364 treated B-subtype RT sequences, 5880 untreated.
C-subtype had 1112 sequences untreated protease, 1565
treated protease, 650 treated RT, and 2202 untreated RT.
Due to lack of data, Recombinant subtype AG was ob-
tained for protease only. Also due to lack of data, we
analyzed only the RT of subtype A (106 sequences
treated, 1519 sequences untreated).
2.2. Covariation Measurements in Specific
Mutation Pairs
We used covariation measure D’ to determine the amount
of non-random association between the mutations con-
sidered in a pair. D’ is a well known measure for deter-
mining non-random association, and was used in several
previous studies, including [1]. The formula and com-
plete procedure of computing D’ can be found in [6].
We chose D’ as a measure above other covariation
measures because of its symmetry: the D’ value, which is
a value between -1 and 1, provides an equal scale for
evaluating both negative and positive correlation. This
allows us to study both positive and negative correlation
of mutation pairs.
The D’ value of a given mutation pair containing mu-
tations X and Y relies on a 2 × 2 contingency table con-
sisting of NXY, NX, NY, and NO, where NXY is the number of
sequences in the dataset which contain both mutations, NX,
is the number of sequences in the dataset which contain
only mutation X, NY, is the number of sequences in the
dataset which contain only mutation Y, and NO, is the
number of sequences in the dataset which contain neither
mutation. N is the total number of sequences in the dataset.
As in [1], we also used a value θ = (Nxy*NO)/(NX*NY)
which is a maximum likelihood estimator for independ-
ence of mutations X and Y. When θ = 1, there is complete
independence of X and Y.
We used this θ value as a cutoff when plotting our
curves. By using this value to cutoff some of the outlier
points which throw the curves off, we create more clear
and reliable plots. In our plots, we only allowed data
points with θ > 1.5 or θ < 0.5.
In each dataset, a singular cutoff was utilized, such that
mutations which occur only once in the dataset were not
used in the calculation of D’.
2.3. Counting Paradigm for Specific Mutation Pairs
The collection and calculation of the mutation pairs are
handled at the same time by the following algorithm.
Data preprocessing and alignment is just as important
to the algorithm as the central process itself. In preproc-
essing, we ensured that each sequence was correctly
aligned to the reference sequence of the same genomic
region and subtype. Each reference sequence was taken
from the Los Alamos HIV Database. If an individual
sequence couldn’t be aligned with the reference se-
quence, it was not used, as a single unaligned sequence
within a dataset can drastically affect the output of the
D’ analysis.
Gaps were not allowed in the reference sequences, but
were allowed in the data sequences provided they aligned
properly with the reference sequence. If a data sequence
was properly aligned, but longer than the reference se-
quence, we only analyzed the portion of the sequence
which could be compared with the reference sequence.
2.4. D’ Values According to Codon Position
The collection and calculation of the mutation pairs are
handled by a simple counting mechanism. We compared
all nucleotide sequences of our dataset against a con-
sensus sequence, and made not of each nucleotide sub-
stitution, and whether that substitution constituted a
synonymous or asynonymous mutation. For each se-
quence in the dataset, we record all valid pairs of muta-
tions. Mutations pairs that involve both asynonymous
mutations were labeled as <A, A>, those that involve
one asynonymous and one synonymous mutation were
labeled <A, S>, and those that involve two synonymous
mutations are <S, S>. Then, we take frequency counts on
all mutation pairs across all sequences in order to calcu-
late the D’ of each mutation pair.
D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes
For display, we use a sliding window curve. This en-
hances the readability and reliability of the curve. Sim-
ply graphing this data such that each physical position is
an average of all D’ values at that physical position give
an unsteady curve towards the greater physical distances.
As the physical distance increases, the number of data
points available for that physical distance decreases,
leading to greater oscillation as the plot goes on.
A sliding window has the same amount of data going
into each point on the graph, and is thus more reliable.
Our sliding window curves each use 3% of the data in
the set per window, with a 50% overlap.
2.5. D’ According to Genomic Position
We analyzed D’ according to amino acid position within
the genomic region as in [1]. This gives us information
on how specific codon positions interact with one an-
other within the gene, particularly in response to differ-
ent treatments.
We also performed a pair-wise analysis of these specific
mutation positions in order to reveal more on the differ-
ences between <A, A>, <A, S>, and <S, S> mutation pairs.
Using this data, we generated covariation histograms.
In these histograms, the value at each codon position is
the sum of the D’ value for all mutation pairs associated
with that position. Each mutation pair will contribute
total D’ value to the positions of its two mutations. In
this manner, positions which are either the site of great
amounts of mutation or high covariance will stand out,
with positions which are both high in mutation amount
and covariance being seen as peeks.
In order to further explore the relationships between
the amino acid positions, we cast our histograms into two
dimensional contour plots, which reveal clusters of co-
variation. To generate these plots, we form a square two
dimensional table with a length equal to the number of
amino acid positions in a given dataset. Each mutation
pair is then mapped to a position on this table, based on
the position of that pair’s mutations. For example, the
mutation pair L10I and Q20V would be mapped to posi-
tion x = 10, y = 20.
The value of each position in the table is the sum of
all D’ values of the pairs assigned to that position. This
provides a visual representation of the relationships be-
tween positions, with higher values representing posi-
tions which are highly correlated with one another, and
the lower values representing unrelated positions.
3.1. Effects of Specific Treatment on the
Covariation of B-Subtype Protease and RT
First, in order to discover the effects of specific treat-
ments on the covariation of HIV-1 mutation pairs, we
ran a D’ analysis on data sets of B-subtype protease and
RT with known treatment types. For reference, we also
included the generically treated and untreated datasets of
B-subtype protease and RT, in order to see how the dif-
ferent treatments effected the genomes and, and how the
compare to the effects of overall treatments.
Our findings revealed that covariation between muta-
tions is, as we expected, treatment dependent. In com-
paring IDV- and NFV-treated protease, these results
become clear. The plots in Figure 1, show the results of
the analysis according to physical distance, display
clearly different patterns in their covariation. The average
D’ values of <A, A> pair covariation are just at 0.3 for
both datasets, however we can clearly see peaks of high
covariance in different positions on the <A, A> curves. If
we compare these two treatment-specific plots against D’
values generated from the set of generically treated
HIV-1 sequences (those sequences that have received
treatment of any sort, plot not pictured), we can see that
the differences even more pronounced. The average D’ of
<A, A> mutation pairs in protease which has seen any
sort of treatment whatsoever is much higher—a value just
at 0.4, and yet a different set of peaks within the curve.
We see similar results in RT. The curves generated by RT
treated with the drug AZT are considerably different than
the generically treated RT, as can be seen in Figure 2.
The generic trends of covariation, however, were
largely the same despite what specific treatments were
used. <A, A> covariation tended to be higher than <A, S>
or <S, S> covariation in all datasets. In addition, we also
noticed that on average, <A, S> covariation tended to be
higher than <S, S>. This separation was even present in
the untreated dataset.
Figure 1. Different treatments lead to different patterns of
covariation. These two sliding window plots display the D’
analysis of two different treatment types. The top displays
results derived from IDV-based treatment, and the bottom plot
displays those derived from NFV. Clearly, the different treat-
ments induce quite different <A, A> covariation patterns in the
sliding window curves. The different treatments do not seem to
significantly affect the <A, S> or <S, S> curves.
294 D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes JBiSE
To ensure that these were typical results that were
caused by treatment of HIV, we retrieved a dataset from
Stanford that contained sequences from the same set of
patients, 470 sequences of both before and after treatment.
Numerical analysis revealed the treatment both increased
the amount of <A, A> covariation from an average value
of 0.278 to 0.308 and increased the overall separation of
the curves. Before treatment, the average difference be-
tween <A, A> and <A, S> covariation was a value
of0.085 from 0.278 to 0.193, and the average difference
between <A, S> and <S, S> was 0.051 from 0.193 to
0.142. After treatment, the difference between <A, A>
and <A, S> was 0.104 from 0.308 to 0.204, and the
We also found that, in agreement with previous results
[1], <A, A> covariation increased when subjected to any
form of treatment. In Figure 3, we can see the changes
made by specific drug treatments, both before and after
treatment. There is a clear pattern of increase in the <A,
A> category.
There were instances where <A, S> or <S, S> co-
variation was decreased, and other instances where the
<A, S> or <S, S> covariation was increased.
Figure 2. Treatment specific RT. These graphs display the
analysis of B-type RT before and after treatment. The top plot
displays the results from the analysis of untreated B-subtype RT.
The middle plot displays the results derived from any RT se-
quences which have received any NRTI treatment whatsoever,
and the bottom plot displays the results derived from those
sequences treated only with the specific drug AZT. Clearly, the
AZT-specific treatment had a different effect than the overall
treatment. The curves for the overall treatment are very well
separated, whereas the AZT-specific curves are not as well
separated, but still somewhat distinct. The average values of the
three curves are seperated. <A, A> has an average of 0.359, <A,
S> has 0.326, and <S, S> has 0.284.
Figure 3. Before and after treatment for different drug treat-
ments. This chart shows the effects of specific treatments on
B-subtype protease and reverse transcriptase. The values here
are averaged from the three curves in the sliding window plots
we generated. In each group, the first column is <A, A> co-
variation, the second is <A, S>, the third is <S, S>. The gray
portions represent the average D’ before treatment. The black
portions represent the change in D’ from the treatment. If they
are above the gray, the D’ value increased with treatment. If
they are below the gray, the D’ decreased. The column labeled
‘Same Patients’ is the dataset containing the exact same group
of patients, both before and after treatment.
difference between <A, S> and <S, S> 0.066 from 0.204
to 0.138. These results, and the typical trends these re-
sults show, can be seen in Figure 4.
We also analyzed <A, A> mutation pairs according to
their codon positions, rather than physical distance. This
analysis can be seen in Figure 4.
The top plot shows a control analysis of generally
treated subtype B protease. In this plot, we show the
thirty positions which were most significantly affected
by the treatment, and what their total D’ value was prior
to and after treatment. This plot shows that, in almost all
significantly affected positions, there was an increase in
<A, A> covariation. In addition, we can see that several
of the most affected positions are also medically signifi-
cant, according the Stanford HIV database.
The second plot shows a comparison between the IDV
and NFV treated datasets. Again, we can see that the two
treatments cause different patterns in the covariation
pattern. Certain codon positions have roughly the same
amount of covariation after treatment, but others, in-
cluding several medically significant positions, seen in
the bottom plot, have significantly different covariation
values, such as positions 20, 46, and 82.
3.2. D’ Results from Treated/Untreated Datasets
of Different Subtypes
For this portion of the study, we did not distinguish
based on treatment type, but rather only invested in dis-
tinguishing between ‘treated’ and ‘untreated’ sequences.
Selecting a specific treatment type limited the larger
treated datasets into subsets too small for proper analysis.
D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes
Figure 4. Site-specific Analysis for B-subtype Protease. The top plot shows the thirty postitions who’s covariation was most affected
by the application of generic treatment ot B-subtype protease. The positions with * or ^ next to them are the major or minor positions
respectively that are associated with drug resistance according to the Stanford HIV Database. The bottom plot contrasts the difference
between the IDV- and NFV-treated datasets on medically significant sites.
3.3. Pair-Wise Mutation Analysis and Clustering
For use in the analysis of protease and RT, we selected
HIV-1 subtypes A, B, and C, as well as recombinant
subtypes AE, and AG. Subtype AG was only analyzed
for protease, and subtype A was only analyzed for RT,
due to lack of data.
Results of the pair-wise analysis revealed that there is a
clear-and-distinct difference between the position-based
covariance of <A, A> mutation pairs, <A, S> mutation
pairs, and <S, S> mutation pairs.
With the treatment-specific datasets, we analyzed all
datasets, and generated sliding window curves for all of
them, mapping the relationship of D’ values of mutation
pairs and their physical distances.
There tended to be far greater <A, A> covariance at
certain positions than <A, S> or <S, S> covariance in
general. Additionally, these peaks of high <A, A> covari-
ance tended to be close to one another, creating clusters or
areas of high <A, A> covariance within the genome. By
contrast, <A, S> covariation was less clustered, and <S,
S> not clustered at all. This can be seen in Figure 6.
Our results showed that different subtypes yield dif-
ferent patterns of covariation, and that once again the
typical trends were maintained on average. There was a
clear seperation of <A, A>, <A, S> and <S, S> covara-
tion, both before and after treatment, although treatment
in most cases improved the separation. There was one
exception to this: subtypes A and C RT displayed a sig-
nificant increase in<A, A> covariation, but similar in-
creases in <A, S> and <S, S> covariation lead to them
having less-separated curves after treatment.
There was also increase in <A, A> covariation after
treatment in all datasets. While this increase in <A,
A> covariation is consistent for all datasets, we did
notice that subtype-B protease and RT had a consid-
erably larger increase in covariation than any other
subtype. Figure 5 shows a summary of the findings in
this section.
Figure 5. Before and after treatment for different subtypes.
This chart shows the effects of treatment on different subtypes of
protease and RT. For the most part, data followed expected pat-
terns. Subtype A RT does not have a clear distinction between <A,
A>, <A, S> and <S, S>, but beyond that, plots behave normally.
296 D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes JBiSE
Figure 6. Treated/Untreated covariation comparison for protease. These three plots show the thirty positions whose covariation was
most effected by treatment. These positions were selected because they had the largest difference in their total D’ values between
treated and untreated. The <A, A> mutation show that frequently D’ values were higher after treatment, trend that was not as clear in
<A, S> and <S, S> plots. D’ values for <A, A> covariation are higher than those of <A, S> covariation, and much higher than those
of <S, S> covariation. <S, S> covariation seems not to have been effected by treatment very much: the highest difference between
before and after was less than 6.5.
Casting these histograms into a 2D contour plot re-
vealed further information about the relationships be-
tween specific positions: we can see that covariation
between positions is clearly related to the amount of
covaration at a specific position. Two positions having
high D’ values will very likely have a high correlation.
Both the histogram and the contour mapping of generi-
cally treated protease are shown in Figure 7.
Based on the position-covariation histogram of gener-
ally-treated subtype B protease as in Figure 4, we
D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes
Figure 7. Treated B Protease. The top plot is a position relationship chart, with bright colors showing positions which are highly
correlated with one another, and dark colors showing positions which are not. The shade of the grid at the left is representative of
total D’, the sum of the covariance values for all mutation pairs at that position. The bottom plot is a histogram of D’ values for gen-
erally-treated B-subtype protease. Each column in the histogram is the sum of all values for a particular position in the 2D chart.
These charts were generated from the statistically significant mutation pairs with a Fisher Test P value less than 0.05 and a ChiSQ
Test P value less than 0.05.
selected the 20 most correlated statistically significant
<A, A> positions according to D’ value, which are: 10+^,
13+, 20+^, 33+^, 34, 36+^, 46+*, 48+*, 54+*, 62+,
63+^, 67, 71+^, 72, 73+^, 82+*, 84+*, 89, 90+*, and 95,
with + positions having also been found in [1] using the
θ value, and positions with * or ^ being sites of major or
minor drug resistance respectively according to the
Stanford HIV database. The D’ analysis has an advan-
tage of being able to find negative correlation effectively.
We also found the negatively correlated positions: 3, 30+,
64, 88+, 96. We find clusters of covariation to occur near
positions 10, 20, 37, 50, 73, and 90.
We also found top statistically significant correlated
mutation pairs in our Treated Protease dataset. In order
to do this, we sorted all <A, A> pairs according to the
Fisher Test P value followed by the ChiSQ Test P value,
298 D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes JBiSE
giving us the most statistically significant pairs. Then we
chose the top thirty according to the highest D’ value.
Fisher’s Exact Test and Pearson’s Chi-Squared Test are
done by calling the functions in R: fisher.test and
chisq.test with their default values, such as the confi-
dence interval = 95% in Fisher test and Yates’s correc-
tion applied in Chi-Squared Test.
We selected the 34 most correlated position pairs from
our Treated Protease dataset, which are: (10, 46), (10,
79), (12, 19), (13, 20), (13, 89), (15, 20), (20, 34), (20,
73), (20, 90), (33, 34), (33, 89), (34, 36) (34, 54), (34,
62), (34, 71), (36, 82), (47, 73), (48, 54), (48, 82), (54,
61), (54, 71), (54, 73), (54, 82), (63, 67), (63, 72), (63,
82), (71, 72), (71, 82), (72, 73), (72, 90), (73, 90), (79,
84), and (90, 95). The correlation of these positions is
not reliant on a specific mutation, but all mutations asso-
ciated with these positions. A list of the most correlated
mutation pairs can be seen in Table 1.
Table 1. Top 30 highly covaried <A, A> mutation pairs.
Mut X Mut Y D' Fisher
Test P
Test P
<I62V(A) I66L(A)> 0.818621 5.47E-08 1.01E-07
<L63P(A) G73S(A)> 0.820203 1.66E-66 1.17E-50
<E35G(A) M36I(A)> 0.829569 2.01E-17 2.77E-18
<L10I(A) I54T(A)> 0.835171 4.23E-32 5.05E-31
<L10F(A) P79N(A)> 0.841458 5.91E-06 1.14E-09
<T4A(A) I84V(A)> 0.844762 6.99E-05 1.02E-05
<K20R(A) M36I(A)> 0.846462 0 0
<L38W(A) I62V(A)> 0.847875 5.81E-10 1.30E-09
<I13A(A) M46I(A)> 0.850174 0.000136 8.07E-05
<E35N(A) M36I(A)> 0.851419 2.58E-14 7.05E-15
<T12P(A) G68D(A)> 0.85259 5.58E-09 7.51E-31
<I66V(A) L90M(A)> 0.85367 2.77E-21 2.13E-19
<I54V(A) Q61R(A)> 0.861101 7.84E-05 5.91E-05
<T4A(A) L10F(A)> 0.861276 6.62E-07 1.35E-11
<N83S(A) I84V(A)> 0.862011 4.14E-10 8.92E-13
<G73S(A) 90M(A)> 0.868072 5.87E-239 3.71E-218
<I72K(A) L90M(A)> 0.881701 1.20E-12 2.27E-11
<I13M(A) L90M(A)> 0.88433 5.17E-09 4.38E-08
<P79A(A) I84V(A)> 0.8871 3.05E-42 6.05E-56
<L90M(A) C95F(A)> 0.890186 9.91E-44 3.35E-39
<L63P(A) I66V(A)> 0.905986 1.46E-08 2.09E-06
<L63P(A) I72L(A)> 0.908414 5.33E-21 5.82E-15
<L63P(A) I72E(A)> 0.913298 2.44E-09 5.93E-07
<G73T(A) L90M(A)> 0.913535 6.34E-89 1.12E-78
<I72L(A) L90M(A)> 0.926688 4.36E-65 4.96E-57
<G48A(A) I54V(A)> 0.926895 1.50E-09 4.67E-10
<D30N(A) K45Q(A)> 0.92856 4.25E-16 2.85E-38
<G73A(A) L90M(A)> 0.931959 1.91E-16 2.09E-14
<I66L(A) L90M(A)> 0.933267 6.96E-09 8.29E-08
<C67F(A) L90M(A)> 0.985661 3.68E-44 1.14E-36
4.1. Biological Significance of <A> Type
Mutations Versus <S> Type Mutations
Throughout the study, we can see a marked difference
between the <A, A> category mutation pairs, the <A, S>
category, and the <S, S> category. This is trend is con-
sistent and universal. <A, A> pairs are, on average, the
most covaried, <A, S> pairs are less so, and <S, S> pairs
have even less covariation. This can be clearly seen in all
plots which include the three different types of mutation
pairs, but is most clearly seen in Figure 6.
We suggest the reason for this is that <A> mutations
necessarily lead to greater covariance. Because an <A>
mutation will have a more significant impact on an or-
ganism, it is more likely to be related to other changes
within the genome. This is why <A, A> mutation pairs
have such high covariance. However, an <A> type mu-
tation might just as likely be related to a synonymous
mutation as well. Thus <A, S> mutation pairs will also
have a relatively high covariance, as opposed to <S, S>
mutation pairs. <S> type mutations have a lesser impact
on the organism at large, because the amino acid types are
We can further see this confirmed when we look at the
general covariance of mutations at specific positions, as
seen in Figure 7. <S> type mutations have a much higher
occurrence frequency than <A> type mutations. The
covariation of <A, A> and <A, S> pairs, however, is
much higher than that of <S, S> pairs. This seems to
imply that <A> type mutations have a greater effect on
the genome itself.
4.2. Biological Importance of Individual
Mutation Sites in Relation to Specific
We can see in Figure 7 the effects which treatment has
on the different mutation types. <A, A> mutation pairs in
general have a dramatic increase of covariation after
treatment. The mutation correlation patterns we dis-
covered in the bottom plot of Figure 6 are consistent to
the single mutation patterns in [7]. We find that in [7],
IDV-treated datasets negatively weight in positions 30
while NFV leads to highest positive weight among all
the other weights. Similarely, Position 76 in IDV has the
highest weight of all the other weights, while the NFV-
treatment gives that position a negative weight. This is
consistent with the findings of our plot. Note the distinct
difference between IDV and NFV at positions 30 and 76
in the bottom plot of Figure 7.
Position 30 is an interesting case, as the overall corre-
lation is negative, which seems to point out that other
mutations are frequently absent when this mutation is
present. However, we know that position 30 hosts a mu-
tation, D30N, which is correlated with other mutations
D. King et al. / J. Biomedical Science and Engineering 3 (2010) 291-299
Copyright © 2010 SciRes
when the specific PI treatment is neflinavir. This seems
to hint that other treatment types have a steep inverse
correlation at this mutation site. At the very least, we see
that the treatment IDV gives a negative correlative
weight at position 30 [7].
4.3. Differences in Covariation in Different
Treatments and Subtypes
While the general trends we found were largely consis-
tent throughout our comparison between the different
treatments and subtypes, we found the differences in the
covariation patterns between the subtypes and treatments
In Figure 4, we see that the increase in covariation
between the untreated sequences and the the sequences
which received any treatment whatsoever is far larger
than the increase in covariation present in those se-
quences only treated with individual drugs. For example,
the two drugs, NFV and IDV, have the most data within
the Stanford database. In spite of this, neither the co-
variation increase from NFV or IDV alone is enough to
cause the dramatic increase we see from generic treat-
ment of B-subtype protease. The same is true of AZT-
treated RT compared with generically treated RT. The
generically treated datasets accounts for sequences
treated with single drugs, such as NFV or IDV or AZT,
as well as those treated with combinations of drugs. Our
results, then, suggest that combinations of treatments
lead to greater covariance than single treatments. This is
further supported by the results of the RT treated with
the combination of drugs, AZT, 3TC, and EFV, which
have a greater increase in covariation than any single-
treatment, but still not as much as the generically-treated
We would like to thank Houghton College for providing the funding for
this research through the Summer Research Institute at Houghton, as
well as Qi Wang for responding to our e-mails regarding his paper [1]
both promptly and helpfully.
[1] Wang, Q. and Lee, C. (2007) Distinguishing functional
amino acid covariation from background linkage disequi-
librium in HIV protease and reverse transcriptase. PLoS
ONE, 2(8), 814.
[2] Liu, Y., Eyal, E. and Bahar, I. (2008) Analysis of corre-
lated mutations in HIV-1 protease using spectral cluster-
ing. Bioinformatics, 24, 1243-1250.
[3] Gilbert, P.B., Novitsky, V. and Essex, M. (2005) Covari-
ability of selected amino acid positions for HIV type 1
subtypes C and B. AIDS Research and Human Retrovi-
ruses, 21(12), 1016-1030
[4] Hoffman, N.G., Schiffer, C.A. and Swanstrom, R. (2003)
Covariation of amino acid positions in HIV-1 protease.
Virology, 314, 536-548.
[5] Rhee, S.Y., Liu, T.F., Holmes, S.P. and Shafer, R.W.
(2007) HIV-1 subtype B protease and reverse transcrip-
tase amino acid covariation. PLoS Computational Biol-
ogy, 3(5), 87.
[6] Hedrick, P. (1987) Gametic disequilibrium measure:
proceed with caution. Genetics, 117, 331-341.
[7] Rhee, S.Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag,
D. and Shafer, R.W. (2006) Genotypic predictors of hu-
man immunodeficiency virus type 1 drug resistance.
Proceedings of the National Academy of Sciences USA.
103, 17355-17360.