Paper Menu >>
Journal Menu >>
J. Biomedical Science and Engineering, 2010, 3, 785-790 JBiSE doi:10.4236/jbise.2010.38104 Published Online August 2010 (http://www.SciRP.org/journal/jbise/). Published Online August 2010 in SciRes. http:// www. scirp. org/journal/jbise Categorizing HIV-1 subtypes using an ant-based clustering algorithm David King*, Wei Hu Department of Computer Science, Houghton College, Houghton, USA. Email: David.King@Houghton.edu Received 28 May 2010; revised 17 June 2010; accepted 22 June 2010. ABSTRACT Human Immunodeficiency Virus (HIV) is especially difficult to treat due to its rapid mutation rate. There are currently eleven different genomic subtypes of HIV-1, as well as a number of recombinant subtypes. An area of study in bioinformatics is the development of algorithms to identify the subtypes of HIV-1 ge- nomes. Ant-based algorithms have the ability to find global solutions in optimizations problems, and are also able to process complex data efficiently. We pro- posed a new technique named Ant Colony Anchor Al- gorithm (ACAA), using anchors of training data on a topographic map to categorize HIV-1 sequences based on ant-based clustering. We used three sets of se- quences from the POL region of the HIV-1 genome. We categorized these three dataset with the Subtype Analyzer (STAR), a current HIV-1 categorization al- gorithm, and the ACAA. We found that the ACAA returned higher accuracy values of 83.2%, 67.1%, and 53.5% for our three datasets respectively, than the STAR’s 47.3%, 49.4% and 18%. The results of the ACAA are the average results of 20 runs of the algo- rithm. We also observed the performance of the algo- rithm on specific subtypes, and observed that while the STAR and ACAA performed with similar accuracy on several subtypes (A, B, and C in particular), the ACAA had a significant advantage over the STAR in others, especially in categorizing recombinant subtypes. Keywords: Ants; Clustering; HIV; Classification; Subtype; STAR; ATTA; Bioinformatics 1. INTRODUCTION 1.1. HIV-1 Subtype Categorization Human Immunodeficiency Virus (HIV), which causes Acquired Immunodeficiency Syndrome (AIDS), falls into two broad categories: HIV-1 and HIV-2. HIV-2 is largely constrained to West Africa, and has a relatively lower transmission rate. HIV-1, on the other hand, has a very high transmission rate, and accounts for most HIV infections on a global scale. Treatment of HIV-1 is a difficult process because of the virus’ exceptionally high mutation rate. When DNA is replicated in the reproductive cycle of the virus, any- where between 1 in 1000 and 1 in 10000 nucleotides will be improperly transcribed. This has lead to an exception- ally diverse pool of viruses, genetically speaking. HIV-1 has been classified into 11 subtypes based on genomic patterns and variations (subtypes A-H, J, N, and O). Additionally, there are recombinant types of these vi- ruses, which contain combined DNA from two or more viruses. These recombinant subtypes are caused by the infection of a single cell by multiple viruses [1]. There is neither a cure for HIV-1 nor a preventative vaccine. Antiretroviral drugs have been shown to be an effective treatment. These drugs interfere with the DNA of the virus. As such, the genetic makeup of the virus has tremendous effect on its resistance to the treatments. Because the subtype has such a significant impact on the effectiveness of the drugs, categorization of different subtypes of HIV-1 is critical to the treatment process. The development of an algorithm to effectively cate- gorize HIV-1 is of prime interest in this area of study. One current categorization algorithm is the STAR (Sub- type Analyzer), developed by University College Lon- don Centre for Virology. The STAR functions by main- taining profiles of these 11 categories. These profiles are calculated based on aligned training sequence data. Each profile is a two-dimensional frequency matrix, where the number of rows in the matrix is the length of the train ing sequences in amino acid positions, and the number of columns is 20—one for each possible amino acid. These profiles are calculated based on training sequence data, where each position in a subtype profile is the frequency of that particular amino acid at that position in all train- ing sequences of that subtype. A global profile matrix is 786 D. King et al. / J. Biomedical Science and Engineering 3 (2010) 785-790 Copyright © 2010 SciRes. JBiSE also calculated, using the amino acid frequency from all training sequences rather than just a single subtype [1]. Each test sequence to be categorized receives a Z-score based on each subtype profile. The test se- quence is categorized as the subtype which gives the score closest to the mean. If the sequence is not within 1.5 standard deviations of the mean for any of the sub- types, the virus is categorized into the lump category, ‘unassigned’ [1]. This is not an innately helpful cate- gory; however neither is it innately harmful. It is pre- ferred to leave a sequence unclassified rather than mis- classified. There are other algorithms, similar in nature to the STAR, which have also been turned towards the purpose of subtype categorization. The SUDI algorithm1, devel- oped by Los Alamos National Laboratory, uses phylo- genic analysis to determine subtype. The HIV Genotyp- ing algorithm2, developed by NCBI uses both a similar- ity search as well as a bootscanning similarity process to categorize subtypes. Stanford’s HIVseq3 algorithm cate- gorizes based only on a simple similarity search [2]. 1.2. Ant-Bas ed Clustering Algorithms Ant-based algorithms fall under a category of algorithms which model their behavior from ants in th e wild. These algorithms have found effective applications in bioin- formatics because of their ability to find global solutions in optimization problems, as well as for their ability to efficiently process complex data quickly [3]. Ant-based clustering algorithms refer specifically to those algorithms which model the clustering behavior of ants in the wild. The behavior of these social insects displays a high level of swarm intelligence. In this in- stance, the habits of many relatively simple individuals form a pattern of behavior with distinct order and pur- pose. In the same way that real ants will cluster objects into like p iles within a space (e.g., within a colony, eggs will be clustered into one chamber, while food will be clustered into another), the algorithm will cluster like pieces of data into the same area on a two-dimensional topographic map, forming clusters. Clustering data on a two dimensional map can greatly reduce the dimension- ality which needs to be analyzed, allowing for much more efficient analysis and evaluation than would be possibl e w ith the raw data [4]. Ant-based clustering algorithms were originally pro- posed in 1991 by Deneubourg et al. [5], and were de- signed to emulate the sorting and clustering behaviors of ants. Early algorithms of this variety were ineffective, but showed enough potential for clustering data that new improvements continued to occur [3]. One such improvement, called ATTA (Adaptive Time- dependent Transporter Ants), was proposed by Julia Handl [3], and is effective at clustering like data points on a two dimensional map. A sample output of the ATTA can be seen in Figure 1. The ATTA utilized formulae which were altered from those used in previous algo- rithms. Several other changes were made to the process of the algorithm which increased the ability of the algo- rithm to produce a distinct clustering [6]. The goal of this study is to develop a new algorithm based on the ATTA to categorize HIV-1 subtypes, and to compare the performance of our algorithm with that of the STAR (http://www.vgb.ucl.ac.uk/starn.shtml). 2. DATA 2.1. HIV-1 Sequence Data An analysis of the STAR algorithm used sequence data which includes the entire protease region of the HIV-1 genome and 340 amino acids of the reverse transcriptase (RT), making for a total sequence length of 439. The protease and RT regions of HIV-1 play a role in the re- productive cycle of the virus. Common antiretroviral treatments interfere with the functions of these regions. Because of their medical significance, sequence data of Figure 1. ATTA. This is visual representation of the ATTA. To generate this plot, 800 two-dimensional vectors were generated, each containing values close (within a vector distance of 1) to one of the points [2, 2], [2, 8], [8, 2] or [8, 8]. The points were assigned colors based on which of the four points they were associated with, then clustered using the ATTA. 1http://www.hiv.lanl.gov/content/hiv-db/SUDI/sudi.html 2http://www.ncbi.nih.gov/projects/genotyping/ 3http://hivdb.stanford.edu/ D. King et al. / J. Biomedical Science and Engineering 3 (2010) 785-790 787 Copyright © 2010 SciRes. JBiSE protease and RT has a high availability, making them prime for use in categorization algorithms [1]. In this study, we used three datasets, each covering different subtypes and different portions of these genomic re- gions. Dataset 1 contained sequences of the same 439 amino acid region as in Myers et al. [1]. We include data from 8 subtypes (Table 1). We obtained these sequences from the Los Alamos National Laboratory HIV Sequence Da- tabase. Each subtype contained no more than 50 se- quences and no less than 35. Dataset 2 contained sequences of a shorter length: 99 amino acids from the protease, and only 240 amino acids from the RT (339 amino acids total, nucleotide positions 2253 to 3270). These were also downloaded from the Los Alamos National Laboratory HIV Sequence Data- base. 10 subtypes were covered, with sequence numbers between 35 and 50 (Table 1). Dataset 3 contained sequences of the protease ge- nomic region (99 amino acids, ranging from nucleotide positions 2253 to 2550). These were downloaded from the Stanford drug resistance database. This dataset con- tained 10 subtypes (Table 1). Again, each data type had no more than 50 and no less than 35 sequences. Because misaligned sequences can significantly re- duce the effectiveness of sequence analysis, we per- formed a multiple alignment on all sequences in each dataset using the ClustalW2 algorithm. In each dataset, every subtype was separated into training and test data. For the purposes of consistency, we used 30 sequences as training data for every subtype. All other sequences (between 5 and 20 in number) were used as test data to be categorized. All datasets are available on request from the author. 2.2. HIV-1 Sequence Encoding Each dataset was a collection of sequenced HIV-1 ge- nomes. Each sequence consisted of a list of amino acids. In order to properly analyze these sequences, we con- verted each into a binary vector. Because there are twenty different possible amino acids, we mapped each amino acid to a unique twenty-dimensional binary vector. For example, we mapped the amino acid Alanine to the point [1, 0, 0, 0, 0...], while we mapped Arginine to [0, 1, 0, 0, 0...]. After all the amino acids in a sequence were converted, we concatenated these individual vectors into an ordered, high-dimensional sequence vector, with di- mensionality of 20N, where N is the number of amino acids in the sequence. In this manner, each unique amino acid was represented in the sequence vector. 3. METHODS 3.1 The ATTA Algorithm The ATTA takes as input a collection of vectors, called documents—a term used by the ATTA. These documents are distributed randomly across a two dimensional grid called a topographic map. The vector data of the docu- ments bears no relation to their initial coord inates on the topographic map. The goal of the algorithm is to cluster the documents on the topographic map such that the documents which are similar to one another will be lo- cated near one another on the map. The ants are simple entities which have the ability to wander freely across the map. They are able to pick up, carry, and drop documents as the algorithm runs. The probability that a document is picked up or dropped by an ant is dependent on the nearby documents. For each iteration of the algorithm, a random ant from the colony is chosen. The ant moves to a different loca- tion on the map within a user-defined distance (our maximum movement distance was 50). At the beginning of the algorithm, this movement is randomly generated, but later is based partially on previous movements of that ant. The ant attempts to drop the document it is car- rying based on a probability calculated from the pdrop formula. If the drop action is successful, the ant will choose a new document at random, and attempt to pick with a probability calculated from the ppick formula. If this operation is unsuccessful, the ant will choose an- other document at random, and try again. This continues until the ant has successfully picked a new document [3]. The ppick and pdrop formulae are defined as: else if ifif ppick 2* 2* )( 11)(0.1 elseif ifif pdrop 4* 2* )( 1)(0.1 Table 1. This table contains the subtypes used for each dataset, as well as the number of test data sequences for each subtype. Any subtype with the heading N/A was not included in that dataset. Subtype A B C D F F1 G J K O U AE AG Dataset 1 20 20 20 20 N/A 7 19 N/A N/A N/A N/A 20 20 Dataset 2 20 20 20 20 N/A 20 20 N/A N/A 8 10 20 20 Dataset 3 20 20 20 20 20 N/A 20 20 20 N/A N/A 20 20 788 D. King et al. / J. Biomedical Science and Engineering 3 (2010) 785-790 Copyright © 2010 SciRes. JBiSE where otherwise0 0 ),( 10)( ),( 1 1 )( * 2 * ji jif ji if Jif where i is the candidate document, σ is the neighborhoo d size, J is the set of documents in the neighborhood, j is the current document in J, δ(i, j) is the vector distance between document i and document j, and α is a user-defined constant. It is important to note that since the vector information of all documents is constant, δ(i, j) need only be calculated once for all possible document pairs. Because the algorithm merely refer- ences pre-calculated values, it is able to process even high dimensional vectors very efficiently. The value of f* becomes larger as i becomes more similar with the other elements in J. As such, ppick¸ which has an inverse relation to f*, becomes smaller as f* increases, leading to a dramatically decreased likeli- hood that the element is picked up when it is nearby similar elements. By contrast, pdrop skyrockets when f* increases, leading to a very high likelihood that an ele- ment will be dropped amongst like elements. These formulae are specific to the ATTA, and are revised fro m those used in previous ant-based clustering algorithms [7]. In this manner, after a large number of successive it- erations, the documents with similar vectors are likely to have similar Euclidian positions, forming clusters of like documents on the map. The output of the ATTA is a clustering of the input vectors. The quality of these clusters is measured by a Pearson Correlation, which evaluates the correlation between the distances of two documents on the topog- raphic map and the distances between the vectors of the two document s [7] . 3.2 Modifications to the ATTA The ATTA is a clustering algorithm, not originally in- tended for categorization purposes. Where the ATTA takes in all documents at once, and clusters them all in the same manner, categorization algorithms build a clas- sifier based on training data, and then apply the classifier to the test data to categorize. Using the ATTA as a base, we built a categorization algorithm to categorize input test sequences using a classifier build from training se- quences. Our classifier takes the form of anchors of training data in fixed positions on the topographic map. We as- sign each subtype of training data a unique square region on the topographic map, called the anchor area. The Euclidian positions of the training documents are fixed. This anchoring technique ensures that the training documents of each subtype are well separated, and will not move through the clustering phase of the algorithm. As a result, the ants will tend to drop test documents of a given subtype near the pre-defined anchor of the same subtype. We call this the Ant Colony Anchor Algorithm (ACAA). A sample output of the ACAA is visible in Figure 2. The pseudo- code for the AC AA is available in Section 1 of the supplementary file. 3.3 Subtype Categorization Based on Clustering When running the ACAA on HIV-1 sequences, we con- verted each sequence to a binary vector as in Subsection 2.2. Each document in the ACAA represents one HIV-1 sequence. After the clustering process of the ACAA is finished, we defined a local area for each test document. The ra- dius of this local area was one half the width of the an- chor areas. Each test document was classified as the subtype which has the highest representation of training documents within the test document’s local area. This is a modified version of the classification strategy used in the study by Lee et al. [8]. Because the ACAA is non-deterministic, we utilized repetition to stabilize the results of the categorization. We recorded each test sequence’s predicted subtype in each repetition of the ACAA. There are a total of 20 repetitions in our experiment—further repetitions did not Figure 2. ACAA. This figure clearly shows the anchors of training data on the topographic map, with test data clustering around the training anchors of appropriate type. D. King et al. / J. Biomedical Science and Engineering 3 (2010) 785-790 789 Copyright © 2010 SciRes. JBiSE improve performan ce significantly. The final categoriza- tion is based on majority votes. 4. RESULTS We categorized our test data using the ACAA and the STAR’s online interface. For each dataset, we main- tained a record of the number of sequences correctly classified, the number incorrectly classified, and the number left unclassified. The STAR is deterministic in nature, and so the cate- gorization was only run a single time. Because the ACAA is non-deterministic in nature, we ran the algo- rithm on each dataset twenty times, taking the average values of the twenty runs. We also recorded the individ- ual run which produced the highest rate of correct cate- gorization (Table 2). The ACAA yields a more accurate classification than the STAR for all datasets. For the ACAA, the lengthier Dataset 1 yields the best results, while Dataset 2 still gives reasonable accuracy. The STAR was able to cate- gorize both Datasets 1 and 2 with approximately the same level of accuracy. In both the STAR and the ACAA, the categorization of Dataset 3 was not very successful, although the ACAA did categorize with a much higher accuracy than the STAR. In all datasets, we observe that the ratio of inaccuracy to unclassified sequences is higher in our algorithm than in the STAR. In addition, in Datasets 2 and 3, we ob- serve a higher number of misclassified sequences in the ACAA than in the STAR. The statistical significance of these results is undeni- able—every accuracy rating of the ACAA had a P-v alue of 0.0. The accuracies generated by the STAR had P-values of 0.0 for Datasets 1 and 2, and 0.004 for Dataset 3. The statistical significance was determined by gener- ating 10,000 randomized classification predictions for each dataset, and comparing the accuracy of each against the accuracy of the prediction generated by the ACAA. The statistical significance was the percentage of ran- domly generated predictions with accuracies greater than those produced by the ACAA. In addition to analysis on each specific dataset, we also observed distinct differences in the ability of both algorithms to categorize specific subtypes. To measure this, we maintained records of the accuracy for each specific subtype in each dataset (Table 3). In so doing, we observed that there are clear differ- ences in how certain subtypes categorize. Typically the subtypes A-D are easier to categorize than others. The ACAA gives a more accurate classification on recombi- nant subtypes, especially AE. Other subtypes, particu- larly J, K, and U were problematic to classify in both instances. We observed in both algorithms that, while the accu- racy for some subtypes increased as sequence length increased, there were some subtypes which were more accurately categorized when the sequence length was shorter—the G subtype, for instance. Overall, however, we find that the ACAA gives a more consistently accurate categorization for the subtypes used in this study than the STAR gives for all sequence lengths. Table 2. Results of Analysis. This table shows a synopsis of the performance of the ACAA and STAR. Accuracy is the percentage of sequences correctly classified, inaccuracy is the percentage incorrectly classified, and unclassified is the number of sequences which were not assigned a category. The ACAA results are the average values from 20 runs of the algorithm, with the values in parentheses being the results of the run which. STAR accuracy STAR inaccuracy STAR unclassifiedACAA accuracyACAA inaccuracy ACAA unclassified Dataset 1 47.3% 16.4% 36.3% 83.2% (85.6%) 6.9% (6.1%) 10.1% (8.2%) Dataset 2 49.4% 16.9% 33.7% 67.1% (70.2%) 19.9% (16.9%) 13.0% (12.9%) Dataset 3 18% 17% 65% 53.5% (57%) 27.3% (30.5%) 19.2% (19.5%) Table 3. Subtype specific Results. This table shows the accuracy for each subtype for both the ACAA and the STAR. Results of the ACAA are the average of the same twenty runs as used in Table 2. Subtype A B C D F F1 G J K O U AE AG Dataset 1 ACAA STAR 86.7% 55% 97.7% 95% 95% 100% 78.2% 5% N/A N/A 6.4% 0% 46.8% 21% N/A N/A N/A N/A N/A N/A N/A N/A 100% 0% 100% 80% Dataset 2 ACAA STAR 82% 75% 94.7% 80% 49.5% 55% 71.7% 75% N/A N/A 55% 0% 4.5% 25% N/A N/A N/A N/A 100% 100% 1% 0% 100% 0% 99% 90% Dataset 3 ACAA STAR 56.5% 60% 91.2% 0% 85.5% 80% 37% 15% 17.2% 0% N/A N/A 75.7% 25% 27.2% 5% 8% 0% N/A N/A N/A N/A 65% 0% 71.2% 0% 790 D. King et al. / J. Biomedical Science and Engineering 3 (2010) 785-790 Copyright © 2010 SciRes. JBiSE 5. SUMMARY We proposed to use ant-based clustering as a method to categorize HIV-1 sequences according to subtype. We developed the ACAA algorithm to utilize the approach of anchoring test data on a topographic map in order to categorize test data. We ran our algorithm on three distinct datasets, con- taining varied HIV-1 subtypes and sequence lengths. For comparison, we also ran the STAR algorithm, a previ- ously developed algorithm for HIV-1 subtype classifica- tion, on each dataset. We have demonstrated increased accuracy of the ACAA algorithm over the STAR algorithm based on HIV-1 subtype classification. Our results imply that the ACAA is a viable alternative for the algorithmic classi- fication of binary vector-based data. 6. ACKNOWLEDGEMENTS We would like to thank Julia Handl for providing the original java code of the ATTA, and also for her quick and helpful responses to our en- quiries about the code. We would also like to thank Thomas Leitner of the Los Alamos Na- tional Laboratories HIV Database, who also responded to queries in an expedient and helpful manner. REFERENCES [1] Myers, E.R., et al. (2005) A stati stical model for HIV- 1 sequence classification using the subtype analyzer (STAR). Bioinformatics, 21(17), 3535-3540. [2] Oliveira, T., et al. (2005) An automated genotyping system for analysis of HIV-1 and other microbial se- quences. Bioinformatics, 21(19), 3797-380 0. [3] Handl, J. (2003) Ant-based methods for tasks of cluster- ing and topographic mapping: Improvements, evaluation and comparison with alternative methods. Ph.D. Thesis, Friedrich-Alexander University, Erlangen-Nürnberg. [4] Chen, L., et al. (2004) An adaptive ant colony clustering algorithm. Proceedings of the 3rd International Confer- ence on Machine Learning and Cybernetics, Shanghai, 26-29 August 2004, 1387-1392. [5] Deneubourg, J.L., et al. (1991) The dynamics of collec- tive sorting: Robot-like ants and ant-like robots. Pro- ceedings of the 1st International Conference on Simula- tion of Adaptive Behavior: From Animals to Animats, MIT Press, Cambridge, 1, 356-365. [6] Handl, J., et al. (2004) Strategies for increased robustness of ant-based clustering. Engineering Self-Organising Sys- tems (Lecture Notes in Computer Science), 2977, 90- 104. [7] Handl, J., et al. (2004) Ant-based clustering and topog- raphic mapping. Artificial Life, 12(1), 35-61. [8] Lee, M., et al. (2007) An ant-based clustering system for knowledge discovery in DNA chip analysis data. Pro- ceedings of World Academy of Science, Engineering and Technology, 32, 261-266. |