Paper Menu >>
Journal Menu >>
J. Biomedical Science and Engineering, 2010, 3, 684-699 JBiSE doi:10.4236/jbise.2010.37093 Published Online July 2010 (http://www.SciRP.org/journal/jbise/). Published Online July 2010 in SciRes. http://www.scirp.org/journal/jbise Nucleotide host markers in the influenza A viruses Wei Hu Department of Computer Science, Houghton College, Houghton, USA. Email: wei.hu@houghton.edu Received 7 May 2010; revised 23 May 2010; accepted 25 May 2010. ABSTRACT In the efforts to understand the molecular character- istics responsible for the ability of influenza viruses to cross species, various amino acid host markers in influenza viruses were uncovered. Our previous study identified a collection of novel amino acid host markers in ten proteins of 2009 pandemic H1N1. As an extension of our prior work, the objective of the current study was to employ Random Forests, a ro- bust pattern recognition technique, to discover nu- cleotide host makers in the ten corresponding genes of 2009 pandemic H1N1, along with those in the genes of avian and swine viruses. Although different, there was an association between the amino acid markers in proteins and the nucleotide markers in the related genes due to codon translations. Moreover, nucleotide host markers have the capability to indi- cate important positions within a codon for host switches as well as the significance of synonymous mutations on host shifts, all of which amino acid markers could not provide. Our findings highlighted that two or even three nucleotide markers could co- exist within a single codon, and the different impor- tance values of these markers could further discri- minate the multiple markers within a codon. The nucleotide markers found in this study rendered a comprehensive genomic view of the complex and sys- temic nature of host adaptation. They verified and enriched the known amino acid markers and offered a larger set of finer host markers for further experi- mental confirmation. Keywords: 2009 Pandemic H1N1; Host Switch Marker; Influenza; Mutation; Random Forests 1. INTRODUCTION The swine-origin 2009 pandemic H1N1 was a clear re- minder that understanding the biological mechanisms of cross-species transmission of influenza viruses remained an urgent and crucial research topic. Extensive search of host-shift markers in the influenza viruses resulted in a rich set of avian-human or swine-human markers [1-7]. However, sequence analysis of the recently emerged 2009 pandemic H1N1 virus suggested the absence of these well-known host switch markers [8]. Although the symptoms of 2009 pandemic H1N1 were mild, the fear was that the new virus might mutate to a more virulent virus. A recent experiment [9] indicated that the intro- duction of traditional virulence markers (mutations) in PB2 of 2009 pandemic H1N1 did not confer increased virulence or transmission, implying that these markers had minimum impact on this new virus. To tackle the question of where to find the host mark- ers in 2009 pandemic H1N1, it was hypothesized in [8] that they might exist outside of the space of the previ- ously discovered markers. A new procedure using Ran- dom Forests was designed to identify a collection of novel amino acid host markers in ten proteins of 2009 pandemic H1N1, which included, in addition to the SR polymorphism found in [10], a set of markers in PB2 that might play compensatory roles in efficient replica- tion and transmission of this novel virus. The purpose of this study was to uncover nucleotide ho st markers in the ten corresponding genes of 2009 pandemic H1N1 to provide finer and complement information about the host adaptation of this new virus. Furthermore, the nu- cleotide host markers in the ten corresponding genes of avian and swine viruses were also included in this report. Using nucleotide sequences, it was found in [11,12] that mononucleotide composition, rather than the higher- order compositions, was sufficient to distinguish the human and avian viruses with high accuracy. The vi- ruses that replicated in mammals including 2009 pan- demic H1N1 were more likely to change G to A in the mRNA than vice versa. The patterns of nucleotide fre- quency according to host species demonstrated that the 2009 pandemic H1N1 virus had been evolving in swine prior to its emergence. Another separate report [13] con- firmed that the pattern of nucleotide composition of HA and NA genes of 2009 pandemic H1N1 was closest to that of swine H1N1 compared with the viruses of other W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 685 origins and this novel viru s originated from swine H1N1 based on the codon usage bias. To study the selective pressure acting on each gene segment of 2009 pandemic H1N1 [14], the ratio between the rate of non synony- mous substitution s per non synonymous site and the rate of synonymous substitutions per synonymous sites was computed, exhibiting an active pu rifying selection on all segments. Specially, purifying selection was extreme on NP, MP, PA and PB1, moderate on NS and HA. PB1-F2 protein is a virulence factor in influenza viruses. How- ever, genomic annotations of 2009 pandemic H1N1 [15] discovered a nucleotide mutation (C → A) to render a stop codon at position 12, which resulted in a truncated PB1-F2 protein for this new virus. Many host markers are amino acid markers including the ones in [8]. However, amino acids and nucleotides are related because of codon translations. Some codon substitutions are more likely than others due to the ge- netic code structure and selective pressures favor some codons for enhanced translation speed and fidelity. Therefore, it is not realistic to assume that each amino acid is equally likely to be encoded by any of its codons. In general codon-based host shift information is more accurate than the amino acid-based. Based on this ob- servation, the current study aimed to identify nucleotide host markers through a large-scale comparative analysis of ten genes of influenza viruses. These markers could demonstrate which positions within a codon were im- portant and uncover the synonymous mutations that might be crucial for host switches. To facilitate the dis- covery of these markers, this report proposed to employ Random Forests, a robust pattern recognition technique that was previously applied successfully as a cost effec- tive approach to the study of ten proteins of influenza viruses in [8 ]. 2. MATERIALS AND METHODS 2.1. Sequence Data All influenza virus nucleotide sequences corresponding to the protein sequences used in [8] were retrieved from the Influenza Virus Resource (http://www.ncbi/nlm.nih.giv / genomes/FLU/FLU.html) of the National Center for Biotechnology Information (NCBI). All the sequences used in the study were aligned with MAFFT [16 ]. 2.2. Random Forests Random Forest, proposed by Leo Breiman in 1999 [17], is an ensemble classifier based on many decision trees. Each tree is built on a bootstrap sample from the origin al training set and is unprun ed to obtain low-bias trees. The variables used for splitting the tree nodes are a random subset of the whole variable set. The classification deci- sion of a new instance is made by majority voting over all trees. About one-third of the instances are left of the bootstrap sample and not used in the construction of the tree. These instances in the training set are called “out- of-bag” instances and are used to evaluate the perform- ance of the classifier, which can achieve both low bias and low variance with bagging and randomization. 2.3. Feature Selection Using Random Forests Random Forest calculates several measures of variable importance. The mean decrease in accuracy measure was employed in [18] to rank the importance of the features in prediction. This measure is based on the decrease of classification accuracy when values of a variable in a node of a tree are permuted randomly. In this study, two packages of R, randomForest and varSelRF [18], were utilized to compute the importance of the nucleotides in a given gene sequence dataset. The effectiveness and ro- bustness of this technique as a feature selection method has been demonstrat ed in va rious studies [19-24]. Random Forests produce non-deterministic outcomes. To compensate this bias, the Random Forests algorithm was run multiple times and then the average of the results was taken. The importance of each position in the nu- cleotide sequences was based on the averaged calcula- tions by using the function randomVarImpsRF in var- SelRF repeated 5 times. 3. RESULTS 3.1. Comparison of Ten Genes of Influenza Viruses Based on their Consensus Nucleotide Sequences To explore the relationship among the genes of influenza viruses, the Hamming distance, defined as the number of positions at which the corresponding nucleotides of two sequences are different, of any tw o consensus nucleotid e sequences of avian, human, 2009 pandemic H1N1, and swine viruses was calculated. The distance information in Ta b l e 1 provided insight into the sequence similarity between the genes of different viruses. In particular, the distances between 2009 pandemic H1N1 and avian, hu- man, and swine viruses reflected the origin of 2009 pandemic H1N1 with its genes derived from avian (PB2 and PA), human H3N2 (PB1), classical swine (HA, NP, and NS), and Eurasian avian-like swine H1N1 (NA and M) lineages [25]. 3.2. Important Nucleotide Host Markers in Ten Genes of Influenza Viruses In [8], important amino acid host markers in ten proteins of influenza viruses were uncovered, based on which the novel host markers in 20 09 pande mic H1N1 wer e identi- fied. The main task here was to find the nucleotide host markers in the ten corresponding genes of 2009 pan- W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 686 Table 1. This table contains the Hamming distances of ten genes of avian, human, 2009 pandemic H1N1, and swine viruses based on their consensus nucleotide sequences. Genes HA NA NP M1 M2 NS1 NS2 PA PB1 PB2 Dist (Avian, 2009_pandemic) 389 249 242 51 15 108 40 160 2 56 231 Dist (Human, 2009_pandemic) 389 298 250 98 35 115 55 352 118 354 Dist (Swine, 2009_pandemic) 135 263 78 83 15 47 19 192 1 84 211 Dist (Avian, Human) 390 339 254 79 28 61 40 332 215 329 Dist (Avian, Swine) 337 316 212 64 12 77 28 161 158 177 Dist (Human, Swine) 3 42 244 222 64 30 89 42 269 152 286 demic H1N1, avian, and swine viruses, thus offering further information about the adaptation of these viruses to humans. In the following sections, each of the ten genes of human viruses was compared to that of 2009 pandemic H1N1, avian, and swine viruses. Random Forests were employed to locate the top 20 important codons, served as host markers, in the genes of influenza that could separate human from 2009 pandemic H1N1, avian, and swine viruses. In different genes there were several codons that contained two or even three impor- tant nucleotide markers selected by Random Forests, a remarkable feature that amino acid markers lack. The top important codons in each gene for differenti- ating human from 2009 pandemic H1N1, avian, and swine viruses were displayed in single figure (Figures 1-10). The comparison of amino acid markers in [8] and nucleotide markers found in this study revealed several shared sites in each protein/gene, illustrating their sig- nificance as host markers. The consensus nucleotides (codons) comprising these sites in each gene were pre- sented in Tables 2-11, which could also serve as a con- firmation and refinement of the results in [8]. Due to high genetic variation of the HA and NA genes, only the HA nucleotide sequences of H1 subtype and the NA nucleotide sequences of N1 subtype of 2009 pan- demic H1N1, avian, human, and swine viruses were utilized in the current analysis. Therefore, the important codons in HA and NA found in this study were sub- type-specific. Because all the PB1-F2 proteins of 2009 pandemic H1N1 were truncated and nonfunctional, the genes encoding these proteins were excluded in this study. 3.2.1. HA Gene One of the advantages of the nucleotide markers over amino acid markers is their ability to represent the syn- onymous mutations that might be significant for host shifts. In comparison of human with avian, 2009 pan- demic H1N1, and swine viruses, there were several sy- nonymous mutation positions in HA with high impor- tance. They were 197(3) (cac(H), cat(H)) and 230(3) (gag(E), gaa(E)) in avian and 197(3) (cac(H), cat(H)) and 254(3) (gga(G), ggg(G)) in 2009 pandemic H1N1. Codon 197(3) had a very high importance in both avian and 2009 pandemic H1N1, although it contained a syn- onymous mutation in both cases. The codons in 2009 pandemic H1N1 (Figure 1) including 184, 258, and 314 had significant effects on the receptor binding specificity of HA of 2009 pa ndemic H1N1 [26]. The HA activ e site located in a cleft is composed of the codons 91, 150, 152, 180, 187, 191, and 192, and the active site cleft o f HA is formed by its right edge (131_GVTAA) and left edge (221_RGQAGR) [2 7]. Three codons 127( 2), 128(1) , and 129(2) in Table 2 were near the right edge and codon 225(3) in avian (Figure 1) was on the left edge of the active site. The importance values of top codons in avian were more homogenous than those in the 2009 pandemic H1N1 and swine. As in case of the amino acid markers [8], the HA1 domain of HA contained more significant codons than the HA2 domain (Figure 1). 3.2.2. N A Gene In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in NA with high importance. They were 263(3) (gtg(V), gtt(V)) and 410(3) (cca(P), cct(P)) in avian, 156(1) (aga(R), agg(R)), 339(3) (act(T), tcg(S)), and 440(3) (agt(S), agc(S)) in 2009 pandemic H1N1, and 89(3) (tcc(S), tca(S)) and 267(3) (gtt(V), ata(I)) in swine. Furthermore, sequence alignment re- vealed a deletion at codon 435 in the NA nucleotide se- quences of 2009 pandemic H1N1, avian, and swine vi- ruses, causing a very high importance at that codon in avian and swine (Figure 2). The NA active site is a shallow pocket constructed from conserved residues, some of which contact the sub- strate directly and participate in catalysis, while others W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 687 Tabl e 2. This table contains the consensus nucleotides (codons) at positions in HA that have high importance in separating 2009 pandemic H1N1, avian H1, and swine H1 from human H1 viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 45(3) (a) 71(2)(p) 72(1)(s) 74(1)(s) 94(1)(s) 127(2)(s)128(1)(p)129(2)(s)139(1)(a)141(2)(s) 152(2)(s) 157(2)(s)168(1)(p) Avian aat(N) ctc(L) act(T) aac(N) gaa(E)gag(E)aca(T)act(T)tct(S) gcc(A) aca(T) tca(S)aat(N) Human aaa(K) att(I) tcc(S) gaa(E) tat(Y)acc(T)gta(V)acc(T)aat(N)aaa(K) acg(T) ttg(L)aac(N) 2009 H1N1 aga(R) tcc(S) aca(T) agc(S) gat(D)gac(D)tcg(S)aac(N)gct(A)gca(A) gtt(V) tca(S)gat(D) Swine agg(R) ttc(F) aca(T) agc(S) gat(D)gaa(E)aca(T)aac(N)gct(A)gca(A) gta(V) tca(S)aat(N) Position 205(3)(s) 216(2)(p) 235(3)(a) 236(2)(a) 259(2)(a)275(3)(p)298(1)(p)302(1)(p)314(1)(p)365(2)(p) 374(2)(p) 404(3)(a)472(1)(s) Avian aag(K) gct(A) gac(D) caa(Q) aag(K)tgc(C)atc(I) gaa(E)atg(M)cag(Q) gga(G) att(I) gat(D) Human cat(H) aaa(K) gaa(E) ccc(P) aga(R) tgt(C)gtc(V)gag(E)atg(M)caa(Q) ggg(G) atg(M)aac(N) 2009 H1N1 aga(R) ata(I) gag(E) ccg(P) aga(R)tgc(C)atc(I) aaa(K)ctg(L)ctg(L) gag(E) ata(I) gat(D) Swine aaa(K) gca(A) gag(E) cct(P) aga(R)tgt(C)gtc(V)gaa(E) atg(M) caa(Q) ggg(G) ata(I) gat(D) Tabl e 3. This table contains the consensus nucleotides (codons) at positions in NA that have high importance in separating 2009 pandemic H1N1, avian N1, and swine N1 from human N1 viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 126(2)(p) 157(1)(s) 163(3)(s) 166(2)(p) 189(2)(p)214(3)(a,s)221(3)(a)222(3)(a)257(2)(p)269(1)(p) 285(1)(p) 329(3)(a,s)331(2)(p) Avian cac(H) aca(T) gtg(V) gct(A) agt(S) gac(D)aac(N)aac(N)aaa(K)ttg(L) gcc(A) aat(N) gga(G) Human cac(H) gcc(A) cta(L) gct(A) ggc(G)gaa(E)aag(K)caa(Q)aag(K)ttg(L) act(T) aaa(K)gga(G) 2009 H1N1 ccc(P) acc(T) att(I) gtt(V) aat(N) gac(D)aac(N)aat(N)aga(R)atg(M) tct(S) aat(N)aag(K) Swine cac(H) acc(T) att(I) gct(A) gga(G)gat(D)aac(N)aaa(K)aaa(K)ctg(L) aca(T) aat(N)ggg(G) Position 336(1)(s) 340(1)(a,s) 344(1)(a) 351(2)(a) 365(2)(p,s)369(2)(a)395(2)(p)397(2)(p)398(3)(p)435(1)(a,s) 435(2)(a,s) 435(3)(a,s) Avian ggt(G) cct(P) tat(Y) ttt(F) act(T) agc(S)gca(A)act(T)gat(D)--- --- --- Human aat(N) gtt(V) aac(N) tac(Y) aac(N)aag(K)gca(A)act(T) gat(D)aca(T) aca(T) aca(T) 2009 H1N1 ggt(G) tct(S) aat(N) ttc(F) att(I) aac(N)gga(G)aat(N)gag(E)--- --- --- Swine ggc(G) tct(S) aat(N) ttt(F) atc(I) agt(S) gca(A)act(T) gat(D)--- --- --- provide a structural framework [28]. According to the numbering in [29], these residues of N1 are 118, 119, 151, 152, 156, 179, 180, 223, 225, 228, 247, 277, 278, 293, 295, 368, and 402. The important codon s in Figure 2 including 157(1), 221( 3), 222(3 ), and 369( 2) were n ear these residue positions, and codon 156(1) carrying a sy- nonymous mutation in 2009 pandemic H1N1 is at one of these positions. 3.2.3. M1 Gen e Residue positions 115, 121, and 137 were avian-human host shift markers in [5]. Codons 103(3), 115(1), 121(1), 137(1), 218(1), 218(3), and 239(1) were identified as avian-human markers in this study and in [8], with codon 218 being select ed twice, 218(1) and 218(3 ). Remarkably, codons 149 and 180 carry ing a synonymous mutation had a very higher i m portance t han resi due posit ions 11 5, 1 21, and 137. Residue position 137 was a swine-human marker in [2]. There were codons 115(1), 115(3), 137(1), 218(1), and 218(3) selected as swine-human markers in this study and in [8], and two codons 115 and 218 were chosen W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 688 Figure 1. Top important HA codon positions in distinguishing avian H1, human H1, 2009 pandemic H1N1, and swine H1 viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an asterisk are the important residue positions identified in [8]. twice, i.e., 115(1), 115(3), and 218(1), 218(3). Even though the previously discovered residue position 137 received the highest importance, the two newly found codons 17 3 and 180 ha d a very hi gh importance as well. It was noteworthy that codon 180 was significant in both avian an d sw in e and wa s loca ted in the C-terminal part of M1 (codons 165-252) that bind to vRNP (viral ribonu- Figure 2. Top important NA codon positions in distinguishing avian N1, human N1, 2009 pandemic H1N1, and swine N1 viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an asterisk are the important residue positions identified in [8]. cleoproteins) [30] (Figure 3). This study and [8] found that codons 30(1), 30(2), 115(3), 142(3), 166(3), 209(1), 214(3), and 218(3 ) were impor tant host markers in 2009 pandemic H1N1 with codon 30 being chosen twice. In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in M1 with high importance. They were 149(3) (gcc(A), gca(A)) and 180(3) (gtg(V), W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 689 Figure 3. Top important M1 codon positions in distinguishing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an aster- isk are the important residue positions identified in [8]. gtt(V)) in avian, 117(3) (cta(L), ctc(L)), 186(3) (gct(A), gca(A)), and 242(3) (aaa(K), aag(K)) in 2009 pandemic H1N1, and 90(3) (ccg(P), cca(P)), 173(3) (atc(I), ata(I)) and 180(3) (gta(V), gtt(V)) in swine (Figure 3). 3.2.4. M2 Gen e This gene has three domains, one N-terminal extracellu- lar domain (24 codon s) recognized by host immune sys- tem, one transmembrane domain (19 codons) responsi- ble for ion channe l activity, and one cytoplasmic tail (54 Figure 4. Top important M2 codon positions in distinguishing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an aster- isk are the important residue positions identified in [8]. codons) interacting with M1 and required for genome packing and formation of virus particles [3 1]. Residue positions 11, 14, 20, 28, 54, 55, 57, 78, and 86 were avian-human host shift sites found in [5]. Co- dons 11(2), 14(2), 18(2), 20(2), 43(1), 54(2), 55(3), 57(1), 78(1), 86(2), and 93(2) were important avian- human markers in this study and in [8], plus codons 18(2), 43(1), and 93(2) were new markers, with codon 93(2) carrying a high importance. W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 690 Tabl e 4. This table contains the consensus nucleotides (codons) at positions in M1 that have high importance in separating 2009 pandemic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was se- lected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 30(1)(p) 30(2)(p) 115(1)(a,s) 115(3)(a) 115(3)(p,s) 121(1)(a) 137(1)(a,s) Avian gat(D) gat(D) gtt(V) gtt(V) gtt(V) act(T) acg(T) Human gat(D) gat(D) ata(I) ata(I) ata(I) gct(A) gct(A) 2009 H1N1 agt(S) agt(S) gtg(V) gtg(V) gtg(V) act(T) aca(T) Swine gat(D) gat(D) gta(V) gta(V) gta(V) gct(A) act(T) Position 142(3)(p) 166(3)(p) 209(1)(p) 214(3)(p) 218(1)(a,s) 218(3)(a,p,s) 239(1)(a) Avian gtg(V) gtg(V) gct(A) cag(Q) aca(T) aca(T) gcc(A) Human gtg(V) gtg(V) gcc(A) cag(Q) gcc(A) gcc(A) acc(T) 2009 H1N1 gct(A) gct(A) act(T) cat(H) act(T) act(T) gcc(A) Swine gtg(V) gtg(V) gct(A) cag(Q) aca(T) aca(T) gcc(A) Tabl e 5. This table contains the consensus nucleotides (codons) at positions in M2 that have high importance in separating 2009 pandemic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was se- lected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 11(2)(a,p) 13(2)(p) 14(2)(a) 16(2)(p) 18(2)(a)20(2)(a,p,s)28(1)(p)28(2)(s)31(2)(p) 31(3)(s) 43(1)(a,p)43(2)(p) Avian acc(T) aac(N) gga(G) gag(E) aga(R)agc(S) att(I) att(I)agt(S) agt(S) ctt(L) ctt(L) Human atc(I) aac(N) gaa(E) ggg(G) aga(R)aac(N) gtt(V)gtt(V)agt(S) agt(S) ctt(L) ctt(L) 2009 H1N1 acc(T) agc(S) gaa(E) gag(E) aga(R)agc(S) att(I) att(I)aat(N) aat(N) act(T) act(T) Swine acc(T) aac(N) gga(G) gag(E) aga(R)aac(N) gtt(V)gtt(V)agc(S) agc(S) ctt(L) ctt(L) Position 54(2)(a,s) 55(1)(s) 55(3)(a) 57(1)(a,s) 77(2)(p)78(1)(a,p,s)79(1)(s)82(2)(s)86(2)(a,p,s) 89(1)(s) 93(2)(a,s)95(2)(s) Avian cgc(R) ctt(L) ctt(L) tac(Y) cgg(R)cag(Q) gaa(E)agt(S)gtt(V) ggt(G) aac(N) gag(E) Human ctc(L) ttc(F) ttc(F) cac(H) cga(R)aag(K) gaa(E)aat(N)gct(A) agt(S) agc(S) gag(E) 2009 H1N1 cgc(R) ttt(F) ttt(F) tac(Y) caa(Q)cag(Q) gaa(E)agt(S)gtt(V) ggt(G) aac(N) gag(E) Swine cgc(R) ttt(F) ttt(F) tac(Y) cga(R)cag(Q) aaa(K)agt(S)gtt(V) ggt(G) aac(N) gag(E) Residue positions 57, 86, and 93 were swine-human shift markers in [32]. Codons 20(2), 28(2), 31(3), 54(2), 55(1), 57(1 ), 78 (1), 79 (1), 82(2) , 86(2) , 89(1 ), 93(2 ), and 95(2) were primary swine-human markers in this study and in [8], a nd in pa rticular codon 78( 1) was new and had a much higher importance than the residue positions 57, 86, and 93 discovered in [32]. Similarly, codons 11(2), 13(2), 16(2), 20(2), 28(1), 31(2), 43(1), 43(2), 77(2), 78(1), and 86(2) were major host markers in 2009 pan- demic H1N1 in this study and in [8]. In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in M2 with high importance. They were 53(3) (cg t(R), cga(R) ) in avian, 27(3) (gtc(V) , gtt(V)) and 50(3) (tgt(C), tgc(C)) in 2009 pandemic H1N1, and 26(3) (ctc(L), ctt(L)) and 68(3) (gtg(V), gta(V)) in swine (Figure 4). Figur e 4 indicated that codons 20(2), 78(1), 86(2) were significant in all three categories: 2009 pandemic H1N1, avian, an d swi ne, als o co don 2 0(2) was in the N- terminal W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 691 Figure 5. Top important NS1 codon positions in distinguishing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an aster- isk are the important residue positions identified in [8]. extracellular domain and codons 78(1) and 86(2) in the cytoplasmic tail. 3.2.5. NS1 Gene NS1 is a multifunc tional g ene [33]. Its N-termin al reg ion has an RNA-binding domain (codons 1-73) and its C-terminal region (codons 74-237) contains the effector domain that inhibits the maturation and exportation of the host cellular antiviral mRNAs [34]. Residue positions 22, 60, 81, 84, 215, and 227 were avian-human host shift markers in [4]. Codons 22(1), Figure 6. Top important NS2 codon positions in distinguishing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an aster- isk are the important residue positions identified in [8]. 70(1), 79(2), 79(3), 80(3), 81(3), 84(2), 84(3), 98(1), 114(1), 171(2), and 215(1) were sign ificant av ian-human markers in this study and in [8] ( Figure 5). Furthermore, our analysis selected two position s 2 and 3 within codon 84 with a much higher importance than the previous residue positions 22, 60, 81, 215, and 227 discovered in [4]. Another codon had two positions selected at 79(2) W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 692 Ta b le 6 . This table contains the consensus nucleotides (codons) at positions in NS1 that have high importance in separating 2009 pandemic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was se- lected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 6(1)(p) 22(1)(a,s) 25(1)(s) 25(3)(s) 41(2)(s)67(1)(s)70(1)(a)79(2)(a)79(3)(a)80(3)(a) 81(3)(a,s)82(3)(a) Avian gtg(V) ttt(F) caa(Q) caa(Q) aag(K)cgg(R)gag(E)atg(M)atg(M)act(T) att(I) gct(A) Human gtg(V) gtt(V) caa(Q) caa(Q) aag(K)agg(R)aag(K)atg(M)atg(M)acc(T) atg(M) gcc(A) 2009 H1N1 atg(M) ttt(F) aat(N) aat(N) aag(K)tgg(W)aaa(K)atg(M)atg(M)aca(T) att(I) gca(A) Swine gtg(V) ttt(F) aat(N) aat(N) aag(K)tgg(W)aaa(K)atg(M)atg(M)acc(T) att(I) gca(A) Position 84(2)(a,s) 84(3)(a) 91(1)(p) 98(1)(a) 114(1)(a)119(1)(p)129(1)(p)171(1)(p)171(2)(a)205(2)(p) 206(1)(p,s)209(1)(s) Avian gtg(V) gtg(V) act(T) atg(M) tcc(S) atg(M)ata(I) gat(D) gat(D) agc(S) agt(S) gat(D) Human aca(T) aca(T) act(T) ttg(L) cct(P) atg(M)atg(M)att(I) att(I) agc(S) agt(S) aat(N) 2009 H1N1 gta(V) gta(V) tct(S) atg(M) cct(P) ttg(L) gta(V) tat(Y) tat(Y) aac(N) tgt(C) aat(N) Swine gta(V) gta(V) gct(A) atg(M) tct(S) atg(M)ata(I) gat(D) gat(D)agc(S) cgt(R) gat(D) Ta b le 7 . This table contains the consensus nucleotides (codons) at positions in NS2 that have high importance in separating 2009 pandemic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was se- lected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 6(1)(p) 14(1)(a,s)26(1)(s) 32(1)(s) 34(2)(p)40(3)(p)48(1)(p)48(3)(p)49(1)(s) 57(2)(a,p,s) 57(3)(a,s)60(1)(a) Avian gtg(V) atg(M) gag(E) ata(I) cag(Q)ctc(L)gca(A)gca(A) gtg(V) tcc(S) tcc(S) agc(S) Human gtg(V) ttg(L) gag(E) ata(I) cag(Q)atc(I)gca(A)gca(A) gta(V) tta(L) tta(L) aac(N) 2009 H1N1 atg(M) atg(M) gag(E) gta(V) cgg(R)ata(I)act(T)act(T) gtg(V) tac(Y) tac(Y) agc(S) Swine gtg(V) atg(M) gag(E) gta(V) cag(Q)atc(I)gcc(A)gcc(A) gta(V) tac(Y) tac(Y) aac(N) Position 60(2)(a,p) 63(3)(a) 70(1)(a) 70(3)(s) 83(1)(p)85(3)(a)89(2)(a)89(3)(a,s)107(1)(a,s)115(1)(p) 115(3)(a) Avian agc(S) gga(G) agt(S) agt(S) gtg(V)cat(H)att(I) att(I) ctt(L) act(T) act(T) Human aac(N) gga(G) ggt(G) ggt(G) gtg(V)cac(H)aca(T)aca(T) ttt(F) act(T) act(T) 2009 H1N1 agc(S) gaa(E) gga(G) gga(G) atg(M)cac(H) gcg(A)gcg(A)ctt(L) gct(A) gct(A) Swine aac(N) gaa(E) ggt(G) ggt(G) gtg(V)cac(H)atg(M)atg(M) ctt(L) act(T) act(T) and 79(3) as well. Both of these double selected codons were located in the C-terminal region. The results of this study and [8] suggested that codons 22(1), 25(1), 25(3), 41(2), 67(1), 81(3), 84(2), 206(1), 209(1), and 215(1) were key swine-human markers and codons 6(1), 91(1), 100(1), 119(1), 128(3), 129(1), 171(1), 205(2), and 206(1) were essential host markers in 2009 pandemic (Figure 5). In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in NS1 with high importance. They were 157(3) (gtg(V)), (gtt(V)) in avian, 5(3) (acc(T)), (act(T)), 68(3) (atc(I)), (att(I)), 90(3) (ctt(L)), cta(L)), and 128(3) (ata(I)), (atc(I)) in 2009 pandemic H1N1, and 146(1) (cta(L)), (tta(L)) in swine (Figure 5). 3.2.6. NS2 Gene Residue positions 60, 70, and 107 were avian-human host shift markers in [4]. Codons 14(1), 57(2), 57(3), 60(1), 60(2), 63(3), 70(1), 85(3), 89(2), 89(3), 107(1), and 115(3) were avian-human markers in this study and W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 693 in [8] with codons 57, 60, and 89 being selected twice. Codons 57(3) and 89(3) were not only new markers but also had a very high importance, comparable to that of positions 70 and 107 in [4](Figure 6). Codons 14(1), 26(1), 32(1), 49(1 ), 57(2 ), 57( 3), 70 (3) , 89(3), and 107(1) were important swine-human markers in this study and in [8] with codon 57 being selected twice. The same analysis also identified codons 16(1), 34(2), 40(3), 48(1), 48(3), 57(2), 60(2), 83(1), and 115(1) as important host markers in 2009 pandemic H1N1, with codon 48 being selected twice. Notably, there was one codon position 57(2) that was important in all three categories: avian, 2009 pandemic H1N1, and swine, and it was in the M1 binding domain (codons 59-116) [5]. In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in NS2 with high importance. They were 69(3) (ttg(L), cta(L)) and 84(1) (cga(R), aga(R)) in avian, 5(3) (acc(T), act(T)) and 13(3) (ctt(L), cta(L)) in 2009 pandemic H1N1, and 48(3) (gcc(A), gca(A)) and 84(1) (cgg(R), aga(R)) in swine (Figure 6). Further more, codons 8 4(1) and 13(3) r eceived a very high importance in both avian and swine and in 2009 pan- demic H1N1, respectively. Codon 84(1) was in the M1 binding domain. 3.2.7. NP Gene Residue positions 16, 33, 61, 100, 136, 214, 283, 305, 313, 357, 375, and 423 were avian-human host shift markers in [4]. Codons 16(2), 61(1), 100(1), 100(2), 283(2), 305(1), 305(3), 313(2), 357(1), and 455(3) were significant for discriminating avian and human viruses in this study and in [8] with codons 100 and 305 being chosen twice (Figure 7). In this study and in [8], codons 16(2), 61(1), 214(2), 283(2), 289(1), 313(2), 372(3), 442(1), and 455(3) were main swine-human marker s, and similarly codons 53(3), 289(1), 313(1), 316(3), 430(2), and 444(1) were vital host markers in 2009 pandemic H1N1. In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in NP with high importance. They were 108(3) (ctg(L), ctc(L)) and 155(3) (gtg(V), gtt(V)) in avian, 177(3) (ggt(G), gga(G)), 182(3) (gcg(A), gca(A)), and 363(3) (gtc(V), gta(V)) in 2009 pandemic H1N1, and 3(3) (tct(S), tcc(S)), 94(3) (gga(G), ggg(G)), and 376(3) (tcc(S), tca(S)) in swine (Figure 7). NP has three regions (codons 1-160, 256-340 and 340–498) that bind to PB1 and PB2 [35], and codons 108(3) and 155(3) in avian and codons 3(3), 94(3), and 376(3) in swine were in two of these three regions. One region, codons 360-374, in NP of 2009 pandemic H1N1 was deemed extremely important for host range restric- tion and is a common feature of pandemic viruses [36], and codon 363(3) carrying a synonymous mutation in 2009 pandemic H1N1 was in this region . 3.2.8. PA Gen e Residue positions 28, 55 , 57, 65, 66, 100, 225, 268, 321, 337, 356, 382, 400, 404, 409, 421, and 552 were avian-human host shift markers in [4], whereas this study and [8] indicated that codons 55(1), 56(1), 225(1), 337(1), 337(3), 421(2), and 552(2) were important avian-human markers, with codon 337 being chosen twice (Figure 8). Residue positions 268 and 552 were swine-human markers uncovered in [32]. Codons 225(1), 268(1), 272(1), 337(1), 421(2), and 552(2) were key swine- human markers in our analysis and in [8] with codon 337(1) having a similar importance as position 268. Also codons 85(2), 186(1), 275(2), 277(1), and 388(3) were crucial for classifying 2009 pandemic H1N1 and human viruses in this study and in [8]. The N-terminal domain of PA (codons 1-256) harbors several functional domains, including an endonuclease active site with a putative active site motif, two putative nuclear transport motifs (codons 124-139 (NLS1) and codons 186-247 (NLS2)), and a proteolytic domain that can induce generalized proteolysis of both viral and host proteins. The C-ter minal domain of PA (codons 257-716) binds to PB1 for complex formation and nuclear trans- port [37]. In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in PA with high importance. They were 106(3) (ctc(L), cta(L)), 138(3) (ata(I), att(I)), and 270(3) (tta(L), ctt(L)) in avian, 44(3) (gtt(V) , gta(V)), 173(3) (act(T), acc(T)), and 526(3) (tca(S), tct(S)) in 2009 pandem ic H1N 1, and 106(3) (ct t(L), ct a(L)), 290(3) (tta(L), ttg(L)), and 345(3) (cta(L), ttg(L)) in swine (Figur e 8). Codon 138(3) was in the first putative nuclear localization signal (NLS1) region, and codons 270(3) and 526(3) were in the C-terminal domain of PA. 3.2.9. PB 1 G ene Residue position 336 was the only avian-human host shift markers in [4]. Codons 327(2), 336(1), 361(3), 401(3), and 576 (3) were importan t host markers in av ian in this study and in [8] with codons 361(1), 401(3), and 576(3) having a much higher importance than position 336 (Figure 9). The analysis of this study and [8] suggested that co- dons 327(2), 339(3), and 638(3) were important swine- human markers, and codons 12(1), 175(1), 212(1), 339(3), 364(1), 435(2), 618(3), 638(3), 728(1), and 728(3) were vital for classifying 2009 pandemic H1N1 and human viruses with codon 728 being sel ected twice. W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 694 Table 8. This table contains the consensus nucleotides (codons) at positions in NP that have high importance in separating 2009 pan- demic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position num- ber indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 16(2)(a,s) 53(3)(p) 61(1)(a,s) 100(1)(a)100(2)(a)214(2)(s)283(2)(a,s)289(1)(p,s) 305(1)(a) 305(3)(a) Avian ggt(G) gaa(E) ata(I) aga(R) aga(R) aga(R) ctt(L) tat(Y) cgt(R) cgt(R) Human gat(D) gaa(E) ttg(L) gta(V) gta(V) aaa(K) cct(P) tac(Y) aaa(K) aaa(K) 2009 H1N1 ggt(G) gat(D) ata(I) ata(I) ata(I) agg(R) ctt(L) cat(H) aaa(K) aaa(K) Swine ggt(G) gag(E) ata(I) gta(V) gta(V) agg(R) ctt(L) cat(H) aaa(K) aaa(K) Position 313(1)(p) 313(2)(a,s) 316(3)(p) 355(3)(a)357(1)(a)372(3)(s)430(2)(p) 442(1)(s) 444(1)(p) 455(3)(s) Avian ttc(F) ttc(F) att(I) aga(R) caa(Q) gaa(E) aca(T) act(T) atc(I) gat(D) Human tac(Y) tac(Y) atc(I) cgg(R) aaa(K) gat(D) act(T) gca(A) atc(I) gaa(E) 2009 H1N1 gtc(V) gtc(V) atg(M) aga(R) aag(K) gaa(E) agc(S) aca(T) gtt(V) gat(D) Swine ttc(F) ttc(F) atc(I) aga(R) aag(K) gaa(E) act(T) act(T) att(I) gat(D) Table 9. This table contains the consensus nucleotides (codons) at positions in PA that have high importance in separating 2009 pan- demic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position num- ber indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 55(1)(a) 65(1)(a) 85(2)(p) 186(1)(p) 225(1)(a,s) 268(1)(s) 272(1)(s) Avian gat(D) tct(S) aca(T) ggt(G) agc(S) ctc(L) gat(D) Human aat(N) ctt(L) aca(T) ggc(G) tgc(C) atc(I) aat(N) 2009 H1N1 gac(D) tct(S) atc(I) agt(S) agc(S) ctc(L) gat(D) Swine gat(D) tct(S) aca(T) ggt(G) agc(S) ctc(L) gat(D) Position 275(2)(p) 277(1)(p) 337(1)(a,s) 337(3)(a) 388(3)(p) 421(2)(a,s) 552(2)(a,s) Avian cct(P) tct(S) gct(A) gct(A) agc(S) agt(S) act(T) Human cct(P) tat(Y) tca(S) tca(S) agc(S) atc(I) agt(S) 2009 H1N1 ctt(L) cat(H) gct(A) gct(A) gga(G) agc(S) act(T) Swine ccc(P) tct(S) gct(A) gct(A) agt(S) agc(S) act(T) PB1 contains several functional domains, including cRNA binding domain (codons 1-139 and 267-493), vRNA binding domain (codons 1-83 and 233-249 and 494-758), NLS region (codons 180-195 and 202-252), PA binding domain (codons 1-25), and PB2 binding domain (codons 600-757 ) [5]. In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in PB1 with high importance. They were 148(3) (gag(E), gaa(E)), 322(3) (ata(I) , att(I)), and 454(3) (ccg(P), cca(P)) in avian, 62(3) (ggt(G) , ggg(G)), 167(1) (tta(L), ctc(L)), 543(3) (acg(T), aca(T)), and 601(3) (ata(I), atc(I)) in 2009 pandemic H1N1, 60(3) (gag(E), gaa(E)) and 726(3) (gca(A), gcc(A)) in swine (Figur e 9). It was striking that codons 322(3), 167(1), and 60(3), each carrying a synonymous mutation, had the highest importance in 2009 pandemic H1N1, avian, and swine, respectively. Many of these significant codons carrying synonymous mutations were located in the functional domains of PB1. W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 695 Figure 7. Top important NP codon positions in distinguishing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an aster- isk are the important residue positions identified in [8]. 3.2.10. PB2 Gene Amino acid positions 9, 44, 64, 81, 105, 199, 271, 292, 368, 475, 567, 588, 613, 627, 661, 674, and 702 were avian-human host shift markers in [4]. Codons 81(2), 105(2), 199(1), 271(1), 475(1), 588(2), 627(1), 674(1), and 674(3) were significant avian-human markers in this study and in [8]. In particular, our analysis revealed Figure 8. Top important PA codon positions in distinguishing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an aster- isk are the important residue positions identified in [8]. codon 199(1) as equally essential as codon 627(1), a well-known host marker ( Figure 10). Residue position 44 was a swine-human marker in [32]. Our analysis and [8] implied codons 81(2), 199(1), 567(1), 613(1), and 674(1) were as equally significant as position 44. In fact, codon 674(1) received the highest importance in swine. Moreover, codons 54(2), 225(1), 315(3), 559(2), 591(2), and 645(1) were key host mark- ers in 2009 pandemic H1N1. W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 696 Figure 9. Top important PB1 codon positions in distinguishing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an aster- isk are the important residue positions identified in [8]. In comparison of human with avian, 2009 pandemic H1N1, and swine viruses, there were several synony- mous mutation positions in PB2 with high importance. They were 373(3) (att(I), ata(I)), 598(3) (aca(T), act(T)), and 604(3) (cgt(R), aga(R)) in avian, 142(1) (agg(R), cgc(R)), 142(3) ( agg(R), c gc(R)), 221 (3) (gc c(A), gct( A)), 553(3) (ata(I) ,atc(I)), and 664(1) (c ga(R), aga(R)) in 2009 pandemic H1N1, and 169(3) (cca(P), ccc(P)) and 527(1) (ttg(L), ctg(L)) in swine (Figure 10). There was one codon 142 carrying two synonymou s mutation positions, both were selected as important host markers in 2009 pandemic H1N1. Codons 221(3), 553(3), and 664(1) were the top three important ones in 2009 pandemic H1N1. The PB2-NP binding domain contains codons 1-269 and 580-683, and the PB2-PB1 binding domain contains codons 51-2 59 and 580-759 [5]. Codons 142(1) , 142(3), 169(3), 2 21(3), 598(3), 604(3), and 664( 1) were in the PB2-PB1 and PB2-NP binding domains. In addition to codon 142 in 2009 pandemic H1N1, there was one codon 674 in avian (Figure 10) that in- cluded two significant positions. The lik elihood to affect the host shifts by any potential mutations at these multi- ple selected codons might be higher than any other codons. 4. DISCUSSION In the current study, a comprehensive analysis of the nucleotide sequences of ten genes of influenza viruses was performed to discover a catalogue of host markers, illustrating the complex and systematic nature of host adaptation. One of the benefits of using nucleotide se- quences was their capability to detect synonymous mu- tations that were essential for host switches. These syn- onymous mutations could not be found at the amino acid level. Moreover, the nucleotide markers could pinpoint exactly where the important positions were within a codon. Our investigation also revealed several codons in ten genes of 2009 pandemic H1N1, avian, and swine viruses that contained two or even three important positions selected by Random Forests for host shifts, thus provid- ing extra and finer information abo ut the ho st ad aptatio n . These codons might have a higher probability to affect the host switch than those codons containing only a sin- gle important position. Amino acid mutation E627K in PB2 is a well-known determinant for adaptation from avian to human hosts. The nucleotide marker information uncovered in this study suggested that generally it was codon gaa to en- code E in avian viruses and codon aag to encode K in human viruses. Furthermore, it was the first position within codon 627 that was essential for the discrimina- tion of avian and human viruses, although there was another possibility in the th ird pos ition within codo n 627. The SR polymorphism found in [10] contained two posi- tions 590 and 591. Our analysis demonstrated that it was the second position within codon 591 that was really vital for the separation of 2009 pandemic H1N1 and human viruses (Table 11). W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 697 Table 10. This table contains the consensus nucleotides (codons) at positions in PB1 that have high importance in separating 2009 pandemic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was se- lected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 12(1)(p) 175(1)(p) 212(1)(a,p) 327(2)(a,s) 336(1)(a) 339(3)(p,s) 361(3)(a) 364(1)(p) Avian gtt(V) gat(D) ctg(L) aga(R) gtc(V) att(I) agc(S) ctt(L) Human gtt(V) gat(D) gtg(V) aaa(K) atc(I) atc(I) aga(R) ctc(L) 2009 H1N1 att(I) aac(N) ctg(L) aga(R) atc(I) atg(M) aga(R) att(I) Swine gtg(V) gat(D) ttg(L) aga(R) gtt(V) att(I) agc(S) ctc(L) Position 401(3)(a) 435(2)(p) 576(3)(a) 618(3)(p) 638(3)(p,s) 728(1)(p) 728(3)(p) Avian gcc(A) aca(T) ctg(L) gaa(E) gag(E) att(I) att(I) Human gca(A) aca(T) cta(L) gag(E) gag(E) att(I) att(I) 2009 H1N1 gca(A) ata(I) tta(L) gat(D) gat(D) gtc(V) gtc(V) Swine gca(A) aca(T) cta(L) gaa(E) gaa(E) att(I) att(I) Table 11. This table contains the consensus nucleotides (codons) at positions in M1 that have high importance in separating 2009 pandemic H1N1, avian, and swine from human viruses. The single digit in parenthesis is the position within the codon that was se- lected by Random Forests. The single letter ‘a’ (for avian) or ‘p’ (for pandemic 2009) or ‘s’ (for swine) in parenthesis after a position number indicates whether the position is important for avian or 2009 pandemic H1N1 or swine viruses. Position 44(1)(s) 54(2)(p) 81(2)(a,s) 105(2)(a) 199(1)(a,s) 225(1)(p) 271(1)(a) 315(3)(p) 475(1)(a) Avian gca(A) aaa(K) aca(T) aca(T) gct(A) agc(S) aca(T) atg(M) ctg(L) Human tca(S) aaa(K) atg(M) gtg(V) tct(S) agc(S) gca(A) atg(M) atg(M) 2009 H1N1 gca(A) aga(R) aca(T) aca(T) gct(A) ggc(G) gca(A) ata(I) ctg(L) Swine gca(A) aaa(K) aca(T) aca(T) gct(A) agc(S) aca(T) atg(M) ctg(L) Position 559(2)(p) 567(1)(s) 588(2)(a) 591(2)(p) 613(1)(s) 627(1)(a) 645(1)(p) 674(1)(a,s) 674(3)(a) Avian act(T) gac(D) gcc(A) caa(Q) gtt(V) gaa(E) atg(M) gca(A) gca(A) Human gct(A) aat(N) att(I) caa(Q) acc(T) aag(K) atg(M) act(T) act(T) 2009 H1N1 att(I) gat(D) acc(T) cgg(R) gtc(V) gaa(E) ttg(L) gca(A) gca(A) Swine act(T) gac(D) gcc(A) caa(Q) gtc(V) gaa(E) atg(M) gca(A) gca(A) In [8], a set of residue positions in the PB2 protein in- cluding the SR polymorphism found in [10 ] were identi- fied as host markers in 2009 pandemic H1N1. Codons 54(2), 225(1), 315( 3), 559(2), 591(2) , and 645(1) in PB2 of 2009 pandemic H1N1 were concluded as essential host markers both in this study and in [8]. Furthermore, the current study fo und thr ee new codo ns 221(3), 553(3), and 664(1) that were the top three significant codons in 2009 pa ndemic H1N1. They coul d augm ent the re pertoire of existing host markers in PB2 that might play com- pensatory roles, as the SR polymorphism, in the viral replication and transmission of 2009 pandemic H1N1. Additionally, the new nucleotide markers carrying syn- onymous m utations fo und in NP, PA, and PB 1, along with those in PB2 would provide further information for the life cycle of 2009 pandemic H1N1. Also, co dons 728(1), 728(3) in PB1, carrying a non-synonymous mutation, and codons 142(1), 142(3) in PB2, carrying a synonymous mutation, were of special in terest in this regard as multi- ple selected markers within the same codon. W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 698 Figure 10. Top important PB2 codon positions in distinguish- ing avian, human, 2009 pandemic H1N1, and swine viruses. The single digit in parenthesis is the position within the codon that was selected by Random Forests. The positions with an asterisk are the important residue positions identified in [8]. 5. CONCLUSION As an exten sion of our earlier work in [8], Random For- ests were employed to discover a collection of nucleo- tide positions, served as host markers, in ten genes of influenza that could separate 2009 pandemic H1N1, avian, and swine viruses from human viruses with high confidence. Our results indicated that two or even three important positions could coexist within a single codon, i.e., multiple nucleotide markers might be present within one codon, and the different importance values of these positions could further differentiate these multiple mar- kers within a codon. The nucleotide markers uncovered in the current study provided a complete genomic view of host adaptation of influenza viruses. They verified and enriched the known amino acid host markers and generated new information about the adaptation of zoo- notic viruses to humans, thus offering a larger set of fin- er potential sites for further experimental verification to elucidate their biological functions in cellular processes. 6. ACKNOWLEDGEMENTS We thank Houghton College for its financial support. REFERENCES [1] Chen, G.W., Chang, S.C., Mok, C.K., Lo, Y.L., Kung, Y.N., et al. (2006) Genomic signatures of human versus avian influenza A viruses. Emerging Infectious Diseases, 12(9), 1353-1360. [2] Chen, G.W. and Shih, S.R. (2009) Genomic signatures of influenza A pandemic (H1N1) 2009. Emerging Infectious Diseases, 15(12), 1897-1903. [3] Pan, C., Cheung, B., Tan, S., Li, C., Li, L., et al. (2010) Genomic signature and mutation trend analysis of pan- demic (H1N1) 2009. Influenza A Virusus PLoS One, 5(3), e9549. [4] Miotto, O., Heiny, A., Tan, T. W., August, J.T. and Brusic, V. (2008) Identification of human-to-human transmissi- bility factors in PB2 proteins of influenza A by large- scale mutual information analysis. BMC Bioinformatics., 9(Suppl 1), S18. [5] Miotto, O., Heiny, A.T., Albrecht, R., García-Sastre, A., Tan, T.W., August, J.T. and Brusic, V. (2010) Com- plete-proteome mapping of human influenza A adaptive mutations: Implications for human transmissibility of zoonotic strains. PLoS One, 5(2), e9025. [6] Finkelstein, D.B., Mukatira, S., Mehta, P.K., Obenauer, J.C., Su, X., Webster, R.G. and Naeve, C.W. (2007) Per- sistent host markers in pandemic and H5N1 influenza viruses. Journal of Virology, 81(19), 10292-10299. [7] Allen, J.E., Gardner, S.N., Vitalis, E.A. and Slezak, T.R. (2009) Conserved amino acid markers from past influ- enza pandemic strains. BMC Microbiology, 9(1), 77. [8] Hu, W. (2010) Novel host markers in the 2009 pandemic H1N1 influenza A virus. Journal of Biomedical Science and Engineering, 3(6), 584-601. [9] Herfst, S., Chutinimitkul, S., Ye, J., de Wit, E., Munster, V.J., Schrauwen, E.J., Bestebroer, T.M., Jonges, M., Meijer, A., Koopmans, M., Rimmelzwaan, G.F., Oster- haus, A.D., Perez, D.R. and Fouchier, R.A. (2010) Intro- duction of virulence markers in PB2 of pandemic swine-origin influenza virus does not result in enhanced virulence or transmission, Journal of Virology, 84(8), 3752-3758. [10] Mehle, A. and Doudna, J.A. (2009) Adaptive strategies of the influenza virus polymerase for replication in humans. Proceedings of National Academic Science in USA., 106(50), 21312-21316. [11] Alexande,r S., Benjamin, G., Gustavo, P., Ian Lipkin, W. and Raul R. (2010) Host dependent evolutionary patterns W. Hu / J. Biomedical Science and Engineering 3 (2010) 684-699 Copyright © 2010 SciRes. JBiSE 699 and the origin of 2009 H1N1 pandemic influenza. PLoS Current Influenza, RRN1147. [12] Rabadan, R., Levine, A.J. and Robins, H. (2006) Com- parison of avian and human influenza A viruses reveals a mutational bias on the viral genomes. Journal of Virology, 80(23), 11887-11891. [13] Microbiol Biotechnol, J. (2010) Comparative study of the nucleotide bias between the novel H1N1 and H5N1 sub- types of influenza A viruses using bioinformatics tech- niques. Ahn I, Son HS. Bioinformatics Team, 20(1), 63- 70. [14] Valli, M.B., Meschi, S., Selleri, M., Zaccaro, P., Ippolito, G., Capobianchi, M.R. and Menzo, S. (2010) Evolution- ary pattern of pandemic influenza (H1N1) 2009 virus in the late phases of the 2009 pandemic. PLoS Current In- fluenza, RRN1149. [15] Ramakrishnan, M.A., Gramer, M.R., Goyal, S.M. and Sreevatsan, S. (2009) A Serine12Stop mutation in PB1- F2 of the 2009 pandemic (H1N1) influenza A: a possible reason for its enhanced transmission and pathogenicity to humans. Journal of Veterinary Science, 10(4), 349-351. [16] Katoh, K., Kuma, K., Toh, H. and Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acid Research, 33, 511-518. [17] Breiman, L. (2001) Random Forests, Machine Learning, 45(1), 5-32. [18] Díaz-Uriarte, R. and Alvarez de Andrés, S. (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3), 3-16. [19] Archer, K.J. and Kimes, R.V. (2008) Empirical charac- terization of random forest variable importance measures. Computational Statistics and Data Analysis, 52(4), 2249- 2260. [20] Reif, D.M., Motsinger, A.A., McKinney, B.A., Crowe, J.E. and Moore, J.H. (2006) Feature selection using a random forests classifier for the integrated analysis of multiple data types. Proceedings of 2006 IEEE Symposium on Computa- tional Intelligence and Bioinformatics and Computational Biology, CIBCB ’06. [21] Granittoa, P.M., Furlanellob, C., Biasiolia, F. and Gas- peria, F. (2006) Recursive feature elimination with ran- dom forest for PTR-MS analysis of agroindustrial prod- ucts. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. [22] Menze1, B.H., Kelm, B.M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W. and Hamprecht, F.A. (2009) A comparison of random forest and its Gini importance with standard chemometric methods for the feature se- lection and classification of spectral data. BMC Bioin- formatics, 10, 213. [23] Gao, D., Zhang, Y.-X. and Zhao, Y.-H. (2009) Random forest algorithm for classification of multi-wavelength data. Research in Astronomy and Astrophysics, 9(2), 220- 226. [24] Hu, W. (2009) Identifying predictive markers of chemo- sensitivity of breast cancer with random forests. Journal of Biomedical Science and Engineering, 3(1), 59-64. [25] Gavin, J.D., Smith, D.V., Justin, B., Samantha, J.L., et al. (2009) Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature, 459(7250), 1122-1125. [26] Hu, W. (2010) Quantifying the effects of mutations on receptor binding specificity of influenza viruses. Journal of Biomedical Science and Engineering, 3(3), 227-240. [27] KováccaronOVá, A., Ruttkay-Nedecký, G., HaverlíK1, I.K. and Janecccaronek, S. (2002) Sequence similarities and evolutionary relationships of influenza virus A he- magglutinins. Virus Genes, 24(1), 57-63. [28] Colman, P.M., Hoyne, P.A. and Lawrence, M.C. (1993) Sequence and structure alignment of paramyxovirus he- magglutinin-neuraminidase with influenza virus neura- minidase. Journal of Virology, 67(6), 2972-2980. [29] Maurer-Stroh, S., Ma, J.M., Lee, R.T.C., Sirota, F.L. and Eisenhaber, F. (2009) Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites. Biology Direct, 4, 18. [30] Baudin, F., Petit, I., Weissenhorn, W. and Ruigrok, R.W.H. (2001) In vitro dissection of the membrane bind- ing and RNP binding activities of influenza virus M1 protein. Virology, 281(1), 102-108. [31] Furuse, Y., Suzuki, A., Kamigaki, T. and Oshitani, H. (2009) Evolution of the M gene of the influenza A virus in different host species: Large-scale sequence analysis. Journal of Virology, 6(1), 67. [32] Yang, H., Carney, P. and Stevens, J. (2010) Structure and Receptor binding properties of a pandemic H1N1 virus hemagglutinin. PLoS Current Influenza, RRN1152. [33] Dundon, W.G. and Capua, I. (2009) A closer look at the NS1 of influenza virus. Viruses, 1(3), 1057-1072. [34] Lin, D., Lan, J. and Zhang, Z. (2007) Structure and func- tion of the NS1 protein of influenza A virus. Acta Bio- chim Biophys Sin (Shanghai), 39(3), 155-162. [35] Ye, Q., Krug, R.M. and Tao, Y.J. (2006) The mechanism by which influenza A virus nucleoprotein forms oli- gomers and binds RNA. Nature, 444(7122), 1078-1082. [36] Liu, X. and Zhao, Y.P. (2010) Switch region for patho- genic structural change in conformational disease and its prediction. PLoS One, 5(1), e8441. [37] Yuan, P.W., Bartlam, M., Lou, Z.Y., Chen, S.D., Zhou, J., He, X.J., Lv, Z.Y., Ge, R.W., Li, X.M., Deng, T., Fodor, E., Rao, Z.H. and Liu, Y.F. (2009) Crystal structure of an avian influenza polymerase PAN reveals an endonuclease active site. Nature, 458(7240), 909-913. |