J. Biomedical Science and Engineering, 2009, 2, 550-558 doi: 10.4236/jbise.2009.27080 Published Online November 2009 (http://www.SciRP.org/journal/jbise/ JBiSE ). Published Online November 2009 in SciRes. http://www.scirp.org/journal/jbise Analysis of correlated mutations, stalk motifs, and phylogenetic relationship of the 2009 influenza A virus neuraminidase sequences Wei Hu Department of Computer Science, Houghton College, Houghton, NY, USA. Email: wei.hu@houghton.edu Received 13 October 2009; revised 27 October 2009; accepted 30 October 2009. ABSTRACT The 2009 H1N1 influenza pandemic has attracted worldwide attention. The new virus first emerged in Mexico in April, 2009 was identified as a unique combination of a triple-reassortant swine influenza A virus, composed of genetic information from pigs, hu- mans, birds, and a Eurasian swine influenza virus. Several recent studies on the 2009 H1N1 virus util- ized small datasets to conduct analysis. With new se- quences available up to date, we were able to extend the previous research in three areas. The first was finding two networks of co-mutations that may po- tentially affect the current flu-drug binding sites on neuraminidase (NA), one of the two surface proteins of flu virus. The second was discovering a special stalk motif, which was dominant in the H5N1 strains in the past, in the 2009 H1N1 strains for the first time. Due to the high virulence of this motif, the second finding is significant in our current research on 2009 H1N1. The third was updating the phylogenetic an- alysis of current NA sequences of 2009 H1N1 and H5N1, which demonstrated that, in clear contrast to previous findings, the N1 sequences in 2009 are di- verse enough to cover different major branches of the phylogenetic tree of those in previous years. As the novel influenza A H1N1 virus continues to spread globally, our results highlighted the importance of performin g timely analysis o n th e 2009 H1N1 v irus. Keywords: Entropy; Co-mutation; Mutation; Mutual Information; Neuraminidase; Phylogenetic Analysis; Random Forest; Stalk Motif; Swine Flu 1. INTRODUCTION There are three types of flu viruses, types A, B, and C. Type A viruses are the most pathogenic to humans. The influenza has on its surface two glycoproteins, hemag- glutinin (HA) and neuraminidase (NA), based on which influenza is classified. There are 16 types of HA proteins and 9 types of NA proteins, which are named H1, H2, H3 and etc. For example, “bird flu” is H5N1 and “swine flu” is H1N1. The HA binds the virus to sialic acid re- ceptors on the host cell surface. The NA protein facili- tates the release of virions to infect other cells by re- moving sialic acid residues from the viral HA during entry and release from cells. The NA protein is a te- tramer of four identical polypeptide chains anchored in the membrane of the virus. Its head domain is globular and supported by a long and thin stalk. The “Spanish” influenza pandemic of 1918–1919 cau- sed about 50 million deaths worldwide and about one- third of the world’s population was infected. One unique feature of the 1918 influenza pandemic was the simulta- neous infection of humans and swine. Recent studies on the 1918 virus revealed that the genes encoding the HA and NA surface proteins of the 1918 virus were derived from an avian-like influenza virus shortly before the start of the pandemic [1]. In April 2009, a novel strain of the influenza A (H1N1) virus was discovered in patients from Mexico and the United States and it spread across the globe via human- to-human contact within a very short time. Because of the seriousness of this new flu virus, the World Health Or- ganization (WHO) has officially declared the H1N1 virus a global pandemic. Genomic analysis of the 2009 influ- enza A (H1N1) virus suggested that it is closely related to common reassortant swine influenza A viruses isolated in North America, Europe, and Asia. Its NA sequences have 94.4% similarity at the nucleotide level with European swine influenza A virus strains from 1992 [2]. Flu drugs such as oseltamivir (Tamiflu®) and zana- mivir (Relenza®) currently in use only target the NA proteins, and disrupt the capability of the virus to escape infected cells and move elsewhere to infect other healthy cells. Clinical reports suggested that the new virus is susceptible to the two drugs [3]. However, a growing concern is that more drug resistant mutants will emerge under the selection pressure of constant drug use. Re-
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 551 searchers from Rensselaer Polytechnic Institute [4] de- signed a new flu drug by targeting both the HA and NA genes of the virus as an effective way to treat the next mutation of H1N1 swine flu. Two recent studies [5,6] provided insights into the in- teractions of flu drugs with NA of the 2009 H1N1 virus. One study [6] developed a 3D structure of 2009 H1N1 NA and compared it with the crystal structure of 2006 H5N1 NA and the structure of 1918 H1N1 NA. It found that the hydrophobic Try347 in H5N1 NA does not ma- tch with the hydrophilic carboxyl group of oseltamivir as in the case of H1N1 NA, which explains in part the rea- son why the H5N1 avian influenza virus is drug-resistant to oseltamivir. Another study [5] found that the NA sequences of 2009 H1N1 are phylogenetically more closely related to European H1N1 swine flu and H5N1 avian flu rather than to the H1N1 counterparts in the America. It also investigated the sequence variations of 2009 H1N1 NA, using three sequences of NA, A/H1N1 /California/04/ 2009, A/H5N1/Vietnam/2004, and A/H1N1/Brevig Mis- sion/1/18 (the 1918 Spanish flu). With multiple sequence alignment, they found that among the 387 residues of the NA domain, the 2009 H1N1 NA differs from the other two strains in 21 positions. The novel mutations of NA are mainly located at the protein surface and not near the binding pocket for currently used NA inhibitors. It is na- tural to explore whether there are any potential mutation sites that may interfere with the active sites in the near future. In light of the possible emerging of new mutations of 2009 H1N1 that could lead to serious drug resistance, it is imperative to study the potential mutations and co- mutations of the current sequences of 2009 H1N1, be- cause mutations tend to function in concert to achieve some biological purposes. In this study, we employed entropy and mutual information theory [7,8] to investi- gate this issue. Because the NA stalk supports the head domain, its length can influence the function of NA. A special NA stalk motif with a 20-amino acid deletion in the 49th to 68th positions of the stalk region was first identified in H5N1 in 2000. There was a gradual increase of this spe- cial NA stalk motif in H5N1 isolates from 2000 to 2007, and it was in all 173 H5N1 human isolates from 2004 to 2007. The H5N1 virus carrying this special NA stalk motif has the highest virulence and pathogenicity in chicken and mice [9]. This finding prompted us to search for similar stalk motifs in the current 2009 H1N1 virus strains. In summary, the goal of our study is to conduct a timely analysis of mutations, co-mutations, stalk motifs, and phylogenetic relationship of the 2009 H1N1 NA sequences available up to date. Such information can be valuable in further efforts to improve drug design and flu treatment. 2. MATERIALS AND METHODS 2.1. Sequence Data Published NA sequences of 7251 influenza A virus were downloaded from the Influenza Virus Resource (http:// www.ncbi/nlm.nih.giv/genomes/FLU/FLU.html) of the National Center for Biotechnology Information (NCBI) on Sept 13, 2009. We were mainly interested in the se- quences in 2009, but also needed the sequences in 2008 and 2007 to provide comparison in the study of stalk motifs. There were 283 sequences of H1N1 and H5N1 and 52 of H3N2 in 2009. All the sequences used in the study were aligned with MAFFT [10]. 2.2. Entropy and Mutual Information In information theory [7,8], entropy is a measure of the uncertainty associated with a random variable. Let x be a discrete random variable that has a set of possible values {a1, a2, a3,…,an} with probabilities {p1, p2, p3,…,pn} where p(x=ai)= pi. The entropy H of x is () log ii i xp p The mutual information of two random variables is a quantity that measures the mutual dependence of the two variables or the average amount of information that x conveys about y, which can be defined as: (, )()()(, ) xyHx Hy Hxy where H(x) is the entropy of x, and H(x,y) is the joint entropy of x and y. I(x,y)=0 if and only if x and y are independent random variables. In current study, each of the n columns in a multiple sequence alignment of a set of NA sequences of N resi- dues is considered as a discrete random variable xi (1 ≤i ≤N) that takes on one of the 20 (n=20) amino acid types with some probability. H(xi) has its minimum value 0 if all the residues at position i are the same, and achieves its maximum if all the 20 amino acid types appear with equal probability at position i, which can be verified by the Lagrange multiplier technique. A position of high entropy means that the amino acids are often varied at this position. While H(xi) measures the genetic diversity at position i in our current study, I(xi, yj) measures the correlation between residue substitutions at positions i and j. Entropy and mutual information were applied to se- quence analysis extensively. Mutual information was employed to identify groups of covariant mutation po- sitions in the sequences of HIV-1 protease and to dis- tinguish the correlated residue substitutions resulting from neutral mutations and those induced by multi- drug resistance [11]. Based on entropy a simple infor- mational index was proposed in [12] to characterize the patterns of synonymous codon usage bias. In another SciRes Copyright © 2009 JBiSE
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 552 study, sequence data of 1032 complete genomes of in- fluenza A virus (H3N2) during 1968-2006 were used to construct networks of genomic co-occurrence to de- scribe H3N2 virus evolutionary patterns and dynamics. It suggested that amino acid substitutions correspond- ing to nucleotide co-changes cluster preferentially in known antigenic regions of HA [13]. Further, mutual information was used to construct site transition net- work based on 4064 HA1 of A/H3N1 sequences from 1968 to 2008, which was able to model the evolution- ary path of the influenza virus and to predict seven possible HA mutations for the next antigenic drift in the 2009–2010 season [14]. Recently, entropy and mutual information were also applied to indentify critical posi- tions and co-mutated positions on HA for predicting the antigenic variants [15]. 2.3. Mutual Information Evaluation In order to assess the significance of our mutual infor- mation values of residue pairs of NA, it is necessary to show that these values are significantly higher than those based on random sequences. For each residue position of NA, we randomly permuted the amino acids from dif- ferent sequences at that position and calculated the mu- tual information of these random sequences. This pro- cedure was repeated 1000 times. The P value was calcu- lated as the percentage of the mutual information values of the permuted sequences that were higher than those of the sequences of NA. 2.4. Random Forest Clustering Random Forest, proposed by Leo Breiman in 1999 [16], is an ensemble classifier based on many decision trees. The structure of a single tree could be easily altered by a small perturbation of data. Random Forest overcomes this problem by averaging across different decision trees. For many data sets, Random Forest produces a highly accurate classifier for supervised learning, com- parable to Support Vector Machine, the state of the art machine-learning algorithm. It computes proximities between cases and this technique can be extended to unlabeled data, leading to unsupervised clustering. In [17] random forest clustering was applied to renal cell carcinoma. To view the clusters formed by Random Forest, mul- tidimensional scaling [18] was utilized to project high- dimensional data down into a low-dimensional space while preserving the distances between them. First the proximities between cases i and j form a symmetric and positive definite matrix {prox(i,j)}. Then a second posi- tive definite and symmetric matrix {cv(i,j)} is con- structed using the entries of {prox(i,j)}. Random Forest extracts a few largest eigenvalues of the cv matrix and their corresponding eigenvectors. The values of () ()eivi are referred to as the ith scaling coordinate, where e(i) and v(i) re the ith eigenvalue and eigenvector of matrix cv [19]. In this study, the first and second scaling coordinates were utilized to visualize the data. 2.5. Important Sites in NA The N1 active site is a shallow pocket constructed from conserved residues, some of which contact the substrate directly and participate in catalysis, while others provide a structural framework [12]. According to the numbering in [5], these residue positions of N1 are 118, 119, 151, 152, 156, 179, 180, 223, 225, 228, 247, 277, 278, 293, 295, 368, and 402. The antigenic sites of N1 are residues 83–143, 156–190, 252–303, 330, 332, 340–345, 368, 370, 387–395, 431–435, and 448–468. 3. RESULTS 3.1. Mutations and Co-mutations The NA molecule is a homotetramer consisting of four identical polypeptide chains, each of about 470 amino acids. The exact number varies depending on the strain of the virus. The enzymatic domain of the NA is sup- ported from the virus envelope by a polypeptide stalk of variable length. The major molecular determinants that are known to influence the functional activities of the NA protein are the enzyme active site, the stalk length, the sialic acid binding site, and potential glycosylation sites. In this study, entropy analysis was applied to locate the positions that have elevated likelihood to develop mutations and those that already had mutated. The top 31 positions with high entropy are displayed in Figure 1. All except four, 248, 339, 340, and 454, were mutational positions discovered in [5]. These exceptional positions are of interest because they have the potential to mutate and position 248 is close to one active site 247. Notably, there were two clusters of mutations, one near position 286 and another near position 386 (Figure 1). For the sake of overview and comparison, we also plotted in Figure 1 the distinct entropy distributions of the N1 and N2 sequences deposited at the NCBI web site during 2009 so far. The sequences of N1 and N2 varied the most in the neighborhood of the stalk region between positions 36 and 76. In addition, the N1 sequences in H1N1 and H5N1 are experiencing a much greater ge- netic change than the N2 sequences in H3N2 this year as illustrated by their entropy (Figure 1). Next, we seek to probe the potential mutations that may affect the drug binding sites from a greater distance. We calculated the mutual information of each possible residue pairs from 469 residues of NA, a total of 109746 pairs. The top 44 pairs (top 0.04% of all pairs) were se- lected, all with a P value of zero. Because we were in- terested in the mutations in the NA domain, only those pairs in that region were chosen, which gave us 11 pairs: SciRes Copyright © 2009 JBiSE
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 SciRes Copyright © 2009 553 (149, 263), (149, 321), (263, 321), (228, 321), (188, 365), (189, 369), (221, 369), (189, 386), (149, 389), (263, 389), and (321, 389). Of these 11 pairs, two networks of co-mutations were uncovered (Figure 2), which were mapped to the homology-based 3D structure of N1 built in [5] (Figures 3 and 4). These two networks of co-mutations may form interaction chains to connect distant residues to the active sites. The first network, consisting of positions 149, 263, 321, and 389, has a remarkable property that any one of them is highly correlated to all the other three. Position 149 is near the active site and the bound drug, therefore is of great importance and it is a part of the 150 loop region including positions 147, 148, 150, and 151. Cal- cium ions are important for the thermo stability and en- zyme activity of influenza virus NAs. Three potential metal binding sites in each monomer of the tetramer were observed. The two mutation positions, 321 and 389, are located in the region of one such site at position 470 [20] (Figure 4). The second network has positions 188, 189, 221, 365, and 369. Position 221 is near the three active sites 223, 225, and 228. Positions 365 and 369 are close to the ac- tive site 368 and the bound drug; therefore positions 188 and 189 may function together with 365 and 369 to in- fluence the active site 368 and the bound drug (Figures 5 and 6). Position 221 is not a mutation site in the three-sequence alignment in [5], but it has high entropy. Figure 1. The top plot shows the top 31 residues of highest entropy in the NA domain (83–469) of H1N1 and H5N1 (2009). The residues that had one different amino acid than the two reference strains in [5] are marked with one asterisk, and those that had two different amino acids are marked with two asterisks. The middle and the bottom plots show the entropy of all residues in NA (1 – 469) of H1N1 and H5N1 (2009) and H3N2 (2009) respectively. JBiSE
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 554 149 151 152 188 189 221 223 225 228 263 321 365 368 369 386 389 Figure 2. This plot shows the residues involved in the two networks of co-mutations with an arch to indicate the correlation between co-mutations. Three active sites 151, 152 and 368 are displayed next to their closest mutations. Figure 3. This plot shows in 3D structure four residues, 248, 339, 340, and 454, that have high entropy and have not mutated yet and one active site 247. Residue 248 is very close to active site 247. Calcium ion site 471 is marked to show its closeness to two residues 339 and 340. Residue 247 is in pink, 248 in yellow, 339 in blue, 340 in green, and 454 in black. The backbone of the antibody recognition sites is colored green and the bound drug (zanamivir) and three calcium ions are shown in red. Figure 4. This plot shows in 3D structure the four resi- dues, 149, 263, 321, and 389, in the first network of co-mutations. Residue 149 is in pink, 263 in yellow, 321 in blue, and 389 in green. Calcium ion site 470 is marked to illustrate its close position to residues 321 and 389. 471 Figure 5. This plot shows in 3D structure the five residues, 188, 189, 221, 365, and 389, in the second networks of co-mutations. Residue 188 is in pink, 189 in yellow, 221 in blue, 365 in green, and 389 in black. 4 Figure 6. This plot shows in 3D structure the close- ness of the four residues 221, 223, 225, and 228. One residue 221 is a part of the second network of co-mutations and the other three residues 223, 225, and 228 are active sites. Residue 221 is in pink, 223 in yellow, 225 in blue, and 228 in green. SciRes Copyright © 2009 JBiSE
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 555 This is a position of interest, because it is close to the active sites 223, 225 and 228 and co-mutates with a cluster of mutation positions 365, 366, and 369. The latter cluster also encloses another active site 368 (Fig- ures 2, 5 and 6). 3.2. Stalk Motifs In this study, our intention was to discover the NA stalk motif patterns of 2009 H1N1and H5N1, which is the focus of this work, and those of 2008 and 2007 to place our findings in the right historical context. As noted in the previous section, the N1 strains in 2009 are experi- encing a rapid genetic variation in the stalk region com- pared with other regions of N1, which could be reflected in the different stalk motifs appearing this year. We extracted the sub-sequences consisting of posi- tions from 36 to 79 in the NA sequences and discarded those that contained no amino acids. All the different motifs found are displayed in Tab le 1. In 2007, all NA stalk motifs in H1N1 and H5N1 had three different types referred to as types 1, 2, and 3. In 2008, type 3 stalk mo- tif disappeared and type 4 appeared. In 2009, types 1, 2 and 4 persisted. In all these three years, types 1 and 2 persisted. Type 2, referred to as the special stalk motif in [9], was in H5N1 in all three years and type 1 was in H1N1 in all three years. The important change in 2009 was that type 1, a com- mon stalk motif in H1N1, was more prevalent in H5N1 and type 2, a common stalk motif in H5N1, was in H1N1 for the first time, a dramatic exchange of the two different motifs between these two subtypes of flu vi- ruses. The special stalk motif was dominant in H5N1 in 2007 and 2008, but it was no longer the case in 2009. The new type 4 was more evident in H5N1 in 2009. These new patterns or exchange of different patterns of stalk motifs reminded us again of the fast evolutionary nature of the flu virus and the need for timely analysis of its data. The occurrence of the special stalk motif in 2009 H1N1, which may bring increased virulence to the current swine flu epidemic, is worthy of further attention and surveillance. These alterations in the stalk region of NA in H1N1 and H5N1 could also be reflected in the phylogenetic analysis conducted in the next section. 3.3. Phylogenetic Analysis Even though the phylogenetic tree of a small number of NA sequences was constructed before [5]. With new H1N1 NA sequences being deposited at the NCBI web site regularly, it is constructive to perform phylogenetic analysis on these new sequences. For easy comparison, software MEGA [21] was used to reproduce the phy- logenetic tree in Figure 3 in [5] with the same NA se- quences, which had eight sequences of 2009 H1N1 (numbered from 1 to 8) available at the time of the re- search (as of April 29th) in [5] and 44 different represen- tative sequences of H1N1 or H5N1(numbered from 9 to 52) in previous years (left plot of Figure 7). We em- ployed Random Forest to cluster these sequences to get a different view of their phylogenetic relationship (Figure 8), where a number is used to represent a sequence due to the limited space in that plot. The association of these numbers with their sequences can be found in Figure 7, where a number is printed before each flu subtype such as “2 H1N1 California 09 2009” and “6 H1N1 Texas 05 2009”. Random Forest clustering revealed that all eight 2009 H1N1 sequences, which formed their own single cluster, were similar to others only in the second scaling coordinate, but not in the first. We reasoned this was because all eight sequences were from the same country and the structures of these clusters might be different if the plentiful sequences deposited recently at the NCBI web site were used. We took note of some minor but interesting differ- ences between the phylogenetic tree in the left plot of Figure 7 and Random Forest-based clusters of the same sequences in the left plot of Figure 8. In Figure 8, the eight sequences 1, 2, 3, …, 8 were clustered in one clus- ter away from all the other sequences. This cluster was close to two groups of sequence numbers in the second scaling coordinate. The first group consisted of sequences Table 1. Different NA stalk motifs 36th ---------------------------- Stalk region -------------------------------79th Subtype Year Number of strains shared this motif Motif type number HS IQTGSQNHTGICNQRIITYENSTWVNHTYVNINNTNVVAG KD H1N1 2007 196 1 HS IQTGNQHQAEP----------------------------------------ISNTNFLTE KA H5N1 2007 189 2 -- ---------NQNQVEP---------------------------------------ISNTNFLTE KA H5N1 2007 1 3 HS IQIGSQGYPETCNQSVITYENNTWVNQTYINISNTNLIGG QA H5N1 2007 2 1 HS IQTGSQNNTGICNQRIITYENSTWVNHTYVNINNTNVVAG ED H1N1 2008 147 1 -- ------------------------------------------------NHTYVNINNTNVVAG ED H1N1 2008 1 4 HS IQTGNQCQAEP----------------------------------------ISNTKFLTE KA H5N1 2008 70 2 HS IQLGNQNQIETCNQSVITYENNTWVNQTYVNISNTNFAAG QS H1N1 2009 157 1 HS IQTGNQCQDEP----------------------------------------ISNTKFLTE KA H1N1 2009 21 2 ------------------------------SVITYENNTWVNQTYVNISNTNFAAG QS H1N1 2009 39 4 HS INTGNQHQAEP----------------------------------------ISNANFLTE KA H5N1 2009 1 2 HS IQLGNQNQIETCNQSVITYENNTWVNQTYVNISNTNFAAG QS H5N1 2009 20 1 SciRes Copyright © 2009 JBiSE
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 556 2 H1N1 Cali fornia 0 9 2009 gi |227 831806|g 6 H1N1 Texas 05 2 009 gi |22 7831797 |gb|A CP 5 H1N1 Texas 04 2 009 gi |22 7831824 |gb|A CP 1 H1N1 Cali fornia 0 4 2009 gi |227 809834|g 3 H1N1 Cali fornia 05 2009 gi |22 78317 68|g 4 H1N1 Cali fornia 06 2009 gi |22 80177 56|g 7 H1N1 NewYork 20 20 09 gi |2279 77174|gb |A 8 H1N1 Ohi o 07 2009 g i|2 27977 158|gb |A CP 4 11 H1N1 Thailand 271 2005 gi|117935805|g 13 H1N1 swine Chachoengsao NIAH587 2005 9 H1N1 s wi ne B el g iu m 74 85 gi |2006 8225|e 10 H1N1 s win e S cotla nd WVL1 7 1999 gi |225 14 H1N1 s win e Zhejian g 1 2007 gi |210 0766 15 H1N1 swine Italy 1509-6 97 gi|2006821 16 H1N1 s wine Hungary 19774 2006 gi |2249 12 H1N1 swi ne F i n i s tere 3616 84 gi |20068 19 H6N1 grayteal A ust ralia 1 1979 gi|115 17 H7N1 ost ric h Ital y 984 00 gi|12396720 20 H1N1 duc k E asternChi na 15 2 2003 gi |16 28 H5N1 HongK ong 156 97 gi |28336 59|gb|A A 34 H6N1 partridge S hant ou 5028 2004 gi|1 25 H5N1 duc k S ha nt ou 1930 20 01 gi |16912 4 33 H5N1 Chick e n HongKong Y U5 62 01 gi |288 22 H5N1 egret HongK o ng 757 . 2 03 gi|5 6548 32 H5N1 Mus c ovyduc k V iet nam 39 2007 gi|1 23 H5N1 Chick en P aul auRam p ang B PP V11 200 29 H5N1 Vi etNam HN3124 2 20 07 g i|16256900 30 H5N1 duck E as t ernChina 150 2003 gi|16 37 H5N1 Thailand NKNP 2005 gi|116295106| 24 H5N1 Thailand 1K AN-1 2004 gi |46578136 36 H5N1 c hi ck e n Thailan d ICRC-V586 20 08 21 H6N1 c hi cken Taiwan S P1 00 g i|8 724711 18 H3N1 mall ardduc k Mi nnesota 1979 gi|14 35 H11N1 ruddy t u rns tone De lawa re 2589 87 27 H7N1 FP V Rostock 1934 gi |58577|emb|CA 31 H1N1 Brevig Mi ss i on 1 18 gi |85721 69|g 42 H1N1 swine Ohio 75004 04 gi|188572594 51 H1N1 s wine A l berta 5662 6 03 gi |8262 28 43 H1N1 turkey KS 488 0 1980 gi |189 312997 40 H1N1 c hi ck e n NY 21665-7 3 1998 gi|1 938 46 H1N1 s wine S hang hai 3 2005 gi |2244828 48 H3N1 sw i ne IN PU542 04 gi |757565 70|gb 49 H1N1 s win e Ontari o 111 12 04 gi |8262 29 26 H1N1 Swine Iowa 30 gi |85 721 85|gb |AA F 7 38 H1N1 W i l son-S m i th 33 gi |89782161 |gb|A 41 H3N1 NYMCX-161AP uertoRic o 8 1934-Wi s c 39 H1N1 W ei ss 43 g i|8 572187|g b|A A F 77 045. 47 H1N1 Fort M on m outh 1 47 gi |217 17612|gb 44 H1N1 swi ne Tianjin 01 20 04 g i |151 3355 45 H1N1 Len i ngra d 54 1 gi|325336|gb |AA A4 50 H1N1 NewCaledoni a 20 19 99 gi |15882 753 52 H1N1 Taiwan 117 96 gi |597 97382|gb |A A X 99 26 13 8 17 79 100 88 58 74 51 46 99 72 86 48 67 100 89 77 99 99 35 38 41 86 97 25 6 22 35 94 96 53 41 100 47 19 17 38 36 96 44 44 13 30 28 0.02 24 H5N1 Thailand 1KA N-1 2004 gi|46578136 36 H5N1 chick en Thailand ICRC-V586 2008 37 H5N1 Thailand NKNP 2005 gi|116295106| 30 H5N1 duck E asternChina 150 2003 gi|16 6 H1N1 Crete 2307 2009 gi|254688553|gb|A 29 H5N1 Viet Nam HN31242 2007 gi|16256900 23 H5N1 Chick en PaulauRampang B P PV 11 200 5 H1N1 Stockholm 35 2009 gi|251833568|gb 32 H5N1 Muscovyduck Vietnam 39 2007 gi|1 7 H1N1 swine Hong Kong NS 29 2009 gi|2396 22 H5N1 egret HongK ong 757.2 03 gi |56548 25 H5N1 duck S hant ou 1930 2001 gi|169124 33 H5N1 Chic ken HongK ong YU562 01 gi|288 28 H5N1 HongKong 156 97 gi|2833659|gb|AA 34 H6N1 partri dge S hant ou 5028 2004 gi|1 17 H7N1 ost rich Italy 984 00 gi|12396720 20 H1N1 duck E ast ernChina 152 2003 gi|16 19 H6N1 grayt eal A us t rali a 1 1979 gi|115 12 H1N1 swine Fi ni st ere 3616 84 gi|20068 16 H1N1 swine Hungary 19774 2006 gi|2249 4 H1N1 Dublin 11 2009 gi|253828472|gb|AC 14 H1N1 swine Zhejiang 1 2007 gi|2100766 15 H1N1 swine Italy 1509-6 97 gi|2006821 9 H1N1 swine B el gium 74 85 gi|20068225|e 10 H1N1 swine S cot land WVL17 1999 gi|225 3 H5N1 grey heron Hong Kong 779 2009 gi| 11 H1N1 Thailand 271 2005 gi|117935805|g 13 H1N1 swine Chachoengs ao NIAH587 2005 21 H6N1 chick en Taiwan SP 1 00 gi|8724711 18 H3N1 mallardduck M i nnesot a 1979 gi|14 35 H11N1 ruddyturns t one Delaware 2589 87 27 H7N1 FPV Ros toc k 1934 gi|58577|emb|CA 31 H1N1 Brevig Miss i on 1 18 gi|8572169|g 8 H1N1 swine Guangdong 2 2009 gi|2555298 46 H1N1 swine S hanghai 3 2005 gi|2244828 40 H1N1 chick en NY 21665-73 1998 gi|1938 43 H1N1 turkey K S 4880 1980 gi|189312997 42 H1N1 swine Ohio 75004 04 gi|188572594 51 H1N1 swine A l bert a 56626 03 gi|826228 48 H3N1 swine IN PU542 04 gi|75756570|gb 49 H1N1 swine Ont ario 11112 04 gi|826229 26 H1N1 Swine Iowa 30 gi|8572185|gb|AAF 7 1 H1N1 Was hi ngt on 01 2009 gi |243031412|g 2 H1N1 North Caroli na 02 2009 gi|2559606 50 H1N1 NewCaledonia 20 1999 gi|15882753 52 H1N1 Taiwan 117 96 gi|59797382|gb|AAX 38 H1N1 Wil son-S m ith 33 gi|89782161|gb|A 41 H3N1 NYMCX-161APuert oRic o 8 1934-Wi s c 39 H1N1 Weiss 43 gi|8572187|gb|AAF 77045. 47 H1N1 FortMonm out h 1 47 gi|21717612|gb 44 H1N1 swine Tianjin 01 2004 gi|1513355 45 H1N1 Leningrad 54 1 gi|325336|gb|AAA 4 67 92 100 99 90 31 26 14 26 87 100 52 76 42 55 100 100 86 64 99 45 99 34 27 4 10 35 88 43 82 98 95 52 41 99 56 27 29 45 34 95 92 96 34 41 19 21 39 15 0.02 Figure 7. Left plot: a reproduced phylogenetic tree of the NA protein sequences of the N1 subtype family in [5]. Right plot: A phy- logenetic tree of the same sequences used in the left plot except that the first eight sequences were replaced with eight new represen- tative sequences from 283 N1 sequences in 2009. 13, 14, 16, 10, and 11, and the second had sequences 38, 35, and 31. The phylogenetic tree displayed the close relationship between the first group and the group of sequences 1, 2,…., 8, but did not do so for the second group. The first group had H1N1 swine Chachoengsao 2005 (13), H1N1 swine Zhejiang 2007 (14), H1N1 swine Hungary 2006 (16), H1N1 swine Scotland 1999 (10), and H1N1 Thailand 2005 (11). The second group had H1N1 Wilson Smith 33 (38), H11N1 Delaware 1987 (35), and H1N1 Brevig Mission 1918 (31). These two groups were all H1N1 subtype and the second group inherited from the ancient strains in the first group, and that is why they were similar and clustered together by Random Forest. To get a new view of the clusters of the NA sequences of 2009 H1N1 and H5N1 available up to date, eight new representative sequences were selected from 283 se- quences in the same year with cd-hit [22] to replace the eight NA sequences, numbered from 1 to 8, used in the left plot of Figures 7 and 8. The new sequences were H1N1 Washington 2009 (1), H1N1 North Carolina 2009 (2), H5N1 Hong Kong 2009 (3), H1N1 Dublin 2009 (4), H1N1 Stockholm 2009 (5), H1N1 Crete 2009 (6), H1N1 swine Hong Kong 2009 (7), and H1N1 swine Guang- SciRes Copyright © 2009 JBiSE
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 557 dong 2009 (8). In the Random Forest-based clusters in the right plot of Figure 8, there were three major branches with the center made of ancient NA sequences. Sequences 1, 2, and 8, sequences 3 and 4, and sequences 5, 6, and 7 were on each branch separately, which was also similarly reflected in the phylogenetic tree in the right plot of Figure 7. Sequences 1 and 2 were close to sequences 50 and 52 and sequence 8 was close to se- quences 46 and 43. Sequences 3 and 4 were close to se- quences 18, 16, 10, 13, 12, 9, 15, 11, and 14, which were mainly European swine strains. Sequences 5 and 6 were close to sequences 30 and 32 and sequence 7 was close to sequences 22, 25, and 33. All eight representative sequences were evenly spread in various clusters in the right plot of Figure 8 and the phylogenetic tree in the right plot of Figure 7, which illustrated that as the 2009 H1N1 strains continue to evolve, diverse genetic makeup of the sequences would develop. 4. CONCLUSIONS The recent outbreak of the novel 2009 H1N1 influenza has raised global concerns regarding its virulence and pandemic potential. Several recent studies on the 2009 H1N1 virus used small datasets to conduct analysis. With new sequences available up to date, we were able to extend the previous research in three areas using en- tropy, mutual information, and Random Forest. The first was finding two networks of co-mutations that may po- tentially affect the current flu-drug binding sites on NA. The second is discovering a special stalk motif, which was dominant in the H5N1 strains in the past, in the 2009 H1N1 strains for the first time. Due to the high virulence of this motif, the second finding is significant in our current research on 2009 H1N1. The third was updating the phylogenetic analysis of the NA sequences of 2009 H1N1 and H5N1, which demonstrated that, in clear contrast to previous findings, the N1 sequences in Figure 8. Left plot: Random Forest-based clusters of the same NA sequences used in the left plot of Figure 7. Right plot: Random Forest- based clusters of the same NA sequences used in the right plot of Figure 7. SciRes Copyright © 2009 JBiSE
W. Hu / J. Biomedical Science and Engineering 2 (2009) 550-558 558 2009 are diverse enough to cover different major bran- ches of the phylogenetic tree of those in previous years. As the novel influenza A H1N1 virus continues to spread globally, our results highlighted the value of performing timely analysis on the 2009 H1N1 virus. 5. ACKNOWLEDGMENTS We thank Houghton College for its financial support. REFERENCES [1] Taubenberger, J. K. and Morens, D. M., (2006) 1918 Influenza: the mother of all pandemics, Emerg. Infect. Dis., 12(1), 15–22. [2] Trifonov, V., Khiabanian, H., and Rabadan, R., (2009) Geographic dependence, surveillance, and origins of the 2009 influenza A (H1N1) virus, N. Engl. J. Med., 361, 115–119. [3] Centers for Disease Control and Prevention (CDC), (2009) Update: Drug susceptibility of swine-origin in- fluenza A (H1N1) viruses, MMWR Morb Mortal Wkly Rep 2009, 58, 433–435. [4] Weïwer, M., Chen, C. C., Kemp, M. M., and Linhard, R. J., (2009) Synthesis and biological evaluation of non-hydrolyzable 1,2,3-triazole-linked sialic acid deriva- tives as neuraminidase inhibitors, European Journal of Organic Chemistry, 16, 2587. [5] Maurer-Stroh, S., Ma, J., Lee, R. T. C., Sirota, F. L., and Frank, E., (2009) Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites, Biol. Direct., 4, 18. [6] Wang, S. Q., Du, Q. S., Huang, R. B., Zhang, D. W., and Chou, K. C., (2009) Insights from investigating the in- teraction of oseltamivir (Tamiflu) with neuraminidase of the 2009 H1N1 swine flu virus, Biochemical and Bio- physical Research Communications, 386(3), 432–6. [7] Cover, T. A. and Thomas, J. A., (1991) Elements of in- formation theory, John Wiley and Sons, NewYork. [8] MacKay, D., (2003) Information theory, inference, and learning algorithms, Cambridge University Press. [9] Zhou, H. B., Yu, Z. J., Hu, Y., Tu, J. G., Zou, W., Peng, Y. P., Zhu, J. P., Li, Y. T., Zhang, A. D., Yu, Z. N., Ye, Z. P., Chen, H. C., and Jin, M. L., (2009) The special neura- minidase stalk-motif responsible for increased virulence and pathogenesis of H5N1 influenza A virus, PLoS One, 4(7), e6277. [10] Katoh, K., Kuma, K., Toh, H., and Miyata, T., (2005) MAFFT version 5: Improvement in accuracy of multiple sequence alignment, Nucleic. Acids. Res., 33, 511–518. [11] Liu, Y., Eyal, E., and Bahar, I., (2008) Analysis of corre- lated mutations in HIV-1 protease using spectral cluster- ing, Bioinformatics, 24(10), 1243–1250. [12] Colman, P. M., Hoyne, P. A., and Lawrence, M. C., (1993) Sequence and structure alignment of paramyxovirus he- magglutinin-neuraminidase with influenza virus neura- minidase, J. Virol., 67, 2972–2980. [13] Du, X. J., Wang, Z., Wu, A. P., Song, L., Cao, Y., Hang, H. Y., and Jiang, T. J., (2008) Networks of genomic co-occurrence capture characteristics of human influenza A (H3N2) evolution, Genome. Res., 18, 178–187. [14] Xia, Z., Jin, G. L., Zhu J., and Zhou, R. H., (2009) Using a mutual information-based site transition network to map the genetic evolution of influenza A/H3N2 virus, Bioinformatics, 25(18), 2309–2317. [15] Huang, J. W., King, C. C., and Yang, J. M., (2009) Co- evolution positions and rules for antigenic variants of human influenza A/H3N2 viruses, BMC Bioinformatics, 10(Suppl 1), S41. [16] Breiman, L., (2001) Random forests, Machine Learning, 45(1), 5–32. [17] Shi, T., Seligson, D., Belldegrun, A. S., Palotie, A., and Horvath, S., (2005) Tumor classification by tissue mi- croarray profiling: Random forest clustering applied to renal cell carcinoma, Mod. Pathol., 18(4), 547–57. [18] Cox, T. F. and Cox, M. A. A., (2001), Multidimensional scaling, Chapman and Hall. [19] http://www.stat.berkeley.edu/~breiman/RandomForests/. [20] Xu, X. J., Zhu, X. Y., Dwek, R. A., Stevens, J., and Wil- son, I. A., (2008) Structural characterization of the 1918 influenza virus H1N1 neuraminidase, Journal of Virology, 82(21), 10493–10501. [21] Kumar, S., Nei, M., Dudley, J., and Tamura, K., (2008) MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences, Brief Bioinfor- matics, 9, 299–306. [22] Li, W. and Godzik, A., (2006) Cd-hit: A fast program for clustering and comparing large sets of protein or nucleo- tide sequences, Bioinformatics, 22, 1658–1659. SciRes Copyright © 2009 JBiSE
|