Saturday, May 30, 2009

Eurosurveillance, Volume 14, Issue 21, 28 May 2009 CLUSTER ANALYSIS OF THE ORIGINS OF THE NEW INFLUENZA A(H1N1) VIRUS A Solovyov1

Physics Department, Princeton University, Princeton, United States
Center for Infection and Immunity, Mailman School of Public Health, Columbia University, New York, United States
Department of Biomedical Informatics, Center for Computational Biology and Bioinformatics, Columbia University College of Physicians and Surgeons, New York, United States

Date of submission: 27 May 2009
In March and April 2009, a new strain of influenza A(H1N1) virus has been isolated in Mexico and the United States. Since the initial reports more than 10,000 cases have been reported to the World Health Organization, all around the world. Several hundred isolates have already been sequenced and deposited in public databases. We have studied the genetics of the new strain and identified its closest relatives through a cluster analysis approach. We show that the new virus combines genetic information related to different swine influenza viruses. Segments PB2, PB1, PA, HA, NP and NS are related to swine H1N2 and H3N2 influenza viruses isolated in North America. Segments NA and M are related to swine influenza viruses isolated in Eurasia.


Influenza A virus is a single stranded RNA virus with a segmented genome. When different influenza viruses co-infect the same cell, progeny viruses can be released that contain a novel mix of segments from both parental viruses. Since the first reported pandemic in 1918, there have been two other pandemics in the 20th century. In both cases, the pandemic strains presented a novel reassortment of genome segments derived from human and avian viruses [1-3]. The origins of the 1918 strain are so not clear, although different analyses suggest that this virus had an avian origin [4,5].

When and where pandemic reassortments happen remains a mystery. Avian viruses often undergo reassortment events among different subtypes. Several reports suggest that reassortments are also frequent between human viruses [6,7]. Swine have been found frequently with co-infections and reassortment of swine, human, and avian viruses has been reported [8-10,3]. In addition, cell surface oligosaccharide receptors of the swine trachea present both, a N-acetylneuraminic acid-alpha2,3-galactose (NeuAcalpha2,3Gal) linkage, preferred by most avian influenza viruses, and a NeuAcalpha2,6Gal linkage, preferred by human viruses [11]. Co-infection combined with co-habitation of swine and poultry on small family farms all over Asia, and the presence of avian as well as human receptor types in pigs have led to the “mixing vessel” conjecture [12,13] that suggests that most of the inter-host reassortments are produced in pigs.

Recently, a new A(H1N1) subtype strain has been identified initially in Mexico, then rapidly reported in all continents. As of 27 May, 12,954 cases of the new influenza A(H1N1) virus infection, including 92 deaths have been reported to the World Health Organization [14,15]. Several approaches have been used to understand the origins of this strain. Searches in public databases containing influenza A genomes using sequence alignment tools indicated that the closest relatives for each of the eight genomic segments are from viruses circulating in swine for the past decade [16-19]. These include genome segments derived from “triple reassortant” swine viruses that combined in the late 1990s genome segments from viruses previously identified in humans, birds, and swine [20]. Similar conclusions were drawn by the application of phylogenetic techniques [16,21].

Here we present a cluster analysis using Principal Component Analysis and unsupervised clustering. Clustering methods are particularly robust under changes in the underlying evolutionary models. Our results substantiate previous reports [16,21], and demonstrate that for each of the genome segments of the new influenza A(H1N1) virus the closest relative was most recently identified in a swine, compatible with a reassortment of Eurasian and North American swine viruses (Figure 1).

Figure 1. Origins of the new influenza A(H1N1) virus

Materials and methods

Influenza sequences were obtained from the National Center for Biotechnology Information (NCBI) [22] in the United States. We performed a search using Basic Local Alignment Search Tool (BLAST) for each of the eight A/California/04/2009(H1N1) segments separately, recording the 50 best matches. Then we constructed the union of all these matches, taking the sequences for all their segments available in the database. We aligned these sequences using the stretcher algorithm as implemented in the EMBOSS package.

After the alignment we translate the sequences into the binary data, comparing them to the reference sequence site by site. A mutation maps to 1, while a nucleotide identical to that in a reference sequence maps to 0. Whenever there are masks, they map to the corresponding fractional numbers. Gaps are not counted as polymorphisms. Therefore, if there are the S sequences restricted to the P polymorphic sites, these data translate to the SxP matrix. Each row of this matrix can be thought of as a vector in a P-dimensional space, and it represents one of the sequences.

We perform the Principal Component Analysis (PCA) in order to determine the most significant coordinates in this P-dimensional space. After this we leave the principal components which capture 85% of the total variance, discard the remaining ones and project the data onto this relevant coordinate subset.

This procedure is followed by the consensus K-means clustering. Namely, if one targets for K clusters, one repeats the K-means clustering procedure N times, and forms the matrix n whose elements nij (i,j=1,…,S) represent the number of times out of the N trials when the i-th and j-th sequences were clustered together. In our analysis we set N=100. The matrix of the distances between the samples is:

One then performs the standard hierarchical clustering with this matrix, targeting for the K clusters. This procedure does not depend on any assumptions made by the phylogenetic models. Note that these techniques can be used for inferring phylogenies as well [23], though this is beyond the scope of the present note.


Sequence comparison of available sequences of the new A(H1N1) virus (as of 27 May 2009) did not identify significant sequence variation, except for a few point mutations. Hence A/California/04/2009(H1N1) was chosen as the representative for further analyses. There are many different phylogenetic techniques, each of them with their own assumptions about evolutionary models that vary in the way of computing genetic distances, probabilities, etc. As opposed to phylogenetic techniques, cluster methods do not have a need for evaluation of a tree, which is a more complicated structure than a set of clusters. Clustering techniques do not provide a detailed phylogenetic structure because they analyse group features of the sequence data. That is why the clustering analysis is more robust to the assumptions we make, for instance, the choice of genetic distance. Unsupervised methods provide a way of identifying clusters without relying on previous information about the origins, host and time isolation.

Figures 2a-2h show the data projected onto the first two principal components with the corresponding percentage of variation. The figures clearly show that in all cases the new virus sequences clustered with those of swine viruses. The closest matches for each of the segments are summarised in the Table.

Our analyses support the hypotheses whereby the 2009 pandemic influenza A(H1N1) virus derives from one or multiple reassortment(s) between influenza A viruses circulating in swine in Eurasia and in North America. It is schematically illustrated in the Figure 1.

Supplementary Tables 1 to 8 show the results of the clustering for each of the eight segments (PB2, PB1, PA, HA, NP, NA, M NS):

The work of T. Briese, G. Palacios and W. I. Lipkin was supported by National Institutes of Health awards HL083850 and AI57158 (Northeast Biodefense Center - Lipkin). The work of A. Solovyov has been supported by grant NSF PHY-0756966.


1. Webster RG, Laver WG. Studies on the origin of pandemic influenza. I. Antigenic analysis of A 2 influenza viruses isolated before and after the appearance of Hong Kong influenza using antisera to the isolated hemagglutinin subunits. Virology. 1972;48(2):433–444.
2. Y Kawaoka, S Krauss, and R G Webster, Avian-to-human transmission of the PB1 gene of influenza A viruses in the 1957 and 1968 pandemics. J Virol. 1989;63(11): 4603–4608.
3. Scholtissek C, von Hoyningen V, Rott R. Genetic relatedness between the new 1977 epidemic strains (H1N1) of influenza and human influenza strains isolated between 1947 and 1957 (H1N1). Virology. 1978;89(2):613–617.
4. Taubenberger JK, Reid AH, Lourens RM, Wang R, Jin G, Fanning TG., Characterization of the 1918 influenza virus polymerase genes., Nature. 2005;437(7060):889-93.
5. Rabadan R, Levine AJ, Robins H., Comparison of avian and human influenza A viruses reveals a mutational bias on the viral genomes. J Virol. 2006 Dec;80(23):11887-91.
6. Rabadan R, Levine AJ, Krasnitz M. Non-random reassortment in human influenza A viruses. Influenza Other Respi Viruses. 2008;2(1):9-22.
7. Nelson MI, Viboud C, Simonsen L, Bennett RT, Griesemer SB, St George K, et al. Multiple reassortment events in the evolutionary history of H1N1 influenza A virus since 1918. PLoS Pathog. 2008 Feb 29;4(2):e1000012.
8. Zhou NN, Senne DA, Landgraf JS, Swenson SL, Erickson G, Rossow K, et al. Genetic reassortment of avian, swine, and human influenza A viruses in American pigs. J Virol. 1999;73(10):8851-6.
9. Webby RJ, Swenson SL, Krauss SL, Gerrish PJ, Goyal SM, Webster RG. Evolution of swine H3N2 influenza viruses in the United States. J Virol. 2000;74(18):8243-51.
10. Lindstrom SE, Cox NJ, Klimov A. Genetic analysis of human H2N2 and early H3N2 influenza viruses, 1957-1972: evidence for genetic divergence and multiple reassortment events. Virology. 2004;328(1):101-19.
11. Ito T, Couceiro JN, Kelm S, Baum LG, Krauss S, Castrucci MR, et al. Molecular basis for the generation in pigs of influenza A viruses with pandemic potential. J Virol. 1998;72(9):7367-73.
12. Scholtissek C. Pigs as the ‘mixing vessel’ for the creation of new pandemic influenza A viruses. Med Princip Prac. 1990;2:65–71.
13. vanReeth K. Avian influenza in swine: a threat for the human population. Verh K Acad Geneeskd Belg. 2006;68(2):81-101.
14. World Health Organization (WHO). Influenza A(H1N1). Available from:
15. Centers for Disease Control and Prevention (CDC). H1N1 Swine flu. Available from:
16. Trifonov V, Khiabanian H, Greenbaum B, Rabadan R. The origin of the recent swine influenza A(H1N1) virus infecting humans. Euro Surveill. 2009;14(17):pii=19193. Available from:
17. Garten RJ, Davis CT, Russell CA, Shu B, Lindstrom S, Balish A, et al. Antigenic and Genetic Characteristics of Swine-Origin 2009 A(H1N1) Influenza Viruses Circulating in Humans. Science. 22 May 2009 [Epub ahead of print] DOI: 10.1126/science.1176225
18. Novel Swine-Origin Influenza A (H1N1) Virus Investigation Team. Emergence of a Novel Swine-Origin Influenza A (H1N1) Virus in Humans. N Engl J Med. 22 May 2009. [Epub ahead of print].
19. Trifonov V, Khiabanian H, Rabadan R. Geographic Dependence, Surveillance, and Origins of the 2009 Influenza A (H1N1) Virus, New England Journal of Medicine, NEJM. 27 May 2009. [Epub ahead of print] DOI: 10.1056/NEJMp0904572.
20. Shinde V, Bridges CB, Uyeki TM, Shu B, Balish A, Xu X, et al. Triple-Reassortant Swine Influenza A (H1) in Humans in the United States, 2005-2009. N Engl J Med. 22 May 2009. [Epub ahead of print]
21. Rambaut A. Human/Swine A/H1N1 Influenza Origins and Evolution. 3 May 2009. Available from:
22. National Center for Biotechnology Information. Influenza virus resource, information, search and analysis. Available from:
23. Alexe G, Satya RV, Seiler M, Platt D, Bhanot T, Hui S, et al. PCA and clustering reveal alternate mtDNA phylogeny of N and M clades. J Mol Evol. 2008;67(5):465-87.

No comments:

Post a Comment