2020 April 23
The recent Peter Forster et al. study "Phylogenetic network analysis of SARS-CoV-2 genomes" has attracted much attention, and most media reports seem to echo how the Cambridge University Research News inaccurately characterized the study in the headline that first announced it: "Study charts the 'incipient supernova' of COVID-19 through genetic mutations as it spread from China and Asia to Australia, Europe and North America". Perhaps this reflects the way the authors discussed their results in text, and presented it in their Figure 1, which shows a big group, called cluster B by the authors, and representing COVID-19 patients from Wuhan, dominating the figure with lines radiating from it to all other clusters, including clusters A (mainly patients from rest of Asia and north America) and C (mainly Australia and Europe).
But the Forster study is not about the propagation of COVID-19, rather it is about the phylogenetic evolution of virus SARS-Cov-2 that causes it, from its presumed origin, the bat coronavirus, whose genome (i.e., DNA) has 96.2% similarity with that of the human virus. From this perspective, I present a schematic, simplified, topologically identical redrawing of Forster's Figure 1, "Evolution of the virus of Covid-19", shown below. In this plot, the direction of propagation is away from the source, or bat, which is drawn downwards, or sideways, but the idea of ancestor-descendant relation is clear. Forster's Figure 1, rotated 180 degrees, is shown in the upper-left corner of my plot. If one turns the branch leading to "bat" so that it points up, then it is clear Figure 1 and my plot are topologically identical.
Some clarification before going further. A phylogenetic tree (or network) of genomes such as Figure 1 (or my plot) is not the same as a "father-begets-son" genealogic tree. Here the begetting in the ancestor-descendant relation is by mutation, presumably occurring in the host carrying the mutating virus (sorry folks, no sex here). Hence, while it is not possible for a son to beget a father on a genealogic tree, it is in principle possible on a phylogenetic tree to get closer to the original ancestor (here, the bat) by mutation. More on this later.
My plot shows the two groups of viruses designated respectively by "Snohomish" (from the first confirmed Covid-19 case in Snohomish, WA, USA on 2020/01/19) and "Guangdong" (province in southern China) are closest to the original ancestor (bat), and the plot suggests that they shared a theoretical, unidentified common ancestor two mutations away from each. Both are equally probable ancestors of group A1, which is the most probable ancestor of the big group "Wuhan". This is why I say the Cambridge University Research News headline is inaccurate and misleading. Per-haps this could be traced to the Forster et al. article itself, where the word ancestral was attached collectively to a large group called "A type" that included "Snohomish", "Guangdong", and A1, and individually to "Guangdong", A1 and "Wuhan", but not individually to "Snohomish".
If that is the case, how then was patient Snohomish infected? Here there are many possibilities. One is that he was already carrying the virus before traveling to China, that is, he contracted it locally, near home. Would that have been possible? A New York Times 2020 March 10 article and April 22 live-coverage suggest that it would. Briefly, the article reports that in Seattle (of which Snohomish is a suburb), a team of researchers led by the physician Dr. Helen Chu repurposed swabs taken in a several month-old flu study (of residents showing flu symptoms) for testing coronavirus and quickly found a positive case of teenager with no traveling history. The repurposing effort was later stopped by federal and state officials.
We are still left with the connection (via A1) between virus "Snohomish" and virus "Wuhan", and Forster's study suggests the transmission is from "Snohomish" to "Wuhan". Earlier we said it is in principle possible for mutations to go backwards, in this case, say, from "Wuhan" to "Snohomish", which are separated by five mutations. Because "Wuhan" to "Snohomish" would be a result of directed, not random, mutations, the chance that this would happen would be roughly one in (30,000)**5/5!, or approximately one in 100 times a billion times a billion (i.e., 1 in 100,000,000,000,000,000,000). Here, 30,000 is the approximate size (in nucleotide bases) of the coronavirus genome. Thus, it is virtually impossible that the transmission was from "Wuhan" to "Snohomish".
How reliable are the Forster results so far as our plot is concerned? A study on the evolution and transmission of the COVID-19 virus (i.e., SARS-Cov-2) by Wen-Bin Yu et al. and posted on chi-naXiv.org on 2020/02/21, five weeks before the Forster paper was published, also presented a phylogenetic tree for the virus. The Forster study and the Yu study used the same source of SARS-Cov-2 data, but because of its earlier date, the Yu study was based on a set of 93 early out-break virus genomes smaller than the set of 160 genomes used in the Forster study. Both also add-ed in their analyses the genome of the bat coronavirus as an outlier, which I have taken here to be the root of the resulting phylogenetic tree/network. The methods used in the two analyses are similar but not identical. Specifically, Yu constructed a tree, and Forster, a network (that happened to be very close to a tree). The results of the Forster and Yu studies are in substantial agreement. In particular, the groups "Snohomish", "Guangdong", A1, and "Wuhan" in the Forster study have exact counterparts in the Yu study, as shown in our plot. Crucially, in the Yu study "Snohomish"/H38 is ancestral to "Wuhan"/H1 via A1/H3.
In another study of the origin and evolution of SARS-Cov-2 published on 2020/03/03, X. Tang et al., using a method yet different from those used by Forster and by Yu, constructed a phylogenetic network on 103 early outbreak SARS-Cov-2 genomes that is consistent with the results of Forster and Yu. Tang also showed that the human and bat coronaviruses are closer to each other than is either to the pangolin corona virus.
Thus, the scientific backing of the following appears firm: "Guangdong" and "Snohomish" are siblings and are co-ancestral to A1, who in turn is ancestral to "Wuhan".
Can the source of SARS-Cov-2 be pushed further back, beyond "Guangdong"/"Snohomish"? There must be a large number of swabs extant of flu patients both in China and US, taken during the period, say, October and November of 2019, and a concerted effort to repurpose those swabs for Covid-19 tests may shed some light on "patient zero"--if that person ever existed--or on the unidentified common ancestor of "Snohomish" and "Guangdong".
© HC Lee, April 23, 2020, Taoyuan, Taiwan
Postscript. Richard Corlett (co-author of Yu et al. paper) writes: Remember that the bat sequence comes from a family of bats - the Rhinolophidae - which is not found in North or South America, and the sequence closest to COVID-19 is from China. 2020/04/24