Some comments on this preprint that made the rounds today.
Is this evidence of a lab-cultured ancestral SARS-CoV-2?
Nope - the data prove it's not, and likely we're just looking at common sample contamination from early clinical sequencing.
1/🧵
researchsquare.com/article/rs-1330800/v1
Early during the pandemic we had two SARS-CoV-2 lineages - A and B. These are distinguished by "T" in A and "C" in B at position 8782 and by "C" in A and "T" in B at position 28144.
So "A" becomes T/C and "B" C/T at these positions.
3/
Closely related viruses like BANAL-52 and RaTG13 have T/C at these positions, so A is closer to those than B. Many have concluded that that means that A is the "root" of the SARS-CoV-2 tree and ancestral to B.
That is not a correct assumption (e.g., because of reversions).
4/
So we don't actually know the "root" of the tree - and we don't know if e.g., A is ancestral to B or B is ancestral to A.
I'm not going to go into more details on this as it's technical - we'll have a study on it soon.
5/
The three samples identified in the paper that are contaminated are:
SRR13441704
SRR13441705
SRR13441708
These are samples unrelated to SARS-CoV-2, however, they contain SARS-CoV-2 reads - because the sequencing run was contaminated, as happens often.
6/
This allows us to assemble SARS-CoV-2 genomes from these samples - presumably from early (~December, 2019 / January, 2020) specimens. This is what the authors of the preprint did.
Here's the problem though - the samples are *super* messy.
7/
Here's just one example - at three key positions (8782, 18060, 28144) for sample SRR13441704.
A LOT of mismatches - which means that confidently calling genomes becomes near-impossible. Further, this suggest that we're dealing with mixed samples
8/
Because of this, we can't properly call consensus sequences and do lineage assignment (A vs B) - because of low quality and because of mixed samples.
However, it seems reasonable to assume that we have a mixture of both A and B in these samples.
9/
Further, we have a unique mutation at 23525 (H655Y), which is also found in e.g., Omicron and Gamma. This appears to be a derived mutation (i.e., not ancestral).
10/
❗️We also have a deletion at 21763-21786 (S:68-75) that leads to the in-frame deletion of "IHVSGTNG".
This deletion is *definitely* derived - because we don't see it in e.g., BANAL-52, RaTG13 - but we *do* see it occuring during the pandemic (e.g., EPI_ISL_434692).
11/
Here's an example of the same deletion observed during the pandemic. Also, note that SRR13441708 has an additional 3bp deletion in this region.
12/
Here's what it looks like across the SARSr-CoVs - note how both BANAL-52 and RaTG13 don't have this deletion.
13/
❗️Because of this deletion (and to an extent H655Y), we *know* that what we're looking at here is *not* an ancestral virus.
We're likely just looking at clinical samples - or even cultured samples - from early in the pandemic that were sequenced.
14/
Sequenced in Shenzhen btw - probably not where the WIV gets their sequencing done... 🧪
15/
Could the contaminating virus be one that was cultured (as the preprint suggests)? Sure - they had just discovered a novel coronavirus...
Alternatively, Vero/CHO contaminants may not actually be Vero/CHO. Or, other samples on the same run came from Vero/CHO. 🤷
18/
So why did the original authors delete some of the raw files? Well, probably because they realized there was contamination in their dataset (from the sequencing company) and wanted to sort that out before people might get crazy ideas about antarctic lab leaks. 🤷.
19/
Absolutely nothing unusual about any of this - we observe contaminants all the time and it's up to people analyzing data to realize that and have the expertise to not jump to conclusions based on messy data.
It happens all the time. I wish it didn't, but it does.
20/
And to those of you wondering - no, I did not reject this manuscript on the bioRxiv.
That would just fit the conspiracy theory a little too well, wouldn't it?
Time for rational thinking. 🕵️
21/