Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen.

MacLean OA; Lytras S; Weaver S; Singer JB; Boni MF; Lemey P; Kosakovsky Pond SL; Robertson DL

doi:10.1371/journal.pbio.3001115

Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen.

Kosakovsky Pond SL ²,

Robertson DL ¹

Affiliations

1. MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom.
Authors
MacLean OA¹
Lytras S¹
Singer JB¹
Robertson DL¹
(4 authors)
2. Temple University, Institute for Genomics and Evolutionary Medicine, Philadelphia, Pennsylvania, United States of America.
Authors
Weaver S²
Kosakovsky Pond SL²
(2 authors)
3. Center for Infectious Disease Dynamics, Department of Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America.
Authors
Boni MF³
(1 author)
4. Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium.
Authors
Lemey P⁴
(1 author)

ORCIDs linked to this article

Show all (6)

Plos Biology, 12 Mar 2021, 19(3):e3001115
https://doi.org/10.1371/journal.pbio.3001115 PMID: 33711012 PMCID: PMC7990310

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

This is an update of "" bioRxiv. 2020 Jul 30:2020.05.28.122366. doi: 10.1101/2020.05.28.122366. A comment on this article appears in "SARS-CoV-2 roots." Nat Ecol Evol. 2022 Jan;6(1):10. doi: 10.1038/s41559-021-01612-y.

Abstract

Virus host shifts are generally associated with novel adaptations to exploit the cells of the new host species optimally. Surprisingly, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has apparently required little to no significant adaptation to humans since the start of the Coronavirus Disease 2019 (COVID-19) pandemic and to October 2020. Here we assess the types of natural selection taking place in Sarbecoviruses in horseshoe bats versus the early SARS-CoV-2 evolution in humans. While there is moderate evidence of diversifying positive selection in SARS-CoV-2 in humans, it is limited to the early phase of the pandemic, and purifying selection is much weaker in SARS-CoV-2 than in related bat Sarbecoviruses. In contrast, our analysis detects evidence for significant positive episodic diversifying selection acting at the base of the bat virus lineage SARS-CoV-2 emerged from, accompanied by an adaptive depletion in CpG composition presumed to be linked to the action of antiviral mechanisms in these ancestral bat hosts. The closest bat virus to SARS-CoV-2, RmYN02 (sharing an ancestor about 1976), is a recombinant with a structure that includes differential CpG content in Spike; clear evidence of coinfection and evolution in bats without involvement of other species. While an undiscovered "facilitating" intermediate species cannot be discounted, collectively, our results support the progenitor of SARS-CoV-2 being capable of efficient human-human transmission as a consequence of its adaptive evolutionary history in bats, not humans, which created a relatively generalist virus.

Free full text

PLoS Biol. 2021 Mar; 19(3): e3001115.

Published online 2021 Mar 12. https://doi.org/10.1371/journal.pbio.3001115

PMCID: PMC7990310

PMID: 33711012

Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen

Oscar A. MacLean, Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing,^#¹ Spyros Lytras, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing,^#¹ Steven Weaver, Formal analysis, Investigation, Resources, Software,² Joshua B. Singer, Data curation, Resources, Software,¹ Maciej F. Boni, Data curation, Funding acquisition, Methodology, Writing – review & editing,³ Philippe Lemey, Formal analysis, Funding acquisition, Investigation, Visualization, Writing – original draft, Writing – review & editing,⁴ Sergei L. Kosakovsky Pond, Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing,^2,^* and David L. Robertson, Conceptualization, Formal analysis, Funding acquisition, Investigation, Supervision, Writing – original draft, Writing – review & editing^1,^*

Damien Tully, Academic Editor

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Associated Data

Supplementary Materials: S1 Text: Inferring population demography from distribution of allele frequencies. (PDF)
pbio.3001115.s001.pdf (111K)
S2 Text: Signals of positive selection in SARS-CoV-2 and frequency-based analysis of SARS-CoV-2 polymorphisms. (PDF)
pbio.3001115.s002.pdf (562K)
S3 Text: Model selection when testing for a nCoV-specific rate change. (PDF)
pbio.3001115.s003.pdf (496K)
S1 Fig: Phylogenetic tree (generated with RAxML and visualized using FigTree), showing the relationship of SARS-CoV-1 and SARS-CoV-2 (orange) to related bat and pangolin Sarbecoviruses from the whole genome alignment. Grey, red, and blue variants are coloured according to Letko and colleagues [12] who showed experimentally some viruses are able to use human ACE2 (red), while some require exogenous protease treatment in vitro (grey); the red outlier is a known recombinant. Black indicates not tested by Letko and colleagues, while the virus in blue (sampled in Bulgaria) could not be induced to infect human cells. The scale bar corresponds to nucleotide substitutions per site.
(PNG)
pbio.3001115.s004.png (80K)
S2 Fig: Schematic representation of selection methods implemented on each data set of the analysis: (i) the 133,741 SARS-CoV-2 genomes and (ii) the 69 Sarbecovirus genomes.
(PNG)
pbio.3001115.s005.png (296K)
S3 Fig: Synonymous rate of each of the 3 synonymous rate variation classes estimated by BUSTED[S] + HMM and the proportion attributed to each class for all nonrecombinant open reading frames. (PNG)
pbio.3001115.s006.png (228K)
S4 Fig: Corrected synonymous dinucleotide usage (SDUc) values for the Orf1ab of each Sarbecovirus for all dinucleotide frame positions: (A) pos1: first and second codon positions, (B) pos2: second and third codon positions, and (C) bridge: third codon position and first position of the next codon, plotted against patristic distance from SARS-CoV-2 (reference genome Wuhan-Hu-1). (D) The tip colours of the phylogeny correspond to the SDUc data points. SARS-CoV-1 is labelled in orange in panels A, B, and C for comparison. The non-nCoV part of the phylogeny has been collapsed for clarity.
(PNG)
pbio.3001115.s007.png (320K)
S5 Fig: Maximum likelihood trees for NRR1 (left) and NRR2 (right). The trees were inferred using IQTREE using a GTR substitution model with gamma-distributed rate variation among sites. The nCoV lineage is indicated in red. SARS and SARS-CoV-2 are shaded in blue and red, respectively.
(PNG)
pbio.3001115.s008.png (847K)
S6 Fig: Estimates of shifts in CpG in the Sarbecovirus phylogeny based on NRR1 (left) and NRR2 (right). The log-transformed CpG content values are shown at the tips of the trees. The identified shifts are indicated with black circles on their respective branches and with different colours for the lineages and CpG measures involved. The inset shows the results for penalized least-squares model selection criterion (BGHml).
(PNG)
pbio.3001115.s009.png (51K)
S1 Table: List of GISAID accessions used for FUBAR analysis (see S2 Text). (CSV)
pbio.3001115.s010.csv (155K)
S2 Table: List of sites found to be under selection by MEME in the nCoV clade. (XLSX)
pbio.3001115.s011.xlsx (24K)
S3 Table: BUSTED[S] fits for nonrecombinant segments. For each segment, 2 rows of numbers are presented: top—for the BUSTED[S]-HMM model and bottom—for the BUSTED[S] model, allowing for multinucleotide substitutions (see Lucaci and colleagues [45] for details). The model with the better AIC-c and significant test results (p-value) are shown in bold. CV synonymous rate variation (SRV): coefficient of variation (CV) of the site-to-site synonymous rate distribution. : the rate of double-nucleotide substitutions (relative to the rate of single-nucleotide synonymous substitutions); ẟ: the relative rate of triple nucleotide substitutions; δ_islands: the relative rate of triple nucleotide substitutions that involve synonymous codons, e.g., for Serine.
(XLSX)
pbio.3001115.s012.xlsx (14K)
S4 Table: List of Sarbecoviruses used in the selection analysis and their GISAID accessions. (CSV)
pbio.3001115.s013.csv (4.2K)
S5 Table: GISAID accessions used in the SARS-CoV-2 selection analysis. (CSV)
pbio.3001115.s014.csv (2.7M)
S6 Table: GISAID sequences acknowledgement table. (PDF)
pbio.3001115.s015.pdf (4.0M)
S1 Data: XML files with BEAST model parameters (sequences are excluded due to GISAID data sharing restrictions). (GZ)
pbio.3001115.s016.tar.gz (37K)
Attachment: Submitted filename: MacLean-Lytras-etal-reviewers-response-24thDec.pdf
pbio.3001115.s017.pdf (83K)
Attachment: Submitted filename: MacLean-Lytras-etal-reviewer-response-24thDec.docx
pbio.3001115.s018.docx (28K)

Data Availability Statement: All relevant data are within the paper, its Supporting Information files or associated online resources.

Go to:

Abstract

Virus host shifts are generally associated with novel adaptations to exploit the cells of the new host species optimally. Surprisingly, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has apparently required little to no significant adaptation to humans since the start of the Coronavirus Disease 2019 (COVID-19) pandemic and to October 2020. Here we assess the types of natural selection taking place in Sarbecoviruses in horseshoe bats versus the early SARS-CoV-2 evolution in humans. While there is moderate evidence of diversifying positive selection in SARS-CoV-2 in humans, it is limited to the early phase of the pandemic, and purifying selection is much weaker in SARS-CoV-2 than in related bat Sarbecoviruses. In contrast, our analysis detects evidence for significant positive episodic diversifying selection acting at the base of the bat virus lineage SARS-CoV-2 emerged from, accompanied by an adaptive depletion in CpG composition presumed to be linked to the action of antiviral mechanisms in these ancestral bat hosts. The closest bat virus to SARS-CoV-2, RmYN02 (sharing an ancestor about 1976), is a recombinant with a structure that includes differential CpG content in Spike; clear evidence of coinfection and evolution in bats without involvement of other species. While an undiscovered “facilitating” intermediate species cannot be discounted, collectively, our results support the progenitor of SARS-CoV-2 being capable of efficient human–human transmission as a consequence of its adaptive evolutionary history in bats, not humans, which created a relatively generalist virus.

Go to:

Introduction

In December 2019, an outbreak of pneumonia cases in the city of Wuhan, China, was linked to a novel coronavirus. Evolutionary analysis identified this new virus to humans as a severe acute respiratory syndrome-related virus [1], in the Sarbecovirus subgenus of the Betacoronavirus genus, sister lineage to the original SARS virus; subsequently named Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) to reflect this relationship to SARS-CoV [2]. This novel lineage represents the seventh known human-infecting member of the Coronaviridae. The initial outbreak of human cases of the virus was connected to the Huanan Seafood Wholesale Market in Wuhan [3], and while related viruses have been found in horseshoe bats [4] and pangolins [5], their divergence represents decades of evolution [6] leaving the direct origin of the pandemic unknown. In addition to elucidating the transmission route from animals to humans, key questions for assessing future risk of emergence are: (i) what is the extent of evolution, if any, required for a bat virus to transmit to humans, and (ii) what subsequent evolution will occur once the virus is established within the human population?

The first SARS virus outbreak in 2002/2003, causing approximately 8,000 infections, and its reemergence in late 2003, causing 4 infections, were linked to Himalayan palm civets and raccoon dogs in marketplaces in Guangdong Province [7,8]. Later, it became clear that while these animals may have been conduits for spillover to humans, they were not true viral reservoirs [9]. Extensive surveillance work subsequently identified related viruses circulating in horseshoe bats in China, some of which can replicate in human cells [10,11]. The bat viruses most closely related to SARS-CoV (hereafter referred to as SARS-CoV-1 for clarity), can bind to human angiotensin-converting enzyme (hACE2, the receptor SARS-CoV-2 also uses for cell entry), while the addition of a host protease is required to cleave the Spike protein before it can bind hACE2 for the more divergent bat viruses tested (S1 Fig) [12]. Two of the key changes which appear to have been required to generate this highly capable pathogen, the specific receptor binding domain sequence and the inserted furin cleavage site, can all be traced to bat coronaviruses [6,13–15]. Collectively, these results demonstrate that, unlike most other RNA viruses which acquire adaptations after switching to a new host species [16,17] for efficient replication and spreading as successfully as exhibited by SARS-CoV-2, the Sarbecoviruses—which already transmit frequently among bat species [18]—can exploit the generalist properties of their ACE2 binding ability, facilitating successful infection of non-bat species, including humans. A main difference between SARS-CoV-1 and SARS-CoV-2 is the increased binding affinity for hACE2 in the latter [19] permitting more efficient use of human cells and the upper respiratory tract, and on average lower severity but, paradoxically—due to the higher number of infections—higher disease burden.

There is intense interest in the mutations emerging in the SARS-CoV-2 pandemic [20–22]. Although the vast majority of observed genomic change is expected to be “neutral” [23,24], mutations with functional significance to the virus will likely arise, as they have in many other viral epidemics and pandemics [25]. In SARS-CoV-2, amino acid replacements in the Spike protein may reduce the efficacy of vaccines, replacements in proteases and polymerases will result in acquired drug resistance, and other mutations could change the biology of the virus, e.g., enhancing its transmissibility as demonstrated for the replacement D614G [26], contributing to adaption to humans, a new host species.

A main way to begin to understand the functional impact of mutations is to characterise the selective regime they are under. Mutations which are under positive selection are of particular interest as they are more likely to reflect a functional change. However, identifying mutations under positive selection from frequency data alone can be misleading, as allele frequencies in viral pandemics are significantly driven by biased sampling, founder effects, and superspreading events [22]. Exponentially growing populations can increase in average fitness [27]; however, they are also expected to exhibit elevated genetic drift, with deleterious mutations “surfing” expansion waves [28].

Here, we investigate the evolutionary history of bat Sarbecoviruses that shaped the emergence and rapid spread of SARS-CoV-2 by contrasting results from comprehensive searches for signatures of positive selection (a measure of molecular adaptation) in the virus circulating in humans since the Coronavirus Disease 2019 (COVID-19) outbreak began, to signatures of historic selection acting on related bat viruses. We use an array of selection detection methods (summarised in S2 Fig) on a set of (i) 133,741 SARS-CoV-2 genomes sampled from December 2019 to October 2020 and on (ii) 69 Sarbecovirus genomes (including a representative of SARS-CoV-1 and SARS-CoV-2 sequences) separated into phylogenetically congruent regions based on detected recombination patterns [6]. Finally, we detect a shift in CpG representation on the Sarbecovirus tree, associated with the phylogenetic clade SARS-CoV-2 is found in.

Go to:

Results

Evidence of nearly neutral evolution and relatively weak purifying selection in the first 11 months of the SARS-CoV-2 pandemic

We first analyse selection acting on the encoded amino acids of SARS-CoV-2 using 133,741 quality control (QC)-filtered genome sequences from the GISAID database as of 12 October 2020, representing a sample of the variants circulating in humans during the first 11 months of the pandemic (see Methods, S5 Table). Purifying selection acts more strongly on nonsynonymous sites, supported by the estimate that the nonsynonymous substitution rate (dN) was only 4% the speed of the synonymous substitution rate (dS) in the divergence between SARS-CoV-2 and the bat Sarbecovirus RaTG13 [29]. We used the phylogenetically corrected single-likelihood ancestor counting (SLAC) method [30] to calculate “frequency-stratified” dN/dS estimates. This qualitative analysis revealed that, on average, the evolutionary process is indicative of weakly purifying selection, with point estimates of dN/dS ranging from 0.4 to 0.9, and dN/dS generally declining with increasing frequency (Fig 1A). Most of the bootstrap-based 95% confidence intervals exclude a dN/dS = 1, although for ranges with a small number of variants (e.g., ≥5,000 sequences), there exists a high degree of uncertainty in these estimates. The vast majority of 20,687 observed mutations occur at very low frequency, with 79% of mutations observed in 10 or fewer of the 133,741 SARS-CoV-2 genome sequences analysed (Fig 1A), consistent with a model of exponential population growth of virus spread (S1 Text).

An external file that holds a picture, illustration, etc.
Object name is pbio.3001115.g001.jpg

Fig 1

(A) Estimates of molecular adaptation (dN/dS) for 133,741 SARS-CoV-2 genome sequences based on the counting SLAC method [30] (black circles—point estimates, error-bars: 95% interpercentile range based on 500 bootstrap replicates) and the number of variants as a function of their frequency (blue bars). (B) Cumulative number of Spike and RdRp gene sequences in GISAID passing QC filters from 31 March to 12 October 2020, and the number unique haplotypes among them. (C) Cumulative fractions of codon sites in Spike and RdRp, which harbour different types of sequence variants, are positively selected (MEME [36] p ≤ 0.05, internal branches only), or are negatively selected (FEL [30] p ≤ 0.05, internal branches only). (D) Estimates of gene-wide dN/dS in Spike and RdRp on internal branches and terminal branches (MG94xREV model), and the total length of internal tree branches, which serves as a good proxy of statistical power to detect selection. And (E) statistical evidence for episodic positive selection at 52 codons in SARS-CoV-2 Spike (S) and RdRp that reached significance (p ≤ 0.05) at least once during the analysis period. The list of accessions for the SARS-CoV-2 sequences downloaded from GISAID on 12 October 2020 are provided in S5 Table. dN, nonsynonymous substitution rate; dS, synonymous substitution rate; FEL, fixed effects likelihood; MEME, mixed effects model of evolution; QC, quality control; RdRp, RNA-dependent RNA polymerase; S, Spike; SARS-CoV-2, Severe Acute Respiratory Syndrome Coronavirus 2; SLAC, single-likelihood ancestor counting.

Our initial searches for signatures of departures from neutral evolution in the SARS-CoV-2 sequence data from late March produced numerous false positive signals, due to artefactual lab recombination and other sequencing errors found on terminal branches (see S2 Text and S1 Table). Subsequently, we have been performing regular analyses of the data excluding these terminal branches (available online at http://covid19.datamonkey.org). As more sequences were deposited in public databases (Fig 1B), we expected to see increasingly more variation, with >95% of codons in Spike (S) and RNA-dependent RNA polymerase (RdRp) containing some variation in most recent analyses (Fig 1C). Many of those variants may be due to sequencing errors, but the rapid increase of variants is clear in the sequence data sampled from the circulating viruses. Similar increasing trends are seen for sites with amino-acid variants that are present in at least 2 sequences (47.1% in S and 36.5% in RdRp). Interestingly, the number of unique sequences has increased more slowly than the number of sequenced genomes (Fig 1B), and the fraction of sites inferred to be under positive selection remains approximately 2% in both genes, while the fraction of sites detected to be under purifying selection is up to 11.4% in S and 7.2% in RdRp. This observation shows SARS-CoV-2 is evolving relatively slowly with no dramatic increases in selective pressures occurring over the sampling period December 2019 to October 2020. This pattern is further confirmed by the relatively stable behaviour of gene-wide dN/dS estimates (Fig 1D), both on internal branches, which include successful transmissions, and terminal branches, which sometimes include intrahost “dead-ends” or deleterious variation, and thus have higher estimates [31].

Even in Spike, which is being assiduously scrutinized for selection due to its immunogenic and phenotypic importance, overall selective pressure is stable over time and consistent with weak purifying selection. The genetic homogeneity of SARS-CoV-2 results in very shallow phylogenetic trees, despite >130,000 collected sequences, with cumulative branch lengths only about 0.6 (S) and 0.4 (RdRp) substitutions/site (Fig 1D), somewhat limiting the statistical power of comparative methods [32]. There are individual sites in the SARS-CoV-2 genomes which possess statistically detectable signatures of episodic diversifying selection along internal tree branches. In selection scans of Spike and RdRp, for example, there are 4 sites that appear to be selected in data prior to April 15 (early pandemic, Fig 1E). These include Spike 614 and RdRp 323 which are in nearly perfect linkage disequilibrium (R² > 0.99), with the former extensively discussed by others as having some functional significance [33,34], while the latter has received much less attention. The majority of these flagged sites, however, are not detected until later in the pandemic, suggesting that they were not critical for the exponential spread of the virus. Most observed mutations are only transient in the viral population. This is because they are mildly deleterious and will fail to persist over longer time periods of the viral outbreak, being gradually purged by purifying selection [31,35]. Exponential growth, consistent with the SARS-CoV-2 trajectory (S1 Text), is known to reduce the efficacy of purifying selection [28], with deleterious mutations able to “surf” expansion waves. It is likely that many of these putatively deleterious segregating mutations will ultimately be lost, reducing the long-term dN/dS, though their persistence will be influenced by future demographic patterns [28], and the effect of lockdowns on the circulating variants. With increasing levels of host immunity helped by the deployment vaccines and ongoing widespread SARS-CoV-2 circulation, we fully expect to see increased evidence for adaptive evolution in Spike and other genes, but this evolution is not relevant to initial adaptation for efficient spread among human hosts. This can already be seen in the data, with 15/52 sites appearing on the list of positively selected sites only in the 12 October 2020 (most recent) analysis (Fig 1E).

What about the bats? Evidence for positive selection in Sarbecoviruses

Coronaviruses frequently recombine in their bat hosts, with the Spike open reading frame (ORF) being an apparent hotspot for recombination events, with potentially adaptive implications for the viruses, e.g., antigenic shift in the context of immune evasion [9,37–40]. To avoid the confounding effects of recombination on the inference of selection patterns, we separately analysed maximally nonrecombinant regions, as derived in Boni and colleagues [6] (see Methods). For each of the 19 nonrecombinant ORF regions identified, we define as a working nomenclature the new coronavirus “nCoV” clade, the set of viruses closest to SARS-CoV-2 lineage in the Sarbecovirus phylogeny (see Methods). We find that genomic sites are generally subject to stabilising selection in this nCoV clade, with 8,184/9,744 (84%) of codon sites conserved at the amino-acid level, and 4,274 (43.7%) sites, of which 3,388 were variable at the nucleotide level, showing evidence of purifying selection in this lineage (using the fixed effects likelihood (FEL) method [30], Fig 2).

An external file that holds a picture, illustration, etc.
Object name is pbio.3001115.g002.jpg

Fig 2

Schematic of the nonrecombinant ORF regions used for the nCoV clade selection analyses.

(A) Displays (top to bottom): Individual codon sites (N = 85), mapped to SARS-CoV-2 genomic coordinates, found to be under episodic positive selection in the nCoV clade (MEME p ≤ 0.05); brighter colours indicate that a larger fraction of lineages was subject to selection; sites (N = 3,388) subject to negative selection in the nCoV clade (FEL p ≤ 0.05); genome structure of SARS-CoV-2; ticks inside the genome structure indicate sites that are conserved in the nCoV clades (N = 8,184); nonrecombinant fragments (N = 20) found in the Boni and colleagues [6] analysis; colours show the coefficient of variation for the distribution of site-level synonymous rates. (B) The nCoV clade for 9/20 nonrecombinant segments that exhibit any evidence of branch-level selection according to the aBSREL method. Branches with significant tests (p ≤ 0.05) are shown in the orange-red colours; the colour is based on the average dN/dS estimate for these branches, thickness is proportional to the number of individual sites (genome-wide) that have evidence for positive selection along that branch. The list of GenBank and GISAID accessions for the Sarbecovirus sequences used are provided in S4 Table. aBSREL, adaptive branch-site random effects likelihood; dN, nonsynonymous substitution rate; dS, synonymous substitution rate; FEL, fixed effects likelihood; MEME, mixed effects model of evolution; nCoV, new coronavirus; ORF, open reading frame; SARS-CoV-2, Severe Acute Respiratory Syndrome Coronavirus 2.

To search for evidence of positive selection on specific branches of the phylogeny in the nonrecombinant regions, we first used the adaptive branch-site random effects likelihood (aBSREL) method [41]. Our analysis finds that diversifying selection left its imprints primarily in the deepest branches of the nCoV clade or lineage leading to it, with no evidence of selection in the terminal branch leading to SARS-CoV-2 (Fig 2). This is consistent with the nonhuman progenitor of SARS-CoV-2 requiring little or no novel adaptation to successfully infect humans. Still, no model can detect all signatures of historic genomic adaptation, and mutations which may enable SARS-CoV-2 to infect humans could have arisen by genetic drift in the reservoir host before human exposure.

We next sought evidence of ORF-specific positive selection across the full phylogeny of the nCoV clade (episodic diversifying selection) using BUSTED[S] [42], coupled with a hidden Markov model (HMM) with 3 rate categories to describe site-specific synonymous rate variation and allow autocorrelation across neighbouring sites in these rates [43]. Six nonrecombinant regions of Orf1ab, Spike, and ORF N show evidence of episodic diversifying positive selection in the nCoV clade (Fig 2A). This finding is consistent with evidence of positive selection operating on Orf1ab in Middle East Respiratory Syndrome (MERS) [44], and Spike and N proteins being essential for immune recognition. All segments show a nontrivial signal of synonymous rate site-to-site variation, highlighting the importance of accounting for synonymous rate variation in this study, as recent literature [45] indicates that when the coefficient of variation (CV) for the distribution of site-to-site synonymous rates exceeds 0.5. Importantly, not accounting for synonymous rate variation risks mistaking it for evidence of selection.

A subsequent scan for the specific codon sites showing evidence of nCoV-specific selection using the mixed effects model of evolution (MEME) method [36] revealed 85 individual sites inferred to evolve subject to episodic diversifying selection in the nCoV clade or lineage leading to it (Fig 2B and S2 Table). Most of these sites (55/85) are found on Orf1ab, which is also substantially longer than the other regions. Consistent with BUSTED[S] results, Spike and ORF N have the next greatest number of selected sites (7 and 11, respectively; S2 Table). Interestingly, despite its short length, ORF3a has 6 sites with evidence of positive selection. Little is known about the function of this accessory gene; however, it could be related to immune evasion, like that of other short accessory genes (e.g., Orf3b) [46]. Still, this analysis does not allow attributing specific functional relevance to the individual sites.

The BUSTED[S] method also partitioned synonymous rate variation into 3 rate classes across the sites. The majority of regions showed large, in some cases more than 20-fold, differences between rate classes, with all 3 classes representing a substantial proportion of sites for most regions (S3 Fig), with varying degrees of autocorrelation. This suggests that strong purifying selection is acting on some synonymous sites (e.g., due to conserved RNA motifs, RNA structures, or overlapping ORFs), and some synonymous mutations in the SARS-CoV-2 genome may not be selectively neutral or occur at sites that are hypervariable. Some synonymous rate variation may also be attributed to the 5′ and 3′ context-specific mutation rate variation observed in SARS-CoV-2 [47]. Because multinucleotide mutations could be taking place in SARS-CoV-2 genomes, e.g., trinucleotide substitutions in N at positions 28881–28883 [48]), and not considering them in models can lead to false positives in certain cases [49], we modified BUSTED[S] to include them and to estimate the extent of multinucleotide mutations in the SARS-CoV-2 data [45]. For 11 out of 20 segments (S3 Table), this model provides a better information theoretical fit to the data, including 2 of 5 segments where BUSTED[S] found evidence of positive selection (N and ORF1ab/regE). For N, both models agree that there is evidence of positive selection, while for ORF1ab/regE, the inclusion of multiple nucleotide changes removes the signal of selection due to their effect (S3 Table). Overall, the addition of multinucleotide mutations does not dramatically alter the overall selective picture.

Patterns of CpG depletion and recombination in the nCoV clade

Genome composition measures, such as dinucleotide representation and codon usage, can also be an informative tool for characterising the host evolutionary history of a virus [50]. Two major host antiviral mechanisms are currently thought to drive the depletion of CpG dinucleotides (a cytosine followed by a guanine in the 5′ to 3′ direction) in virus genomes. This depletion is apparently primarily mediated either through selective pressures by a CpG-targeting mechanism involving the Zinc finger Antiviral Protein (ZAP) [51] or C to U hypermutation by APOBEC3 cytidine deaminases [52], and it has been demonstrated that the CpG-binding ZAP protein inhibits SARS-CoV-2 replication in human lung cells [53]. These evolutionary forces are likely to vary across tissues and between hosts [50]. Thus, a smaller or greater level of CpG depletion in specific viral lineages may be indicative of a switch in the evolutionary environment of that lineage or its ancestors, although care must be taken to not overinterpret such results [54].

We examined the CpG representation using the corrected Synonymous Dinucleotide Usage (SDUc) framework, controlling for amino acid abundance and single nucleotide composition bias in the sequences [55]. In Orf1ab, the longest ORF in the genome, all viruses show CpG underrepresentation (SDUc < 1), while viruses in the nCoV clade have even lower CpG levels than the other Sarbecoviruses (S4 Fig). To further examine this trend, we fitted CpG representation as traits in a phylogenetic comparative approach [56], using the 2 longest putatively nonrecombinant regions in the genome alignment: NRR1, comprising of subregions of Orf1ab regC, regD, and regE, and NRR2, identical to Orf1ab regE apart from a small region at one end being removed based on RmYN02’s recombination pattern (see Methods). This method formally tests for adaptive shifts of quantitative traits across a phylogeny and estimates where these events have taken place on the tree. In this way, we identified an adaptive shift favouring CpG suppression in the lineage leading to the nCoV clade (Fig 3A and 3B). This CpG shift may indicate a change of evolutionary environment, e.g., host or tissue preference, since CpG depletion following host switches has been observed in other human infecting RNA viruses, such as Influenza B [50].

An external file that holds a picture, illustration, etc.
Object name is pbio.3001115.g003.jpg

Fig 3

(A) Bayesian phylogeny of a modified NRR2 region using different local clocks for the nCoV clade (red branches) and the rest of the phylogeny. All internal nodes’ posterior values are above 0.98. Viruses infecting bats are indicated by black circles, pangolins white circles, and SC1 and SC2 labelled in blue and red, respectively, at the tips of the tree. The inset summarises the substitution rate estimates on a natural log scale for the two-parameter local clock model with colours corresponding to the branches in the tree. The estimated date for the shared common ancestor of the nCoV clade (1467) and the RmYN02/SARS-CoV-2 divergence (1976) are shown with confidence intervals. (B) CpG relative representation for all dinucleotide frame positions (pos1: first and second codon positions; pos2: second and third codon positions; bridge: third codon position and first position of the next codon) is presented as SDUc values. (C) Schematic of our proposed evolutionary history of the nCoV clade and putative events leading to the emergence of SARS-CoV-2. The list of GenBank and GISAID accessions for the Sarbecovirus sequences used are provided in S4 Table. nCoV, new coronavirus; SARS-CoV-2, Severe Acute Respiratory Syndrome Coronavirus 2; SC1, SARS-CoV-1; SC2, SARS-CoV-2; SDUc, corrected synonymous dinucleotide usage.

We investigated if this adaptive shift in CpG content at the base of the nCoV clade was associated with an elevation in substitution rate, which would be expected if dinucleotide mutations were being driven to fixation by positive selection. We tested a model which allowed for a rate change on the internal branch connecting the nCoV clade to the ancestral node and found significant evidence of an elevated substitution rate, consistent with an adaptive shift along this branch (Fig 3A). We could not distinguish between a transient acceleration on the internal branch or an nCoV clade-wide substitution rate acceleration (S3 Text). However, we propose that the elevated rate specific to the nCoV clade ancestral lineage may explain some of the variation between different Coronaviridae substitution rates, previously attributed to a time-dependent evolutionary rate phenomenon [6]. Future estimates of split timings should incorporate this substitution rate variation, as they can have a significant impact on estimates (S3 Text).

Zhou and colleagues [57] report a novel bat-infecting Sarbecovirus sample, RmYN02, which has the most similar sequence to SARS-CoV-2 of known Sarbecoviruses for most of its genome, closer than RaTG13. The part of the RmYN02 Spike ORF that is recombinant falls outside the nCoV clade of the Sarbecovirus phylogeny. Molecular dating using BEAST [58] on a region chosen to be a nonrecombinant part of the genome (NRR2) [6] (Fig 3A, see Methods)—accounting for the aforementioned nCoV branch local clock model—indicates that RmYN02 shared an ancestor with SARS-CoV-2 about 1976 (consistent with a previous estimate [29]), and strengthening the evidence for a direct bat to human SARS-CoV-2 emergence. The RmYN02 genome sequence offers an opportunity to test if a recombination event can take place between the 2 distinct CpG content Sarbecovirus lineages. A sliding window of CpG relative dinucleotide abundance (RDA) [59] shows that CpG levels of SARS-CoV-2 and RmYN02 only differ at the recombinant region (Fig 4A), further demonstrated by the SDUc values of the nCoV and non-nCoV parts of RmYN02 Spike (Fig 4B). The finding that RmYN02 is a recombinant between the high and low CpG lineages means that viruses from both lineages can coinfect the same bat species. Βoth the shift in CpG pressures and much of the inference of positive selection at the protein level (Fig 2), coupled with the consistency of the CpG selection pressure throughout the lineage, should reflect a single expansion in evolutionary environment in the lineage leading to the nCoV clade. That viruses from this lineage have not subsequently specialised on novel distinct non-bat hosts is also supported by the absence of related Sarbecoviruses in the Sunda/Malayan pangolins prior to their trafficking into China [60]. Together, this suggests the shift at the base of the lineage does not represent specialisation on a host other than bats, but instead an evolved preference towards different host tissues or perhaps circulation in a population of bat species possessing distinct antiviral immunity.

An external file that holds a picture, illustration, etc.
Object name is pbio.3001115.g004.jpg

Fig 4

(A) 3kb sliding window plot of RDA across the whole-genome alignment of Wuhan-Hu-1 (turquoise) and RmYN02 (magenta). Shaded regions depict the Spike ORF region in the alignment. The dashed line indicates the inferred RmYN02 Spike recombination breakpoint, splitting the shaded region into non-nCoV (yellow) and nCoV (grey). (B) SDUc values calculated for each frame position of the 2 RmYN02 Spike nonrecombinant regions and the corresponding Wuhan-Hu-1 regions. The absolute differences between SDUc values of SARS-CoV-2 and RmYN02 for each frame position are significantly greater in the non-nCoV than in the nCoV region (t_2.07 = 3.03, p = 0.0450; unpaired one-tailed t test with unequal variance). The list of GenBank and GISAID accessions for the Sarbecovirus sequences used are provided in S4 Table. nCoV, new coronavirus; ORF, open reading frame; RDA, relative dinucleotide abundance; SARS-CoV-2, Severe Acute Respiratory Syndrome Coronavirus 2; SDUc, corrected synonymous dinucleotide Usage.

Go to:

Discussion

The data presented here highlight that the Sarbecovirus clade SARS-CoV-2 emerged from shows evidence of positive selection on its deeper branches, coupled with an adaptive shift in CpG composition in this lineage. Our selection analysis on SARS-CoV-2 genomes across the timeline of the pandemic (Fig 1) indicates a lack of diversifying selection taking place during the virus’s circulation in humans from December 2019 to October 2020. We do find evidence of some ongoing positive selection, presumably associated with minor optimisations to the human population (Fig 1E). However, selection is acting on a small number of sites, and there is a lack of selective sweeps since sequencing began. These results suggest that the majority of adaptive changes which generated SARS-CoV-2 took place prior to its emergence in the human population. There are indications of increased selective pressure in some recent lineages sampled in late 2020, associated with faster spread and higher than usual number of nonsynonymous substitutions, for example, the England and South Africa associated lineages B.1.1.7 [61] and B.1.351 [62], respectively. These appear to be associated with SARS-CoV-2 evolution in the context of host immunity due to previous exposure and/or chronic infections of probably immunocompromised individuals (discussed in [63]), and not the slow rate of evolution associated with acute SARS-CoV-2 infections and transmission that has predominated in the pandemic to October 2020.

That SARS-CoV-2 can readily transmit to other animals (pangolins, mink, cats, and others) is strongly indicative this generalist property evolved in the bat reservoir species and not as a consequence of adaptation to human–human transmission. Nonetheless, the amount of time between the first spillover of the progenitor SARS-CoV-2 in humans and sequencing the first variants remains unknown. This raises the concern that important changes might have taken place in that “unsampled” period that cannot be picked up by our SARS-CoV-2 genomic analysis. Despite the inability to directly address these concerns without earlier SARS-CoV-2 sequences or broader sampling of the virus’s close relatives, such adaptive changes should theoretically be detected in our Sarbecovirus analysis. That we fail to find evidence of diversifying selection on the terminal branch leading up to the emergence of SARS-CoV-2 in humans (Fig 2B), indicates that the adaptations that created a generalist capable of efficient replication in humans and other mammals, probably did not occur in the unsampled SARS-CoV-2 lineage.

Recombination between a proximal ancestor of RmYN02 and a non-nCoV Sarbecovirus further demonstrates the viruses in this subgenus cocirculate in the same bat reservoirs. The apparent “success” of these bat viruses to transmit to multiple other mammals and spread with few to no significant genomic changes further supports the hypothesis that the SARS-CoV-2 progenitor is from a viral lineage with a relatively generalist nature (Fig 3C). The level of specialisation of a virus is likely to be driven by many factors, such as mutation rates and available opportunities for transmission, with generalism imposing constraints on change [64]. The recent finding of convergent mutations in mink-circulating SARS-CoV-2 variants, which appear to readily transmit back to humans, demonstrates the potential for low-level specialisation on new hosts following exposure of a generalist virus to a new host species [65]. In fact, our analysis detects some evidence of diversifying selection specific to the terminal branches of the pangolin infecting nCoV viruses (Fig 2B), suggestive of a narrative similar to that in minks. Contrastingly, that is not the case with the terminal branch leading to SARS-CoV-2. Genomes from additional closely related Sarbecovirsues would offer additional power to our analyses and provide greater insight into any potential evolutionary hurdles these zoonoses face.

Our analysis suggests that the early ancestors of the nCoV clade—existing hundreds of years ago (Fig 3A and S3 Text)—developed this generalist phenotype through an expansion in their evolutionary environment (host switch, or tissue tropism) within bat species, allowing for efficient spillover events. The shift in CpG suppression at the base of the nCoV clade, which is maintained in all sampled members, could be indicative of an immune evasion adaptation—partly responsible for the proposed generalist properties of the viruses—by evading known CpG-targeting mammalian immune mechanisms such as ZAP [53]. Notably, SARS-CoV-1, the first known Sarbecovirus that emerged in humans in recent years is not a member of the nCoV clade, demonstrating that the nCoV-specific evolutionary changes are not the sole route to a human SARS-like pathogen.

An overarching point stemming from these observations is the lack of sampling and knowledge of the diversity in this viral subgenus. In particular, the closest known bat viruses to SARS-CoV-2 are relatively divergent in time (Fig 3A), and the apparent generalist nature of these viruses suggests that there are species of wild mammals, yet to be sampled, infected with nCoV-like viruses. Serological studies of communities in China that come into contact with bats indicate that incidental and dead-end spillover of SARS-like viruses into humans do occur [66,67]. Due to the high diversity and generalist nature of these Sarbecoviruses, a future spillover, potentially coupled with a recombination event with SARS-CoV-2, is possible, and such a “SARS-CoV-3” emergence could be sufficiently divergent to evade either natural or vaccine-acquired immunity, as demonstrated for SARS-CoV-1 versus SARS-CoV-2 [68]. We must therefore dramatically ramp up surveillance for Sarbecoviruses at the human–animal interface [15] and monitor carefully for future SARS-CoV emergence in the human population.

Go to:

Methods

SARS-CoV-2 GISAID sequence filtering

After removing any sequence with a nonhuman host (e.g., bat, pangolin), to reduce the impact of sequencing errors on selection analysis, the data from GISAID were filtered by excluding all sequences which meet any of the following criteria: any sequence of length less than 29,000 nucleotides; any sequences with ambiguous nucleotides in excess of 0.5% of the genome; any sequences with greater than 1% divergence from the longest sampled sequence (Wuhan-Hu-1); and any sequence with stop codons. To retain codon alignments and retain correct amino acid sequences for selection analyses, unique (identical sequences are collapsed for alignment to reduce computational expenditure; “N” is treated as matching any resolved character in this process) in-frame nucleotide sequences were translated to amino-acids and aligned with MAFFT [69]. The amino-acid alignment is mapped back to the constituent nucleotide sequences to produce a codon-level alignment. Only unique haplotypes are retained for comparative phylogenetic analyses, since including identical copies is not informative for these types of inference.

SARS-CoV-2 positive selection

We initially utilised methodology which incorporated the full phylogeny and found misleading signatures such as sequencing errors and lab-based recombination on the terminal branches, which confounded the software (S2 Text). These errors are reflected in the elevated dN/dS ratio observed on the terminal branches (Fig 1). We subsequently used the FEL [30] and MEME methods [36] to infer negative and episodic diversifying positive selection, respectively, using an implementation which only considered internal branches to infer positive selection. MEME uses a maximum likelihood methodology and performs a likelihood ratio test for positive selection on each site, comparing modes which allow or disallow positive diversifying selection at a subset of branches (dN/dS > 1). FEL performs a test that assumes uniform selective pressure on all branches (pervasive selection). These selection analyses were performed in the HyPhy software package v.2.5.14.

Sarbecoviruses alignment and recombination

To avoid the confounding effects of recombination, we have analysed each ORF separately and divided the Orf1ab and Spike ORFs into putative nonrecombinant regions, based on the 7 major recombination breakpoints presented in Boni and colleagues [6]. This produces 5 nonrecombinant regions for Orf1ab (regions A to E) and 5 regions for Spike (regions A to D, and the variable loop—region VL). The protein sequences of the nonrecombinant regions SARS-CoV-2, SARS-CoV-1, and 67 closely related viruses with nonhuman hosts (bats and pangolins, S4 Table), identified based on sequence similarity and retrieved from online databases NCBI Genbank and GISAID, were aligned using MAFFT version 7 (L-INS-i) [69]. Subsequent manual corrections were made on the protein alignments and PAL2NAL (http://www.bork.embl.de/pal2nal) was used to convert them to codon alignments. Phylogenies for each codon alignment were inferred using RAxML with a GTR+Γ nucleotide substitution model [70].

Sarbecovirus selection analysis

We used an array of selection detection methods to examine whether the lineage leading to SARS-CoV-2 has experienced episodes of diversifying positive selection. Each nonrecombinant region was examined separately. We separated each region’s phylogeny into an nCoV and non-nCoV/SARS-CoV-1 lineage. The nCoV clade includes SARS-CoV-2 and the viruses that form a monophyly with it, excluding the SARS-CoV-1 containing sister clade. These are the bat-infecting viruses CoVZC45, CoVZXC21, RmYN02, and RaTG13, and the pangolin-infecting viruses Pangolin-CoV and the P2V, P5L, P1E, P5E, P4L cluster (S4 Table). Note, some recombinant regions of CoVZC45, CoVZXC21, and RmYN02 do not belong to the nCoV clade, and these were excluded from any analysis of these regions.

We tested for evidence of episodic diversifying selection on the internal branches of the nCoV clade using BUSTED[S], accounting for synonymous rate variation as described in Wisotsky and colleagues [71]. We developed an extension to BUSTED[S], which included an HMM with 3 rate categories to describe site-specific synonymous rate variation [43]. This HMM allows explicit incorporation of autocorrelation in synonymous rates across codons. Such autocorrelation would be expected if selection or mutation rate variation were spatially localised within ORFs. The rate switching parameter between adjacent codons of the HMM describes the extent of autocorrelation, with values under 1/N (N = number of rate classes) suggestive of autocorrelation. Standard HMM techniques (e.g., the Viterbi path) applied to these models can reveal where the switches between different rate types occur, thereby partitioning the sequence into regions of weaker or stronger constraint on synonymous substitutions.

The aBSREL method [41] was used on all branches of the nCoV clade to determine which specific branches drive the inference of selection. Finally, we examined which specific codon sites are under negative selection on average over the nCoV clade using FEL [30] and under pervasive or episodic diversifying positive selection on the nCoV clade using MEME [36]. P values of ≤0.05 for the likelihood ratio tests, specific to each method, were taken as evidence of statistical significance. Most selection analyses were performed in the HyPhy software package v.2.5.14, with BUSTED adjusted to allow multiple nucleotide substitutions using v2.5.24 [72].

CpG depletion

To quantify over/under representation of CpG dinucleotides in the Sarbecovirus genomes, we developed a modified version of the synonymous dinucleotide usage (SDU) metric [55], which now accounts for biased base composition. The original SDU metric compares the observed proportion of synonymous CpG, o; for each pair of frame positions in a coding sequence, h; to that expected under equal synonymous codon usage, e; for each amino acid (or amino acid pair) that can have CpG containing codons (or codon pairs), i. The SDU metric is the mean of these ratios weighted by the number of informative amino acids (or pairs) in the sequence, n (Eq 1).

To incorporate the biased and variable base composition of SARS-CoV-2 and other Sarbecoviruses [47], here we have estimated expected codon usage based on each virus’ whole-genome nucleotide composition. We term this new metric the corrected synonymous dinucleotide usage (SDUc). We use observed base frequencies from each virus to generate the corrected null expectation of the metric, e′, instead of assuming equal usage (Eq 1). The expected proportion, e′, for every amino acid/amino acid pair was estimated by randomly simulating codons based on the whole-genome single nucleotide proportions of each virus. This e′ was then used for all SDUc calculations of the corresponding virus.

As this metric is susceptible to error when used for short coding sequences, we applied SDUc on the longest ORF (Orf1ab) of all the viruses. To estimate the extent of phylogenetic independence between synonymous sites across SDUc datapoints, we measured the pairwise synonymous divergence (Ks) between viruses. Pairwise Ks values were calculated using the seqinr R package [73], which utilises the codon model of Li (1993) [74], demonstrating the partial but not complete independence within the 2 lineages. The Ks median and maximum is 0.54 and 0.89 within the nCoV clade, and 0.34 and 1.09, respectively, within the non-nCoV/SARS-CoV-1 clade.

S D U c_{C p G, h} = \frac{\sum_{i = 1}^{k} n_{i} \times \frac{o_{i, h}}{{e^{'}}_{i, h}}}{N}

Eq 1

(Eq 1, N = total number of informative amino acids)

Spike recombination analysis

To determine the recombination breakpoint on the Spike ORF of the RmYN02 virus, we used the RDP5 method suite [75], implementing 7 methods: RDP, GENECONV, Chimaera, MaxChi, BootScan, SiScan, and 3seq. We first performed the analysis on the whole-genome alignment of the Sarbecoviruses and then determined the relevant breakpoint within the Spike ORF by rerunning the method on the Spike-only alignment. The accepted breakpoint (position 24058 in the RmYN02 genome) was consistently called by 6 out of the 7 tested methods (RDP, GENECONV, Maxchi, Chimaeara, SiSscan, and 3seq). Similarly, the preceding breakpoint of the non-nCoV region was called at position 21248 of the RmYN02 genome (before the start of the Spike ORF).

Bayesian inference under a local molecular clock

To assess the overall nonrecombinant phylogenetic relation between the viruses, we used the longest nonrecombinant genomic regions described in Boni and colleagues [6], NRR1 and NRR2, and added RmYN02 to the alignments. Based on the recombination breakpoints determined above for RmYN02, NRR2 was adjusted to end at position 21266 of Wuhan-Hu-1, instead of 21753, corresponding to the start of the RmYN02 recombinant region (RmYN02 position 21248). Time-measured evolutionary histories for NRR1 and NRR2 were inferred using a Bayesian approach, implemented through the Markov chain Monte Carlo (MCMC) framework available in BEAST 1.10 [58]. Motivated by the observation of a higher root-to-tip divergence for the nCoV clade (see S3 Text and S5 Fig) and by the CpG trait shift on the branch ancestral to the nCoV clade (see below), we specified a fixed local clock [76] that allows for a different rate on the branch leading to the nCoV clade. In the absence of strong temporal signal, we specified an informative normal prior distribution (with mean = 0.0005 and standard deviation = 0.0002) on the rate on all other branches based on recent estimates under a relaxed molecular clock [77]. To maintain the overall topology under the local clock model, we constrained the Kenyan (KY352407) bat virus as outgroup for the viruses from China in NRR1, and the Kenyan as well as a Bulgarian bat virus (NC_014470) as outgroup for the viruses from China in NRR2. We partitioned the coding regions of NRR1 and NNR2 by codon position and specified an independent general time-reversible (GTR) substitution model with gamma distribution rate variation among sites for each of the 3 partitions. We used a constant size coalescent model as tree prior and specified a lognormal prior with mean = 6.0 and standard deviation = 0.5 on the population size. Three independent MCMC analyses were run for 250 million states for each data set. We used the BEAGLE library v3 [78] to increase computational performance. For completeness, we performed the same analysis specifying the local clock model on the entire nCoV clade instead of just the branch leading to it. This has an effect on the substitution rate and node time estimates; however, we cannot formally distinguish which model is a better fit (S3 Text). We focused on the branch-only local clock model, since it directly tests our hypothesis that the rate change is linked to the CpG adaptive shift specific to that branch. Fig 3A presents the phylogeny for NRR2 because it is the longest intact nonrecombinant region and could possibly yield more reliable estimates. The additional NRR1 phylogenies and nCoV clade local model phylogenies are presented in S3 Text. BEAST parameter XML files are provided in S1 Data (sequence alignments are excluded to abide to GISAID data sharing restrictions). Continuous parameters were summarised and effective sample sizes were estimated using Tracer [79]. Divergence time estimates for the nCoV clade are summarised in S3 Text. Trees were summarised as maximum clade credibility (MCC) trees using TreeAnnotator and visualized using FigTree (http://tree.bio.ed.ac.uk/software/figtree/).

Identifying shifts in CpG content

Shifts in CpG content were identified using a phylogenetic comparative method that infers adaptive shifts in multivariate correlated traits with the R package PhylogeneticEM [56]. This approach models trait evolution on phylogenies using an Ornstein–Uhlenbeck (OU) process and uses a computationally tractable version of the full multivariate OU model (scalar OU) for multivariate traits. Estimates of the shift positions are obtained using an Expectation–Maximization (EM) algorithm. The shift positions are estimated for various numbers of unknown shifts, and a lasso-regression model selection procedure identifies the optimal number of shifts. We applied the procedure to the MCC trees for NRR1 and NRR2 with their respective CpG SDUc values (ln transformed as required by the phylogenetic approach). The trees were forced to be ultrametric, as required for the scalar OU model, by extending all the external edges of the trees to match the most recently sampled tip. Using this procedure, we identified 3 and 2 CpG shifts in both NRR1 and NRR2, respectively (S6 Fig), with the shift on the branch ancestral to the nCoV clade being the only consistent one identified in both genomic regions (Fig 3A).

Go to:

Supporting information

S1 Text

Inferring population demography from distribution of allele frequencies.

(PDF)

Click here for additional data file.^{(111K, pdf)}

S2 Text

Signals of positive selection in SARS-CoV-2 and frequency-based analysis of SARS-CoV-2 polymorphisms.

(PDF)

Click here for additional data file.^{(562K, pdf)}

S3 Text

Model selection when testing for a nCoV-specific rate change.

(PDF)

Click here for additional data file.^{(496K, pdf)}

S1 Fig

Phylogenetic tree (generated with RAxML and visualized using FigTree), showing the relationship of SARS-CoV-1 and SARS-CoV-2 (orange) to related bat and pangolin Sarbecoviruses from the whole genome alignment.

Grey, red, and blue variants are coloured according to Letko and colleagues [12] who showed experimentally some viruses are able to use human ACE2 (red), while some require exogenous protease treatment in vitro (grey); the red outlier is a known recombinant. Black indicates not tested by Letko and colleagues, while the virus in blue (sampled in Bulgaria) could not be induced to infect human cells. The scale bar corresponds to nucleotide substitutions per site.

(PNG)

Click here for additional data file.^{(80K, png)}

S2 Fig

Schematic representation of selection methods implemented on each data set of the analysis: (i) the 133,741 SARS-CoV-2 genomes and (ii) the 69 Sarbecovirus genomes.

(PNG)

Click here for additional data file.^{(296K, png)}

S3 Fig

Synonymous rate of each of the 3 synonymous rate variation classes estimated by BUSTED[S] + HMM and the proportion attributed to each class for all nonrecombinant open reading frames.

(PNG)

Click here for additional data file.^{(228K, png)}

S4 Fig

Corrected synonymous dinucleotide usage (SDUc) values for the Orf1ab of each Sarbecovirus for all dinucleotide frame positions: (A) pos1: first and second codon positions, (B) pos2: second and third codon positions, and (C) bridge: third codon position and first position of the next codon, plotted against patristic distance from SARS-CoV-2 (reference genome Wuhan-Hu-1). (D) The tip colours of the phylogeny correspond to the SDUc data points. SARS-CoV-1 is labelled in orange in panels A, B, and C for comparison. The non-nCoV part of the phylogeny has been collapsed for clarity.

(PNG)

Click here for additional data file.^{(320K, png)}

S5 Fig

Maximum likelihood trees for NRR1 (left) and NRR2 (right).

The trees were inferred using IQTREE using a GTR substitution model with gamma-distributed rate variation among sites. The nCoV lineage is indicated in red. SARS and SARS-CoV-2 are shaded in blue and red, respectively.

(PNG)

Click here for additional data file.^{(847K, png)}

S6 Fig

Estimates of shifts in CpG in the Sarbecovirus phylogeny based on NRR1 (left) and NRR2 (right).

The log-transformed CpG content values are shown at the tips of the trees. The identified shifts are indicated with black circles on their respective branches and with different colours for the lineages and CpG measures involved. The inset shows the results for penalized least-squares model selection criterion (BGHml).

(PNG)

Click here for additional data file.^{(51K, png)}

S1 Table

List of GISAID accessions used for FUBAR analysis (see S2 Text).

(CSV)

Click here for additional data file.^{(155K, csv)}

S2 Table

List of sites found to be under selection by MEME in the nCoV clade.

(XLSX)

Click here for additional data file.^{(24K, xlsx)}

S3 Table

BUSTED[S] fits for nonrecombinant segments.

For each segment, 2 rows of numbers are presented: top—for the BUSTED[S]-HMM model and bottom—for the BUSTED[S] model, allowing for multinucleotide substitutions (see Lucaci and colleagues [45] for details). The model with the better AIC-c and significant test results (p-value) are shown in bold. CV synonymous rate variation (SRV): coefficient of variation (CV) of the site-to-site synonymous rate distribution. [slashed lambda] : the rate of double-nucleotide substitutions (relative to the rate of single-nucleotide synonymous substitutions); ẟ: the relative rate of triple nucleotide substitutions; δ_islands: the relative rate of triple nucleotide substitutions that involve synonymous codons, e.g., for Serine.

(XLSX)

Click here for additional data file.^{(14K, xlsx)}

S4 Table

List of Sarbecoviruses used in the selection analysis and their GISAID accessions.

(CSV)

Click here for additional data file.^{(4.2K, csv)}

S5 Table

GISAID accessions used in the SARS-CoV-2 selection analysis.

(CSV)

Click here for additional data file.^{(2.7M, csv)}

S6 Table

GISAID sequences acknowledgement table.

(PDF)

Click here for additional data file.^{(4.0M, pdf)}

S1 Data

XML files with BEAST model parameters (sequences are excluded due to GISAID data sharing restrictions).

(GZ)

Click here for additional data file.^{(37K, gz)}

Go to:

Acknowledgments

We would like to thank all the authors who have kindly deposited and shared genome data. A table with GISAID genome sequence acknowledgments can be found in S6 Table. Credit also needs to be given to the surveillance projects for generating the genome data that are available in GenBank and to the software developers for making the tools we have used freely available. We thank Alex Gunnarsson, Xiaowei Jiang, Joseph Hughes, and Kyriaki Nomikou for thankful comments on the manuscript.

Go to:

Abbreviations:

aBSREL	adaptive branch-site random effects likelihood
COVID-19	Coronavirus Disease 201
CV	coefficient of variation
dN	nonsynonymous substitution rate
dS	synonymous substitution rate
EM	expectation–maximization
FEL	fixed effects likelihood
GTR	general time-reversible
hACE2	human angiotensin-converting enzyme 2
HMM	hidden Markov model; MCC, maximum clade credibility
MCMC	Markov chain Monte Carlo
MEME	mixed effects model of evolution
MERS	Middle East Respiratory Syndrome
nCoV	new coronavirus
ORF	open reading frame
OU	Ornstein–Uhlenbeck
QC	quality control
RDA	relative dinucleotide abundance
RdRp	RNA-dependent RNA polymerase
SARS-CoV-2	Severe Acute Respiratory Syndrome Coronavirus 2
SDUc	corrected synonymous dinucleotide usage
SLAC	single-likelihood ancestor counting
ZAP	Zinc finger Antiviral Protein

Go to:

Funding Statement

DLR is funded by the Medical Research Council (MC_UU_1201412) and Wellcome Trust (220977/Z/20/Z). OAM is funded by the Wellcome Trust (206369/Z/17/Z). SLKP and SW are supported in part by the National Institutes of Health (R01 AI134384 (NIH/NIAID)) and the National Science Foundation (award 2027196). PL acknowledges funding from the European Research Council under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 725422-ReservoirDOCS), the European Union's Horizon 2020 project MOOD (874850), the Wellcome Trust through project 206298/Z/17/Z (The Artic Network) and the Research Foundation -- Flanders (`Fonds voor Wetenschappelijk Onderzoek -- Vlaanderen', G066215N, G0D5117N and G0B9317N). MFB is funded by a grant from the Bill and Melinda Gates Foundation (INV-005517) and by NIH/NIAID Center of Excellence in Influenza Research and Surveillance contract (HHS N272201400007C). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Go to:

Data Availability

All relevant data are within the paper, its Supporting Information files or associated online resources.

Go to:

References

1. Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, et al.. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–9. 10.1038/s41586-020-2008-3 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

2. Gorbalenya AE, Baker SC, Baric RS, de Groot RJ, Drosten C, Gulyaeva AA, et al.. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nature Microbiology. Nature Research. 2020:536–44. 10.1038/s41564-020-0695-z [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

3. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al.. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus–Infected Pneumonia. N Engl J Med. 2020;382:1199–207. 10.1056/NEJMoa2001316 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

4. Zhou P, Lou YX, Wang XG, Hu B, Zhang L, Zhang W, et al.. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–3. 10.1038/s41586-020-2012-7 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

5. Lam TTY, Shum MHH, Zhu HC, Tong YG, Ni XB, Liao YS, et al.. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature. 2020;583:282–5. 10.1038/s41586-020-2169-0 [Abstract] [CrossRef] [Google Scholar]

6. Boni MF, Lemey P, Jiang X, Lam TTY, Perry BW, Castoe TA, et al.. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol. 2020:1–10. 10.1038/s41564-019-0652-x [Abstract] [CrossRef] [Google Scholar]

7. Guan Y, Zheng BJ, He YQ, Liu XL, Zhuang ZX, Cheung CL, et al.. Isolation and characterization of viruses related to the SARS coronavirus from animals in Southern China. Science (80). 2003;302:276–8. 10.1126/science.1087139 [Abstract] [CrossRef] [Google Scholar]

8. Song HD, Tu CC, Zhang GW, Wang SY, Zheng K, Lei LC, et al.. Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human. Proc Natl Acad Sci U S A. 2005;102:2430–5. 10.1073/pnas.0409608102 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

9. Graham RL, Recombination BRS. Reservoirs, and the Modular Spike: Mechanisms of Coronavirus Cross-Species Transmission. J Virol. 2010;84:3134–46. 10.1128/JVI.01394-09 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

10. Menachery VD, Dinnon IIIKH, Yount BL, McAnarney ET, Gralinski LE, Hale A, et al.. Trypsin Treatment Unlocks Barrier for Zoonotic Bat Coronavirus Infection. J Virol. 2020;94:e01774–19. 10.1128/JVI.01774-19 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

11. Ge XY, Li JL, Lou YX, Chmura AA, Zhu G, Epstein JH, et al.. Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor. Nature. 2013;503:535–8. 10.1038/nature12711 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

12. Letko M, Marzi A, Munster V. Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses. Nat Microbiol. 2020;5:562–9. 10.1038/s41564-020-0688-y [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

13. Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF. The proximal origin of SARS-CoV-2. Nature Medicine. Nature Research. 2020:450–2. 10.1038/s41591-020-0820-9 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

14. Gallaher WR. A palindromic RNA sequence as a common breakpoint contributor to copy-choice recombination in SARS-COV-2. Arch Virol. 2020;1:3. 10.1007/s00705-020-04750-z [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

15. Lytras S, Hughes J, Xia W, Jiang X, Robertson DL. Exploring the natural origins of SARS-CoV-2. bioRxiv. 2021; 2021.01.22.427830. 10.1101/2021.01.22.427830 [CrossRef] [Google Scholar]

16. Parrish CR, Holmes EC, Morens DM, Park E-C, Burke DS, Calisher CH, et al.. Cross-Species Virus Transmission and the Emergence of New Epidemic Diseases. Microbiol Mol Biol Rev. 2008;72:457–70. 10.1128/MMBR.00004-08 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

17. Woolhouse MEJ, Taylor LH, Haydon DT. Population biology of multihost pathogens. Science. 2001. 10.1126/science.1059026 [Abstract] [CrossRef] [Google Scholar]

18. Lin XD, Wang W, Hao ZY, Wang ZX, Guo WP, Guan XQ, et al.. Extensive diversity of coronaviruses in bats from China. Virology. 2017;507:1–10. 10.1016/j.virol.2017.03.019 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

19. Shang J, Ye G, Shi K, Wan Y, Luo C, Aihara H, et al.. Structural basis of receptor recognition by SARS-CoV-2. Nature. 2020;581:221–4. 10.1038/s41586-020-2179-y [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

20. Verity R, Okell LC, Dorigatti I, Winskill P, Whittaker C, Imai N, et al.. Estimates of the severity of coronavirus disease 2019: a model-based analysis. Lancet Infect Dis. 2020;20:P669–77. 10.1016/S1473-3099(20)30243-7 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

21. Grubaugh ND, Petrone ME, Holmes EC. We shouldn’t worry when a virus mutates during disease outbreaks. Nature Microbiology. Nature Research. 2020:529–30. 10.1038/s41564-020-0690-4 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

22. Maclean OA, Orton RJ, Singer JB, Robertson DL. No evidence for distinct types in the evolution of SARS-CoV-2. Virus Evol. 2020;6:veaa034. 10.1093/ve/veaa034 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

23. Kimura M. Evolutionary rate at the molecular level. Nature. 1968;217:624–6. 10.1038/217624a0 [Abstract] [CrossRef] [Google Scholar]

24. Ohta T. Slightly deleterious mutant substitutions in evolution. Nature. 1973;246:96–8. 10.1038/246096a0 [Abstract] [CrossRef] [Google Scholar]

25. Geoghegan JL, Holmes EC. The phylogenomics of evolving virus virulence. Nat Rev Genet. Nature Publishing Group; 2018. pp. 756–769. 10.1038/s41576-018-0055-5 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

26. Volz E, Hill V, McCrone JT, Price A, Jorgensen D, O’Toole Á, et al.. Evaluating the effects of SARS-CoV-2 Spike mutation D614G on transmissibility and pathogenicity. Cell. 2020. 10.1016/j.cell.2020.11.020 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

27. Boni MF, Nguyen TD, de Jong MD, van Doorn HR. Virulence attenuation during an influenza A/H5N1 pandemic. Philos Trans R Soc B Biol Sci 2013;368: 20120207. 10.1098/rstb.2012.0207 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

28. Peischl S, Dupanloup I, Kirkpatrick M, Excoffier L. On the accumulation of deleterious mutations during range expansions. Mol Ecol. 2013;22:5972–82. 10.1111/mec.12524 [Abstract] [CrossRef] [Google Scholar]

29. Nielsen R, Wang H, Pipes L. Synonymous mutations and the molecular evolution of SARS-Cov-2 origins. Virus Evolution. 2020; veaa098. 10.1093/ve/veaa098 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

30. Kosakovsky SL, Frost SDW. Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection. Mol Biol Evol. 2005;22:1208–22. 10.1093/molbev/msi105 [Abstract] [CrossRef] [Google Scholar]

31. Pybus OG, Rambaut A, Belshaw R, Freckleton RP, Drummond AJ, Holmesà EC. Phylogenetic Evidence for Deleterious Mutation Load in RNA Viruses and Its Contribution to Viral Evolution. Mol Biol Evol. 2007;24:845–52. 10.1093/molbev/msm001 [Abstract] [CrossRef] [Google Scholar]

32. Anisimova M, Bielawski JP, Yang Z. Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution. Mol Biol Evol 2001;18: 1585–1592. Available: https://academic.oup.com/mbe/article-abstract/18/8/1585/993482 10.1093/oxfordjournals.molbev.a003945 [Abstract] [CrossRef] [Google Scholar]

33. Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, et al.. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell. 2020. 10.1016/j.cell.2020.06.043 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

34. Plante JA, Liu Y, Liu J, Xia H, Johnson BA, Lokugamage KG, et al.. Spike mutation D614G alters SARS-CoV-2 fitness. Nature. 2020. 10.1038/s41586-020-2895-3 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

35. Duchêne S, Holmes EC, Ho SYW. Analyses of evolutionary dynamics in viruses are hindered by a time-dependent bias in rate estimates. Proc R Soc B Biol Sci. 2014;281:20140732. 10.1098/rspb.2014.0732 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

36. Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, Kosakovsky Pond SL. Detecting individual sites subject to episodic diversifying selection. PLoS Genet. 2012;8:e1002764. 10.1371/journal.pgen.1002764 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

37. Kosakovsky SL, Posada D, Gravenor MB, Woelk CH, Frost SDW. GARD: a genetic algorithm for recombination detection. Bioinformatics. 2006;22:3096–8. 10.1093/bioinformatics/btl474 [Abstract] [CrossRef] [Google Scholar]

38. Anthony SJ, Gilardi K, Menachery VD, Goldstein T, Ssebide B, Mbabazi R, et al.. Further evidence for bats as the evolutionary source of middle east respiratory syndrome coronavirus. MBio. 2017;8. 10.1128/mBio.00373-17 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

39. Dudas G, Rambaut A. MERS-CoV recombination: implications about the reservoir and potential for adaptation. Virus Evol. 2016;2:vev023. 10.1093/ve/vev023 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

40. Wong MC, Cregeen SJJ, Ajami NJ, Petrosino JF. Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019. bioRxiv. 2020; 2020.02.07.939207. 10.1101/2020.02.07.939207 [Abstract] [CrossRef] [Google Scholar]

41. Smith MD, Wertheim JO, Weaver S, Murrell B, Scheffler K, Kosakovsky Pond SL. Less Is More: An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection. Mol Biol Evol. 2015;32:1342–53. 10.1093/molbev/msv022 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

42. Wisotsky SR, Kosakovsky Pond SL, Shank SD, Muse SV. Synonymous Site-to-Site Substitution Rate Variation Dramatically Inflates False Positive Rates of Selection Analyses: Ignore at Your Own Peril. Mol Biol Evol. 2020;37:2430–9. 10.1093/molbev/msaa037 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

43. Felsenstein J, Churchill GA. A Hidden Markov Model Approach to Variation Among Sites in Rate of Evolution. Mol Biol Evol 1996;13: 93–104. Available: https://academic.oup.com/mbe/article-abstract/13/1/93/1055515 10.1093/oxfordjournals.molbev.a025575 [Abstract] [CrossRef] [Google Scholar]

44. Forni D, Cagliani R, Mozzi A, Pozzoli U, Al-Daghri N, Clerici M, et al.. Extensive Positive Selection Drives the Evolution of Nonstructural Proteins in Lineage C Betacoronaviruses. J Virol. 2016;90:3627–39. 10.1128/JVI.02988-15 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

45. Lucaci A, Wisotsky S, Shank S, Weaver S, Kosakovsky Pond S. Extra base hits: widespread empirical support for instantaneous multiple-nucleotide changes. bioRxiv. 2020; 2020.05.13.091652. 10.1101/2020.05.13.091652 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

46. Konno Y, Kimura I, Uriu K, Fukushi M, Irie T, Koyanagi Y, et al.. SARS-CoV-2 ORF3b Is a Potent Interferon Antagonist Whose Activity Is Increased by a Naturally Occurring Elongation Variant. Cell Rep 2020; 108185. 10.1016/j.celrep.2020.108185 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

47. Simmonds P, Rampant C. →U Hypermutation in the Genomes of SARS-CoV-2 and Other Coronaviruses: Causes and Consequences for Their Short- and Long-Term Evolutionary Trajectories. mSphere. 2020;5:e00408–20. 10.1128/mSphere.00408-20 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

48. De Maio Nicola; Walker Conor; Borges Rui; Weilguny Lukas; Slodkowicz Greg; Goldman Nick. Issues with SARS-CoV-2 sequencing data. In: Virological [Internet]. 2020. [cited 2020 Dec 24]. Available: https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 [Google Scholar]

49. Venkat A, Hahn MW, Thornton JW. Multinucleotide mutations cause false inferences of lineage-specific positive selection. Nat Ecol Evol. 2018. 10.1038/s41559-018-0584-5 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

50. Greenbaum BD, Levine AJ, Bhanot G, Rabadan R. Patterns of Evolution and Host Gene Mimicry in Influenza and Other RNA Viruses. PLoS Pathog. 2008;4:e1000079. 10.1371/journal.ppat.1000079 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

51. Takata MA, Gonçalves-Carneiro D, Zang TM, Soll SJ, York A, Blanco-Melo D, et al.. CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature. 2017;550:124–7. 10.1038/nature24039 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

52. Bishop KN, Holmes RK, Sheehy AM, Malim MH. APOBEC-mediated editing of viral RNA. Science (80-). 2004;305:645. 10.1126/science.1100658 [Abstract] [CrossRef] [Google Scholar]

53. Kmiec D, Nchioua R, Sherrill-Mix S, Stürzel CM, Heusinger E, Braun E, et al.. CpG frequency in the 5’ third of the env gene determines sensitivity of primary HIV-1 strains to the zinc-finger antiviral protein. MBio. 2020;11. 10.1128/mBio.02903-19 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

54. Pollock DD, Castoe TA, Perry BW, Lytras S, Wade KJ, Robertson DL, et al.. Viral CpG deficiency provides no evidence that dogs were intermediate hosts for SARS-CoV-2. Mol Biol Evol. 2020;37:2706–10. 10.1093/molbev/msaa178 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

55. Lytras S, Hughes J. Synonymous Dinucleotide Usage: A Codon-Aware Metric for Quantifying Dinucleotide Representation in Viruses. Viruses. 2020;12:462. 10.3390/v12040462 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

56. Bastide P, Ané C, Robin S, Mariadassou M, Alfaro M. Inference of Adaptive Shifts for Multivariate Correlated Traits. Syst Biol. 2018;67:662–80. 10.1093/sysbio/syy005 [Abstract] [CrossRef] [Google Scholar]

57. Zhou H, Chen X, Hughes AC, Bi Y, Shi WA. Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein. Curr Biol. 2020;30:1–8. 10.1016/j.cub.2019.10.048 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

58. Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 2018;4:vey016. 10.1093/ve/vey016 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

59. Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics. Elsevier Current Trends. 1995:283–90. 10.1016/S0168-9525(00)89076-9 [Abstract] [CrossRef] [Google Scholar]

60. Lee J, Hughes T, Lee MH, Field H, Rovie-Ryan JJ, Sitam FT, et al.. No Evidence of Coronaviruses or Other Potentially Zoonotic Viruses in Sunda pangolins (Manis javanica) Entering the Wildlife Trade via Malaysia. Ecohealth. 2020;17:406–18. 10.1007/s10393-020-01503-x [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

61. Volz E, Mishra S, Chand M, Barrett JC, Johnson R, Hopkins S, et al.. Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data. medRxiv. 2021; 2020.12.30.20249034. 10.1101/2020.12.30.20249034 [CrossRef] [Google Scholar]

62. Tegally H, Wilkinson E, Giovanetti M, Iranzadeh A, Fonseca V, Giandhari J, et al.. Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. medRxiv. 2020. [Google Scholar]

63. Rambaut A, Loman N, Pybus O, Barclay W, Barrett J, Carabelli A, et al. Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations—SARS-CoV-2 coronavirus/nCoV-2019 Genomic Epidemiology—Virological. [cited 20 Jan 2021]. Available: https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563

64. Holmes EC. Error thresholds and the constraints to RNA virus evolution. Trends Microbiol. 2003. 10.1016/j.tim.2003.10.006 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

65. Van Dorp L, Tan CC, Lam SD, Richard D, Owen C, Berchtold D, et al.. Recurrent mutations in SARS-CoV-2 genomes isolated from mink point to rapid host-adaptation. bioRxiv. 2020; 2020.11.16.384743. 10.1101/2020.11.16.384743 [CrossRef] [Google Scholar]

66. Wang N, Li SY, Lou Yang X, Huang HM, Zhang YJ, Guo H, et al.. Serological Evidence of Bat SARS-Related Coronavirus Infection in Humans, China. Virol Sin. Science Press; 2018. pp. 104–107. 10.1007/s12250-018-0012-7 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

67. Li H, Mendelsohn E, Zong C, Zhang W, Hagan E, Wang N, et al.. Human-animal interactions and bat coronavirus spillover potential among rural residents in Southern China. Biosaf Heal. 2019;1:84–90. 10.1016/j.bsheal.2019.10.004 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

68. Anderson DE, Tan CW, Chia WN, Young BE, Linster M, Low JH, et al.. Lack of cross-neutralization by SARS patient sera towards SARS-CoV-2. Emerg Microbes Infect. 2020;9:900–2. 10.1080/22221751.2020.1761267 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

69. Katoh K, Standley DMMAFFT. Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30:772–80. 10.1093/molbev/mst010 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

70. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinforma Appl. 2014;30:1312–3. 10.1093/bioinformatics/btu033 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

71. Wisotsky SR, Kosakovsky Pond SL, Shank SD, Muse SV. Synonymous site-to-site substitution rate variation dramatically inflates false positive rates of selection analyses: ignore at your own peril. Mol Biol Evol. [cited 8 Apr 2020]. 10.1093/molbev/msaa037 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

72. Kosakovsky Pond SL, Poon AFY, Velazquez R, Weaver S, Lance Hepler N, Murrell B, et al.. HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Mol Biol Evol. 2019;37:295–9. 10.1093/molbev/msz197 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

73. Charif D, Lobry JR. SeqinR 1.0–2: A Contributed Package to the R Project for Statistical Computing Devoted to Biological Sequences Retrieval and Analysis. Structural Approaches to Sequence Evolution. Springer, Berlin, Heidelberg; 2007. pp. 207–232. 10.1007/978-3-540-35306-5_10 [CrossRef] [Google Scholar]

74. Li WH. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. Journal of Molecular Evolution. Springer; 1993. pp. 96–99. 10.1007/BF02407308 [Abstract] [CrossRef] [Google Scholar]

75. Martin DP, Murrell B, Golden M, Khoosal A, Muhire B. RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evol. 2015;1:vev003. 10.1093/ve/vev003 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

76. Worobey M, Han GZ. Rambaut A. A synchronized global sweep of the internal genes of modern avian influenza virus. Nature. 2014;508:254–7. 10.1038/nature13016 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

77. Drummond AJ, Ho SYW, Phillips MJ, Relaxed Phylogenetics RA. Dating with Confidence. Penny D, editor PLoS Biol. 2006;4:e88. 10.1371/journal.pbio.0040088 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

78. Ayres DL, Cummings MP, Baele G, Darling AE, Lewis PO, Swofford DL, et al.. BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics. Syst Biol. 2019;68:1052–61. 10.1093/sysbio/syz020 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

79. Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7. Syst Biol. 2018;67:2. 10.1093/sysbio/syy032 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

2021 Mar; 19(3): e3001115.

Published online 2021 Mar 12. 10.1371/journal.pbio.3001115.r001

Decision Letter 0

Paula Jauregui, PhD, Senior Editor

23 Oct 2020

Dear Dr. Robertson,

Thank you for submitting your manuscript entitled "Natural selection in the evolution of SARS-CoV-2 in bats, not humans, created a highly capable human pathogen" for consideration as a Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Oct 25 2020 11:59PM.

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at gro.solp@ygoloibsolp if you have any queries relating to your submission.

Kind regards,

Paula

---

Paula Jauregui, PhD,

associate Editor

PLOS Biology

2021 Mar; 19(3): e3001115.

Published online 2021 Mar 12. 10.1371/journal.pbio.3001115.r002

Decision Letter 1

Paula Jauregui, PhD, Senior Editor

4 Dec 2020

Dear Dr. Robertson,

Thank you very much for submitting your manuscript "Natural selection in the evolution of SARS-CoV-2 in bats, not humans, created a highly capable human pathogen" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by several independent reviewers.

In light of the reviews (below), we are pleased to offer you the opportunity to address the comments from the reviewers in a revised version that we anticipate should not take you very long. We will then assess your revised manuscript and your response to the reviewers' comments and we may consult the reviewers again.

In particular, referee #1 says that due to the under sampling in wildlife species, a greater sampling of animal species in nature including bats could indicate a closer ancestor and may well demonstrate selection and host-adaptations, and says that you should recognize this and the paucity of SARS-CoV-2 genomes from very early in the pandemic from China in the text, adding that you can’t dismiss the hypothesis of human-specific adaptation. Referee #1 also has several questions and comments about your analysis and assumptions, that we consider all should be addressed. Referee #2 has suggestions for modifying text and figures in order to make it clearer for a general audience. Both reviewers think that you should change the title.

We expect to receive your revised manuscript within 1 month.

Please email us (gro.solp@ygoloibsolp) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Resubmission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this resubmission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Paula

---

Paula Jauregui, PhD,

Associate Editor,

gro.solp@aveinoiugeruajp,

PLOS Biology

*****************************************************

REVIEWER'S EXPERTISE:

Reviewer #1: viral phylodynamics

Reviewer #2: dynamics of zoonotic infections

Reviewer #1: This manuscript by MacLean/Lytras and colleagues is a fascinating analysis assessing the nature of selection occurring in horseshoe bats compared to SARS-CoV-2 evolution in humans. In this study the authors explore the evolutionary history of bat Sarbecoviruses that may help shed some insight into the emergence and rapid spread of SARS-CoV-2. One of the quirks of this pandemic is that the virus responsible evolves relatively slowly resulting in a very shallow phylogenetic tree. This feature inhibits the statistical power of standard comparative methods and as such other computational approaches are needed. This manuscript adapts a rigorous methodology to measure the strength and direction of selection from SARS-CoV-2 and its Bat precursors while carefully considering confounding errors (sequencing or other lab-based errors) that may have occurred on terminal branches.

One thing that does irk me slightly is the authors assertions of the "generalist" nature of these viruses allowing for efficient spillover events. While there is no doubt that SARS-CoV-2 is well adapted for humans and the authors have not found any evidence for selection in the ancestral branches leading to SARS-CoV-2 there is a huge amount of under sampling in wildlife species so a greater sampling of animal species in nature including bats could indicate a closer ancestor and may well demonstrate selection and host-adaptations. This caveat and the paucity of SARS-CoV-2 genomes from very early in the pandemic from China should be duly recognized within the text. Moreover, given the divergence estimates from RmYN02/SARS-CoV-2 of 1976 there is potentially decades of unobserved evolution that may have occurred that warrants consideration.

Specific comments.

1. The methods are quite technical and for a broader reader of PLoS Biology I think they would find it challenging to understand the nuances of each computational approach. It may be worth the authors considering a figure (even as supplementary) to highlight the methodological approach used.

2. The authors premise that the majority of host adaptations occurred before the emergence of SARS-CoV-2 in humans while may be technically sound with the current sequence data but how can the authors dismiss the hypothesis that human-specific adaptation would have likely reached fixation even before the first SARS-CoV-2 genome was sequenced?

3. As this field is continuously evolving with new genomic data being added daily the statement in line 328-331 should no longer be considered accurate as we can now observe rapid adaptation of SARS-CoV-2 in mink populations. E.g Y453F in the receptor biding motif and this rarely occurs in humans. My point linked with the above point is that increased surveillance permitted us to observe this variant in minks while similarly changes may have occurred earlier in the pandemic and become fixed before sequencing was done. Could rapid adaptation in late 2019 from unsampled asymptomatic transmission chains be plausible?

4. Unsurprisingly, the authors find that up until early June, relatively weak purifying selection was acting on SARS-CoV-2 sequences. If the authors extended their analysis to the present what would they expect to find given that there is more diversity in current circulating variants?

5. While SARS-CoV-2 consensus sequences remain largely unchanged over time within hosts is there evidence for a more dynamic evolutionary process sub-consensus?

6. The authors use MEME as documented in Figure 1c to individual sites subjected to episodic diversifying selection. However, I thought that MEME assumes that selective pressures between branches are uncorrelated. Surely, this is not the case for SARS-CoV-2 as changes are occurring very slowly across the phylogeny as neighboring branches will be correlated. Have the authors considered a mixed effects covarion model and if this would improve power to detect directional selection?

7. The addition of synonymous site rate variation is a great addition to BUSTED as constant dS rates can elevate false-positive and reduce power to test individual sites for selection. However, can the authors comment on whether accommodating synonymous rate variation results in reduced power compared to the original method?

8. How did the authors consider multinucleotide mutations in their analysis as I am sure the authors are aware that there is a high possibility of false positives with branch sites tests like BUSTED. For example, Venkat et al. (2018).

9. From figure 2 there appears to be evidence for selection in the pangolin CoV cluster within Orf1ab. While not the focus of this paper does this not suggest that there may be other adaptations under selection and responsible for its emergence in this animal host?

10. The title while scientifically accurate is a bit clumsy to me. I would suggest something catchier for a reader.

11. How did the authors consider genomes with Ns in collapsing sequences into unique haplotypes. Were they ignored and only A,C,G,T characters considered?

Reviewer #2: This important paper explores the evolutionary selective pressure on the Sarbecovirus subgenus of viruses and a subset of early SARS-CoV-2 genomes circulating in the human population in order to assess whether natural selection facilitated SARS-CoV-2's cross-species transmission to and consequential spread in the human population. In particular, the authors undertake the following key analyses:

1. Exploration of natural selection in SARS-CoV-2 (hereafter, SC2 for simplicity) human genomes.

First, the authors analyze, ~50000 human genomes of SC2, up until June 28, 2020 of the pandemic, limiting their analysis to genomes >29000 bps, with <1% divergence from the reference, with <0.5% ambiguous bases, and lacking stop codons. They use this dataset to assess evidence of purifying and/or positive selection in SC2, finding that most genetic variants in the virus occur at low frequency (in <15 genomes) and exhibit weakly purifying selection, consistent with a model of exponential virus growth. A few exceptions occur in the case of a few high frequency variants found in >3000 genomes that show weakly positive selection.

The authors undertake a number of advanced analyses to validate these few SNPs that are deemed to be under positive selection, identifying 10 candidate mutations with dN/dS ratios >1 that could be positively selected. They consider each of these individually, investigating the timing and laboratory of submission, the possibility of sequencing error or recombination to give rise to this variant and ultimately converge on four mutations that appear to be truly under positive selection: RdRp 323, S943, S614, and S141.

2. Analysis of selection in SARSr-CoV sequences in bats.

For the second major analysis of the paper, the authors analyze a subset of 19 non-recombinant regions of several bat/human/pangolin SARSr-CoVs identified in Boni et al 2019. Each non-recombinant region is analyzed independently, and for each region, the authors separate the viruses into an nCoV clade, representing those most closely related to SC2 and forming a monophyly and a non-nCoV clade, representing those more distantly related.

Using the program aBSREL, the authors first search for positive selection on specific branches in the SARSr-CoV phylogeny and find its imprints in the deepest branches of the nCoV lineage long before the emergence of the virus to humans, suggesting that it was not recent selection that allowed for the cross species shift.

The authors next use a program called BUSTED and their own extension to explore synonymous rate variation (SRV) across SARSr-CoV genomes, finding positive selection in the nCoV clade in Orf1ab, Spike and N proteins. Using the MEME method, the authors identify 85 particular sites in the nCoV clade under selection, most in Orf1ab, Spike, and N. Critically, they show a higher than expected proportion of sites in Orf3a, for which the function is not known, but they suggest it might play a role in immune evasion.

The authors' extension to the BUSTED method allowed them to infer differing substitution rate classes across the non-recombinant sites in the genome; the authors determine up to 200-fold differences in the rate of synonymous substitution across sites in the genome, suggesting that some synonymous sites may be under strong purifying selection to purge deleterious mutations.

3. CpG depletion in the nCoV clade

The third and final major analysis of the paper investigates the depletion of CpG sites in the nCoV clade. CpG depletion is believed to be advantageous for virus evolution because It aids in evasions of a CpG-targeted mammalian immune response involving Zinc-finger Antiviral Protein (ZAP), as well as antiviral C to U hypermutation carried out by APOBEC3 cytidine deaminases. The authors use a framework called Synonymous Dinucleotide Usage (SDUc) to compare CpG representation across the 19 non-recombinant regions of the SARSr-CoV clade, finding significant CpG under-representation in Orf1ab for all Sarbecoviruses and lower CpG content overall in the nCoV clade Sabecoviruses vs all others. They fit this trait on two alignments of the 19 non-recombinant regions of the SARSr-CoVs to identify points where the "CpG suppressive" trait evolved: in particular, on the lineage leading to the nCoV clade.

Finally, they tested a model which allowed for a relaxed mutation rate in different clades to find evidence of an elevated substitution rate on the nCoV lineage subsequent to this CpG depletion event, giving way to a generalist virus clade.

General comments:

Though familiar with the SC2 evolutionary history literature, I am not a phylogeneticist by training and cannot comment critically on the methods selected for recombination and selection analysis (aBSREAL, MEME, BUSTED, SDUc). These appear to be appropriate to me. More generally, however, I believe that this paper is an important and relevant contribution to the SC2 evolution literature and should be published soon in PLoS Biology. It is not, however, the most clearly written paper I have encountered, and I have a few suggestions for how the authors could make their findings more accessible.

Title: Title is a bit of a mouthful. What about dropping the "not humans" bit and including the word "generalist" -- something to the effect of: "Natural selection of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen"

Intro: It would be helpful if you set up these three major analyses summarized above at the end of the intro (~Line 99). Because PLoS Bio requires results to come before Methods, it can often lead to slightly disjointed papers, but I think one quick sentence preparing the reader for the three major areas of focus to follow would help a lot. In addition, I would recommend being very explicit about the two datasets used in this paper and stating that analysis #1 refers to the 50000 human SC2 genomes while analyses #2 and 3 refer to the SARSr-CoV dataset (in fact, two versions of it as presented in Boni et al. 2020)

Discussion: I think some discussion of the pros vs. cons of being a specialist vs. a generalist virus is warranted here. The paper suggests that the entire clade of Sarbecoviruses is a highly generalist clade which seems like it should be a majorly adaptive feature. Why then have they not outcompeted all of the other tradeoffs? A nod to the literature on specialism vs. generalism in host-pathogen coevolution would be appropriate.

Figures

Fig 1: Fig 1 is largely appropriate for summarizing the analysis of human SC2 genomes and evidence of positive selection but Fig 1A could be improved: it shows 'Variant Frequency' on the x-axis, which I believe gives the number of genomes in the dataset for in which a given variant is found, by both the dN/dS ratio on the primary y-axis (dots) and the count of variants of this type in the dataset on the secondary y-axis. So I think it is saying that there are ~7000 individual variants which are unique to only one genome in the dataset and these have a mean corresponding dN/dS ratio around .85. Likewise, I see it as saying that there are <50 variants that are found in over 3000 genomes and that these have an average dN/dS ratio of around 1.4. Is this a correct interpretation?

It would be helpful if the bar color marched the color of the secondary y-axis text (count) and the dots matched the color of the primary y-axis text (dN/DS) to avoid confusion.

Also, why is there no range of error for the bars (i.e. through bootstrapping) or the dots from the SLAC method?

Fig 2. I found Fig 2 difficult to interpret. I have a few suggestions for how it could be improved: 1. Break it down into subsections (A, B, C) so that you can refer to each separately in the caption.

2. In the caption, explain the genome structure and positive/negative selection subset first, sinc these are at the top of the figure (and refer to this as figure component A).

3. The individual phylogenies are fine—just group these together as part B. I *think* the color scale for dN/dS ratio corresponds to the mean ratio inferred for the highlighted clades within the lineage, so if these are subgrouped together it will be easier to understand.

4. Then, discard the donut plots. It is unclear why they are only present for some of the ORF regions (presumably only those with significant positive selection are shown?) and also confusing that their color scheme is different from the phylogenies. Instead, just either refer to the supplementary table as is done anyway or make a small table as part C that lists omega3 parameter for each ORF from Table S4 and corresponding percentage and mention in the caption that omega1 and 2 were basically 0 for all parameters.

Fig 3. Lovely figure. Could be slightly easier to explain if you made the CpG values part B of the figure and the schematic part C (or if the schematic were a different figure entirely—they don't really relate to one another).

Fig 4. In general, very clear. Can you do some comparative stats on these SDUc values for the frame positions in these two regions of the spike protein (inset) to show that WuHan and RmYN02 don't differ in the nCoV half but do in the non-nCoV half?

A few minor line-by-line comments here:

Line 43: change to "which created a relatively generalist"

Line 47-50: wording is awkward. Change to "Evolutionary analysis identified this new virus to humans as a Severe acute respiratory syndrome-related coronavirus [1], in the Sarbecovirus subgenus of the Betacoronavirus genus, sister to the original SARS virus; it was subsequently named SARS-CoV-2 to reflect this relationship [2]."

Line 62-63: it is not proven that SARS-CoV transmitted to humans via civets and the cited ref is just a review paper. Suggestion to change to "Later it became clear that while these animals may have been conduits for spillover to humans, they were not true viral reservoirs"

Line 88: suggestion to change "us" to "humans"

Line 144: would be good to here cite Plante et al 2020, now published in Nature

Line 242: should be one sentence

Line 359: why was 1% chosen as a cutoff for too divergent (I agree, as 1% divergence would far outpace the known mutation rate for SARS-CoV-2 but I think you should provide a ref indicating the reasonable range of expected divergence, especially up to the point in time studies in your data subset)

Line 360: sudden shift to first person, present tense is perplexing. Please keep tense consistent throughout the paper. Past tense probably makes more sense (i.e. "the data from GISAID was filtered").

Supplementary Materials:

Table S4 could be easily incorporated into the pdf file for the supplementary materials and would be more accessible that way.

2021 Mar; 19(3): e3001115.

Published online 2021 Mar 12. 10.1371/journal.pbio.3001115.r003

Author response to Decision Letter 1

1 Jan 2021

Attachment

Submitted filename: MacLean-Lytras-etal-reviewers-response-24thDec.pdf

Click here for additional data file.^{(83K, pdf)}

2021 Mar; 19(3): e3001115.

Published online 2021 Mar 12. 10.1371/journal.pbio.3001115.r004

Decision Letter 2

Paula Jauregui, PhD, Senior Editor

15 Jan 2021

Dear Dr. Robertson,

Thank you for submitting your revised Research Article entitled "Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen" for publication in PLOS Biology. I have now obtained advice and discussed with the Academic Editor.

We will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the editorial requirements. Please also make sure to address the data and other policy-related requests noted at the end of this email.

In particular, it seems that you provide data for supplementary figures 1, 3, 6 and 7 but it is not stated where the data can be found and it is not very clear to what figure this data belongs to. Please see at the end of this email for more information.

We expect to receive your revised manuscript within two weeks. Your revisions should address the specific points made by each reviewer.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

- a track-changes file indicating any changes that you have made to the manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

Please do not hesitate to contact me should you have any questions.

Sincerely,

Paula

---

Paula Jauregui, PhD,

Associate Editor,

gro.solp@aveinoiugeruajp,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figure 1A, 1B, 1C, 1D, 1E, 2B, 3A, 4A, 4B, Supplementary figure 8A, 8B, 8C, 8D, Supplementary figure 10, Supplementary figure 11, Supplementary figure 12.

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

2021 Mar; 19(3): e3001115.

Published online 2021 Mar 12. 10.1371/journal.pbio.3001115.r005

Author response to Decision Letter 2

22 Jan 2021

Attachment

Submitted filename: MacLean-Lytras-etal-reviewer-response-24thDec.docx

Click here for additional data file.^{(28K, docx)}

2021 Mar; 19(3): e3001115.

Published online 2021 Mar 12. 10.1371/journal.pbio.3001115.r006

Decision Letter 3

Paula Jauregui, PhD, Senior Editor

25 Jan 2021

Dear David,

On behalf of my colleagues and the Academic Editor, Damien Tully, I am pleased to say that we can in principle offer to publish your Research Article "Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen" in PLOS Biology, provided you address any remaining formatting and reporting issues if needed. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have made the required changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

Thank you again for supporting Open Access publishing. We look forward to publishing your paper in PLOS Biology.

Sincerely,

Paula

---

Paula Jauregui, PhD

Associate Editor

PLOS Biology

gro.solp@aveinoiugeruajp

Articles from PLOS Biology are provided here courtesy of PLOS

Full text links

Read article at publisher's site: https://doi.org/10.1371/journal.pbio.3001115

Read article for free, from open access legal sources, via Unpaywall: https://journals.plos.org/plosbiology/article/file?id=10.1371/journal.pbio.3001115&type=printable

Citations & impact

Impact metrics

106

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/101646141

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/101646141

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1371/journal.pbio.3001115

Supporting

Mentioning

Contrasting

175

Article citations

Mapping the Evolutionary Space of SARS-CoV-2 Variants to Anticipate Emergence of Subvariants Resistant to COVID-19 Therapeutics.
Rojas Chávez RA, Fili M, Han C, Rahman SA, Bicar IGL, Gregory S, Helverson A, Hu G, Darbro BW, Das J, Brown GD, Haim H
PLoS Comput Biol, 20(6):e1012215, 10 Jun 2024
Cited by: 0 articles | PMID: 38857308 | PMCID: PMC11192331
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Finally neutralizing the threat? A novel SARS-CoV-2 vaccine platform that elicits enhanced neutralizing antibody responses.
Salamango DJ
mBio, 15(4):e0006724, 26 Feb 2024
Cited by: 0 articles | PMID: 38407097 | PMCID: PMC11005347
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Evolution of the SARS-CoV-2 Omicron Variants: Genetic Impact on Viral Fitness.
Liu W, Huang Z, Xiao J, Wu Y, Xia N, Yuan Q
Viruses, 16(2):184, 25 Jan 2024
Cited by: 2 articles | PMID: 38399960 | PMCID: PMC10893260
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
The most exposed regions of SARS-CoV-2 structural proteins are subject to strong positive selection and gene overlap may locally modify this behavior.
Rubio A, de Toro M, Pérez-Pulido AJ
mSystems, 9(1):e0071323, 14 Dec 2023
Cited by: 0 articles | PMID: 38095866 | PMCID: PMC10804949
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Evolutionary Shortcuts via Multinucleotide Substitutions and Their Impact on Natural Selection Analyses.
Lucaci AG, Zehr JD, Enard D, Thornton JW, Kosakovsky Pond SL
Mol Biol Evol, 40(7):msad150, 01 Jul 2023
Cited by: 5 articles | PMID: 37395787 | PMCID: PMC10336034
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (106) article citations

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC7990310?xr=true

Funding

Funders who supported this work.

Bill & Melinda Gates Foundation (1)

Grant ID: INV-005517
10 publications

Bill and Melinda Gates Foundation (1)

Grant ID: INV-005517
9 publications

European Research Council (2)

The evolutionary dynamics of pathogen emergence and establishment: from Reservoir Detection to Outbreak Control (ReservoirDOCs)
Prof Philippe LEMEY, University of Leuven
Grant ID: 725422
82 publications
Grant ID: 725422-ReservoirDOCS
18 publications

European Union's Horizon 2020 project MOOD (1)

Grant ID: 874850
2 publications

Medical Research Council (4)

Viral Genomics and Bioinformatics
Professor Andrew Davison, University of Glasgow
Grant ID: MC_UU_12014/12
327 publications
MRC Centre for Virus Research (CVR): exceptional award
Prof Massimo Palmarini, University of Glasgow
Grant ID: MC_PC_19026
159 publications
Grant ID: MC_UU_1201412
14 publications
Grant ID: 2294377
12 publications

NIAID NIH HHS (1)

Grant ID: R01 AI134384
54 publications

NIGMS NIH HHS (1)

Grant ID: U01 GM110749
47 publications

National Institutes of Health (2)

Grant ID: R01 AI134384
4 publications
Grant ID: NIH/NIAID Center of Excellence in Influenza Research and Surveillance contract (HHS N272201400007C)
1 publication

National Science Foundation (1)

Rapid: Collaborative Research: Agile and effective responses to emerging pathogen threats through open data and open analytics
Sergei Pond, Temple University
Grant ID: 2027196
1 publication

Research Foundation - Flanders (1)

Grant ID: G066215N, G0D5117N and G0B9317N
1 publication

Wellcome Trust (3)

Grant ID: 220977/Z/20/Z
32 publications
Putting genomic surveillance at the heart of viral epidemic response.
Prof Ian Goodfellow, University of Cambridge
Grant ID: 206298/Z/17/Z
140 publications
Host determinants of disease outcomes in arboviral infections
Prof Massimo Palmarini, University of Glasgow
Grant ID: 206369/Z/17/Z
31 publications

Search life-sciences literature (44,722,910 articles, preprints and more)

Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen.

Author information

Affiliations

Authors

Authors

Authors

Authors

ORCIDs linked to this article

Abstract

Free full text

Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen

Oscar A. MacLean

Spyros Lytras

Steven Weaver

Joshua B. Singer

Maciej F. Boni

Philippe Lemey

Sergei L. Kosakovsky Pond

David L. Robertson

Associated Data

Abstract

Introduction

Results

Evidence of nearly neutral evolution and relatively weak purifying selection in the first 11 months of the SARS-CoV-2 pandemic

What about the bats? Evidence for positive selection in Sarbecoviruses

Patterns of CpG depletion and recombination in the nCoV clade

Discussion

Methods

SARS-CoV-2 GISAID sequence filtering

SARS-CoV-2 positive selection

Sarbecoviruses alignment and recombination

Sarbecovirus selection analysis

CpG depletion

Spike recombination analysis

Bayesian inference under a local molecular clock

Identifying shifts in CpG content

Supporting information

S1 Text

S2 Text

S3 Text

S1 Fig

S2 Fig

S3 Fig

S4 Fig

S5 Fig

S6 Fig

S1 Table

S2 Table

S3 Table

S4 Table

S5 Table

S6 Table

S1 Data

Acknowledgments

Abbreviations:

Funding Statement

Data Availability

References

Decision Letter 0

Decision Letter 1

Author response to Decision Letter 1

Attachment

Decision Letter 2

Author response to Decision Letter 2

Attachment

Decision Letter 3

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Data

Data behind the article

BioStudies: supplemental material and supporting data

Similar Articles

Funding

Bill & Melinda Gates Foundation (1)﻿

Bill and Melinda Gates Foundation (1)﻿

Bill & Melinda Gates Foundation (1)

Bill and Melinda Gates Foundation (1)

European Research Council (2)

European Union's Horizon 2020 project MOOD (1)

Medical Research Council (4)

NIAID NIH HHS (1)

NIGMS NIH HHS (1)

National Institutes of Health (2)

National Science Foundation (1)

Research Foundation - Flanders (1)

Wellcome Trust (3)