An American scientist has incited a new skirmish over the origin of the coronavirus, reporting that he has retrieved potentially significant genetic data about SARS-CoV-2 that had been stored and later deleted from a digital archive at the National Institutes of Health.
Jesse Bloom, a computational biologist at the Fred Hutchinson Cancer Research Center in Seattle, posted his findings on the preprint server bioRxiv, where papers that have not yet been peer-reviewed or published in a journal have been landing by the thousands since the start of the pandemic.
The scientific significance of Bloom’s research remained unclear Wednesday, but it stirred instant online reaction, favorable and unfavorable alike, among scientists who have been debating the flurry of theories about the initial coronavirus outbreak.
“I recognize this is a hot-button topic,” Bloom said in an interview with The Washington Post. “It’s not a highly traditional scientific study, but at least it has some new data and new information.”
Bloom, who retrieved the data through Google Cloud, does not claim that it advances one theory or another, but he contends it bolsters evidence that the virus was circulating in Wuhan, China, before a December outbreak of COVID-19, the illness caused by the virus, that was linked to a market selling live animals.
What is not in dispute is that raw data was deleted from a database at the NIH. Processed forms of the same data were included in a preprint paper from Chinese scientists posted in March 2020 and, after peer review, published that June in the journal Small.
The NIH released a statement Wednesday saying that a researcher who originally published the genetic sequences asked for them to be removed from the NIH database so that they could be included in a different database. The agency said it is standard practice to remove data if requested to do so. The NIH statement did not identify the scientist who requested that the material be excised from the agency’s sequence read archive, known as the SRA.
“These SARS-CoV-2 sequences were submitted for posting in SRA in March 2020 and subsequently requested to be withdrawn by the submitting investigator in June 2020. The requester indicated the sequence information had been updated, was being submitted to another database, and wanted the data removed from SRA to avoid version control issues,” the NIH said.
The statement said the NIH “can’t speculate on motive beyond a submitter’s stated intentions.”
Bloom’s paper acknowledges that there are benign reasons why researchers might want to delete data from a public database.
The data cited by Bloom are not alone in being removed by the NIH during the pandemic. The agency, in response to an inquiry from The Post, said the National Library of Medicine has so far identified eight instances since the start of the pandemic when researchers had withdrawn submissions to the library.
“This one from China and the rest from submitters predominantly in the U.S.,” the NIH said in its response. “All of those followed standard operating procedures.”
Bloom said in an email to The Post that he was not accusing the NIH of wrongdoing. But Bloom’s online paper suggests the deletion of data violates scientific norms and the code of trust essential to science. On Twitter, Bloom said the data was also taken down from a Chinese database.
“Certainly, the consequence of removing the sequences was to obscure their existence,” Bloom told The Post in the interview.
In the preprint, he wrote, “that the current study suggests that at least in one case, the trusting structures of science have been abused to obscure sequences relevant to the early spread of SARS-CoV-2 in Wuhan.”
Efforts by The Post to reach the senior author of the sequencing paper have been unsuccessful.
Robert Garry, a Tulane University virologist who co-wrote an influential March 2020 paper saying SARS-CoV-2 was a natural virus and not engineered, took issue with the new Bloom paper. Among his criticisms: The key data from the China study, a list of mutations seen in the virus sequences, has remained available to researchers in an appendix. He said Bloom found the same mutations.
“Jesse Bloom found exactly nothing new that is not already part of the scientific literature,” Garry wrote in an email. He called the Bloom paper “inflammatory.”
Benjamin Neuman, a virologist at Texas A&M University, agreed that the data on mutations remained public. Neuman said he understood Bloom’s goal — to use the raw genomic sequences to construct what is known as a phylogenetic tree of SARS-CoV-2. Such a diagram would show how and when the virus evolved and splintered into different lineages.
“The question is what constitutes adequate publication?” Neuman said in an email. “Is it having access to the data, which we have through [the paper from China], or access to the data in your preferred form, which is what Bloom mined out? It’s the exact same data in refined vs. raw form.”
Bloom is no stranger to the debate over the virus’s origins. He was the lead author of a letter to the journal Science, signed by an additional 17 prominent scientists, that last month criticized a World Health Organization probe into the origins of the virus. The letter called for a deeper investigation of the “lab leak” hypothesis, which asserts that the coronavirus — accidentally or by design — potentially slipped out of a laboratory in Wuhan.
Stanford University microbiologist David Relman, another organizer of that letter, said of Bloom’s findings: “It shows how critical it is that early data be sought, preserved, and shared in trying to infer virus evolutionary paths and origins, since early data are always sparse to begin with, and since analyses are therefore so sensitive to specific data that happen to be available.”
In his paper, Bloom does not claim that the data he retrieved advances the argument for a lab leak or a natural zoonosis.
“This study provides no evidence either way,” Bloom said in an email. “But it does indicate that we probably have not exhausted all relevant data.”
He added, “I think as scientists we really need to focus on the following two questions: How can we get more data? How can we better analyze the data we have?”
Bloom said the deleted sequences he recovered reinforce a notion supported by previous analyses, including a conclusion from the WHO-convened investigation into the virus’s origins conducted earlier this year: The virus probably infected people before the outbreak at the Huanan Seafood Market in December 2019. That spreading event, though large, was not necessarily the first instance of SARS-CoV-2 in humans.
After Bloom became aware of sequences removed from the NIH archive, he checked Google Cloud storage where files are usually also present.
“I just started entering sample numbers,” he said, “and files on the Google Cloud came up.” Raw sequencing data from the cloud storage allowed him to reconstruct partial sequences of a dozen viruses from early in the pandemic.
Bloom was the sole author of this preprint, he said, because he did not want to ask a student or postdoctoral researcher in his lab to contribute to a report on such a charged topic. He said he shared the paper with NIH and other researchers before posting it online.
“I’m cautiously optimistic that this is not the last piece of information that can be found” in alternative repositories or supplemental materials of obscure papers from early in the pandemic, Bloom said.
Ian Lipkin, a Columbia University epidemiologist, said by email that Bloom’s paper offers “evidence of what many of us speculated — that the virus was circulating before the market outbreak. The retraction of sequence data is unprecedented and must be addressed.”
University of California at San Diego evolutionary biologist Joel Wertheim, who has studied the emergence of the virus in Hubei province, said, “I actually don’t think this study adds much to the origins debate.”
The sequences Bloom analyzed show greater similarities with coronavirus relatives in bats, when compared with the virus that infected many people at the seafood market. But researchers were already aware of two genetic lineages of the coronavirus that spread in Wuhan in January and February 2020, Wertheim said, and “these genome fragments further demonstrate this point.”
Speculation emerged on Twitter on Wednesday that Bloom’s findings could alter the timeline of the virus emergence, but Wertheim said that’s doubtful: “I’m not convinced that this paper makes a strong case for altering our molecular clock estimates, since similar — more complete — data were included in previous studies.”
President Joe Biden has ordered intelligence agencies to conduct a review of information that could shed light on the origins of the virus. In an interview with Yahoo News published Tuesday, Director of National Intelligence Avril Haines said the ultimate answer might never be found.
“We’re hoping to find a smoking gun,” Haines said, but “it’s challenging to do that.” She added: “It might happen, but it might not.”
Haines said that teams were seeking to collect new intelligence, in addition to taking a fresh look at information that was already gathered.