Already, 60 percent of Americans of Northern European descent — the primary group using the genetic-genealogy sites — can be identified through such databases whether or not they’ve joined one themselves, according to a new study.
The genetic-genealogy industry is booming. In recent years, more than 15 million people have offered up their DNA — a cheek swab, some saliva in a test-tube — to services such as 23andMe and Ancestry.com in pursuit of answers about their heritage. In exchange for a genetic fingerprint, individuals may find a birth parent, long-lost cousins, perhaps even a link to Oprah or Alexander the Great.
But as these registries of genetic identity grow, it’s becoming harder for individuals to retain any anonymity. Already, 60 percent of Americans of Northern European descent — the primary group using these sites — can be identified through such databases whether or not they’ve joined one themselves, according to a study published Thursday in the journal Science.
Within two or three years, 90 percent of Americans of European descent will be identifiable from their DNA, researchers found. The science-fiction future, in which everyone is known whether or not they want to be, is nigh.
“It’s not the distant future, it’s the near future,” said Yaniv Erlich, lead author of the study. Erlich, formerly a genetic-privacy researcher at Columbia University, is chief science officer of MyHeritage, a genetic-ancestry website.
Most Read Nation & World Stories
- The lost month: How a failure to test blinded the U.S. to COVID-19
- A Mount Vernon choir went ahead with rehearsal. Now dozens have coronavirus and 2 are dead.
- Conservatives push Trump to end social distancing, citing a conspiracy of the left to impose its agenda
- Here's what experts say must be done to really shut down the coronavirus VIEW
- Ski vacation hot spot becomes virus ground zero in Idaho
The science involves a search for third cousins. To identify a person through a DNA sample, an investigator uploads a previously analyzed genetic sequence to a database. The goal is to find someone who shares enough DNA to place them in the third cousin or closer range. Most of us have at least 800 people out there, somewhere in the world, who fall into this category. So long as one of these people is in a database, a skilled sleuth may be able to use other publicly available information to start building a family tree and figure out the person’s actual identity.
That technique has been used in recent months to identify more than 15 suspects in murder and sexual-assault cases. The breakthroughs began in April with an arrest in the case of the Golden State Killer, who terrorized California with rapes and killings in the 1970s and 1980s. Other successes soon followed. A truck driver in Washington state was charged with the 1987 slayings of a Canadian couple in 1987; a DJ in Pennsylvania was charged with the slaying of a teacher in 1992.
Watching these developments, Erlich wondered about the odds of identifying any given person through cousins’ DNA in one of these databases.
His analysis is based not on the big genealogy databases such as 23andMe and Ancestry, but on two of the smallest: GEDmatch, which has about 1 million profiles, and MyHeritage, which had about 1.5 million at the time of the study. That’s because, for legal and logistical reasons, the larger sites cannot be easily used to identify anyone other than customers who mail in saliva.
But the smaller sites, set up to help genealogists maximize the odds of finding relatives, are more flexible. GEDmatch allows law-enforcement officials to scan its database in homicide and sexual-assault cases. MyHeritage does not, but it permits uploads from external labs. With both, it’s hard to be sure what’s being uploaded: grandma’s saliva, crime scene blood, a sample from a medical study or something else entirely.
To determine the odds of correctly identifying an individual from a given DNA sample, Erlich and his colleagues — from Columbia University, the Hebrew University of Jerusalem and the New York Genome Center — analyzed 30 DNA kits chosen at random from the GEDmatch database.
Their results were eye-opening. The team found that a DNA sample from an American of Northern European heritage could be tracked successfully to within a third-cousin distance of its owner in 60 percent of cases. A comparable analysis on the MyHeritage site had similar results. (The analysis focused on Americans of North European background because 75 percent of the users on GEDmatch and other genealogy sites belong to that demographic.)
Some experts have raised questions about the study’s methodology. Its sample size was small, and it didn’t factor in that more than one match is often required to identify a suspect.
CeCe Moore is a genetic genealogist with Parabon, a forensic-consulting firm. She expressed worry in an email that the Science paper may obscure the difficulty involved in puzzling out someone’s identity; it takes a highly skilled expert to build a family tree from the initial genetic clues.
Still, she said, the take-away of the study is “not news to us.” In recent months Moore has been involved in a dozen homicide and sexual-assault cases that used GEDmatch to identify suspects. Of the 100 crime-scene profiles her firm had uploaded to GEDmatch by May, half were obviously solvable, she said, and 20 were “promising.”
“I think it’s a strong and convincing paper,” said Graham Coop, a population-genetics researcher at the University of California, Davis. In a blog post in May, Coop calculated how lucky investigators had been in the Golden State Killer case. He reached a statistical conclusion similar to Erlich’s: Society is not far from being able to identify 90 percent of people through the DNA of their cousins in genealogical databases.
“This is this moment of, wow, oh, this opens up a lot of possibilities, some of which are good and some are more questionable,” he said.
In an alarming result, the Science study found that a supposedly “anonymized” genetic profile taken from a medical data set could be uploaded to GEDmatch and positively identified. This shows that an individual’s private health data might not be so private after all.
Erlich has urged genealogy companies to consider attaching some sort of cryptographic signature to the genetic profiles they analyze. This would help ensure that whoever uploads a genetic profile is who they say they are, and make it harder for anyone to abuse this data, should they, for example, want to figure out who attended a protest.
Possibilities and limits
Daniel MacArthur, a genomics researcher at Massachusetts General Hospital, said he endorses the cryptographic signature, but that it doesn’t go far enough. “We live in a world where people are very interested in obtaining and sharing their genetic data to learn more about themselves,” he said. “It’s a natural human instinct. But legislative protection is required to ensure that it’s not used for nefarious purposes.”
The high-profile use of genetic genealogy to identify violent criminals this summer led to speculation about how else it might be used: to invade people’s medical privacy, to track the identity of undercover agents or in searches by law enforcement or immigration officials in ways that may be more morally ambiguous to some than finding a killer.
Ethicists said greater awareness of the possibilities and limits of the technology is necessary, given that many people don’t realize that a public DNA profile contains information not just about one person but contains a family secret that connects to hundreds of other people. A sibling shares half of your genetic profile. A cousin shares an eighth. A second cousin, 1/16th.
“By making this real, and by making people understand just how interconnected we are by our genetics, and how skilled investigators could use these — with a fairly high success rate — to find second and third cousins or even closer relatives, underlines the power of this new technology and really brings home the reality of it,” said Benjamin Berkman, a bioethics researcher at the National Institutes of Health.