Only 35 of 100 studies that the Reproducibility Project looked at held up fully to scrutiny.
The past several years have been bruising ones for the credibility of the social sciences. A star social psychologist was caught fabricating data, leading to more than 50 retracted papers. A top journal published a study supporting the existence of ESP that was widely criticized. The journal Science pulled a political-science paper about the effect of gay canvassers on voters’ behavior because of concerns about faked data.
Now, a painstaking yearslong effort to reproduce 100 studies published in three leading psychology journals has found that more than half of the findings did not hold up when retested. The analysis was done by research psychologists, many of whom volunteered their time to double-check what they considered important work.
Their conclusions, reported Thursday in the journal Science, have confirmed the worst fears of scientists who have long worried that the field needed a strong correction.
The vetted studies were considered part of the core knowledge by which scientists understand the dynamics of personality, relationships, learning and memory. Therapists and educators rely on such findings to help guide decisions, and the fact so many studies were called into question could sow doubt in the scientific underpinnings of their work.
Most Read Nation & World Stories
- Washington health-care exec vowed to follow COVID rules. Then he led a 153-person Grand Canyon hike, rangers say.
- Florida middle school killer dies in prison at 31
- The girl in the Kent State photo and the lifelong burden of being a national symbol
- Can you have alcohol after the COVID vaccine?
- Affluent Americans rush to retire in new 'life-is-short' mindset
“I think we knew or suspected that the literature had problems, but to see it so clearly, on such a large scale — it’s unprecedented,” said Jelte Wicherts, an associate professor in the department of methodology and statistics at Tilburg University in the Netherlands.
More than 60 studies did not hold up. Among them was one on free will. That study found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.
Another was on the effect of physical distance on emotional closeness. Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.
A third was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.
The project began in 2011, when University of Virginia psychology professor Brian Nosek decided to find out whether suspect science was a widespread problem.
He and his team recruited more than 250 researchers, identified the 100 studies published in 2008, and rigorously redid the experiments in close collaboration with the original authors.
The new analysis, the Reproducibility Project, found no evidence of fraud or that any original study was definitively false. Rather, it concluded the evidence for most published findings was not nearly as strong as originally claimed.
Nosek said the study is a reminder that a single study rarely provides definitive answers and why scientists often greet new findings by saying, “More research is needed.”
“Any one study is not going to be the last word. Each individual study has some evidence. It contributes some information toward a conclusion. But the real conclusion, when you can say confidently that something is true or false, is based on an accumulation of evidence over many studies,” said Nosek.
And yes, he said at a news conference in New York: “Even this project itself is not … a definitive word about reproducibility.”
The report appears at a time retractions of published papers are rising sharply in wide variety of disciplines. Scientists have pointed to a hypercompetitive culture across science that favors novel, sexy results and provides little incentive for researchers to replicate the findings of others, or for journals to publish studies that fail to find a splashy result.
“We see this is a call to action, both to the research community to do more replication, and to funders and journals to address the dysfunctional incentives,” said Nosek, who is also executive director of the Center for Open Science, the nonprofit data-sharing service that coordinated the project published Thursday, in part with $250,000 from the Laura and John Arnold Foundation.
The new analysis focused on studies published in three of psychology’s top journals: Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition.
The act of double-checking another scientist’s work has been divisive. Many senior researchers resent that an outsider, typically a younger scientist, with less expertise, would critique work that often has taken years of study to pull off.
“There’s no doubt replication is important, but it’s often just an attack, a vigilante exercise,” said Norbert Schwarz, a professor of psychology at the University of Southern California.
Schwarz, who was not involved in any of the 100 studies that were re-examined, said the replication studies themselves were virtually never vetted for errors in design or analysis.
Nosek’s team addressed this complaint in part by requiring the researchers attempting to replicate the findings to collaborate closely with the original authors, asking for guidance on design, methodology and materials.
Most of the replications also included more subjects than the original studies, giving them more statistical power.
Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies.
Yet very few of the redone studies contradicted the original ones; their results were simply weaker.