We rely on research journals to vet what they publish, but does every study these academic gatekeepers accept truly hold water? Wharton marketing professor Gideon Nave worked with a group of multinational researchers on a project that aimed to replicate the results of 21 social science experiments published in the journals Nature and Science. Only 13 of the researchers’ studies produced results that supported the original studies. A surprising 38 percent of these studies failed to generate the same results. Nave sat down with Knowledge@Wharton to discuss the findings of his team’s paper, “Evaluating the Replicability of Social Science Experiments in Nature and Science Between 2010 and 2015,” and the future of research.
Knowledge@Wharton: Studies like yours have shown that the results of high-profile experiments often can’t be replicated. Do you think there’s a “replication crisis,” as people are calling this?
Gideon Nave: I don’t know if I want to use the word “crisis” to describe it, but we certainly know that many results that are published in top academic journals, including classic results that are parts of textbooks and TED Talks, don’t replicate well, which means that if you repeat the experiment with exactly the same materials in a different population, sometimes in very similar populations, the results don’t seem to hold. Top academic journals like Science and Nature, which are the ones we used in this study, have acceptance rates of something like five percent of papers that are submitted, so it’s not like they don’t have papers to select from. In my view, the replication rates we’ve seen in these studies are lower than what you would expect.
K@W: Can you describe some of the experiments you tried to replicate?
Nave: The experiments we used were social science experiments involving human participants, either online or in laboratory studies. The experiments we selected also typically had some manipulation, meaning there is an experimental setting where half of the population gets some treatment and the other half gets another. For example, we had a study in which people watched a picture of a statue. In one condition, it was Rodin’s The Thinker, and in the other one, it was a man throwing a discus. The assumption of the researchers was that when you show people Rodin’s Thinker, it makes them more analytical, so this was the manipulation. And then they measured people’s religious beliefs. The finding that the paper reported was that when you look at the picture of the Rodin and become more analytical, you are less likely to report that you believe in God.
K@W: What did you find when you tried to replicate that one?
Nave: This study specifically did not replicate. I think the problem is the manipulation itself. I’m not sure that looking at the Rodin statue makes you more analytical in the first place.
K@W: You looked at 21 experiments. What were some of the key takeaways from the entire project?
Nave: There is an ongoing debate in the social sciences as to whether there is a replication problem or not. The results of previous studies that failed to replicate a large number of papers published in top journals in psychology and economics were dismissed by some of the researchers. Some said that this was just some kind of statistical fluke, or maybe the replications weren’t sufficiently similar to the original experiments. We wanted to overcome some of these limitations. In order to do so, we first sent all of the materials to the original authors and got their endorsement of the experiment. In case we got something wrong, we also got comments from them. There was joint collaboration with the original authors in order to replicate the experiment as closely as possible to the original. The second thing we did was preregister the analysis, so everything was open online. People could go and read what we were doing. Everything was very clear a priori—before we ran the studies—in terms of what analyses we would use. The third thing was using much larger samples than the originals. Sample size is a very important factor in an experiment. If you have a large sample, you’re more likely to be able to detect effects that are smaller. The larger your sample, the better the estimate you have of the effect size, and the better your capacity to detect effects that are smaller.
K@W: Even in the studies that did replicate, the effect size was much smaller, correct?
Nave: Yes. We’ve seen it in previous studies. Again, in this study, because the samples were so large, the studies that failed to replicate had essentially a zero effect. But then we could tease apart the studies that didn’t replicate from the ones that did replicate. Even the studies that did replicate well had on average an effect that was only 75 percent of the original, which means that the original studies probably overstated the size of the effect by 33 percent. This is something one would expect to see if there is a publication bias in the literature. If results that are positive are being published and results that are negative are not being published, you expect to see an inflation of the effect size. This is what we saw in the studies. This means that if you want to replicate the study, you probably in the future want to use a larger number of participants than what you had in the original, so you can be sure you’ll detect an effect that is smaller than what was reported originally.
K@W: This was a collaboration of all these researchers. What was their reaction to the results?
Nave: The reactions were pretty good, overall. When this crisis debate started and there were many failures to replicate the original findings, replication wasn’t a normal thing to do. It was perceived by the authors of the original studies as hostile. I have to say, it doesn’t feel nice when your own study doesn’t replicate. But now, there is more acceptance that it’s okay if your study doesn’t replicate. It doesn’t mean you did something bad on purpose. It can happen.
K@W: Could this type of research change how larger journals and high-profile journals like Nature and Science accept papers?
Nave: I think it already has. If we look at the studies that we replicated, all of the experiments that failed to replicate took place between 2010 and 2013. From the past two years of the studies that we selected, everything replicated. Those were only four studies, so I’m not going to make bold claims like “Everything now replicates.” But it’s very clear that there were changes in journal policies. This is especially true for psychology journals, where one now has to share the data, share the analysis scripts. You get a special recognition badge when you preregister the study. Preregistration is a very important thing. It’s committing to an analysis plan before you do the experiment. When you do that, you limit the amount of bias that your own decisions can induce when you analyze the data. There were previous studies conducted, mostly here at Wharton, showing that when you have some flexibility in the analysis, you’re very likely to find results that are statistically significant but don’t reflect the effect.
K@W: What’s the next step in your own research?
Nave: One of the things we’ve done is try to use machine learning to go over the papers and see whether an algorithm can predict whether studies will replicate or not. For this specific experiment, the algorithm can detect replicability in something like 80 percent of the cases, which is not bad at all. So we’re working on automating this process. Another thing is just to continue to replicate. Replicability should be an integral part of the scientific process. We have neglected it, maybe for some time. Maybe it was because people were perceived as belligerent or aggressive if they tried to challenge other people’s views. But when you think of it, this is the way science has progressed for many years. If a study doesn’t replicate, you’d better know it before building on it and standing on the shoulders of the researchers who conducted it. The Rodin study had about 400 citations in as little as four years. These papers have a high impact on many disciplines.
Published as “Is There a Replication Crisis in Research?” in the Spring/Summer 2019 issue of Wharton Magazine.