All papers in the expert’s recommended reading order. The full collection as the expert intended it.
Introduction
We are used to treating science as a source of knowledge that can be trusted a priori. But is this really the case? How much can we trust the results of psychological studies when making important decisions about our mental health and well-being.
This collection contains articles devoted to checking psychological experiments — how well they can be replicated. In doing so, I examine two independent cohorts of studies: the reproducibility of the experiments themselves within large independent studies, as well as the reproducibility of meta-analyses. The latter are especially relevant, since the media often refer specifically to their conclusions.
A Bayesian argument that most published findings are probably false — and the math checks out uncomfortably well.
Bayesian arguments are often underestimated, and that's a pity: with their help, you can verify the reliability of seemingly unverifiable information. To begin here regarding the crisis as a whole
— ES
Background:
Basic statistics; type I/II errors; p-value and what it REALLY means
Nature asks 1,576 researchers if there's a crisis — 52% say "significant crisis," and yet 73% still trust the literature.
SummaryAI
A Nature survey of researchers across disciplines. Over 70% have tried and failed to reproduce someone else's work; over 50% have failed to reproduce their own. The top factors blamed: selective reporting, pressure to publish, low statistical power. Yet paradoxically, most respondents still trust the published literature, and less than 31% think failure to replicate means the original result is wrong. A snapshot of collective cognitive dissonance.
In 2016, 72% of scientists still believed the scientific literature; I wonder how many do now? Interestingly, scientists trust their OWN field more...
Richard A. Klein, Kate A. Ratliff, Michelangelo Vianello et al. · 2014 · Social Psychology
At a GlanceAI
36 labs, 6,344 participants, 13 effects — and the answer depends more on the effect than on who tests it.
SummaryAI
The first "Many Labs" project. 13 classic and contemporary effects replicated across 36 independent samples. In aggregate: 10 of 13 effects replicated consistently, 1 showed weak support, and 2 (flag priming, currency priming) did not replicate at all. Crucially, variation in effect size across labs was small — replicability depended on the effect itself, not the sample or setting. Lab-vs-online and US-vs-international differences were negligible.
A good start for understanding the context. The first replications were on well-known experiments and effects and showed very good results.
100 replications, 5 indicators, 1 uncomfortable conclusion — psychology's published effects are substantially weaker than reported.
SummaryAI
270 researchers replicated 100 studies from three major psychology journals (PSCI, JPSP, JEP:LMC). Using five indicators of replication success: only 36% of replications achieved p < .05 (vs. 97% of originals); mean replication effect size was half the original; 47% of original effect sizes fell within the replication 95% CI; 39% were subjectively rated as replicated; and meta-analytically combining both studies left 68% significant.
Cognitive psychology effects replicated better than social (50% vs. 25% by significance). Replication success was predicted by strength of original evidence (lower p-values and larger effect sizes) rather than by team expertise or replication quality. The authors attribute much of the gap to publication and reporting biases inflating original estimates, and emphasize that the results don't establish any individual effect as true or false but reveal that the field's cumulative evidence base is less certain than assumed.
Interesting results (poor reproducibility; weak effect), but the conclusions are overly cautious, possibly due to fear of the community’s reaction
Richard A. Klein, Michelangelo Vianello, Fred Hasselman et al. · 2018 · Advances in Methods and Practices in Psychological Science
Summary
ML2 replicated 28 effects across 125 samples spanning 36 countries. 14 of 28 (50%) replicated by the traditional significance criterion. The key finding: variation in effect sizes was mostly due to which effect was being tested, not where or by whom. The "hidden moderators" defense (failures to replicate are due to unidentified contextual differences between samples) received little empirical support. When effects failed, they failed consistently across diverse settings.
In all our papers, we write that you can't generalize the conclusions to the population at large. However, apparently, that's not the problem: if the effect exists, then it works everywhere!
— ES
Method:
Crowdsourced multi-lab direct replication
Background:
Their first manuscript is essential to read before
Colin F. Camerer, Anna Dreber, Felix Holzmeister et al. · 2018 · Nature Human Behaviour
At a Glance
Even Nature and Science papers replicate only 62% of the time — and at half the original effect size.
SummaryAI
The Social Science Replication Project replicated 21 social science experiments published in Nature and Science between 2010–2015. 13 of 21 (62%) replicated in the same direction with a significant effect. The average replication effect size was about 50% of the original. A prediction market among scientists correctly predicted replication outcomes about 75% of the time — suggesting the community has reasonable intuitions about which findings are real.
What struck me here was the author’s thought that good reviewers know whether the effect will be reproduced. Which, to be fair, makes you think that they don’t always say that out loud…
Daniel Lakens, Elizabeth Page-Gould, Marcel A. L. M. van Assen et al. · 2017
At a Glance
Can you even reproduce a meta-analysis from its own description? Often: no.
Summary
A preliminary audit of whether published meta-analyses in psychology can be reproduced from the information they report. The findings pointed to widespread problems: missing data, unclear coding decisions, insufficient reporting of methods and choices.
Note: this is a preliminary report from OSF. Later manuscript address the details.
We often rely on meta-analyses, but: a) they often don’t provide their raw data (=> can’t be rechecked), b) they handle statistics in an extremely “loose” manner
— ES
Method:
methodological & statistical assessment based on the published data
Esther Maassen, Marcel A. L. M. van Assen, Michèle B. Nuijten et al. · 2020 · PLOS ONE
At a Glance
In half of the meta-analyses, there isn’t enough data for a proper verification.
Summary
The authors attempted to reproduce 500 primary study effect sizes from 33 randomly selected psychological meta-analyses. Only ~55% were fully reproducible. The rest were incomplete (11%), incorrect (14%), or ambiguous (19%). In 30 of 33 meta-analyses, at least some effect sizes contained errors. The good news: when they recalculated meta-analytic results with corrected effect sizes, the overall conclusions mostly didn't change (significance of pooled effects was preserved). The bad news: that's partly because errors were random rather than systematic.
Half the effect sizes in psychology meta-analyses can't be reproduced from the primary papers. Sleep well.
Amanda Kvarven, Eirik Strømland, Magnus Johannesson · 2019 · Nature Human Behaviour
At a Glance
Meta-analyses overestimate effect sizes 3x compared to multi-lab replications. Three times.
SummaryAI
The headline study of this sub-field. The authors matched 15 psychological effects that had both a published meta-analysis and a pre-registered multi-lab replication (from Many Labs, RRR, etc.). The mean meta-analytic effect size was 0.42; the mean replication effect size was 0.15 — nearly three times larger. The difference was significant for 12 of 15 pairs. Bias correction methods (trim-and-fill, PET-PEESE, 3PSM) reduced the gap somewhat but couldn't fully explain it. No evidence for "replicator selection" (that replicators systematically choose weak study designs).
A good idea is to compare the conclusions from meta-analyses with the results of replication studies conducted by independent laboratories. The outcome, I suppose, is clear to you...
— ES
Method:
Comparison between literature and independent estimations
Background:
Basic statistics, but for the content, it's essential to check the Many Labs and OSC projects to get clarity on the independent verifications.
Molly Lewis, Maya B. Mathur, Tyler J. VanderWeele et al. · 2022 · Royal Society Open Science
At a Glance
Meta-analyses and MLRs disagree — but they're still correlated (r = 0.72), so don't throw the baby out.
Summary
A re-analysis and commentary on Kvarven et al. (2019). Lewis et al. show that while meta-analytic estimates are systematically larger, they're strongly correlated with MLR estimates (r = 0.72) — meaning meta-analyses are informative, not worthless. Using sensitivity analyses for publication bias (worst-case selection models), they find that publication bias alone cannot fully account for the discrepancy in 8 of 15 cases. They consider alternatives: genuine effect heterogeneity from minor methodological differences, differential intervention fidelity, and possible context-sensitivity of social-psychological effects. The core conclusion: the discrepancy is real and still largely unexplained.
An alternative view of the results of the previous article (Kvarven et al. (2019)): that it’s not the meta-analyses that are to blame, but something else. Useful for general context and variety.
Rubén López-Nicolás, Daniel Lakens, Jose A. López-López et al. · 2024 · Advances in Methods and Practices in Psychological Science
At a GlanceAI
Even clinical meta-analyses (the ones that guide actual treatment) are only 67% reproducible.
Summary
From 100 randomly selected articles on clinical-psychological interventions (2000–2020), 217 meta-analyses were evaluated. Only 67% were "process reproducible" (the data could even be retrieved). Of those, 52 showed discrepancies > 5% in main results. After multi-stage correction (fixing coding errors, qualitative assessment, contacting authors), 27 meta-analyses from 10 papers remained irreproducible. The process-reproducible rate improved over time (41% for 2000–2010, 80% for 2016–2020), suggesting data-sharing norms are working. Most numerical discrepancies were minor and didn't change conclusions — but data availability remains the biggest barrier.
The authors rightly note that the key problem is the lack of access to the original raw data, plus the fact that if a paper says “data upon request,” it usually means “no.” I agree 100%.
— ES
Method:
random selection multi-stage reproducibility audit
John Protzko, Jon Krosnick, Leif Nelson et al. · 2023 · Nature Human Behaviour
At a GlanceAI
"The crisis is over!" said the paper. Then it got retracted.
SummaryAI
The authos claimed that newly discovered social-behavioral effects, when selected and designed with methodological rigor (large samples, pre-registration, representative samples), could achieve high replicability. The paper was published in Nature Human Behaviour and initially received as evidence that the crisis had been "solved." It was subsequently retracted due to concerns about the data and analyses.
The best example in the selection! The article that claimed the crisis had been overcome had to be retracted due to doubts about the data and analysis :)