The Research Was Wrong: Medical Reversals and the Replication Crisis
Hundreds of standard medical practices have been overturned by later evidence. This isn't a scandal — it's science working. But it means population research is a starting point, not an answer.
The treatment your doctor recommended for a decade was tested against placebo last year. The placebo won.
In 2002, arthroscopic knee surgery for osteoarthritis was one of the most common orthopedic procedures in the United States — about 650,000 per year, at a cost of roughly $5,000 each. The surgery involved inserting a camera into the knee and either removing debris or smoothing cartilage, depending on the patient. Surgeons had been doing it since the 1970s. Patients reported improvement. The procedure made physiological sense.
Then a randomized controlled trial compared it directly to a sham surgery — a procedure where patients were cut open, had instruments inserted and tapped on their knees, and were sewn back up without any arthroscopic intervention. The sham surgery group reported the same improvement as the real surgery group. The surgery, performed on millions of knees, worked no better than placebo.
This is not an isolated case. It is the norm.
The scale of medical reversals
In 2019, researchers at Brigham and Women's Hospital systematically reviewed all original articles published in the New England Journal of Medicine over a ten-year period — 3,017 studies representing what was considered the frontier of medical knowledge. Of those, 363 tested an existing standard of care. Of those 363 tests of established practice, 40.2% reversed it — meaning they found the current standard was no better or worse than a less invasive or cheaper alternative, or was outright harmful.
A separate analysis published in eLife looked at 146 randomized trials that directly replicated earlier research. Only 44% reproduced the original finding. In psychology, the Reproducibility Project attempted to replicate 100 studies published in top journals: only 36% succeeded. In cancer biology, a 2021 replication effort found that only 50% of key findings could be replicated.
The point is not that science is broken. The point is that any individual study, no matter how well-designed, is a probabilistic claim. It is the best available evidence at a moment in time, not the truth.
Why this happens
Several overlapping problems compound each other.
Publication bias. Journals prefer positive results. Negative results — "we tested this and it didn't work" — are rarely published and rarely read. This means the published literature systematically overrepresents effects that are real, large, and consistent, and underrepresents effects that are small, inconsistent, or absent. If ten labs test the same hypothesis and three find a positive result by chance while seven find nothing, the three positive results may get published and the seven nulls end up in file drawers. The literature then looks like strong evidence for an effect that is largely noise.
P-hacking and researcher degrees of freedom. A p-value below 0.05 is the conventional threshold for a publishable result. Researchers — consciously or not — can push toward this threshold by trying multiple outcomes, multiple analytical approaches, or multiple ways of defining who counts in the sample, stopping data collection when significance is reached, or excluding outliers until the number drops below the threshold. Each individual decision seems defensible. The cumulative effect is that many published p-values are false positives.
Underpowered studies. The effect size assumed in a study's power calculation is often optimistic, because it's based on earlier, underpowered studies that overestimated the effect. Small samples then detect only large effects, inflating apparent effect sizes in the published literature. Later, larger studies find smaller effects — which looks like "the research was contradicted" but is actually the research converging on the true value.
Population mismatch. Clinical trials are often conducted in narrow populations: predominantly male, predominantly white, predominantly middle-aged, often academic medical centers with atypical patients. The results generalize less cleanly than the abstract suggests. A drug that works in a specific subgroup gets licensed for a broad population. The benefit in the broad population is smaller or absent.
The surrogate endpoint problem. Many trials measure something that is assumed to correlate with the outcome you actually care about — LDL cholesterol as a proxy for heart attack, blood pressure as a proxy for stroke. Drugs that hit the surrogate sometimes fail to improve the actual outcome. HDL-raising drugs are a clear example: despite strong epidemiological associations between HDL and cardiovascular health, drugs that raised HDL did not reduce heart attacks.
Notable reversals
These are not fringe cases. They involve treatments prescribed to hundreds of millions of people.
Hormone replacement therapy. For decades, observational studies suggested that postmenopausal women on HRT had lower rates of heart disease. The mechanism was plausible: estrogen has known cardioprotective effects in premenopausal women. Prescriptions were widespread. The Women's Health Initiative RCT, which randomized 16,000 women, found that combined HRT actually increased coronary heart disease risk, breast cancer risk, stroke risk, and pulmonary embolism risk. Prescriptions fell by 50% within two years.
Arthroscopic knee surgery. See above. The 2002 NEJM trial has been replicated multiple times. A 2013 Finnish trial found no benefit over sham surgery. A 2017 trial found no benefit over physical therapy. The surgery is still performed at scale.
Low-fat dietary guidelines. The diet-heart hypothesis — that saturated fat raises LDL, LDL causes heart disease, therefore saturated fat causes heart disease — was the basis for low-fat dietary guidelines from the 1980s onward. The evidence base was largely observational, with influential contributions from Ancel Keys's Seven Countries Study. Subsequent meta-analyses of RCTs found no significant effect of reducing saturated fat on cardiovascular mortality. The relationship between dietary fat and cardiovascular disease is more complicated than the guidelines assumed.
Antiarrhythmic drugs post-heart attack. In the 1980s, it was known that ventricular arrhythmias after a heart attack predicted higher mortality, and that antiarrhythmic drugs suppressed arrhythmias. Suppressing the risk factor should reduce the outcome. The CAST trial, published in 1989, randomized post-MI patients to antiarrhythmic drugs or placebo. The drug group had higher mortality — the drugs suppressed the arrhythmias and killed more patients. The drugs were removed from clinical use.
Routine episiotomy during childbirth. Standard practice for decades on the grounds that a clean surgical cut heals better than a tear. Multiple RCTs found that restrictive episiotomy policies lead to less severe perineal trauma and faster healing than routine episiotomy. The practice is now not recommended as routine.
Vitamin E and heart disease. Observational studies found that people with higher vitamin E intake had lower rates of cardiovascular disease. Several RCTs found no benefit. The HOPE trial found no reduction in cardiovascular events with 400 IU/day of vitamin E. Later meta-analyses found a small increase in all-cause mortality at high doses.
What this means for personal health decisions
The reversals do not mean you should ignore population research. They mean you should calibrate your confidence in it correctly.
A large, pre-registered RCT in a relevant population, replicated by independent groups, with a clinically meaningful outcome (not a surrogate), and a plausible mechanism — this is strong evidence. It should update your priors substantially.
A single observational study, or a small RCT in a narrow population, or a trial measuring a surrogate endpoint, or a finding that has not been replicated — this is weak-to-moderate evidence. It should update your priors modestly. It is a starting point for deciding what to test, not a recommendation to act.
The crucial problem is that most health claims — in the media, in wellness culture, and often in clinical practice — are presented with a confidence that the underlying evidence does not support. When you read "X reduces dementia risk by 30%," the correct response is: what study design, what population, what absolute risk reduction, has it been replicated, and what is the plausible mechanism?
Most wellness advice fails this test immediately. Most does not fail it because the underlying finding is false. It fails because the confidence is unearned.
The case for personal experimentation
Population research tells you what happened, on average, in a specific group, under specific conditions. It cannot tell you what will happen to you.
This is not a limitation that better research can fix. It is a feature of biology. You are not the population average. Your metabolism, genetics, sleep patterns, stress levels, gut microbiome, chronotype, and history are different from the average study participant. The effect that produced a statistically significant result in 500 people may be driven by a subgroup you are not in, or may average out a positive effect in some people and a negative effect in others, including you.
The appropriate response to the replication crisis is not cynicism — "nothing is reliable, ignore it all" — and not naive deference — "the guidelines said so." The appropriate response is to treat population research as a prior: a starting point that updates your probability of an effect being real, but that cannot substitute for measuring your own response.
This is measurable. Not every question can be answered with a self-experiment, but many relevant questions can. Does delaying caffeine until 10 AM actually reduce your afternoon crash? Does a 65°F bedroom actually improve your sleep score? Does a post-lunch walk actually improve your afternoon focus? These are not mystical questions. They are empirical questions about your body that can be answered with a few weeks of structured measurement.
The medical reversal literature is, in this light, not a reason to distrust science. It is evidence that the only population you can definitively study is n=1. Everything else is a useful prior.
How to read research without being misled
A few heuristics that hold up under scrutiny:
Distrust relative risk without absolute risk. "Reduces risk by 30%" means nothing without the baseline. A 30% reduction from 0.1% to 0.07% absolute risk is not the same as a 30% reduction from 10% to 7%. Absolute risk reduction and number needed to treat are the numbers that tell you whether something is worth doing.
Prefer pre-registered trials. A pre-registered trial commits to the primary outcome and analytical approach before seeing data. This prevents post-hoc reframing of a secondary outcome as the primary finding when the primary fails. Most published observational studies are not pre-registered.
Check who funded it. Industry-funded trials are not automatically wrong, but the effect sizes reported in industry-funded trials are systematically larger than those in independently-funded replications of the same interventions. A drug trial funded by the manufacturer that shows a 40% reduction in the surrogate endpoint should be read differently than an independent replication in a broader population showing 15%.
Treat case studies and testimonials as hypothesis generators. The fact that a specific person responded dramatically to an intervention is interesting. It suggests the intervention is worth testing. It cannot tell you whether you will respond similarly.
Weight replication heavily. A finding that has been replicated by independent groups in different populations with pre-registered methods is more reliable than a finding from a single landmark study, regardless of journal prestige or sample size.
Look for the mechanism. A finding with a plausible, well-understood mechanism is more credible than a statistical association that lacks one. This is not determinative — mechanisms can be wrong and statistical associations can be real — but it helps calibrate confidence.
The replication crisis is, at bottom, a recalibration of confidence. The treatment of any individual study as establishing a fact was always epistemically overconfident. The more honest representation — this study found X, in this population, under these conditions, with this effect size, and we should test it further — is less satisfying but more accurate. The data is better than nothing. It is not better than measuring yourself.