← All posts
·10 min read

What p < 0.05 Actually Means (And Why You're Probably Misreading It)

Statistical significance is the most misunderstood concept in health research. It doesn't mean the effect is real, large, or relevant to you. Here's what it actually means — and the three numbers that matter instead.

The headline says the supplement works. The p-value says p = 0.03. You should buy it, right?

In 2016, the American Statistical Association did something unusual for a professional scientific body: it published an official statement on the meaning of a p-value. The statement was six principles long, and the first two could be summarized as: p-values are widely misunderstood, and that misunderstanding causes serious harm.

The statement came after decades of accumulated evidence that p < 0.05 — the dominant criterion for scientific publication — was being systematically misapplied, misreported, and misinterpreted. Researchers were using it to claim things it cannot support. Journalists were using it to write headlines it does not justify. Consumers were making decisions based on a number whose meaning they had never been correctly taught.

This piece is an attempt to correct that. Not because statistics are intrinsically interesting, but because if you're making health decisions based on research — or running your own experiments and analyzing the results — you need to know what these numbers actually mean.

What p < 0.05 does not mean

Let's start here, because the misconceptions are more durable than the correct understanding.

It does not mean the probability that the result is a false positive is 5%. This is the most common misinterpretation. A p-value of 0.05 does not mean there's a 95% probability the effect is real. The p-value says nothing directly about the probability of the hypothesis being true.

It does not mean the effect is large or clinically meaningful. A study can reach p < 0.05 with an effect so small it has no practical relevance. With a large enough sample, trivial effects — a 0.2% improvement, a half-point change on a 100-point scale — will be statistically significant. Statistical significance and clinical significance are different concepts.

It does not mean the study will replicate. A p-value is calculated from a specific sample. Another sample from the same population will produce a different p-value. A result at p = 0.049 is not "just barely significant" in a way that makes it almost certain to replicate; it is a result at the boundary of a conventional threshold, and its replication probability in a new sample of the same size may be only 50%.

It does not mean "the drug works" or "the supplement is effective." These are causal claims. P-values emerge from statistical tests that measure whether observed data are surprising under a specific null hypothesis. They do not establish causation, mechanism, or generalizability.

What p < 0.05 actually means

A p-value answers the following specific question: If the null hypothesis were true (meaning there is actually no effect), how often would I see a result as extreme as this one by chance?

A p-value of 0.05 means: if magnesium had absolutely no effect on sleep, I would still get a result this large or larger about 5% of the time just due to random sampling variation. It is a measure of surprise under the null hypothesis. It is not a measure of how probable the hypothesis is.

The null hypothesis is typically the hypothesis of no effect. Rejecting it at p < 0.05 means the data are inconsistent with no effect at a conventional threshold. It does not tell you how large the effect is, whether it matters, or whether you specifically will experience it.

The three numbers that actually matter

When you read research — or analyze your own experiment data — there are three numbers that do more work than the p-value.

1. Effect size

Effect size answers: how large is the difference? P-values are affected by sample size. Effect sizes are not (at least not systematically). A study with 10,000 participants can produce p < 0.001 for a trivially small effect. A study with 20 participants can produce p > 0.10 for a large effect that exists but isn't detected at this sample size.

Common effect size measures:

  • Cohen's d: the difference between two groups divided by the standard deviation. A d of 0.2 is small; 0.5 is medium; 0.8 is large. A sleep intervention that improves sleep quality by half a standard deviation is meaningfully different from one that improves it by a tenth.
  • Absolute risk reduction (ARR): for health outcomes, the difference in outcome rates between groups. If a drug reduces heart attack rate from 4% to 3%, the ARR is 1%. This is the number that tells you whether to take the drug.
  • Number needed to treat (NNT): the inverse of ARR — in this case, you'd need to treat 100 people to prevent one heart attack. An NNT of 10 is clinically important. An NNT of 500 is barely relevant.

The relative risk reduction (RRR) — "reduces risk by 25%" — is deliberately uninformative without the baseline. A 25% relative reduction from 0.1% to 0.075% affects almost no one. A 25% relative reduction from 20% to 15% saves many lives. News headlines almost always report relative risk because it sounds larger. Always look for the absolute number.

2. Confidence interval

A confidence interval (typically 95% CI) gives you a range of plausible effect sizes consistent with the data. If a study reports a mean difference of 5 points (95% CI: 1 to 9), it means: given this data, effect sizes between 1 and 9 are all statistically compatible with what was observed. The most likely estimate is 5, but it could plausibly be as small as 1 or as large as 9.

Wide confidence intervals mean uncertain estimates. A result reporting "30% improvement (95% CI: 2% to 58%)" is not strong evidence for 30% improvement; it is evidence that the true effect is somewhere between barely anything and quite a lot. Wide intervals are the honest signal of underpowered studies.

Narrow confidence intervals that exclude zero are the reliable form of statistical significance. "5% improvement (95% CI: 3% to 7%)" means you can be fairly confident the true effect is positive and in this range — though still subject to all the other caveats about study quality and generalizability.

3. Prior probability

The probability that a finding is a true positive depends not just on the p-value but on the prior probability that the hypothesis is true before the study was run. This is Bayes' theorem in action.

Consider: a study tests whether a specific herbal compound improves memory. Prior probability (based on mechanism, preclinical evidence) might be 10% — plausible but uncertain. The study finds p = 0.04. What is the probability this is a true finding?

The calculation (simplified) depends on statistical power and prior probability. With 80% power and a 10% prior, a p = 0.04 result corresponds to roughly a 31% probability of being a true positive — better than chance, but far from the "95% certain" that people assume from p < 0.05.

Now consider a study testing whether exercise improves cardiovascular function. Prior probability: ~95%, based on decades of mechanistic and epidemiological evidence. A p = 0.04 result here corresponds to a very high probability of being a true positive — not because the p-value is different, but because the prior was different.

The same p-value means very different things depending on what was being tested. A p = 0.05 result from a well-mechanized, well-replicated intervention in a relevant population is much more credible than a p = 0.05 result from a novel, implausible hypothesis tested once.

Why most positive findings in health research are false

In 2005, John Ioannidis published a paper titled "Why Most Published Research Findings Are False." It became one of the most cited papers in medical literature. The argument was mathematical, not rhetorical: given typical values of prior probability, statistical power, and the prevalence of researcher degrees of freedom (researcher choices that inflate false positive rates), more than half of published findings in many fields are false positives.

The intuition: if most hypotheses tested are false (which is mathematically unavoidable in exploratory research), and the false positive rate at p < 0.05 is 5%, and many studies are underpowered, then the majority of "significant" findings in the literature are noise. The multiple testing problem compounds this: if a researcher tests 20 outcomes and reports the one that reaches p < 0.05, the expected false positive rate is not 5% — it is much higher, because you're selecting the most extreme result from a distribution of 20 results.

Pre-registration addresses this: a pre-registered study commits to its primary outcome and analysis plan before seeing data, which means you can't select the significant result after the fact. A pre-registered p = 0.04 is much more credible than an unregistered one.

Reading a result correctly

Here is what a responsible read of a health finding looks like.

A study reports: "Participants taking omega-3 supplementation showed a statistically significant improvement in depression scores compared to placebo (p = 0.02)."

The questions you should ask:

  1. What was the effect size? If it's a 0.5-point improvement on a 60-point scale, the statistical significance is irrelevant.
  2. What was the 95% CI? A narrow CI crossing clearly above zero is a different result from a wide CI that barely misses zero.
  3. What was the study design? RCT or observational? Pre-registered? What population?
  4. Has it been replicated? A single study with p = 0.02 and no replication carries much less weight than three independent replications with consistent effect sizes.
  5. What's the plausible mechanism? Omega-3 has established anti-inflammatory and neurological mechanisms — this raises prior probability. A study of a supplement with no known mechanism requires more replication.
  6. Who funded it? Industry-funded studies report systematically larger effect sizes than independent replications.

After running through these questions, you end up with a calibrated belief — not a binary "true/false," but a probability that the effect is real and an estimate of its likely magnitude. That probability is what should inform your decision to try the intervention, not the p-value alone.

What this means for your own experiments

When you run your own experiment and calculate whether your results are statistically significant, the interpretation is different from a clinical trial — but the concepts are the same.

A p-value from your own n=1 data answers: would I see a difference this large between my intervention and control periods by random chance? Low p-values mean your result is unlikely to be noise. But the same caveats apply:

  • Effect size matters. A statistically significant improvement in sleep score of 0.2 points (on a 10-point scale) is real but possibly not worth acting on. A 1.5-point improvement is worth taking seriously.
  • You need enough observations. With 5 intervention days and 5 control days, you have almost no statistical power to detect anything but enormous effects. 20+ observations per condition gives you enough power to detect moderate effects.
  • Confounders still bias the estimate. A statistically significant result in a poorly controlled experiment is evidence of something — but maybe not what you think.

For most personal experiments, the more useful framing is not "is this significant?" but "what is my best estimate of the effect size, and what range of values are plausible given this data?" A Bayesian credible interval — using your prior belief about the effect and updating it with your data — is often more actionable than a p-value, because it directly answers the question you care about: "Given everything I know, how large is this effect likely to be for me?"

The goal is calibrated confidence, not significance. You want to know what you can trust, at what level, and why. A p-value, on its own, can't tell you that. An effect size, a confidence interval, and a prior probability — together — can.