Most medical researchers blindly adhere to the popular dogma of p-values. According to this dogma, the strategy of declaring statistical significance is based on a p-value alone (often a p-value below 0.05). To the practician of this religion, Statistics refer solely to the investigation of such values. However, the probability that an association is true given a statistically significant finding, depends not only on the estimated p-value but also on the prior probability of it being real, the research bias (the combination of various design, data, analysis, and presentation factors that tend to produce research findings when they should not be produced) and the statistical power of the test. More specifically, it can been seen that the positive predictive value (PPV) of a test (i.e. the post-study probability that the association is true) equals*:
where R is the ratio of the number of “true relationships” to “no relationships” among those tested in the field, α is the Type I error rate, β is the Type II error rate (and hence 1-β is the “power” of the test) and u the research bias. Hence, according to the equation above (assuming at this point insignificant bias) a research finding is more probable to be true than false iff (1 – β)R > α.
The graphs below highlight the relationship between the variables. As we can easily observe (click graphs to zoom in) the higher the R and the lower the type II error the higher the PPV. The red surface corresponds to the zero research bias case while the green and the yellow correspondingly to u=0.2 and u=0.6. The ball blue plane corresponds to PPV 0.5 i.e. the cut-off positive predictive value. The multicoloured floor of the graph indicates the levels of β and R (for u=0, 0.2, 0.6) for which research findings are more possible than not.
The Corrolaries (as highlighted by Ioannides (2005))
- The smaller the studies conducted in a scientific field (and hence the lower the power), the less likely the research findings are to be true.
- The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
- Research findings are more likely true in confirmatory designs, such as large phase III randomized controlled trials, or meta-analyses thereof, than in hypothesis-generating experiments.
- The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true. Flexibility increases the potential for transforming what would be “negative” results into “positive” results, i.e., bias, u
- The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true. Conflicts of interest and prejudice may increase bias, u.
So how often are the published medical research findings more possible than not? How often do they lie over the blue ball plane? According to Ioannides this is slightly rare:
“In the described framework, a PPV exceeding 50% is quite difficult to get… A finding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is eventually true about 85% of the time. A fairly similar performance is expected of a confirmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic finding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research findings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in five chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable standardization of laboratory and statistical methods, outcomes, and reporting thereof to minimize bias.”
Furthermore the same researcher analysed 49 of the most highly regarded research findings in medicine over the previous 13 years and noted that:
“Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged. Five of 6 highly-cited nonrandomized studies had been contradicted or had found stronger effects vs 9 of 39 randomized controlled trials (P = .008). Among randomized trials, studies with contradicted or stronger effects were smaller (P = .009) than replicated or unchallenged studies although there was no statistically significant difference in their early or overall citation impact. Matched control studies did not have a significantly different share of refuted results than highly cited studies, but they included more studies with “negative” results.”
This is very worrying indeed…
* Quite often independent teams are addressing the same sets of hypotheses. In such cases it is reasonable to consider the findings globally rather than in isolation. The PPV of n such independent studies equals
Heather V. Lochner, Mohit Bhandari, Paul Tornetta, III, MD, “Type-II Error Rates (Beta Errors) of Randomized Trials in Orthopaedic Trauma”, The Journal of Bone and Joint Surgery (American) 83:1650-1655 (2001)