Weighing the Evidence

The thesis of this book is that statistical hypothesis testing does not answer the actual questions that the clinical researcher has posed, but instead answers a question that the researcher 1) has not asked, and 2) has no interest in its answer.

My approach to this dilemma is to answer the question, “If we only had estimation theory and not statistical hypothesis testing, how would we analyze clinical research data?” This approach gives clinical researchers direct access to the answers to their fundamental research question, “Does the experimental exposure help my patients or injure them?”

This book begins quite non-mathematically, discussing the philosophical concerns about the use of the p-value, and the acculturation of generations of health care researchers to the use of statistical hypothesis testing even though it was not designed for clinical research from its first principals.  Its inculcation has led to the institutionalization of physicians, biostatisticians, and administrators, who frankly would be lost without this single number’s presence. [1]

The clinical research community has permitted itself to be caught up in the tidal drift generated by the need for a computational, interpretative tool.  While this device added structure to research interpretation in the 1950’s, it has, in my view, placed restrictions on research design that have nothing to do with biology, pathophysiology, or even logistics but is instead driven by the need to generate a p-value based assessment of the impact of the intervention or exposure.

This is not a conspiracy theory book. None of the p-value history that I provide is nefarious. While there have been experienced and prominent members of the statistical community who have been influential in reinforcing p-value primacy, there is no statistical hypothesis testing Darth Vader in command.  In fact many statisticians conduct statistical hypothesis testing  because simply 1) that is what has been asked of them, and 2) they know of no alternative.  We are ourselves to blame for this confused miasma. Our answer does not reside in a Star Wars villain but in Shakespeare’s Julius Caesar.

The book combines a new approach  ‒ duality theory ‒ with a well-established approach in mathematics ‒ measure theory – to weigh the evidence in a clinical research effort supporting benefit and supporting harm. Duality theory states that an estimator of an effect in a clinical trial, be it a difference in mean change in diastolic blood pressure, or a prevalence ratio, simultaneously contains evidence of benefit and evidence of harm. The evidence for each is extracted.


[1] Those readers who are already familiar with this dialogue can skip Chapter 1.

This is my assessment of where we are with the use of hypothesis testing in clinical research as well as our path to its place.

 Neither I nor any of the coworkers that I have been privileged to work with would argue that mathematics has no place in health care research interpretation. When used correctly it can summarize the findings of complicated research programs.

However, statistical hypothesis testing in general, and the p-value in particular fails this test.

By itself (and few people argue that it is useful by itself), statistical hypothesis testing cannot even summarize a simple single outcome measure experiment. It must be accompanied by the effect size, the effect size’s standard error, and the confidence interval to provide the assessment of strength of association and also the variability around that strength.

We have much to be thankful for with the introduction of methodologic rigor into health care research efforts that began in the 1950’s. We should stay close to these improvements and let their requirement continue to guide our clinical research efforts. Solid dependable protocols, concordantly executed are requisites for health care research.

But not the p-value.

Experiments now are much more complicated than in the 1950’s when the “0.05 rule” was first enforced. Clinical trial programs now commonly have multiple treatment arms. They can look at dose response. They can react to a protocol mandated discontinuation of the treatment arms.  They can contain outcomes assessed over multiple time points, multiple outcomes assessed at single follow-up time point. They contain proper subgroups, complex proteomics and exploratory analyses.

P-values were simply not designed for this complex environment.

However, unfortunately, rather than set them aside when the research enterprise became complex, the statistical and administrative community “doubled down’ on them. The new research environment excluded subgroup analyses, secondary endpoints, dose response relationships (and, yes, exploratory analyses) from being quantitatively included in the assessment of the study, principally because there was no way statistical hypothesis testing could manage all of this.

Rather than discard a constraining metric, they just ignored the complexity of the research program that did not lend itself to the p-value, relying on the part of the research program that it deemed interpretable through the type I allocation rule. This is not unlike the hungry man who starves because his weak flashlight does not reveal the feast just out of his view.  

    We need something better. The following is my idea…