Excerpt from Multiple Analyses in Clinical Trials

Excerpt from Multiple Analyses in Clinical Trials

In chapter 3 we acknowledged the inevitability of multiple analyses in clinical trials. Since additional endpoints can be added to the design of a clinical trial relatively cheaply, the inclusion of these additional endpoints can be cost effective. In addition, epidemiologic requirements for building the tightest causal link between the clinical trial’s intervention and that trial’s endpoints serve as powerful motivators for the inclusion of multiple analyses. These carefully considered, prospectively designed evaluations may provide, for example, information about the relationship between the dose of medication and the disease, or evaluate the mechanism by which the clinical trial’s intervention produces its impact on disease reduction. The cost of carrying out these analyses is commonly small compared to the overall cost of the clinical trial.

However, we have also observed that increasing the number of hypothesis tests also increases the overall type I error level. In clinical trials, measuring the type I error level is a community obligation of the trial’s investigators; the type I error level measures the likelihood that an intervention, known to produce an adverse event and a financial burden, will have no beneficial effect in the population from which the sample was derived. Thus the type I error level is an essential component in the risk-benefit evaluation of the intervention and must be both accurately measured and tightly controlled. While the prospective design and concordant execution of a clinical trial ensures that the estimate of the type I error level at the experiment’s conclusion is trustworthy, this research environment does not guarantee that the type I error level will be low.

In this chapter, we will develop the requisite skills to control and manage type I error when there are multiple endpoints in a two armed clinical trial. In doing so we will adhere to the family wise error level (ξ) as the primary tool to manage type I error level control.

4.2 Important Assumptions

                Since effective type I error level management can only occur when the  estimates for these rates are both accurate and trustworthy, we will assume that trials for which these skills are developed are prospectively designed and concordantly executed. This permits us to steer clear of the problems presented by the random research paradigm.[1] In addition, we will assume that, in this chapter, the clinical trial endpoints are independent of each other.

In addition, although the focus of this chapter is discussion of type I error levels, which is the primary statistical difficulty in multiple endpoint analyses, this emphasis should not be interpreted as denying the time tested advice that experimental interpretation is an exercise involving the joint consideration of effect size, standard errors, and confidence intervals. P values are necessary components of this evaluation, but they are not the sole component. They do not measure effect size, nor do they convey the extent of study discordance. A small p value does not in and of itself mean that the sample size was adequate, that the effect size is clinically meaningful, or that there has been a clear attribution of effect to the clinical trial’s intervention. These other factors must themselves be individually considered by a careful, critical review of the research effort. We must examine each of these other important issues separately to gain a clear view of what the sample is saying about the population.

4.3 Clinical Trial Result Descriptors

In order to continue our development we will need some unambiguous terminology to categorize the results of clinical trials. It is customary to classify clinical trials on the basis of their results, e.g. positive trials or negative trials. Here we will elaborate upon and clarify these useful descriptors.

4.3.1 Positive and Negative Trials

Assume that investigators are executing a prospectively designed, concordantly executed clinical trial to demonstrate the benefit of a randomly allocated intervention for reducing the clinical consequences of a disease or condition. For ease of discussion, we will also assume that that clinical trial has only one prospectively designed endpoint which requires a hypothesis test. Define the hypothesis test result as positive if the hypothesis test rejects the null hypothesis in the favor of benefit. Since the clinical trial had only one hypothesis test, and that hypothesis test was positive, the clinical trial is described as positive. This definition is consistent with the customarily terminology now generally in use and we will use it in this text.

 The commonly used descriptor for a negative hypothesis test can be somewhat confusing, requiring us to make a simple adjustment. Typically, a negative hypothesis test is defined as a hypothesis test which did not reject the null hypothesis and therefore did not find that the clinical trial’s intervention produced the desired benefit for the population being studied. However, this terminology can cause confusion since it is possible for a hypothesis test to demonstrate a truly harmful finding.[2] The hypothesis test which demonstrated not benefit but harm must also have a descriptor. We will distinguish these two hypothesis test results as follows. Define a negative hypothesis test as a hypothesis test that has demonstrated that the intervention produced harm. Now, define a null hypothesis test as a hypothesis test that demonstrates that the intervention demonstrated neither benefit nor harm (i.e. the null hypothesis has not been rejected). Thus a positive trial demonstrates that the intervention produced the designed benefit, a negative trial demonstrates the trial produced a harmful result, and a null trial demonstrates that neither harm nor benefit was obtained.[3] This set of descriptors for the trial will be the same as the set of descriptors for the hypothesis test.

4.3.2 Null Results versus Uninformative Results

There is one final adaptation we need to make in this nomenclature development—the notion of power. If a clinical trial is positive, then of the two statistical errors (type I and type II errors) the trial’s critics concern themselves only with the type I error. The same is true for the interpretation of a negative trial (using our new definition of a negative trial as a trial whose one hypothesis test on its prospectively defined endpoint demonstrated that the intervention caused harm). This is because the finding in the sample was positive (negative), and the statistical error associated with a positive (negative) sample result is the type I error.[4] However, a study with a null finding must also address a possible statistical error which occurred in the sampling process. For a null finding, the statistical event of interest is the type II error. A type II error occurs when the population in which the intervention produces a benefit generates a research sample that through chance alone demonstrates no intervention benefit. The population is intervention-positive, but the sample is intervention-null. When the research sample finding is intervention-null it becomes important to consider how likely it is that the null finding could have been produced by a population in which the intervention had a positive effect.[5] This translates into having adequate statistical power[6] for the null finding to be treated as a null result.

Since null findings are readily produced from hypothesis tests with inadequate power in clinical trials, the correct interpretation of the hypothesis test depends on the size of the type II error. For example consider a study which is required to recruit 3868 patients in order to demonstrate with 90% power and an alpha error level of 0.05 that an intervention reduces total mortality by 20% from a cumulative mortality rate of 0.20.[7] Unfortunately, during the execution of their clinical trial, the investigators are only able to recruit 2500 of the required 3868 patients. At the conclusion of the study, the investigators find that the relative risk for the cumulative mortality event rate is 0.85, representing a 15 percent reduction in the total mortality rate produced by the intervention. However, the investigators cannot conclude that the study is null. This is because their inability to recruit the remaining 1368 patients has dramatically reduced the power of the hypothesis test from 90% to 49%, or a type II error of 1 – 0.49 = 0.51. Stated another way, although it was unlikely that a population in which the intervention was effective for mortality would produce a sample of 3868 patients in which the intervention was ineffective, it is very likely that that same population would produce a sample of 2500 patients in which the intervention was not effective. In this case, although the investigators were unable to reject the null hypothesis of no effect, the large type II error blocks them from saying the result of the study was null. They instead must say that the study was uninformative on the mortality issue.

A hypothesis test which does not reject the null hypothesis but has inadequate power will be described as “uninformative”. This is consistent with the a commonly used admonition at the F.D.A. “Absence of evidence is not evidence of absence” In the circumstances of the preceding clinical trial, this aphorism may be interpreted as the “absence of evidence (of a beneficial effect of the intervention in the research sample) is not evidence of absence (of a beneficial effect of the intervention in the population)”. The absence of evidence of the effect in the sample is only evidence of absence of the effect in the population at large in the high power environment of a concordantly executed clinical trial.

From this discussion we see that clinical trials whose results are based on hypothesis tests are either positive, negative, null, or uninformative.


[1] The difficulties of random research are examined in chapter two.

[2] An example of a negative trial is the CAST study (Preliminary Report: Effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. 1989. N Eng J Med:321;406-412.). CAST demonstrated the harmful effects of arrhythmia treatment in patients who had suffered a heart attack.

[3] The finding of a null result has been described as demonstrating “neither therapeutic triumph nor therapeutic calamity”.

[4]Recall that a type I error is the event that a population in which the intervention has no effect will produce a sample with a beneficial (or harmful) effect.

[5] Since during the design of a clinical trial, the investigator does not know whether the sample results will be positive, negative or null, she must protect her research result from either a type I error or type II error. This is why both of these error rates are built into the sample size calculation, developed in appendix six..

[6] Statistical power is defined as one minus the type II error. For example 90% power means there is a 10% probability of a type II error.

[7] An elementary discussion of sample size and power computations is provided in Appendix 6.