Many medical research studies are published with large numbers of significance tests. These are not usually independent, being carried out on the same set of subjects, so the above calculations do not apply exactly. However, it is clear that if we go on testing long enough we will find something which is `significant'. We must beware of attaching too much importance to a lone significant result among a mass of non-significant ones. It may be the one in twenty which we should get by chance alone.
This is particularly important when we find that a clinical trial or epidemiological study gives no significant difference overall, but does so in a particular subset of subjects, such as women aged over 60. For example, Lee et al. (1980) simulated a clinical trial of the treatment of coronary artery disease by allocating 1073 patient records from past cases into two `treatment' groups at random. They then analysed the outcome as if it were a genuine trial of two treatments. The analysis was quite detailed and thorough. As we would expect, it failed to show any significant difference in survival between those patients allocated to the two `treatments'. Patients were then subdivided by two variables which affect prognosis, the number of diseased coronary vessels and whether the left ventricular contraction pattern was normal or abnormal. A significant difference in survival between the two `treatment' groups was found in those patients with three diseased vessels (the maximum) and abnormal ventricular contraction. As this would be the subset of patients with the worst prognosis, the finding would be easy to account for by saying that the superior `treatment' had its greatest advantage in the most severely ill patients! The moral of this story is that if there is no difference between the treatments overall, significant differences in subsets are to be treated with the utmost suspicion. This method of looking for a difference in treatment effect between subgroups of subjects is incorrect. A correct approach would be to use a multifactorial analysis, as described in Chapter 17, with treatment and group as two factors, and test for an interaction between groups and treatments. The power for detecting such interactions is quite low, and we need a larger sample than would be needed simply to show a difference overall (Altman and Matthews 1996; Matthews and Altman 1996a, 1996b).
This spurious significant difference comes about because, when there is no real difference, the probability of getting no significant differences in six subgroups is 0.956 = 0.74, not 0.95. We can allow for this effect by the Bonferroni method. In general, if we have k independent significant tests, at the alpha level, of null hypotheses which are all true, the probability that we will get no significant differences is (1- alpha)k. If we make alpha small enough, we can make the probability that none of the separate tests is significant equal to 0.95. Then if any of the k tests has a P value less than alpha, we will have a significant difference between the treatments at the 0.05 level. Since alpha will be very small, it can be shown that (1-alpha)k is approximately equal to 1 - k alpha. If we put k alpha = 0.05, so alpha = 0.05/k, we will have probability 0.05 that one of the k tests will have a P value less than alpha if the null hypotheses are true. Thus, if in a clinical trial we compare two treatments within 5 subsets of patients, the treatments will be significantly different at the 0.05 level if there is a P value less than 0.01 within any of the subsets. This is the Bonferroni method. Note that they are not significant at the 0.01 level, but at only the 0.05 level. The k tests together test the composite null hypothesis that there is no treatment effect on any variable.
We can do the same thing by multiplying the observed P value from the significance tests by the number of tests, k, any k times P which exceeds one being ignored. Then if any k P is less than 0.05, the two treatments are significant at the 0.05 level.
For example, Williams et al. (1992) randomly allocated elderly patients discharged from hospital to two groups. The intervention group received timetabled visits by health visitor assistants, the control patients group were not visited unless there was perceived need. Soon after discharge and after one year, patients were assessed for physical, disability and mental state using questionnaire scales. There were no significant differences overall between the intervention and control groups, but among women aged 75-79 living alone the control group showed significantly greater deterioration in physical score than did the intervention group (P=0.04), and among men over 80 years the control group showed significantly greater deterioration in disability score than did the intervention group (P=0.03). The authors stated that `Two small sub-groups of patients were possibly shown to have benefited from the intervention. ... These benefits, however, have to be treated with caution, and may be due to chance factors.' Subjects were cross-classified by age groups, whether living alone, and sex, so there were at least eight subgroups, if not more. Thus even if we consider the three scales separately, only a P value less than 0.05/8 = 0.006 would provide evidence of a treatment effect. Alternatively, the true P values are 8 times 0.04 = 0.32 and 8 times 0.03 = 0.24.
A similar problem arises if we have multiple outcome measurements. For example, Newnham et al. (1993) randomized pregnant women to receive a series of Doppler ultrasound blood flow measurements or to control. They found a significantly higher proportion of birthweights below the 10th and 3rd centiles (P=0.006 and P=0.02). These were only two of many comparisons, however, and one would suspect that there may be some spurious significant differences among so many. At least 35 were reported in the paper, though only these two were reported in the abstract (birthweight was not the intended outcome variable for the trial). These tests are not independent, because they are all on the same subjects, using variables which may not be independent. The proportions of birthweights below the 10th and 3rd centiles are clearly not independent, for example. The probability that two correlated variables both give non-significant differences when the null hypothesis is true is greater than (1 - alpha)2, because if the first test is not significant, the second now has a probability greater than 1 - alpha of being not significant also. (Similarly, the probability that both are significant exceeds alpha2, and the probability that only one is significant is reduced.) For k tests the probability of no significant differences is greater than (1 - alpha)k and so greater than 1 - k alpha. Thus if we carry out each test at the alpha = 0.05/k level, we will still have a probability of no significant differences which is greater than 0.95. A P value less than alpha for any variable, or k Pval < 0.05, would mean that the treatments were significantly different. For the example, we have alpha = 0.05/35 = 0.0014 and so by the Bonferroni criterion the treatment groups are not significantly different. Alternatively, the P values could be adjusted by 35 times 0.006 = 0.21 and 35 times 0.02 = 0.70.
Because the probability of obtaining no significant differences if the null hypotheses are all true is greater than the 0.95 which we want it to be, the overall P value is actually smaller than the nominal 0.05, by an unknown amount which depends on the lack of independence between the tests. The power of the test, its ability to detect true differences in the population, is correspondingly diminished. In statistical terms, the test is conservative.
Other multiple testing problems arise when we have more than two groups of subjects and wish to compare each pair of groups (Section 10.9), when we have a series of observations over time, such as blood pressure every 15 minutes after administration of a drug, where there may be a temptation to test each time point separately (Section 10.7), and when we have relationships between many variables to examine, as in a survey. For all these problems, the multiple tests are highly correlated and the Bonferroni method is inappropriate, as it will be highly conservative and may miss real differences.
Lee, K.L., McNeer, J.F., Starmer, F.C., Harris, P.J., and Rosati, R.A. (1980) Clinical judgements and statistics: lessons form a simulated randomized trial in coronary artery disease. Circulation 61, 508-15.
Matthews, J.N.S. and Altman, D.G. (1996a) Statistics Notes. Interaction 2: compare effect sizes not P values. British Medical Journal 313, 808.
Matthews, J.N.S. and Altman, D.G. (1996b) Statistics Notes. Interaction 3: how to examine heterogeneity. British Medical Journal 313, 862.
Newnham, J.P., Evans, S.F., Con, A.M., Stanley, F.J., Landau, L.I. (1993) Effects of frequent ultrasound during pregnancy: a randomized controlled trial. Lancet 342 887-91.
Williams, E.I., Greenwell, J., and Groom, L.M. (1992) The care of people over 75 years old after discharge from hospital: an evaluation of timetabled visiting by Health Visitor Assistants. Journal of Public Health Medicine 14 138-44.
Back to An Introduction to Medical Statistics contents.
A question on multiple significance tests from Statistical Questions in Evidence-based Medicine.
Back to Martin Bland's Home Page.
This page maintained by Martin Bland.
Last updated: 14 April, 2004.
Back to top.