Many medical research studies are published with large numbers of significance tests. These are not usually independent, being carried out on the same set of subjects, so the above calculations do not apply exactly. However, it is clear that if we go on testing long enough we will find something which is ‘significant’. We must beware of attaching too much importance to a lone significant result among a mass of non-significant ones. It may be the one in 20 which we should get by chance alone.

This is particularly important when we find that a clinical trial or
epidemiological study gives no significant difference overall, but does
so in a particular subset of subjects, such as women aged over 60. For
example, Lee *et al.* (1980) simulated a clinical trial of the treatment
of coronary artery disease by allocating 1073 patient records from past
cases into two ‘treatment’ groups at random.
They then analysed the outcome
as if it were a genuine trial of two treatments. The analysis was quite
detailed and thorough. As we would expect, it failed to show any significant
difference in survival between those patients allocated to the
two ‘treatments’.
Patients were then subdivided by two variables which affect prognosis,
the number of diseased coronary vessels and whether the left ventricular
contraction pattern was normal or abnormal. A significant difference in
survival between the two ‘treatment’ groups was found in those patients
with three diseased vessels (the maximum) and abnormal ventricular contraction.
As this would be the subset of patients with the worst prognosis, the finding
would be easy to account for by saying that the superior ‘treatment’ had
its greatest advantage in the most severely ill patients! The moral of
this story is that if there is no difference between the treatments overall,
significant differences in subsets are to be treated with the utmost suspicion.
This method of looking for a difference in treatment effect between subgroups
of subjects is incorrect. A correct approach would be to use a multifactorial
analysis, as described in Chapter 15, with treatment and group as two factors,
and test for an interaction between groups and treatments (Section 15.5).
The power for
detecting such interactions is quite low, and we need a larger sample than
would be needed simply to show a difference overall (Altman and Matthews
1996; Matthews and Altman 1996a, 1996b).

This spurious significant difference comes about because, when there
is no real difference, the probability of getting no significant differences
in six subgroups is 0.95^{6} = 0.74, not 0.95. We can allow for
this effect by the **Bonferroni **method. In general, if we have *k*
independent significant tests, at the *α *level, of null hypotheses
which are all true, the probability that we will get no significant differences
is (1- *α*)^{k}. If we make *α *small enough,
we can make the probability that none of the separate tests is significant
equal to 0.95. Then if any of the *k* tests has a P value less than
*α*, we will have a significant difference between the treatments
at the 0.05 level. Since *α* will be very small, it can be shown
that (1-*α*)^{k} is approximately equal to 1 - *kα*.
If we put *kα *= 0.05, so *α *= 0.05/*k*, we will
have probability 0.05 that one of the *k* tests will have a P value
less than *α* if the null hypotheses are true. Thus, if in a clinical
trial we compare two treatments within 5 subsets of patients, the treatments
will be significantly different at the 0.05 level if there is a P value
less than 0.01 within any of the subsets. This is the Bonferroni method.
Note that they are not significant at the 0.01 level, but at only the 0.05
level. The *k* tests together test the composite null hypothesis that
there is no treatment effect on any variable.

We can do the same thing by multiplying the observed P value from the
significance tests by the number of tests, *k*, any *k* ×
P which exceeds one being ignored. Then if any *k*P is less than
0.05, the two treatments are significant at the 0.05 level.

For example, Williams *et al.* (1992) randomly allocated elderly
patients discharged from hospital to two groups. The intervention group
received timetabled visits by health visitor assistants, the control patients
group were not visited unless there was perceived need. Soon after discharge
and after one year, patients were assessed for physical, disability and
mental state using questionnaire scales. There were no significant differences
overall between the intervention and control groups, but among women aged
75-79 living alone the control group showed significantly greater deterioration
in physical score than did the intervention group (P=0.04), and among men
over 80 years the control group showed significantly greater deterioration
in disability score than did the intervention group (P=0.03). The authors
stated that ‘Two small sub-groups of patients were possibly shown to have
benefited from the intervention. ... These benefits, however, have to be
treated with caution, and may be due to chance factors.’ Subjects were
cross-classified by age groups, whether living alone, and sex, so there
were at least eight subgroups, if not more. Thus even if we consider the
three scales separately, only a P value less than 0.05/8 = 0.006 would
provide evidence of a treatment effect. Alternatively, the true P values
are 8 × 0.04 = 0.32 and 8 × 0.03 = 0.24.

A similar problem arises if we have multiple outcome measurements. For
example, Newnham *et al.* (1993) randomized pregnant women to receive
a series of Doppler ultrasound blood flow measurements or to control. They
found a significantly higher proportion of birthweights below the 10th
and 3rd centiles (P=0.006 and P=0.02). These were only two of many comparisons,
however, and one would suspect that there may be some spurious significant
differences among so many. At least 35 were reported in the paper, though
only these two were reported in the abstract (birthweight was not the intended
outcome variable for the trial). These tests are not independent, because
they are all on the same subjects, using variables which may not be independent.
The proportions of birthweights below the 10th and 3rd centiles are clearly
not independent, for example. The probability that two correlated variables
both give non-significant differences when the null hypothesis is true
is greater than (1 - *α*)^{2}, because if the first test
is not significant, the second now has a probability greater than 1 - *α*
of being not significant also. (Similarly, the probability that both are
significant exceeds *α*^{2}, and the probability that
only one is significant is reduced.) For *k* tests the probability
of no significant differences is greater than (1 - *α*)^{k}
and so greater than 1 - *kα*. Thus if we carry out each test
at the *α *= 0.05/*k* level, we will still have a probability
of no significant differences which is greater than 0.95. A P value less
than *α *for any variable, or *k*P < 0.05, would mean
that the treatments were significantly different. For the example, we have
*α* = 0.05/35 = 0.0014 and so by the Bonferroni criterion the
treatment groups are not significantly different. Alternatively, the P
values could be adjusted by 35 × 0.006 = 0.21 and
35 × 0.02 = 0.70.

Because the probability of obtaining no significant differences if the null hypotheses are all true is greater than the 0.95 which we want it to be, the overall P value is actually smaller than the nominal 0.05, by an unknown amount which depends on the lack of independence between the tests. The power of the test, its ability to detect true differences in the population, is correspondingly diminished. In statistical terms, the test is conservative.

Other multiple testing problems arise when we have more than two groups of subjects and wish to compare each pair of groups (Section 10.9), when we have a series of observations over time, such as blood pressure every 15 minutes after administration of a drug, where there may be a temptation to test each time point separately (Section 10.7), and when we have relationships between many variables to examine, as in a survey. For all these problems, the multiple tests are highly correlated and the Bonferroni method is inappropriate, as it will be highly conservative and may miss real differences.

Lee, K.L., McNeer, J.F., Starmer, F.C., Harris, P.J., and Rosati, R.A.
(1980) Clinical judgements and statistics: lessons form a simulated randomized
trial in coronary artery disease. *Circulation ***61**, 508-15.

Matthews, J.N.S. and Altman, D.G. (1996a) Statistics Notes. Interaction
2: compare effect sizes not P values. *British Medical Journal ***313**,
808.

Matthews, J.N.S. and Altman, D.G. (1996b) Statistics Notes. Interaction
3: how to examine heterogeneity. *British Medical Journal ***313**,
862.

Newnham, J.P., Evans, S.F., Con, A.M., Stanley, F.J., Landau, L.I. (1993)
Effects of frequent ultrasound during pregnancy: a randomized controlled
trial.
*Lancet ***342 **887-91.

Williams, E.I., Greenwell, J., and Groom, L.M. (1992) The care of people
over 75 years old after discharge from hospital: an evaluation of timetabled
visiting by Health Visitor Assistants. *Journal of Public Health Medicine
***14
**138-44.

Adapted from pages 123–125 of
*An Introduction to Medical Statistics* by Martin Bland, 2015,
reproduced by permission of
Oxford University Press.

Back to *An Introduction to Medical Statistics
*contents.

Back to Martin Bland’s Home Page.

This page maintained by Martin Bland.

Last updated: 7 August, 2015.