Regression towards the mean or Why was Terminator III such a disappointment?

Martin Bland
Dept of Health Sciences
University of York

Talk first presented to the Commission for Health Improvement, January 2004.

An example of regression towards the mean

The figure below comes from a trial of the efficacy and tolerability of borage oil in adults and children with atopic eczema (Takwale et al., 2003). This was a randomised, double blind, placebo controlled, parallel group trial where patients with eczema were recruited, treated, and followed over time to assess their symptoms. The graph shows an atopic dermatitis score, SASSAD, using six areas and six signs, throughout the trial (n=140; error bars show SE).

Subjects were recruited because they had eczema which led them to seek medical advice, hence they had high symtom scores. As they are followed over time, their average score falls, whether they receive active treatment or not. This is a classic instance of regression towards the mean.

Outline

In this talk I will try to anwer the following questions:

• What is regression towards the mean?
• Where does the name come from?
• Why is it such a problem?
• What can we do about it?

I shall proceed mainly by illustration. I shall not go into the mathematics of regression towards the mean, but I will give refeerences to several papers describing statistical techniques for estimating its effects in different circumstances.

What is regression towards the mean?

The graph below shows pulse rate measured by two different observers for 185 students. They are not the same; there is considerable variation.

The two pulses should be the same apart from measurement error. The next figure shows the line of equality, on which the points would all lie if the pulse measurements were the same. This represents the true or functional relationship between them, that they are identical apart from error. The diagram also shows lines through the means of the two variables. The means are the same and in fact the whole distribution is the same for each variable.

What is the mean second pulse measurement for students whose first pulse is 60 b/min? As few students have a measurement exactly 60, we can find the mean for those around 60, between 55 and 65, shown by the dark band on the next graph:

We can see that the mean is going to be more than 60 b/min. The next figure shows the point marked:

It is closer to the mean for the second measurement than is 60 b/min to the mean for the first measurement. We can do this for any value of the first pulse measurement. The next graph shows it for first measurements around 50, 60, 70, 80, etc. These means do not lie on the line of identity but on one which crosses it.

These means lie on the regression line, the regression of second pulse on first pulse. This is shown in the next graph:

It works in the same way if we start with the second measurement. What was the average first measurement for students whose second measurement was 60 b/min?

Again, it is closer to the mean.

The mean first pulse for a given value of the second pulse lies on the other regression line, the regression of first pulse on second pulse.

There are two regression lines. Neither is the same as the line of equality, which represents the true, functional relationship.

Where does the name come from?

The name “regression” comes from a paper by the Victorian geneticist and polymath Francis Galton (Galton 1886) entitled “Regression towards mediocrity in hereditary stature”. Galton set up a stand at the Great Exhibition, where he measured the heights of families attending. He adjusted the female heights by multiplying by 1.08. He then calculated the average height of the two parents, the “midheight”, and related it to the height of their adult children:

This plot is based on Galton’s original. The area of the circle represents the number of coincident points. The line is the regression of child height on midparent height. The means of both are the same, 68.2 inches.

Consider parents with midheight 70 inches. Their children had heights between 67 and 73 inches, and a mean height of 69.6 inches. The mean height of the subgroup of children was closer to the mean height of all children than the mean height of the subgroup of midparents was to the mean height of parents. Galton called this ‘regression towards mediocrity’.

The same thing happens if we start with the children. For example, for the children with height 70 inches, the mean height of their midparents is 67.9 inches. This is a statistical, not a genetic phenomenon.

Galton called this “regression towards mediocrity”. Because the word “mediocrity” has acquired adverse connotations since Galton’s time, we now call it “regression towards the mean”.

Why is regression towards the mean such a problem?

Treatment to reduce high levels of a measurement

Subjects with extreme values of a measurement, such as high blood pressure, may be selected and treated to bring their values closer to the mean. If they are measured again, we will observe that the mean of the extreme group is now closer to the mean of the whole population, i.e. reduced. This is often interpreted as showing the effect of the treatment. However, even if subjects are not treated the mean blood pressure will go down, due to regression towards the mean.

In the following graph, A and B are selected.

The average goes down, though the distribution is unchanged.

For example, in the Australian trial in mild hypertension (Reader et al. 1980), patients were selected if their average diastolic blood pressure (DBP) over four readings at two visits was between 95 and 110 mm Hg and their systolic blood pressure was below 200 mm Hg. In the placebo group, the mean DBP on screening was 100.4 mm Hg and the mean DBP during the trial was 93.9 mm Hg, a mean fall of 6.6 mm Hg without any treatment.

A non-medical example of the same thing is provided by a study of reoffending by ex-prisoners. A government minister was reported as claiming that prison sentences work, because following release from prison the next offence for which ex-prisoners were convicted tended to be for a less serious crime than that which had led to the prison sentence (Fletcher 1995). Because more serious crimes are more likely to be punished by prison sentences, ex-prisoners are a group selected because their last crime was at the extreme of the distribution. Hence the "average seriousness" of their next crimes will be lower.

Difference from baseline

In a trial we may measure the outcome variable before and after treatment. Researchers think that if they observe some imbalance between groups on the baseline measurement, they can allow for this by taking the difference "after minus baseline" as the outcome. This does not work. The following simulations show why:

Strength of relationship between two variables

Suppose we wish to look at the relationship between two variables where the predictor variable is measured with error. For example, we might compare the coronary heart disease mortality in three groups of men categorized by their level of serum cholesterol. Some of the members of the low cholesterol group would be measured when they were lower than their personal mean serum cholesterol and so put in the ‘low’ group, but on subsequent measurement they would not fall into this group.

The mean observed serum cholesterol in the ‘low’ group would appear to rise when serum cholesterol was measured at a subsequent occasion, and, similarly, the mean serum cholesterol of the high group would appear to fall.

The difference in CHD mortality between the ‘low’ and the ‘high’ groups would thus be the difference in mortality between groups whose true difference in mean serum cholesterol was less than the apparent difference. The change in mortality per unit of serum cholesterol would be under-estimated.

An example was given by Gardner and Heady (1973). They looked at the relationship between ischaemic heart disease (IHD) over ten years and blood pressure at the start of the period. They compared the relationship of IHD with a single measurement of blood pressure and with the mean of six measurements. The mean of six will be closer to the true value of the subject’s long term average blood pressure than will a single measurment. The relationship is therefore stronger between IHD and the mean of six BP readings than between IHD and a single BL reading.

Relating change to initial value

We may be interested in the relationship between the initial value of a measurement and the change in that quantity over time. In anti-hypertensive drug trials, for example, it may be postulated that the drug’s effectiveness would be different (usually greater) for patients with more severe hypertension. Regression towards the mean will be greater for the patients with the highest initial blood pressures, so that we would expect to observe the postulated effect even in untreated patients.

In the Australian trial in mild hypertension (Reader et al. 1980), the falls in diastolic blood pressure (DBP) in the placebo group were as follows:

Screening DBP
group (mm Hg)
Mean screening
DBP (mm Hg)
Mean DBP on
placebo (mm Hg)
Mean fall in
DBP (mm Hg)
95-99 97.0 92.1 5.0
100-104 101.9 94.5 7.4
105-109 106.7 97.5 9.2

The higher the screening the DBP in these untreated patients, the greater the fall. This is unlikely to be due to the effectiveness of the placebo.

Here is a corresponding table from the pulse data:

First pulse
group (b/min)
Mean first
pulse (b/min)
Mean second
pulse (b/min)
Mean fall in
pulse (b/min)
70-79 73.6 74.8 -1.2
80-89 83.1 78.5 4.6
90-129 99.5 89.8 9.7

The bigger the first pulse measurement, the greater is the mean fall to the second pulse measurement.

Assessing the appropriateness of clinical decisions

Clinical decisions are sometimes assessed by asking a review panel to read case notes and decide whether they agree with the decision made. Because agreement between observers is seldom perfect, the panel is sure to conclude that some decisions are ‘wrong’.

For example, Barrett et al. (1990) reviewed cases of women who had had Caesarian section because of fetal distress. Five observers reviewed case notes and decided whether they thought the Caesarian had been appropriate or not appropriate. They were not unanimous in their judgements. The percentage agreement between pairs of observers in the panel varied from 60% to 82.5% The expected agreement if they made their decisions by tossing a coin would be 50%, so this is not particularly good agreement. They judged a Caesarian ‘appropriate’ if at least four of the five observers thought a Caesarian should have been done. They concluded that 30% of all Caesarians for for fetal distress were unnecessary.

As Esmail and Bland (1990) pointed out, as only women who had undergone Caesarian were reviewed, it was inevitable that another observer would conclude that some Caesarians had been inappropriate. Given the poor agreement between the judges, this number was bound to be quite high.

Comparison of two methods of measurement

When comparing two methods of measuring the same quantity, researchers are sometimes tempted to regress one method on the other. The fallacious argument is that if the methods agree the slope should be one. Because of the regression towards the mean effect, we expect the slope to be less than one even if the two methods agree closely.

For example, two studies of the validity of self-reported weight (Schlichting et al. 1981, Kuskowska-Wolk et al. 1989) used the same design. Self reported weight was obtained from a group of subjects, and the subjects were then weighed. Regression analysis was done with reported weight as the outcome variable and measure weight as the predictor variable. The regression slope was less than one in each study. According to the regression equation, the mean reported weight of heavy subjects was less than their mean measured weight, and the mean reported weight of light subjects was greater than their mean measured weight. Both Schlichting et al. (1981) and Kuskowska-Wolk et al. (1989) interpreted this as follows: those who are overweight tend to report weights below their true value, and those who are excessively thin tend to report greater weights than they really have! But, of course, we expect the slope to less than 1.0 when the distributions are the same and the true functional relationship is equality.

Recent examples of the misunderstanding of regression in the study of agreement between different methods of measurement can be found in the talk “Applying the Right Statistics: Analyses of Measurement Studies” available on this website and published in an expanded version by Bland and Altman (2003).

Two phase sampling

Two phase sampling is procedure for studying small subgroups of a population. We take a large sample of the population (the first phase) and find out which are the members of the subgroups in which we are interested. We then take a sample from these subgroups for more detailed study (the second phase).

Second occasion First occasion Total
>1/day >1/week occasional never
>1/day 15 5 2 0 22
>1/week 12 25 11 1 49
occasional 6 32 65 25 138
never 0 2 10 72 84
Total 33 64 98 98 293

If we analyse the data by the second occasion smoking habits, none of the groups is really representative of children from the whole population who gave these replies. We analysed them by their answers on the first questionnaire. This was my first publication and, though I didn’t realise it, it involved regression towards the mean.

Extremity in different variables

We take an extreme group defined by a variable. Should they be equally extreme on other variables?

In an example from educational research, children were defined to be "gifted" if their IQ exceeded a cut-off. School attainment was measured on other scales. The mean attainment of the gifted children was fewer SDs above the population mean than was mean IQ for this group. This was interpreted as meaning that schools were failing "gifted" children.

Publication bias

Referees for papers submitted for publication do not always agree as to which papers should be accepted. Because referees’ judgements of the quality of papers are made with error, they cannot be perfectly correlated with any measure of the true quality of the paper.

So the next time the Lancet turns you down and you see much weaker papers being published, be consoled that this is one more example of regression towards the mean.

Hollywood Sequels

A sequel to a Hollywood movie is only made if the original film is a success, which we might take as indicating that it is of i.e. of high ‘quality’, commercial quality at any rate. (Art house films rarely get sequels, though they may, in critical terms, be of high quality.) The average ‘quality’ of sequels will be closer to the mean than average ‘quality’ of originals which have sequels, due to regression towards the mean. They will thus tend to be of lower ‘quality’ than the original.

This does not mean that they are bad films. They may still have higher ‘quality’ than the average of all films. Also, this rule applies to the group as a whole, not to every member of it. A sequel is not necessarily of lower ‘quality’ than its original. It is the average ‘quality’ which is lower. However, the majority will be of lower quality than the original and so such films tend to be a disappointment, and the further from the original they get the greater the disappointment will be. "Terminator III" was a good example.

The ‘Curse of Hello’ and the ‘Sports Illustrated Jinx’

People who appear on the covers of these magazines often have bad things happen to them afterwards. Film stars flop, sportsmen lose. But you only get on these covers if you have recently been unusually successful. Regression towards the mean predicts that on average they will be less successful afterwards.

What happens to the teams of Premier League Managers of the Month? They often lose in the next month!

What can we do about regression to the mean?

We can tackle the problems caused by regression to the mean at both the design and the analysis stages.

Design

Control groups

We can avoid the problems caused by regression towards the mean in “before and after” studies by the use of a control group. The randomised trial with concurrent controls obviates problems caused by regression to the mean.

For example, in the Australian trial in mild hypertension (Reader et al. 1980), the falls in diastolic blood pressure (DBP) in the treated and the placebo groups were as follows:

Treatment
group
Mean screening
DBP (mm Hg)
Mean DBP on
follow-up (mm Hg)
Mean fall in
DBP (mm Hg)
Active 100.5 88.3 12.2
Placebo 100.4 93.9 6.6

Although DBP fell in both treated and placebo groups, it fell by more in the treated group, showing the effect of the active treatment.

Duplicate baseline measurements

If we are concerned about problems arising because we want to use a baseline measurement, for example in strength of relationship in epidemiological follow-up studies or the study of the relationship between change and initial value, we can make a duplicate baseline measurement. Use one baseline measurement to select subjects or to calculate changes from and the other to use in analysis as the predictor variable.

Duplicate baseline measurements are best collected on a different occasion from the one used to group subjects. This is because the correlation between measurements on different occasions will be less than between measurements made on the same occasion.

Analysis

We can avoid the regression towards the mean effect which can occur with baseline measurements by not taking differences from baseline. Analysis of covariance is greatly to be preferred (Vickers and Altman 2001).

We can also estimate the expected regression towards the mean effect when we select a subgroup. We can then compare this to the change we actually see. How we do this depends on the data available to us. I shall not give any details here, but I have given references to methods applicable in different circumstances.

The regression towards the mean effect is predicted by the following version of the regression equation:

where is the correlation between X and Y. How we use this depends on what data we have and how reliably we can estimate the elements of the equation.

Adjusting for baseline measurements in clinical trials

Do not use differences from baseline. Use analysis of covariance instead, with the baseline measurement as covariate. This also has the advantage that we do not include the measurement error twice in the residual error used in t tests, regression, etc., which happens when we take differences from baseline (Vickers and Altman 2001).

Extreme subgroup remeasured, population distribution and correlation known

The expected or mean Y for a subgroup chosen so that X>x is given by:

For an illustration consider the pulse data. We have r = 0.675 and mean for first pulse = 72.6 b/min. If we select subjects whose first pulse is greater than 90 b/min, what is their predicted mean second pulse measurement? The mean first measurement is 100.2 b/min.

Hence we predict 91.2 b/min. In fact, the mean second pulse for this group is 90.4 b/min.

As the graph shows, this is very close to prediction:

To apply this we need the right r. This may reduce as the measurements in which we are interested get further apart in time. Gordon et al. (1976) gave the following correlations between baseline and subsequent measurements of blood pressure in the Framingham Study:

Given sufficient knowledge of our variable, we can clearly estimate the appropriate correlation for any period of follow-up.

Extreme subgroup measured for second variable whose population mean is unknown, correlation known

We can estimate the population mean and hence the regression effect provided the observations follow a Normal distribution. A method was given by Davis (1976) and extended by Chinn and Heller (1981). This method also enables us to deal with the effect on variables other than a remeasurement of the same thing. The method depends on the Normal assumption, but does not appear to be greatly sensitive.

Correlation unknown

We can use the data we have to estimate the correlation, provided both variables follow a Normal distribution. These method appear to be very sensitive to the Normal assumption. If we know the mean and SD of the X variable, or the cut-off proportion, we can use the method of James (1973). If we know the mean of the X variable, we can use the method of Mee and Tin (1991). If we know nothing, we can use the method given by Senn and Brown (1985). This is computationally more complex.

Change and initial value

This is a really nasty problem. Don’t do it or make duplicate baseline measurements if it is really important. Hayes (1988) gives a review, Blomqvist (1977) gives a method, Vollmer (1988) develops it. Simon Thompson has a Bayesian method.

Summary

Regression towards the mean is a frequently occurring phenomenon. We can estimate it in some cases and we can avoid it by design. It can make many traps for the unwary.

The most important thing is to be aware!

References

Bland JM, Altman DG. (1994) Regression towards the mean. British Medical Journal 308, 1499. (Back to text.)

Bland JM, Altman DG. (1994b) Some examples of regression towards the mean. British Medical Journal 309, 780.

Vickers AJ and Altman DG. (2001) Analysing controlled trials with baseline and follow up measurements. British Medical Journal 323, 1123-1124. (Back to text.)