Regression towards the mean
or
Why was Terminator III such a disappointment?

Martin Bland
Dept of Health Sciences
University of York

Talk first presented to the Commission for Health Improvement, January 2004.

This talk is based partly on work done with Doug Altman (Bland and Altman 1994, 1994b).

An example of regression towards the mean

The figure below comes from a trial of the efficacy and tolerability of borage oil in adults and children with atopic eczema (Takwale et al., 2003). This was a randomised, double blind, placebo controlled, parallel group trial where patients with eczema were recruited, treated, and followed over time to assess their symptoms. The graph shows an atopic dermatitis score, SASSAD, using six areas and six signs, throughout the trial (n=140; error bars show SE).

Line graphs of symptom score against time for two groups, Borage and Placebo. Both lines fall slightly over time. d

Subjects were recruited because they had eczema which led them to seek medical advice, hence they had high symtom scores. As they are followed over time, their average score falls, whether they receive active treatment or not. This is a classic instance of regression towards the mean.

Outline

In this talk I will try to anwer the following questions:

I shall proceed mainly by illustration. I shall not go into the mathematics of regression towards the mean, but I will give refeerences to several papers describing statistical techniques for estimating its effects in different circumstances.

What is regression towards the mean?

The graph below shows pulse rate measured by two different observers for 185 students. They are not the same; there is considerable variation.

Scatter plot of two pulse rates, the first on the horizontal axis, the second on the veritcal axis, showing a fairly strong relationship. d

The two pulses should be the same apart from measurement error. The next figure shows the line of equality, on which the points would all lie if the pulse measurements were the same. This represents the true or functional relationship between them, that they are identical apart from error. The diagram also shows lines through the means of the two variables. The means are the same and in fact the whole distribution is the same for each variable.

Scatter plot of two pulse rates, with the line of equality at 45 degrees to the two axes and vertical and horizontal lines through the means, which cross on the line of equality.
d

What is the mean second pulse measurement for students whose first pulse is 60 b/min? As few students have a measurement exactly 60, we can find the mean for those around 60, between 55 and 65, shown by the dark band on the next graph:

Scatter plot of pulse rates, with line of equality and mean lines. Vertical line at first pulse = 60 beats per minute, shaded area from 55 to 65. d

We can see that the mean is going to be more than 60 b/min. The next figure shows the point marked:

Scatter plot of pulse rates, line of equality, mean lines, line at first pulse = 60, shaded area 55 to 65. Diamond marks point of mean second pulse = 66 if first pulse 55 to 65. d

It is closer to the mean for the second measurement than is 60 b/min to the mean for the first measurement. We can do this for any value of the first pulse measurement. The next graph shows it for first measurements around 50, 60, 70, 80, etc. These means do not lie on the line of identity but on one which crosses it.

Scatter plot of pulse rates, line of equality, diamonds mark mean second pulse for first = 50, 60, etc., to 110 and would lie on a rough straight line, less steep than line of equality. d

These means lie on the regression line, the regression of second pulse on first pulse. This is shown in the next graph:

Scatter plot of pulse rates, line of equality, diamonds mark mean second pulse for first = 50, 60, etc., to 110 and lie close to a straight line, less steep than line of equality. d

It works in the same way if we start with the second measurement. What was the average first measurement for students whose second measurement was 60 b/min?

Scatter plot of pulse rates, with line of equality and mean lines. horizontal line at second pulse = 60 beats per minute, shaded area from 55 to 65. d

Again, it is closer to the mean.

Scatter plot of pulse rates, line of equality, mean lines, line at second pulse = 60, shaded area 55 to 65. Diamond marks point of mean first pulse = 65 if second pulse 55 to 65. d

The mean first pulse for a given value of the second pulse lies on the other regression line, the regression of first pulse on second pulse.

Scatter plot of pulse rates, line of equality, line at second pulse = 60, shaded area 55 to 65, diamond point (65, 60), regression of first on second is a line steeper than line of equality. d

There are two regression lines. Neither is the same as the line of equality, which represents the true, functional relationship.

Scatter plot of pulse rates, line of equality, regression of first on second is a line less steep than line of equality, regression of first on second is a line steeper than line of equality. d

Where does the name come from?

The name “regression” comes from a paper by the Victorian geneticist and polymath Francis Galton (Galton 1886) entitled “Regression towards mediocrity in hereditary stature”. Galton set up a stand at the Great Exhibition, where he measured the heights of families attending. He adjusted the female heights by multiplying by 1.08. He then calculated the average height of the two parents, the “midheight”, and related it to the height of their adult children:

Plot of child height against mid-parent height, points represented by circles of area proportional to the number of coincident points, regression line, correlation not strong (r=0.39). d

This plot is based on Galton’s original. The area of the circle represents the number of coincident points. The line is the regression of child height on midparent height. The means of both are the same, 68.2 inches.

Consider parents with midheight 70 inches. Their children had heights between 67 and 73 inches, and a mean height of 69.6 inches. The mean height of the subgroup of children was closer to the mean height of all children than the mean height of the subgroup of midparents was to the mean height of parents. Galton called this ‘regression towards mediocrity’.

The same thing happens if we start with the children. For example, for the children with height 70 inches, the mean height of their midparents is 67.9 inches. This is a statistical, not a genetic phenomenon.

Galton called this “regression towards mediocrity”. Because the word “mediocrity” has acquired adverse connotations since Galton’s time, we now call it “regression towards the mean”.

Why is regression towards the mean such a problem?

Regression towards the mean can happen in several different types of study. The study of heredity is just one. Once one becomes aware of the regression effect it seems to be everywhere. The following are just a few examples; Andersen (1990) gives more.

Treatment to reduce high levels of a measurement

Subjects with extreme values of a measurement, such as high blood pressure, may be selected and treated to bring their values closer to the mean. If they are measured again, we will observe that the mean of the extreme group is now closer to the mean of the whole population, i.e. reduced. This is often interpreted as showing the effect of the treatment. However, even if subjects are not treated the mean blood pressure will go down, due to regression towards the mean.

In the following graph, A and B are selected.

Plot of hypothetical before and after study, subjects A (100 on occasion 1, 90 on occasion 2), B (110 and 110) C (90 and 100), D (80 and 80). d

The average goes down, though the distribution is unchanged.

For example, in the Australian trial in mild hypertension (Reader et al. 1980), patients were selected if their average diastolic blood pressure (DBP) over four readings at two visits was between 95 and 110 mm Hg and their systolic blood pressure was below 200 mm Hg. In the placebo group, the mean DBP on screening was 100.4 mm Hg and the mean DBP during the trial was 93.9 mm Hg, a mean fall of 6.6 mm Hg without any treatment.

A non-medical example of the same thing is provided by a study of reoffending by ex-prisoners. A government minister was reported as claiming that prison sentences work, because following release from prison the next offence for which ex-prisoners were convicted tended to be for a less serious crime than that which had led to the prison sentence (Fletcher 1995). Because more serious crimes are more likely to be punished by prison sentences, ex-prisoners are a group selected because their last crime was at the extreme of the distribution. Hence the "average seriousness" of their next crimes will be lower.

Difference from baseline

In a trial we may measure the outcome variable before and after treatment. Researchers think that if they observe some imbalance between groups on the baseline measurement, they can allow for this by taking the difference "after minus baseline" as the outcome. This does not work. The following simulations show why:

Difference from baseline simulation: after against before (baseline) and line of equality (r=0.54), difference (after minus before) against before (r = -0.23). d

Any imbalance will be reversed, due to regression to the mean. Subjects who tend to have low baseline measurements will tend to have high after minus baseline measurements (Vickers and Altman 2001).

Strength of relationship between two variables

Suppose we wish to look at the relationship between two variables where the predictor variable is measured with error. For example, we might compare the coronary heart disease mortality in three groups of men categorized by their level of serum cholesterol. Some of the members of the low cholesterol group would be measured when they were lower than their personal mean serum cholesterol and so put in the ‘low’ group, but on subsequent measurement they would not fall into this group.

The mean observed serum cholesterol in the ‘low’ group would appear to rise when serum cholesterol was measured at a subsequent occasion, and, similarly, the mean serum cholesterol of the high group would appear to fall.

The difference in CHD mortality between the ‘low’ and the ‘high’ groups would thus be the difference in mortality between groups whose true difference in mean serum cholesterol was less than the apparent difference. The change in mortality per unit of serum cholesterol would be under-estimated.

An example was given by Gardner and Heady (1973). They looked at the relationship between ischaemic heart disease (IHD) over ten years and blood pressure at the start of the period. They compared the relationship of IHD with a single measurement of blood pressure and with the mean of six measurements. The mean of six will be closer to the true value of the subject’s long term average blood pressure than will a single measurment. The relationship is therefore stronger between IHD and the mean of six BP readings than between IHD and a single BL reading.

Line graphs: ten year ischaemic heart disease risk against systolic blood pressure for the first SBP reading and for the mean of six SBP readings. The line for the mean of six is steeper. d

Relating change to initial value

We may be interested in the relationship between the initial value of a measurement and the change in that quantity over time. In anti-hypertensive drug trials, for example, it may be postulated that the drug’s effectiveness would be different (usually greater) for patients with more severe hypertension. Regression towards the mean will be greater for the patients with the highest initial blood pressures, so that we would expect to observe the postulated effect even in untreated patients.

In the Australian trial in mild hypertension (Reader et al. 1980), the falls in diastolic blood pressure (DBP) in the placebo group were as follows:

Screening DBP
group (mm Hg)
Mean screening
DBP (mm Hg)
Mean DBP on
placebo (mm Hg)
Mean fall in
DBP (mm Hg)
95-99 97.0 92.1 5.0
100-104 101.9 94.5 7.4
105-109 106.7 97.5 9.2

The higher the screening the DBP in these untreated patients, the greater the fall. This is unlikely to be due to the effectiveness of the placebo.

Here is a corresponding table from the pulse data:

First pulse
group (b/min)
Mean first
pulse (b/min)
Mean second
pulse (b/min)
Mean fall in
pulse (b/min)
70-79 73.6 74.8 -1.2
80-89 83.1 78.5 4.6
90-129 99.5 89.8 9.7

The bigger the first pulse measurement, the greater is the mean fall to the second pulse measurement.

Assessing the appropriateness of clinical decisions

Clinical decisions are sometimes assessed by asking a review panel to read case notes and decide whether they agree with the decision made. Because agreement between observers is seldom perfect, the panel is sure to conclude that some decisions are ‘wrong’.

For example, Barrett et al. (1990) reviewed cases of women who had had Caesarian section because of fetal distress. Five observers reviewed case notes and decided whether they thought the Caesarian had been appropriate or not appropriate. They were not unanimous in their judgements. The percentage agreement between pairs of observers in the panel varied from 60% to 82.5% The expected agreement if they made their decisions by tossing a coin would be 50%, so this is not particularly good agreement. They judged a Caesarian ‘appropriate’ if at least four of the five observers thought a Caesarian should have been done. They concluded that 30% of all Caesarians for for fetal distress were unnecessary.

As Esmail and Bland (1990) pointed out, as only women who had undergone Caesarian were reviewed, it was inevitable that another observer would conclude that some Caesarians had been inappropriate. Given the poor agreement between the judges, this number was bound to be quite high.

Comparison of two methods of measurement

When comparing two methods of measuring the same quantity, researchers are sometimes tempted to regress one method on the other. The fallacious argument is that if the methods agree the slope should be one. Because of the regression towards the mean effect, we expect the slope to be less than one even if the two methods agree closely.

For example, two studies of the validity of self-reported weight (Schlichting et al. 1981, Kuskowska-Wolk et al. 1989) used the same design. Self reported weight was obtained from a group of subjects, and the subjects were then weighed. Regression analysis was done with reported weight as the outcome variable and measure weight as the predictor variable. The regression slope was less than one in each study. According to the regression equation, the mean reported weight of heavy subjects was less than their mean measured weight, and the mean reported weight of light subjects was greater than their mean measured weight. Both Schlichting et al. (1981) and Kuskowska-Wolk et al. (1989) interpreted this as follows: those who are overweight tend to report weights below their true value, and those who are excessively thin tend to report greater weights than they really have! But, of course, we expect the slope to less than 1.0 when the distributions are the same and the true functional relationship is equality.

Recent examples of the misunderstanding of regression in the study of agreement between different methods of measurement can be found in the talk “Applying the Right Statistics: Analyses of Measurement Studies” available on this website and published in an expanded version by Bland and Altman (2003).

Two phase sampling

Two phase sampling is procedure for studying small subgroups of a population. We take a large sample of the population (the first phase) and find out which are the members of the subgroups in which we are interested. We then take a sample from these subgroups for more detailed study (the second phase).

7000 children were asked about their cigarette smoking to obtain a sample of child smokers and non-smokers. The second phase sample was: all children reporting smoking one or more cigarettes a day, and random samples of those smoking one or more a week, less than one a week, and never having smoked. Four months later the children in the second phase sample were given a detailed questionnaire, including a further question on the amount smoked (Bewley et al., 1974). The following was obtained:

Second occasion First occasion Total
>1/day >1/week occasional never
>1/day 15 5 2 0 22
>1/week 12 25 11 1 49
occasional 6 32 65 25 138
never 0 2 10 72 84
Total 33 64 98 98 293

If we analyse the data by the second occasion smoking habits, none of the groups is really representative of children from the whole population who gave these replies. We analysed them by their answers on the first questionnaire. This was my first publication and, though I didn’t realise it, it involved regression towards the mean.

Extremity in different variables

We take an extreme group defined by a variable. Should they be equally extreme on other variables?

In an example from educational research, children were defined to be "gifted" if their IQ exceeded a cut-off. School attainment was measured on other scales. The mean attainment of the gifted children was fewer SDs above the population mean than was mean IQ for this group. This was interpreted as meaning that schools were failing "gifted" children.

Publication bias

Referees for papers submitted for publication do not always agree as to which papers should be accepted. Because referees’ judgements of the quality of papers are made with error, they cannot be perfectly correlated with any measure of the true quality of the paper.

Thus when an editor accepts the ‘best’ papers for publication, the average quality of these will be less than the editor thinks, and the average quality of those rejected will be higher than the editor thinks (Rousseeuw 1991).

So the next time the Lancet turns you down and you see much weaker papers being published, be consoled that this is one more example of regression towards the mean.

Hollywood Sequels

A sequel to a Hollywood movie is only made if the original film is a success, which we might take as indicating that it is of i.e. of high ‘quality’, commercial quality at any rate. (Art house films rarely get sequels, though they may, in critical terms, be of high quality.) The average ‘quality’ of sequels will be closer to the mean than average ‘quality’ of originals which have sequels, due to regression towards the mean. They will thus tend to be of lower ‘quality’ than the original.

This does not mean that they are bad films. They may still have higher ‘quality’ than the average of all films. Also, this rule applies to the group as a whole, not to every member of it. A sequel is not necessarily of lower ‘quality’ than its original. It is the average ‘quality’ which is lower. However, the majority will be of lower quality than the original and so such films tend to be a disappointment, and the further from the original they get the greater the disappointment will be. "Terminator III" was a good example.

The ‘Curse of Hello’ and the ‘Sports Illustrated Jinx’

People who appear on the covers of these magazines often have bad things happen to them afterwards. Film stars flop, sportsmen lose. But you only get on these covers if you have recently been unusually successful. Regression towards the mean predicts that on average they will be less successful afterwards.

What happens to the teams of Premier League Managers of the Month? They often lose in the next month!

What can we do about regression to the mean?

We can tackle the problems caused by regression to the mean at both the design and the analysis stages.

Design

Control groups

We can avoid the problems caused by regression towards the mean in “before and after” studies by the use of a control group. The randomised trial with concurrent controls obviates problems caused by regression to the mean.

For example, in the Australian trial in mild hypertension (Reader et al. 1980), the falls in diastolic blood pressure (DBP) in the treated and the placebo groups were as follows:

Treatment
group
Mean screening
DBP (mm Hg)
Mean DBP on
follow-up (mm Hg)
Mean fall in
DBP (mm Hg)
Active 100.5 88.3 12.2
Placebo 100.4 93.9 6.6

Although DBP fell in both treated and placebo groups, it fell by more in the treated group, showing the effect of the active treatment.

Duplicate baseline measurements

If we are concerned about problems arising because we want to use a baseline measurement, for example in strength of relationship in epidemiological follow-up studies or the study of the relationship between change and initial value, we can make a duplicate baseline measurement. Use one baseline measurement to select subjects or to calculate changes from and the other to use in analysis as the predictor variable.

Duplicate baseline measurements are best collected on a different occasion from the one used to group subjects. This is because the correlation between measurements on different occasions will be less than between measurements made on the same occasion.

Analysis

We can avoid the regression towards the mean effect which can occur with baseline measurements by not taking differences from baseline. Analysis of covariance is greatly to be preferred (Vickers and Altman 2001).

We can also estimate the expected regression towards the mean effect when we select a subgroup. We can then compare this to the change we actually see. How we do this depends on the data available to us. I shall not give any details here, but I have given references to methods applicable in different circumstances.

The regression towards the mean effect is predicted by the following version of the regression equation:

math equation: Y = mu sub Y + rho times sigma Y over sigma X times bracketed X minus mu sub X

where Symbol: Greek letter rho is the correlation between X and Y. How we use this depends on what data we have and how reliably we can estimate the elements of the equation.

Adjusting for baseline measurements in clinical trials

Do not use differences from baseline. Use analysis of covariance instead, with the baseline measurement as covariate. This also has the advantage that we do not include the measurement error twice in the residual error used in t tests, regression, etc., which happens when we take differences from baseline (Vickers and Altman 2001).

Extreme subgroup remeasured, population distribution and correlation known

We select subjects with high initial measurements and measuring them again later. To use this method we need data on the distribution of the variable in the whole population from which the subgroup is drawn and correlation between successive measurements, r. It is described by Shepard and Finison (1983).

The expected or mean Y for a subgroup chosen so that X>x is given by:

math formula: x bar pop minus r times bracketed X bar sub minus x bar pop

For an illustration consider the pulse data. We have r = 0.675 and mean for first pulse = 72.6 b/min. If we select subjects whose first pulse is greater than 90 b/min, what is their predicted mean second pulse measurement? The mean first measurement is 100.2 b/min.

math equation: x bar pop minus r times bracketed X bar sub minus x bar pop = 0.675 times (100.2 - 72.6) + 72.6 = 91.2

Hence we predict 91.2 b/min. In fact, the mean second pulse for this group is 90.4 b/min.

As the graph shows, this is very close to prediction:

Scatter plot of pulse rates, line of equality, shaded area to right of 90, point at first pulse = 100.2, second = 90.2, horizontal line at 91.2. d

To apply this we need the right r. This may reduce as the measurements in which we are interested get further apart in time. Gordon et al. (1976) gave the following correlations between baseline and subsequent measurements of blood pressure in the Framingham Study:

Scatter graph showing correlation coefficient (logarithmic scale) against year, points very close to the regression line, which slopes downwards. d

Given sufficient knowledge of our variable, we can clearly estimate the appropriate correlation for any period of follow-up.

Extreme subgroup measured for second variable whose population mean is unknown, correlation known

We can estimate the population mean and hence the regression effect provided the observations follow a Normal distribution. A method was given by Davis (1976) and extended by Chinn and Heller (1981). This method also enables us to deal with the effect on variables other than a remeasurement of the same thing. The method depends on the Normal assumption, but does not appear to be greatly sensitive.

Correlation unknown

We can use the data we have to estimate the correlation, provided both variables follow a Normal distribution. These method appear to be very sensitive to the Normal assumption. If we know the mean and SD of the X variable, or the cut-off proportion, we can use the method of James (1973). If we know the mean of the X variable, we can use the method of Mee and Tin (1991). If we know nothing, we can use the method given by Senn and Brown (1985). This is computationally more complex.

Change and initial value

This is a really nasty problem. Don’t do it or make duplicate baseline measurements if it is really important. Hayes (1988) gives a review, Blomqvist (1977) gives a method, Vollmer (1988) develops it. Simon Thompson has a Bayesian method.

Summary

Regression towards the mean is a frequently occurring phenomenon. We can estimate it in some cases and we can avoid it by design. It can make many traps for the unwary.

The most important thing is to be aware!

References

Andersen B. (1990) Methodological errors in medical research: an incomplete catalogue. Oxford, Blackwell. (Back to text.)

Barrett JFR, Jarvis, G.J., Macdonald, H.N., Buchan, P.C., Tyrrell S.N., and Lilford, R.J. (1990) Inconsistencies in clinical decision in Obstetrics. Lancet 336, 549-551. (Back to text.)

Bewley BR, Bland JM, Harris R. (1974) Factors associated with the starting of cigarette smoking by primary school children. British Journal of Preventive and Social Medicine 28, 37-44. (Back to text.)

Bland JM, Altman DG. (1994) Regression towards the mean. British Medical Journal 308, 1499. (Back to text.)

Bland JM, Altman DG. (1994b) Some examples of regression towards the mean. British Medical Journal 309, 780.

Bland JM and Altman DG. (2003) Applying the Right Statistics: Analyses of Measurement Studies. Ultrasound in Obstetrics and Gynecology 22, 85-93. (Back to text.)

Blomqvist N. (1977) On the relation between change and initial value. Journal of the American Statistical Association 72, 746-749. (Back to text.)

Chinn S and Heller RF. (1981) Some further results concerning regression to the mean. American Jour5nal of Epidemiology 114, 902-905. (Back to text.)

Davis CE. (1976) The effect of regression to the mean in epidemiologic and clinical studies. American Journal of Epidemiology 104, 493-498. (Back to text.)

Esmail A and Bland M. (1990) Caesarian section for fetal distress. Lancet 336, 819. (Back to text.)

Fletcher, H. (1995) Ways to reduce the risk of further crime. The Guardian, London, 14 February, page 19. (Back to text.)

Galton F. (1886) Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute 15, 246-263. (Back to text.)

Gardner, M.J. and Heady, J.A. (1973) Some effects of within-person variability in epidemiological studies. Journal of Chronic Disease 26, 781-795. (Back to text.)

Gordon T, Sorlie P, Kannel WB. (1976) Problems in the assessment of blood pressure: The Framingham Study. International Journal of Epidemiology 5, 327-334. (Back to text.)

Hayes RJ. (1988) Methods for assessing whether change depends on initial value. Statistics in Medicine 7, 915-927. (Back to text.)

James KE. (1973) Regression toward the mean in uncontrolled clinical studies. Biometrics 29, 121-130. (Back to text.)

Kuskowska-Wolk, A., Karlsson, P., Stolt, M., and Rossner, S. (1989) The predictive value of body mass index based on reported weight and height. International Journal of Obesity 13, 441-43. (Back to text.)

Mee RW and Tin CC. (1991) Regression towards the mean and the paired sample t test. American Statistician 45, 39-42. (Back to text.)

Reader R et al. (1980) The Australian trial in mild hypertension: report by the management committee. Lancet i 1261-7. (Back to text.)

Rousseeuw PJ. (1991) Why the wrong papers get published. Chance 4, 41-43. (Back to text.)

Schlichting P, Hoilund-Carlsen PF, and Quaade F. (1981) Comparison of self reported height and weight with controlled height and weight in women and men. International Journal of Obesity 5, 67-76. (Back to text.)

Senn SJ and Brown RA. (1985) Estimating treatment effects in clinical trials subject to regression to the mean. Biometrics 41, 555-560. (Back to text.)

Shepard DS and Finison LJ. (1983) Blood pressure reductions: correcting for regression towards the mean. Preventive Medicine 12, 304-317. (Back to text.)

Takwale A, Tan E, Agarwal S, Barclay G, Ahmed I, Hotchkiss K, Thompson JR, Chapman T, Berth-Jones J. (2003) Efficacy and tolerability of borage oil in adults and children with atopic eczema: randomised, double blind, placebo controlled, parallel group trial. British Medical Journal 327, 1385-8. (Back to text.)

Vickers AJ and Altman DG. (2001) Analysing controlled trials with baseline and follow up measurements. British Medical Journal 323, 1123-1124. (Back to text.)

Vollmer WM. (1988) Comparing change in longitudinal studies: adjusting for initial value. Journal of Clinical Epidemiology 41, 651-657. (Back to text.)

Back to Martin Bland’s Home Page.

This page is maintained by Martin Bland.
Last updated: 28 February, 2013.

Back to top.