- The correlation coefficient
- Test of significance and confidence interval for
*r* - Regression analyses
- Confidence intervals and P values in regression
- Testing the assumptions of regression
- Dichotomous predictor variables
- References

Correlation coefficients are used to measure the strength of the relationship or association between two quantitative variables. For example, Table 1 shows height, muscle strength and age in 41 alcoholic men.

Height (cm) | Quadriceps muscle strength (N) | Age (years) | Height (cm) | Quadriceps muscle strength (N) | Age (years) | |
---|---|---|---|---|---|---|

155 | 196 | 55 | 172 | 147 | 32 | |

159 | 196 | 62 | 173 | 441 | 39 | |

159 | 216 | 53 | 173 | 343 | 28 | |

160 | 392 | 32 | 173 | 441 | 40 | |

160 | 98 | 58 | 173 | 294 | 53 | |

161 | 387 | 39 | 175 | 304 | 27 | |

162 | 270 | 47 | 175 | 404 | 28 | |

162 | 216 | 61 | 175 | 402 | 34 | |

166 | 466 | 24 | 175 | 392 | 53 | |

167 | 294 | 50 | 175 | 196 | 37 | |

167 | 491 | 35 | 176 | 368 | 51 | |

168 | 137 | 65 | 177 | 441 | 49 | |

168 | 343 | 41 | 177 | 368 | 48 | |

168 | 74 | 65 | 177 | 412 | 32 | |

170 | 304 | 55 | 178 | 392 | 49 | |

171 | 294 | 47 | 178 | 540 | 41 | |

172 | 294 | 31 | 178 | 417 | 42 | |

172 | 343 | 38 | 178 | 324 | 55 | |

172 | 147 | 31 | 179 | 270 | 32 | |

172 | 319 | 39 | 180 | 368 | 34 | |

172 | 466 | 53 |

We will begin with the relationship between height and strength. Figure 1 shows a plot of strength against height.

**Figure 1. Scatter diagram showing muscle strength and height
for 41 male alcoholics**

d

This is a scatter diagram. Each point represents one subject. If we look at Figure 1, it is fairly easier to see that taller men tend to be stronger than shorter men, or, looking at the other way round, that stronger men tend to be taller than weaker men. It is only a tendency, the tallest man is not the strongest not is the shortest man the weakest. Correlation enables us to measure how close this association is.

The correlation coefficient is based on the products of differences from the mean
of the two variables.
That is, for each observation we subtract the mean, just as when calculating a
standard deviation.
We then multiply the deviations from the mean for the two variables
for a subject together, and add them.
We call this the **sum of products about the mean**.
It is very like the sum of squares about the mean used for measuring variability.

To see how correlation works, we can draw two lines on the scatter diagram, a horizontal line through the mean strength and a vertical line through the mean height, as shown in Figure 2.

**Figure 2. Scatter diagram showing muscle strength and height for 41 male alcoholics,
with lines through the mean height and mean strength**

d

Because large heights tend to go with large strength and small heights with small strength, there are more observations in the top right quadrant and the bottom left quadrant than there are in the top left and bottom right quadrants.

In the top right quadrant, the deviations from the mean will be positive for both variables, because each is larger than its mean. If we multiply these together, the products will be positive. In the bottom left quadrant, the deviations from the mean will be negative for both variables, because each is smaller than its mean. If we multiply these two negative numbers together, the products will also be positive.

In the top left quadrant, the deviations from the mean will be negative for height, because the heights are all less than the mean, and positive for strength, because strength is greater than its mean. The product of a negative and a positive number will be negative, so all these products will be negative. In the bottom right quadrant, the deviations from the mean will be positive for height, because the heights are all greater than the mean, and negative for strength, because the strengths are less than the mean. The product of a positive and a negative number will be negative, so all these products will be negative also.

When we add the products for all subjects, the sum will be positive, because there are more positive products than negative ones. Further, subjects with very large values for both height and strength, or very small values for both, will have large positive products. So the stronger the relationship is, the bigger the sum of products will be. If the sum of products positive, we say that there is a positive correlation between the variables.

Figure 3 shows the relationship between strength and age in Table 1.

**Figure 3. Scatter diagram showing muscle strength and age for 41 male alcoholics**

d

Strength tends to be less for older men than for younger men. Figure 4 shows lines through the means, as in Figure 2.

**Figure 4. Scatter diagram showing muscle strength and age for 41 male alcoholics,
with lines through the mean**

d

Now there are more observations in the top left and bottom right quadrants, where products are negative, than in the top left and bottom right quadrants, where products are positive. The sum of products will be negative. When large values of one variable are associated with small values of the other, we say we have negative correlation.

The sum of products will depend on the number of observations and the units
in which they are measured.
We can show that the maximum possible value it can have is the
square root of the sum of squares for height multiplied by the
square root of the sum of squares for strength.
Hence we divide the sum of products by the square roots of the two sums of squares.
This gives the **correlation coefficient**, usually denoted by ** r**.

Using the abbreviation ‘*r*’ looks very odd.
Why ‘*r*’ and not ‘*c*’ for correlation?
This is for historical reasons and it is so ingrained in statistical practice
that we are stuck with it.
If you see an unexplained ‘*r* =’ in a paper, it means the correlation coefficient.
Originally, ‘*r*’ stood for ‘regression’.

Because of the way *r* is calculated, its maximum value = 1.00 and its
minimum value = –1.00.
We shall look at what these mean later.

The correlation coefficient is also known as **Pearson’s correlation coefficient**
and the **product moment correlation coefficient**.
There are other correlation coefficients as well, such as Spearman’s and Kendall’s,
but if it is described simply as ‘the correlation coefficient’ or just ‘the correlation’,
the one based on the sum of products about the mean is the one intended.

For the example of muscle strength and height in 41 alcoholic men, *r* = 0.42.
This a positive correlation of fairly low strength.
For strength and age, *r* = –0.42.
This is a negative correlation of fairly low strength.

Figures 5 to 13 show the correlations between several simulated variables. Each pair of variables was generated to have the correlation shown above it. Figure 5 shows a perfect correlation.

**Figure 5. Scatter diagram showing simulated data from a population where the
two variables are exactly equal, population correlation coefficient = 1.00**

d

The points lie exactly on a straight line and we could calculate Y exactly from X.
In fact, Y = X; they could not be more closely related.
*r* = +1.00 when large values of one variable are associated with large values
of the other and the points lie exactly on a straight line.
Figure 6 shows a strong, but not perfect, positive relationship.

**Figure 6. Scatter diagram showing simulated data from a population where the
population correlation coefficient = 0.90**

d

Figure 7 also shows a positive relationship, but less strong.

**Figure 7. Scatter diagram showing simulated data from a population where the
population correlation coefficient = 0.50**

d

The size of the correlation coefficient clearly reflects the degree of closeness on the scatter diagram. The correlation coefficient is positive when large values of one variable are associated with large values of the other.

Figure 8 shows what happens when there is no relationship at all,
*r* = 0.00.

**Figure 8. Scatter diagram showing simulated data from a population where there
is no relationship between the variables and the population
correlation coefficient is zero**

d

This is the not only way *r* can be equal to zero, however.
Figure 9 shows data where there is a relationship,
because large values of Y are associated with small values of X and with
large values of X, whereas small values of Y are associated with values of X
in the middle of the range.

**Figure 9. Scatter diagram showing simulated data from a population where there
is a strong relationship between the variables and yet
the population correlation coefficient is zero**

d

The products about the mean will be positive in the upper left and upper right quadrants
and negative in the lower left and lower right quadrants, giving a sum which is zero.
It is possible for *r* to be equal to 0.00 when there is a
relationship which is not linear.
A correlation *r* = 0.00 means that there is no linear relationship,
i.e. that there is no relationship where large values of one variable are
consistently associated either with large or with small values of the other, but not both.
Figure 10 shows another perfect relationship, but not a straight line.

**Figure 10. Scatter diagram showing simulated data from a population where there
is a perfect relationship between the variables and yet
the population correlation coefficient is less than one**

d

The correlation coefficient is less than 1.00.
*r* will not equal –1.00 or +1.00 when there is a perfect relationship
unless the points lie on a straight line.
Correlation measures closeness to a linear relationship, not to any perfect relationship.

The correlation coefficient is negative when large values of one variable are associated with small values of the other. Figure 11 shows a rather weak negative relationship.

**Figure 11. Scatter diagram showing simulated data from a population where there
is a weak negative relationship between the variables**

d

Figure 12 shows a strong negative relationship.

**Figure 12. Scatter diagram showing simulated data from a population where there
is a strong negative relationship between the variables**

d

Figure 13 shows a perfect negative relationship.

**Figure 13. Scatter diagram showing simulated data from a population where there
is a perfect linear negative relationship between the variables**

d

*r* = –1.00 when large values of one variable are associated
with small values of the other and the points lie on a straight line.

We can test the null hypothesis that the correlation coefficient in the population is zero.
This is done by a simple t test.
The distribution of *r* if the null hypothesis is true, i.e. in the absence of
any relationship in the population, depends only on the number of observations.
This is often described in the terms of the degrees of freedom for the t test,
which is the number of observations minus 2.
Because of this, it is possible to tabulate the critical value for the test
for different sample sizes.
Bland (2000) gives a table.

For the test of significance to be valid, we must assume that:

- at least one of the variables is from a Normal distribution,
- the observations are independent.

Large deviations from the assumptions make the P value for this test very unreliable.

For the muscle strength and height data of Figure 1,
*r* = 0.42, P = 0.006.
Computer programmes almost always print this when they calculate a correlation coefficient.
As a result you will rarely see a correlation coefficient reported without it,
even when the null hypothesis that the correlation in the population is
equal to zero is absurd.

We can find a confidence interval for the correlation coefficient in the population, too.
The distribution of the sample correlation coefficient when the null hypothesis
is not true, i.e. when there is a relationship, is very awkward.
It does not become approximately Normal until the sample size is in the thousands.
We use a very clever but rather intimidating mathematical function called
**Fisher’s z transformation**.
This produces a very close approximation to a Normal distribution with a
fairly simple expression for its mean and variance
(see Bland 2000 if you really want to know).
This can be used to calculate a 95% confidence interval on the transformed scale,
which can then be transformed back to the correlation coefficient scale.
For the strength and height data,

For Fisher’s *z* transformation to be valid, we must make a much
stronger assumption about the distributions than for the test of significance.
We must assume that both of the variables are from Normal distributions.
Large deviations from this assumption can make the confidence interval very unreliable.

The use of Fisher’s *z* is tricky without a computer, approximate, and
requires a strong assumption.
Computer programmes rarely print this confidence interval and so you rarely see it,
which is a pity.

Regression is the rather strange name given to a set of methods for predicting one variable from another. The data shown in Table 2 and come from a student project aimed at estimating body mass index (BMI) using only a tape measure.

Abdominal circumference (cm) | BMI (Kg/ht ^{2}) | Abdominal circumference (cm) | BMI (Kg/ht ^{2}) | Abdominal circumference (cm) | BMI (Kg/ht ^{2}) | ||
---|---|---|---|---|---|---|---|

51.9 | 16.30 | 64.2 | 19.44 | 73.1 | 20.25 | ||

53.1 | 19.70 | 64.4 | 19.31 | 73.2 | 21.07 | ||

54.3 | 16.96 | 64.4 | 18.15 | 73.2 | 24.57 | ||

57.4 | 11.99 | 64.7 | 20.55 | 74.0 | 20.60 | ||

57.6 | 14.04 | 64.8 | 15.70 | 74.1 | 16.86 | ||

57.8 | 15.16 | 65.0 | 18.73 | 74.4 | 22.58 | ||

58.2 | 16.31 | 65.2 | 18.52 | 74.7 | 21.42 | ||

58.2 | 16.17 | 65.6 | 21.08 | 74.8 | 23.11 | ||

59.0 | 20.08 | 66.2 | 17.58 | 74.8 | 24.11 | ||

59.2 | 14.81 | 66.8 | 18.51 | 79.3 | 19.71 | ||

59.5 | 18.02 | 66.9 | 18.75 | 79.7 | 23.14 | ||

59.8 | 18.43 | 67.0 | 19.68 | 80.0 | 19.48 | ||

59.8 | 15.50 | 67.5 | 18.06 | 80.3 | 23.28 | ||

60.2 | 17.64 | 67.8 | 21.12 | 80.4 | 22.59 | ||

60.2 | 17.54 | 67.8 | 20.60 | 82.2 | 28.78 | ||

60.4 | 14.18 | 68.0 | 19.40 | 82.2 | 25.89 | ||

60.6 | 17.41 | 68.2 | 22.11 | 83.2 | 25.08 | ||

60.7 | 19.44 | 68.6 | 19.23 | 83.9 | 27.41 | ||

61.2 | 21.63 | 69.2 | 19.49 | 85.2 | 22.86 | ||

61.2 | 15.55 | 69.2 | 20.12 | 87.8 | 32.04 | ||

61.4 | 18.37 | 69.2 | 24.06 | 88.3 | 25.56 | ||

62.4 | 17.69 | 69.4 | 19.97 | 90.6 | 28.24 | ||

62.5 | 17.64 | 70.2 | 19.52 | 93.2 | 28.74 | ||

63.2 | 18.70 | 70.3 | 23.77 | 100.0 | 31.04 | ||

63.2 | 20.36 | 70.9 | 18.90 | 106.7 | 30.98 | ||

63.2 | 18.04 | 71.0 | 20.89 | 108.7 | 40.44 | ||

63.2 | 18.04 | 71.0 | 17.85 | ||||

63.4 | 17.22 | 71.2 | 21.02 | ||||

63.8 | 18.47 | 72.2 | 19.87 | ||||

64.2 | 17.09 | 72.8 | 23.51 |

In the full data, analysed later, we have abdominal circumference, mid upper arm circumference, and sex as possible predictors. We shall start with the female subjects only and will look at abdominal circumference.

BMI, also known as Quetelet’s index, is a measure of fatness defined for adults as weight in Kg divided by abdominal circumference in metres squared. Can we predict BMI from abdominal circumference? Figure 14 shows a scatter plot of BMI against abdominal circumference and there is clearly a strong relationship between them.

**Figure 14. Scatter plot of BMI against abdominal circumference**

We could try to draw a line on the scatter diagram which would represent the relationship between them and enable us to predict one from the other. We could draw many lines which might do this, as shown in Figure 2, but which line should we choose?

The method which we use to do this is simple linear regression. This is a method to predict the mean value of one variable from the observed value of another. In our example we shall estimate the mean BMI for women of any given abdominal circumference measurement.

We do not treat the two variables, BMI and abdominal circumference,
as being of equal importance, as we did for correlation coefficients.
We are predicting BMI from abdominal circumference and BMI is the **outcome**,
**dependent**, **y**, or **left hand side** variable.
Abdominal circumference is the **predictor**, **explanatory**, **independent**,
**x**, or **right hand side** variable.
Several different terms are used.
We predict the outcome variable from the observed value of the predictor variable.

The relationship we estimate is called linear, because it makes a straight line on the graph. A linear relationship takes the following form:

BMI = intercept + slope × abdominal circumference

The intercept and slope are numbers which we estimate from the data. Mathematically, this is the equation of a straight line. The intercept is the value of the outcome variable, BMI, when the predictor, abdominal circumference, is zero. The slope is the increase in the outcome variable associated with an increase of one unit in the when the predictor.

To find a line which gives the best prediction, we need some criterion for best. The one we use is to choose the line which makes the distance from the points to the line in the y direction a minimum. These are the differences between the observed BMI and the BMI predicted by the line. These are shown in Figure 16.

**Figure 16. Differences between the observed and predicted values
of the outcome variable**

If the line goes through the cloud of points, some of these differences will be
positive and some negative.
There are many lines which will make the sum zero, so we cannot just minimise
the sum of the differences.
As we did when estimating variation using the variance and standard deviations
(Week 1) we square the differences to get rid of the minus signs.
We choose the line which will minimise the sum of the squares of these differences.
We call this the **principle of least squares** and call the estimates that we obtain
the **least squares line or equation**.
We also call this estimation by **ordinary least squares** or **OLS**.

There are many computer programs which will estimate the least squares equation and for the data of Table 2 this is

BMI = –4.15 + 0.35 × abdominal circumference

This line is shown in Figure 17.

**Figure 17. The least squares regression line for BMI and abdominal circumference**

The estimate of the slope, 0.35, is also known as the **regression coefficient**.
Unlike the correlation coefficient, this is not a dimensionless number,
but has dimensions and units depending on those of the variables.
The regression coefficient is the increase in BMI per unit increase
in abdominal circumference, so is in kilogrammes per square metre per centimetre,
BMI being in Kg/m^{2} and abdominal circumference in cm.
If we change the units in which we measure, we will change the regression coefficient.
For example, it we measured abdominal circumference in metres,
the regression coefficient would be 35 Kg/m^{2}/m.
The intercept is in the same units as the outcome variable, here Kg/m^{2}.

In this example, the intercept is negative, which means that when abdominal circumference is zero the BMI is negative. This is impossible, of course, but so is zero abdominal circumference. We should be wary of attributing any meaning to an intercept which is outside the range of the data. It is just a convenience for drawing the best line within the range of data that we have.

We can find confidence intervals and P values for the coefficients subject to assumptions. These are that deviations from line should have a Normal distribution with uniform variance. (In addition, as usual, the observations should be independent.)

For the BMI data, the estimated slope = 0.35 Kg/m^{2}/cm,
with 95% CI = 0.31 to 0.40 Kg/m^{2}/cm, P<0.001.
The P value tests the null hypothesis that in the population from which these women come,
the slope is zero.
The estimated intercept = –4.15 Kg/m^{2}, 95% CI = –7.11 to –1.18 Kg/m^{2}.
Computer programs usually print a test of the null hypothesis that the intercept is zero,
but this is not much use.
The P value for the slope is exactly the same as that for the correlation coefficient.

For our confidence intervals and P values to be valid, the data must conform to the assumptions that deviations from line should have a Normal distribution with uniform variance. The observations must be independent, as usual. Finally, our model of the data is that the line is straight, not curved, and we can check how well the data match this.

We can check the assumptions about the deviations quite easily
using techniques similar to those used for t tests.
First we calculate the differences between the observed value of the outcome variable
and the value predicted by the regression, the regression estimate.
We call these the **deviations from the regression line**,
the **residuals about the line**, or just **residuals**.
These should have a Normal distribution and uniform variance,
that is, their variability should be unrelated to the value of the predictor.

We can check both of these assumptions graphically. Figure 18 shows a histogram and a Normal plot for the residuals for the BMI data.

**Figure 18. Histogram and Normal plot for residuals for the BMI
and abdominal circumference data**

The distribution is a fairly good fit to the Normal. We can assess the uniformity of the variance by simple inspection of the scatter diagram in Figure 17. There is nothing to suggest that variability increases as abdominal circumference increases, for example. It appears quite uniform. A better plot is of residual against the predictor variable, as shown in Figure 19.

**Figure 19. Scatter plot of residual BMI against abdominal circumference **

Again, there is no relationship between variability and the predictor variable. The plot of residual against predictor should show no relationship between mean residual and predictor if the relationship is actually a straight line. If there is such a relationship, usually that the residuals are higher or lower at the extremes of the plot than they are in the middle, this suggests that a straight line is not a good way to look at the data. A curve might be better.

Table 3 shows 24 hour energy expenditure (MJ) in groups of lean and obese women.

Lean | Obese |
---|---|

6.13 | 8.79 |

7.05 | 9.19 |

7.48 | 9.21 |

7.48 | 9.68 |

7.53 | 9.69 |

7.58 | 9.97 |

7.90 | 11.51 |

8.08 | 11.85 |

8.09 | 12.79 |

8.11 | |

8.40 | |

10.15 | |

10.88 |

In week 4, we analysed these data using the two sample t method. We can also do this by regression. We define a variable = 1 if the woman is obese, = 0 if she is lean.

If we carry out regression:

energy = 8.07 + 2.23 × obese

slope: 95% CI = 1.05 to 3.42 MJ, P=0.0008.

Compare this with the two sample t method:

Difference (obese – lean) = 10.298 – 8.066 = 2.232.

95% CI = 1.05 to 3.42 MJ, P=0.0008.

The two methods give identical results. They are shown graphically in Figure 20.

The assumptions of two sample t method are that

- energy expenditure follows a Normal distribution in each population,
- variances are the same in each population.

The assumptions of regression are that

- differences between observed and predicted energy expenditure follow a Normal distribution,
- variances of differences are the same in whatever the value of the predictor.

These are the same. The energy expenditure predicted for a group by the regression is equal to the mean of the group and if the differences from the group mean follow a Normal distribution, so do the residuals about the regression line, which goes through the two means. If variances are the same for the two values of the predictor, then they are the same in the two groups.

Altman DG. (1991) *Practical Statistics for Medical Research.*
Chapman and Hall, London.

Bland M. (2000) *An Introduction to Medical Statistics*. Oxford University Press.

Hickish T, Colston K, Bland JM, Maxwell JD. (1989)
Vitamin D deficiency and muscle strength in male alcoholics.
*Clinical Science* **77**, 171-176.

Prentice AM, Black AE, Coward WA, Davies HL, Goldberg GR, Murgatroyd PR,
Ashford J, Sawyer M, Whitehead RG. (1986)
High-levels of energy-expenditure in obese women.
*British Medical Journal* **292**, 983-987.

To Introduction to Statistics for Research index.

This page maintained by Martin Bland.

Last updated: 20 January, 2020.