Correlation coefficient using repeated observations

This is a section from my text book An Introduction to Medical Statistics, Fourth Edition. I hope that the topic will be useful in its own right, as well as giving a flavour of the book. Section references are to the book.

11.12 Using repeated observations

In clinical research we are often able to take several measurements on the same patient. We may want to investigate the relationship between two variables, and take pairs of readings with several pairs from each of several patients. The analysis of such data is quite complex. This is because the variability of measurements made on different subjects is usually much greater than the variability between measurements on the same subject, and we must take these two kinds of variability into account. What we must not do is to put all the data together, as if they were one sample. The observations would not be independent.

Consider the simulated data of Table 11.3.

Table 11.3 Simulated data showing 10 pairs of measurements of two independent variables for four subjects
Subject 1 Subject 2 Subject 3 Subject 4
X Y X Y X Y X Y
47 51 49 52 51 46 63 64
46 53 50 56 46 48 70 62
50 57 42 46 46 47 63 66
52 54 48 52 45 55 58 64
46 55 60 53 52 49 59 62
36 53 47 49 54 61 61 62
47 54 51 52 48 53 67 58
46 57 57 50 47 48 64 62
36 61 49 50 47 50 59 67
44 57 49 49 54 44 61 59
Mean 45.0 55.2 50.2 50.9 49.0 50.1 62.5 62.6
Correlation r=−0.33 r=0.49 r=0.06 r=−0.39
Significance P=0.35 P=0.15 P=0.86 P=0.27

**Table 11.3** Simulated data showing 10 pairs of measurements of two independent variables for four subjects
	Subject 1	Subject 2	Subject 3	Subject 4
	X	Y	X	Y	X	Y	X	Y
	47	51	49	52	51	46	63	64
	46	53	50	56	46	48	70	62
	50	57	42	46	46	47	63	66
	52	54	48	52	45	55	58	64
	46	55	60	53	52	49	59	62
	36	53	47	49	54	61	61	62
	47	54	51	52	48	53	67	58
	46	57	57	50	47	48	64	62
	36	61	49	50	47	50	59	67
	44	57	49	49	54	44	61	59
Mean	45.0	55.2	50.2	50.9	49.0	50.1	62.5	62.6
Correlation	r=−0.33	r=0.49	r=0.06	r=−0.39
Significance	P=0.35	P=0.15	P=0.86	P=0.27

The data were generated from random numbers, and there is no relationship between X and Y at all. First, values of X and Y were generated for each ‘subject’, then a further random number was added to make the individual ‘observation’. For each subject separately, there was no significant correlation between X and Y. For the subject means, the correlation coefficient was r = 0.77, P = 0.23. However, if we put all 40 observations together we get r = 0.53, P = 0.0004. Even though the coefficient is smaller than that between subject means, because it is based on 40 pairs of observations rather than 4 it becomes significant.

The data are plotted in Figure 11.17, with three other simulations:

Four scatter diarams showing multiple points,
three relationships statistically significant d

Figure 11.17 Simulations of 10 pairs of observations on four subjects.

As the null hypothesis is always true in these simulated data, the population correlations for each ‘subject’ and for the means are zero. Because the numbers of observations are small, the sample correlations vary greatly. As Table 11.2 shows, large correlation coefficients can arise by chance in small samples. However, the overall correlation is ‘significant’ in three of the four simulations, though in different directions.

We only have four subjects and only four points. By using the repeated data, we are not increasing the number of subjects, but the statistical calculation is done as if we have, and so the number of degrees of freedom for the significance test is incorrectly increased and a spurious significant correlation is produced.

There are two simple ways to approach this type of data, and which is chosen depends on the question being asked. If we want to know whether subjects with a high value of X tend to have a high value of Y also, we can use the subject means and find the correlation between them. If we have different numbers of observations for each subject, we can use a weighted analysis, weighted by the number of observations for the subject. If we want to know whether changes in one variable in the same subject are parallelled by changes in the other, we need to use multiple regression, taking subjects out as a factor (Section 15.1, Section 15.8). In either case, we should not mix observations from different subjects indiscriminately.

Adapted from pages 171–172 of An Introduction to Medical Statistics by Martin Bland, 2015, reproduced by permission of Oxford University Press.

Back to An Introduction to Medical Statistics contents.

Back to Martin Bland’s Home Page.

This page maintained by Martin Bland.
Last updated: 7 August, 2015.