Measurement in Health and Disease: Observer variation

What do we mean by observer variation?

Figure 1 shows the first 9 patients from a study of 28, where each patient was measured 3 times by each of 3 observers.

Figure 1. Pupil diameters measured 3 times by each of 3 observers, first 9 patients from a study of 28.
See d for details. d

Inspection of the data suggests that there is more variation between observations by different observers than when the same observer measures a patient. Patient 6 is a good example. The variability between measurements on the same subject by different observers is called observer variation.

We can estimate the effects of observer variation using the same kinds of statistics as we do for measurement error by the same observer: within-subject standard deviation and coefficient of variation, and correlation coefficients, usually ICCs. We can estimate these statistics for different observers on the same occasion, on different occasions, and so on.

For the data of Figure 1 (all 28 subjects) the intra-observer within-subject standard deviation was sw = 0.38 mm. The corresponding ICC = 0.80. The inter-observer within-subject standard deviation was 0.48 and the ICC was 0.72. The standard deviation is greater with different observers and the ICC is smaller, both reflecting the greater error when different observers used this measurement.

Back to top.

Why investigate observer variation?

Many designs can be used to investigate observer variation, depending on the purpose of the investigation and the resources available. There are several reasons for carrying out observer comparison studies.

Sometimes our focus of interest is the properties of the measurement method itself:

Sometimes the focus may be on the observers rather than the measurement method:

This variety of purpose leads to a variety of designs and analyses. For some purposes, such as demonstrating the possibility of the measurement being applied by different observers or observer training, the number of observers is fixed by the objective. For others, the main problem is getting enough observers to have a reasonable sample to represent observers in general.

The usual design is to get several observers each to measure several subjects, preferably more than once. All we need to do is to ask a sample of observers, representative of the observers whose variation we wish to study, to make repeated observations on each of a sample of subjects, the order in which observers make their measurements being randomized. We then ask by how much the variation between measurements on the subject is increased when these measurements are made by different observers.

In practice, the ideal design of a representative sample of observers making repeated measurements on each of a sample of subjects is almost always impossible in the study of clinical measurements. There are several reasons for this:

  1. One can rarely obtain a representative sample of observers. Clinical measurements often require considerable skill and observers for new methods of measurements make be hard to find. Studies involving only two observers are not uncommon.
  2. Many measurements which involve subjective assessment cannot be repeated by the same observer without the result of the first measurement influencing the second.
  3. Many methods of measurement are either uncomfortable or invasive, and a long series of measurements cannot be done on the same subject.

For these reasons, most observer comparison studies are a compromise between the ideal study design and practical and ethical limitations.

There is one other possible design which might be considered ‘ideal’. This is to have every subject measured by two different observers, using new observers every time. We could then use the methods for simple measurement error to estimate the standard deviation within subjects when each measurement is by a different observer. This design is most unlikely to be used in practice, but we may sometimes choose to analyse our data as if it was, ignoring the fact that the same observer is used several times.

One solution to the problem of needing many observers to measure the same subject is to carry out several small replicates of the ideal design and then combine them. An example is the study of the measurement of abdominal circumference by fetal ultrasound shown in Table 1.

Table 1. Ultrasound abdominal circumference measurements (cm) by 16 observers (L. Chitty, personal communication)
Observer Subject 1 Subject 2 Subject 3
1 13.613.312.9 14.714.814.7
2 13.814.213.2 14.914.114.5 17.217.517.6
3 14.514.213.8 16.315.216.1
4 13.713.713.4 14.414.313.6 16.816.817.5
  Subject 4 Subject 5 Subject 6
5 14.814.614.8 18.318.518.5 12.612.612.4
6 14.914.414.2 17.417.917.0 12.312.112.1
7 14.314.414.3 17.717.018.3 12.512.212.6
8 13.814.114.1 17.417.916.4 13.012.612.7
  Subject 7 Subject 8 Subject 9
9 12.411.711.6 11.311.610.7
10 11.512.512.8 16.115.815.4 9.710.2 9.8
11 14.612.711.5 16.716.516.2 10.710.3 9.8
12 13.513.412.5 17.016.617.2 10.911.211.3
  Subject 10 Subject 11 Subject 12
13 14.314.414.8 15.615.916.1 20.220.921.1
14 14.315.514.6 15.715.016.5 20.120.720.9
15 14.614.815.4 16.316.115.6
16 14.114.613.7 14.415.115.2 20.520.521.1

It was thought feasible for four observers each to make three measurements on a patient. The investigators were able to arrange for three patients to be available for a group of four observers. Thus we have a block of data consisting of four observers, three subjects, and three measurements by each observer on each subject. This is the ideal study design, apart from the small numbers of observers and subjects involved. Now we can repeat this, using four more observers and three more patients, and combine the two studies. Thus we increase the numbers of observers and subjects without putting too many demands on either. In the study shown in Table 1, there were four replications, so that altogether sixteen observers each made three measurements on three patients, and there were twelve patients in all.

This design enables unlimited numbers of observers and patients to be studied without undue stress on either subjects or observers. It also lends itself very neatly to a multicentre study, where small groups of observers could make their measurements in different institutions.

Another strategy which has been used is to construct a physical model of the object to be measured. Obvious advantages are that the model can then be measured as often as required and the true value is known. For example, Moertel and Hanley (1976) made model tumours from 12 solid spheres, arranged in random order on a soft mattress and covered with foam rubber 0.5 inches thick for the six smaller spheres and 1.5 inches thick for the six larger spheres. They then invited 16 experienced oncologists to measure the diameter of each sphere, each observer using the technique and equipment which they routinely used in clinical practice.

There are other ways in which observer variation can be studied without the presence of the subject. When physical contact is not necessary, a video recording of a patient can be used as a subject and measured repeatedly. For example, Falkowski et al. (1980) used video recordings of psychiatric interviews to investigate observer variation in assessment of ego state. It may be possible to present the same subject more than once, as in the British Hypertension Society training film of blood pressure measurements. In this, the manometer is shown while the Korotkov sound is heard on the sound track. Each recording is included twice, but the observers are not told and do not notice this and so there is no bias in the second reading from knowledge of the first.

Such artificial measurement situations are very useful for investigating some of the sources of variation in a measurement and for observer training, but we cannot be sure that they embody all the sources of variation present in practice. They cannot entirely replace investigations of measurement variation in the living subject.

Back to top.

Analysis of observer variation studies

In this lecture we shall consider only continuous outcomes, i.e. measurements, as opposed to categorical ones. We deal with the latter separately using Cohen’s kappa statistics.

The full data for the pupil diameter study are shown in Table 2. To estimate the increase in variation when different observers are used, we use analysis of variance. Compared to the simple measurement error problem, the analysis of variance is more complicated, because we have more sources of variation. The variation for repeated observations by the same observer on the same subject we will call sw2, as before. The variation between subjects, that is between the true values of the quantity being measured, will be sb2, as before. By ‘true value’ we mean the average value we would get from many measurements by many different observers.

The variation due to observers is made up of two different components. An observer may have a bias, a fixed effect where that observer consistently measures higher or lower than others. There may also be a random effect, which we will call the heterogeneity, where the observer measures higher than others for some subjects and lower for others.

The meaning of heterogeneity may be obscure, and a thought experiment may make it clearer. In the film 10 (Edwards 1979), Dudley Moore scores feminine attractiveness out of ten. Suppose we wish to estimate the observer variation of this highly subjective measurement. We persuade several observers to rate several subjects, and repeat the rating the several times. Now there will be an overall mean rating, for all subjects by all observers on all occasions. Some subjects will receive higher mean scores than others and this variation about the overall mean is measured by sb2. If we get the same observer to rate the same subject several times, the ratings will vary. The variation between the individual measurement and the mean for that observer's measurement of that subject is measured by the measurement error, sw2. Some observers will be more generous in their ratings than others. The variation of the observer means about the overall mean is measured by another variance, so2. For a given observer, this is the bias, the tendency to rate high or low. What about the heterogeneity? It is well known that people tend to be attracted to partners who look like them. Tall, thin women marry tall, thin men, and short, fat men marry short, fat women, for example. (Take a good look at your friends if you don't believe this.) Thus Bland, who is short, may give higher ratings to short women than to tall ones, and Altman, who is tall, may give higher ratings to tall women than to short, even though their overall mean ratings may be the same. This is the heterogeneity, or observer times subject interaction, and it may be just as important as the observer bias. It comes from the difference between the mean rating for a given subject by a given observer and the rating we would expect for this subject and observer given the mean rating over all observers for the subject and the mean rating over all subjects by the observer.

Physical measurements can behave in the same way. Measured blood pressure is said to be higher when subject and observer are of opposite sex than when they are the same sex. If both observers and subjects include both sexes, this will contribute to heterogeneity. In general, there may be unknown observer and subject factors which contribute to heterogeneity and our method of analysis must allow for the possibility of their presence. We will denote the extra variability in measurements due to this heterogeneity by sh2.

The final measurement is made up of the overall mean, the difference from the mean for that particular subject, the difference from the mean for the observer, the heterogeneity, and the measurement error. We assume that the effects of subject, observer and measurement error are added.

Hence we have four different variances, and if we have measurements on different subjects made by different observers, the variance will be the sum of all of them:

s2 = sb2 + so2 + sh2 + sw2

To recap: