Applying the Right Statistics: Analyses of Measurement Studies

Martin Bland
Professor of Medical Statistics
St. George's Hospital Medical School

This is a slightly expanded version of a talk given at the Annual Scientific Meeting: Clinical Radiology of the Royal College of Radiologists, 19 September, 2001.

Published as: Bland JM and Altman DG. (2003) Applying the Right Statistics: Analyses of Measurement Studies. Ultrasound in Obstetrics and Gynecology, 22, 85-93.

Accessibility

This article has yet to be modified for accessibility. Sorry.

Introduction

Many research papers in radiology concern measurement. This is a topic which in the past has been much neglected in the medical research methods literature. When I was first approached with a question on measurement error, I turned in vain to my books. I had to work it out myself.

I am going to deal in this talk with two types of study: the estimation of the agreement between two methods of measurement, and the estimation of the agreement between two measurements by the same method, also called repeatability. In both cases I shall be concerned with the question of interpreting the individual clinical measurement. For agreement between two different methods of measurement, I shall be asking whether we can use measurements by these two methods interchangeably, i.e. can we ignore the method by which the measurement was made. For two measurements by the same method, I shall be asking how variable can measurements on a patient be if the true value of the quantity does not change and what this measurement tells us about the patient's true or average value.

I shall avoid all mathematics, which even an audience as intelligent as this one finds difficult to follow during a presentation, except for one formula near the end, for which I shall apologise when the time comes. Instead I shall show what happens when we apply some simple statistical methods to a set of randomly generated data, and then show how this informs the interpretation of these methods when they are used to tackle measurement problems in the radiology literature.

For an example of the sort of study with which I shall be concerned, Borg et al. (1995) compared single X-ray absorptiometry (SXA) with single photon absorptiometry (SPA). They produced the following scatter plot for arm bone mineral density:

The analysis uses a correlation coefficient and a regression line. In this talk I shall show why correlation might be misleading, because it is dependent on the range of measurements and the way the sample of subjects was chosen and because it ignores any systematic difference between the two methods of measurement. I shall go on to show how regression can be misleading when the intercept and slope are interpreted. I shall then show an appropriate use of regression. I shall then describe a simple alternative approach, the 95% limits of agreement. I shall show how the assumptions of this can be checked using plots of difference against average. I shall explain why plots of difference against standard method might be misleading. I shall show how measurement error and agreement can be estimated in comparable ways. I shall then show how confidence intervals for the limits are easy to calculate.

In the simplest case, we start with two observations on each of a group of subjects. These observations may be by two different methods, by the same method but different observers, or by the same method and the same observer. I shall use some randomly generated data. This is not because I do not have any real data, but because with randomly generated data we know the answer.

I generated 100 observations from a Normal distribution with mean 10 and standard deviation 3, to represent the true value of the quantity being measured. Now, we create two measurements, X and Y, by adding some measurement error to the true value.

The red line is the line of equality, where X and Y are exactly the same. This is a line through the origin at 45 degrees to the axes. I shall come to the regression line later.

This is an ideal data set, as we know that X and Y both measure the same thing, that there are no systematic biases, and that errors are the same throughout the range. I shall compare the results from this ideal data set with some analyses drawn from the radiology literature and show how these can be misinterpreted. This is a very difficult area and I would not want the authors of my examples to be thought of as being foolish or perverse in their approach to the analysis. These are typical examples.

Correlation coefficients

One obvious analysis is a correlation coefficient. For these data the correlation coefficient between Measurement X and Measurement Y is r = 0.86, P<0.0001.

There are two problems with this. The first is that correlation depends on the range of the variables. The second is that the correlation coefficient describes association, not agreement. Correlation depends on the range of the variables. For example, if we consider only subjects whose true measurement would be 10 or less we get r = 0.60:

and if we consider only subjects whose true measurement would be greater than 10 we get r = 0.62:

If we take several pairs of measurements on the same subject, who is not changing as we do it, the correlation can be zero. So the correlation coefficient depends on the group of subjects selected. It should be used only if we have a representative sample of the patient population we wish to study.

Consider this example from Schild et al. (2000) comparing final fetal weight estimated by three-dimensional ultrasound with actual birth weight.

r=0.976, P<0.01, n=65,

The distribution of birth weight shows that there are many more low birthweight babies than we would expect in a representative sample of births. There are 4/65 below 1000 gms. The correlation coefficient, is much larger than it would be in a representative sample.

The second problem with correlation is that it looks at the degree of association, not agreement. If I have a third measurement Z, which consistently overestimates by 2 units, the correlation of Measurement Y with Measurement Z is the same as its correlation with Measurement X, 0.86

The correlation between Y and Z is the same as between Y and X, but the agreement is not.

Large systematic differences or biases are possible. For example, Bakker et al. (1999) investigated the agreement between renal volume measurements by UltraSound and Magnetic Resonance imaging. Their data for 40 kidneys were as follows:

The black bars represent the MR measurements and the white bars the US.

There is a clear and significant mean difference of about 25% between the two methods. A correlation coefficient would complete miss this difference and thus be highly misleading.

Correlation is thus inappropriate for the study of agreement between different methods of measurement. Despite this, people do it.

For example, Borg et al. (1995) compared single X-ray absorptiometry (SXA) with single photon absorptiometry (SPA). They produced the following scatter plot for arm bone mineral density:

This looks like good agreement, a tight cloud of points and a high correlation of r=0.98. But should this make us think we could use bone mineral densities measured by SXA and SPA interchangeably? Look what happens when we add the line of equality to the diagram:

Nearly all the points lie to the left of it, and there is a clear bias. The SXA measurements tend to exceed the SPA measurements by 0.02 g/cm².

Regression

Some applications of regression are also inappropriate. For the randomly generated data, the regression line is shown below:

The green regression line does not coincide with the red line of equality. It does not go through the origin and its slope is less than one. If we look at the computer output, we see that the slope is 0.84, with 95% confidence interval 0.74 to 0.94, It is therefore significantly different from 1.0. Similarly, the intercept is significantly different from 0.0.

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf.
Interval]
---------+--------------------------------------------------------------------
       x |   .8372939   .0497111     16.843   0.000        .738644    .9359439
   _cons |   1.492289   .4939103      3.021   0.003         .51214    2.472438
------------------------------------------------------------------------------

The reason for this is that regression attempts to predict the observed Y from the observed X, not the true Y from the true X. The errors in X and Y reduce the slope of the line and so raise the lower end and lower the upper end. Doing the regression the other way round, with X as dependent, has the same effect:

------------------------------------------------------------------------------
       x |      Coef.   Std. Err.       t     P>|t|       [95% Conf.
Interval]
---------+--------------------------------------------------------------------
       y |   .8876808   .0527026     16.843   0.000       .7830943    .9922674
   _cons |   1.122613   .5194797      2.161   0.033       .0917224    2.153503
------------------------------------------------------------------------------

Again the slope is significantly less than one and the intercept significantly greater than zero.

Bankier et al. (1999), in a study of subjective visual grading versus objective quantification with macroscopic morphometry and thin-section CT densitometry in pulmonary emphysema, reported the following table:

Linear Regression Results: Subjective Scores and
Densitometric and Morphometric Measurements
----------------------------------------------
        Subjective Score      Subjective Score
        and Densitometric     and Morphometric
Reader    Measurement           Measurement
----------------------------------------------
  1       0.350,1.059           0.629,1.365
  2      -0.008,0.598           0.443,1.147
  3       0.002,0.658           0.854,1.038
----------------------------------------------
Note. Data are 95% CIs for the intercepts of
regression lines

These measurements were not on the same scale, but all scales had a common point at zero. Bankier et al. (1999) interpret this table thus: "All but one of the CIs did not contain zero, which is suggestive of systematic overestimation of emphysema when compared with objective measurements." I disagree. This is what we would expect to see if there were no such bias.

Others have tested the null hypothesis that the slope is equal to 1.00, which we should not expect to be true if the agreement is good.

Hence we can see that although regression and correlation are used they are used inappropriately. I shall return to an appropriate application of regression later.

Appropriate use of regression

I mentioned earlier that there is an appropriate use of regression in the evaluation of agreement. We can regress the old measurement on the new and calculate the standard error of a prediction of the old from the new. We can use this to estimate a predicted old measurement for any observed value of the new, with a confidence interval, called a prediction interval. This gives us something akin to the limits of agreement. The problem is that it is not constant, being smallest near the middle of the range and wider as we get further towards the extremes. This is quite marked for small samples, but not for large. For the X Y data, regarding X as the old or standard method and Y as the new, we get:

Here the spreading out is very small. If we use a smaller sample, the spreading out is clearer:

This application of regression is particularly useful when the two methods of measurement have different units, as in the study of subjective visual grading versus objective quantification with thin-section CT densitometry Bankier et al. (1999), described above. I analysed my very first methods of measurement study this way.

95% limits of agreement

When Doug Altman and I wrote our first paper on comparing methods of measurement (Altman and Bland, 1983), we were mainly concerned with exposing these potentially misleading approaches. However, we thought that we should suggest a valid method of analysis too. Our proposal was the limits of agreement method. We start with the difference between measurement by the two methods, which we thought we should estimate. We calculated the mean and standard deviation of these differences. Then we calculated the mean difference plus and minus 1.96 standard deviations. 95% of differences should lie between these limits.

For the X Y data, the differences Y-X have mean -0.06 and standard deviation 1.46. Hence the 95% limits are -0.06 - 1.96*1.46 = -2.92 and -0.06 + 1.96*1.46 = 2.80. Hence a measurement by Method Y would be between 2.92 units less than a measurement by Method X and 2.80 greater.

We thought this was an Aunt Minnie, so obvious and so clearly answering the question as to need no justification (Applegate and Neuhauser 1999). We therefore did not go into detail. In a later paper (Bland and Altman 1986) we elaborated the idea and gave a worked example.

The width of the 95% limits of agreement, -2.92 to 2.80, is 5.7. For the same data, the average width of the prediction interval found by regression is also 5.7. These two approaches are similar, but the limits of agreement is much simpler and easier to express and interpret.

For a recent practical example, Cicero et al. (2001) compared cervical length at 22-24 weeks of gestation measured by transvaginal and transperineal-translabial ultrasonography. Their data looked like this:

and they quoted the limits of agreement as -5.8 mm to 6.1 mm. Interestingly, they also quoted r=0.934, P<0.0001 and that line is the regression line, not the line of equality. Old habits die hard.

Checking assumptions

The 95% limits of agreement depend on some assumptions about the data: that the mean and standard deviation of the differences are constant throughout the range of measurement, and that these differences are from an approximately Normal distribution. To check these assumptions we proposed two plots: a scatter diagram of the difference against the average of the two measurements and a histogram of the differences. For the X-Y data, these look like this:

We also suggested adding the 95% limits of agreement and the mean difference to the scatter plot:

Cicero et al. (2001) show such a plot:

About 95% of points should lie within the limits. (In this graph there are many overlapping points.)

To our chagrin, the histogram does not seem to have been adopted with the same enthusiasm, but the scatter plot alone is a reasonable check. Also to our chagrin, many researchers seem to think that the plot is the analysis. It is not, of course, but only a check on the assumptions of the limits of agreement.

These assumptions are not always met and checking is essential. In a study of fetal lung volume measurement using three-dimensional ultrasound, Bahmaie et al. (2000) produced the following difference against mean plot for measurements by two different observers:

This shows a divergence as the magnitude increases, making the limits of agreement suspect.

Plotting difference against average

The reason for plotting the difference against the average, rather than either of the measurements singly, is that Y-X and X are inevitably correlated. For the X-Y data these correlations are:

Correlation with difference Y-X
Average of X & Y     -0.06, P=0.6
X                    -0.31, P=0.002
Y                    +0.21, P=0.04

We expect Y-X and X to be negatively correlated and Y-X and Y to be positively correlated when the difference is not really related to the magnitude, as we know to be the case in this example. Y-X and the average of Y and X will not be correlated if there is no real relationship. We can see this in the plots of difference against X and against Y:

Now, consider this example from Schild et al. (2000) comparing final fetal weight estimated by three-dimensional ultrasound with actual birth weight. There is no evidence of any tendency for the points to divert from the line of equality in the scatter diagram:

They plotted the percentage error (essentially the difference) against the actual birthweight:

The downward trend in the graph is readily apparent. This can be seriously misleading. Consider the following from a comparison of MR with conventional arthrography (Jaramillo et al, 1999):

"For all parameters, the difference between the measurements of the two modalities depended on the magnitude of the measurements. By using arthrography as the standard, a slope test indicated overestimation with MR imaging at small measurements and an underestimation at large measurements (all P<0.001)."

If they regressed MR minus arthrography on arthrography, this is what we would expect in the absence of a true relationship between difference and magnitude (Bland and Altman 1995). It would look like this:

Citation and use

The 95% limits of agreement method has been widely cited and quite widely used, though many citers do not appear to have read the paper. For example, Jaramillo et al. (1999), cited above, say in their methods section that "For each parameter, agreement between MR imaging and arthrography was investigated using the method of Bland and Altman [1986]. Arthrography was considered to be the standard and differences between methods were calculated and plotted. A slope test was used to assess whether these differences varied systematically over the range of measurements." The results section of the paper contains no limits of agreement, but rather correlation and rank correlation coefficients with P values! As for plotting difference against a standard measurement, Bland and Altman (1986) actually wrote "It would be a mistake to plot the difference against either value separately because the difference will be related to each, a well-known statistical artefact. [Gill et al. 1985)]".

Measurement error

Measurement error can be analysed in a manner analogous to the limits of agreement. The main difference is that there should not be any bias, which simplifies things. Correlation can be used provided there is a population from which the sample can be regarded as a representative sample, in which case it is a measure of the information content of the measurement. However, this does not help us to interpret a clinical measurement on a given patient. To do this we need to consider the variability between repeated measurements on the same subject. If we calculate the standard deviation of the differences between pairs of repeated measurements, we can calculate 1.96 times this standard deviation. This gives the repeatability coefficient, which is the difference which will be exceeded by only 5% of pairs of measurements on the same subject (British Standards Institution 1975). It is thus directly comparable to the 95% limits of agreement. Thus we can use this to compare agreement which a new method of measurement would have with a standard method, with the agreement which the new method would have with itself.

In our 1986 paper we advocated a design where each method would be used twice on each subject, so that limits of agreement between the two method and coefficients of repeatability for each method separately could be compared. We regret that this has not been widely adopted by researchers.

Estimation and confidence intervals

Another feature which we stressed in the 1986 paper was that agreement is a question of estimation, not hypothesis testing. Estimates are usually made with some sampling error, and limits of agreement are no exception. We showed how to estimate confidence intervals for the limits of agreement. (The Lancet editorial team, in the person of David Sharp, cut the length of the paper, and much improved it in the process. However, when I explained the importance of the deleted paragraph on confidence intervals, it was reinstated.) Another regret is that these confidence intervals are seldom quoted. For the data of Cicero et al. (2001), the mean difference was 0.2 mm with standard deviation 3.0 mm, giving 95% limits of agreements -5.8 mm to +6.1 mm. There were 234 cases. The standard error of the limits is approximately (for which bit of algebra I apologise!). This give . The 95% confidence interval for the limits of agreement is given by plus or minus 1.96 standard errors = 0.67, so for the lower limit the confidence interval is -6.5 to -5.1 and for the upper limit the 95% confidence interval is +5.4 to +6.8. Not so hard, really!

Conclusion

The limits of agreement approach is fundamentally very simple and direct. Provided its assumptions of uniform mean and standard deviation are met, it can be carried out by anyone with basic statistical knowledge. It provides statistics which are easy to interpret in a meaningful way.

It can be extended to many more complex situations (Bland and Altman 1999):

when distributions are not Normal
when difference is related to magnitude
when there are repeated measurements on the same subject, either paired or not
when there are varying numbers of observations on subjects
there is also a non-parametric version.

References

Altman DG, Bland JM. (1983). Measurement in medicine: the analysis of method comparison studies. The Statistician 32: 307-317.

Applegate, KE, Neuhauser, DVB. (1999) Whose Aunt Minnie? Radiology 211: 292.

Bahmaie, A, Hughes, SW, Clark, T, Milner, A, Saunders, J, Tilling, K, Maxwell, DJ. (2000) Serial fetal lung volume measurement using three-dimensional ultrasound. Ultrasound in Obstetrics & Gynecology 16: 154-158.

Bakker, J, Olree, M, Kaatee, R, de Lange, EE, Moons, KGM, Beutler, JJ, Beek, FJA. (1999) Renal volume measurements: Accuracy and repeatability of US compared with that of MR imaging. Radiology 211: 623-628.

Bankier, AA, De Maertelaer, V, Keyzer, C, Gevenois, PA. (1999) Pulmonary emphysema: Subjective visual grading versus objective quantification with macroscopic morphometry and thin-section CT densitometry. Radiology 211: 851-858.

Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet i: 307-310.

Bland JM, Altman DG. (1995) Comparing methods of measurement: why plotting difference against standard method is misleading. Lancet 346, 1085-7.

Bland JM, Altman DG. (1999) Measuring agreement in method comparison studies. Statistical Methods in Medical Research 8: 135-160.

Borg, J, Møllgaard, A, Riis, BJ. (1995) Single x-ray absorptiometry: performance characteristics and comparison with single photon absorptiometry Osteoporosis International 5: 377-381.

British Standards Institution. (1975) Precision of test methods 1: Guide for the determination and reproducibility for a standard test method (BS 597, Part 1). London: BSI.

Cicero, S, Skentou, C, Souka, A, To, MS, Nicolaides, KH. (2001) Cervical length at 22-24 weeks of gestation: comparison of transvaginal and transperineal-translabial ultrasonography. Ultrasound in Obstetrics & Gynecology 17: 335-340.

Gill JS, Zezulka AV, Beevers DG, Davies P. (1985) Relationship between initial blood pressure and its fall with treatment. Lancet i: 567-69.

Jaramillo, D, Galen, T, Winalski, CS, DiCanzio, J, Zurakowski, D, Mulkern, RV, McDougall, PA, Villegas-Medina, OL, Jolesz, FA, Kasser, JR. (1999) Legg-Calvé-Perthes disease: MR imaging evaluation during manual positioning of the hip--Comparison with conventional arthrography. Radiology 212: 519-525.

Schild, RL, Fimmers, R, Hansmann, M. (2000) Fetal weight estimation by three-dimensional ultrasound. Ultrasound in Obstetrics & Gynecology 16: 445-452.

Back to Martin Bland's Home Page.

This page is maintained by Martin Bland.
Last updated: 12 January, 2005.