# Assessing agreement using Cohen’s kappa

This is a section from Martin Bland’s text book An Introduction to Medical Statistics, Fourth Edition. I hope that the topic will be useful in its own right, as well as giving a flavour of the book. Section references are to the book.

## 20.3 Assessing agreement using Cohen’s kappa

Table 20.5 shows answers to the question ‘Have you ever smoked a cigarette?’ obtained from a sample of children on two occasions, using a self-administered questionnaire and an interview (Bland et al. 1975).

We would like to know how closely the children’s answers agree.

One possible method of summarizing the agreement between the pairs of observations is to calculate the percentage of agreement, the percentage of subjects observed to be the same on the two occasions. For Table 20.5, the percentage agreement is 100 × (61 + 25)/94 = 91.5%. However, this method can be misleading because it does not take into account the agreement which we would expect even if the two observations were unrelated.

Consider Table 20.6, which shows some artificial data relating observations by one observer, A, to those by three others, B, C, and D.

For Observers A and B, the percentage agreement is 80%, as it is for Observers A and C. This would suggest that Observers B and C are equivalent in their agreement with A. However, Observer C always chooses ‘No’. Because Observer A chooses ‘No’ often, A and C appear to agree, but in fact they are using different and unrelated strategies for forming their opinions. Observers A and D give ratings which are independent of one another, the frequencies in Table 20.6 being equal to the expected frequencies under the null hypothesis of independence (chi-squared = 0.0), calculated by the method described in Section 13.1. The percentage agreement is 68%, which may not sound very much worse than 80% for A and B. However, there is no more agreement than we would expect by chance. The proportion of subjects for which there is agreement tells us nothing at all. To look at the extent to which there is agreement other than that expected by chance, we need a different method of analysis: Cohen’s kappa.

Cohen’s kappa (Cohen 1960) was introduced as a measure of agreement which avoids the previously described problems by adjusting the observed proportional agreement to take account of the amount of agreement which would be expected by chance.

First, we calculate the proportion of units where there is agreement, p, and the proportion of units which would be expected to agree by chance, pe. The expected numbers agreeing are found as in chi-squared tests, by row total times column total divided by grand total (Section 13.1). For Table 20.5, for example, we get p = (61 + 25)/94 = 0.915 and

pe = (63 × 67/94 + 31 × 27/94)/94 = 0.572

Cohen’s kappa (κ) is then defined by

κ = (ppe) /(1 − pe)

For Table 20.5 we get:

κ = (0.915 − 0.572)/(1 − 0.572) = 0.801

Cohen’s kappa is thus the agreement adjusted for that expected by chance. It is the amount by which the observed agreement exceeds that expected by chance alone, divided by the maximum which this difference could be. Kappa distinguishes between the agreement shown between pairs of observers A and B, A and C, and A and D in Table 20.6 very well. For Observers A and B, κ = 0.37, whereas for Observers A and C κ = 0.00, as it does for Observers A and D.

We will have perfect agreement when all agree, so p = 1. For perfect agreement κ = 1. We may have no agreement in the sense of no relationship, when p = pe and so κ = 0. We may also have no agreement when there is an inverse relationship. In Table 20.5, this would be if children who said ‘no’ the first time said ‘yes’ the second and vice versa. We would have p < pe and so κ < 0. The lowest possible value for κ is −pe/(1 − pe), so depending on pe, κ may take any negative value. Thus κ is not like a correlation coefficient, lying between −1 and +1. Only values between 0 and 1 have any useful meaning.

Note that kappa is always less than the proportion agreeing, p, unless agreement is perfect. You could just trust me, or we can see this mathematically because:

pκ = p − (ppe)/(1 − pe)
= (p × (1 − pe) − pe)/ (1 − pe)
= (pp × pep + pe)/(1 − pe)
= pep × pe)/ (1 − pe)
= pe × (1 − p)/(1 − pe)

and this must be greater than 0 because pe, 1 − p, and 1 − pe are all greater than 0 unless p = 1, when pκ is equal to 0. Hence p must be greater than kappa.

How large should kappa be to indicate good agreement? This is a difficult question, as what constitutes good agreement will depend on the use to which the assessment will be put. Kappa is not easy to interpret in terms of the precision of a single observation. The problem is the same as arises with correlation coefficients for measurement error in continuous data. Table 20.7 gives two guidelines for its interpretation, one by Landis and Koch (1977), the other, slightly adapted from Landis and Koch, by Altman (1991).

I prefer the Altman version, because kappa less than zero to me suggests positive disagreement rather than poor agreement and ‘almost perfect’ seems too good to be true. Table 20.7 is only a guide, and does not help much when we are interested in the clinical meaning of an assessment.

There are problems in the interpretation of kappa. Kappa depends on the proportions of subjects who have true values in each category. For a simple example, suppose we have two categories, the proportion in the first category is p1, and the probability that an observer is correct is q. For simplicity, we shall assume that the probability of a correct assessment is unrelated to the subject’s true status. This gives us for kappa:

κ = p1(1 − p1)/ (q(1 − q)/(2q − 1)2 + p1(1 − p1))

Trust me. Inspection of this equation shows that unless q = 1 or 0.5, i.e. all observations always correct, κ =1, or random assessments, κ = 0, kappa depends on p1, having a maximum when p1 = 0.5. Thus kappa will be specific for a given population. This is like the intraclass correlation coefficient, to which kappa is related, and has the same implications for sampling. If we choose a group of subjects to have a larger number in rare categories than does the population we are studying, kappa will be larger in the observer agreement sample than it would be in the population as a whole.

Figure 20.2 shows the predicted two-category kappa against the proportion who are ‘yes’ for different probabilities that the observer’s assessment will be correct under this simple model.

Figure 20.2 Predicted kappa for two categories, ‘yes’ and ‘no’, by probability of a ‘yes’ and probability observer will be correct. The verbal categories of Altman’s classification are shown.

Kappa is maximum when the probability of a true ‘yes’ is 0.5. As this probability gets closer to zero or to one, the expected kappa gets smaller, quite dramatically so at the extremes when agreement is very good. Unless the agreement is perfect, if one of two categories is small compared with the other, kappa will almost always be small, no matter how good the agreement is. This causes grief for a lot of users. We can see that the lines in Figure 20.2 correspond quite closely to the categories shown in Table 20.7. A large-sample approximation standard error and confidence interval can be found for kappa. The standard error of κ is given by

SE(κ) = √(p × (1 − p)/ (n × (1 − pe)2)

where n is the number of subjects. The 95% confidence interval for κ is κ − 1.96 × SE(κ) to κ + 1.96 × SE(κ) as κ is approximately Normally Distributed, provided np and n(1 − p) are large enough, say greater than five.

For the data of Table 20.5:

SE(κ) = √(p × (1 − p)/ (n × (1 − pe)2)
= √(0.915 × (1 − 0.915)/ (94 × (1 − 0.572)2)
= 0.067

For the 95% confidence interval we have: 0.801 − 1.96 × 0.067 to 0.801 + 1.96 × 0.067 = 0.67 to 0.93.

We can also carry out a significance test of the null hypothesis of no agreement. The null hypothesis is that in the population κ = 0, or p = pe). This affects the standard error of kappa because the standard error depends on p, in the same way that it does for proportions (Section 8.4). Under the null hypothesis, p can be replaced by pe in the standard error formula: SE(κ) = √(p × (1 − p)/ (n × (1 − pe)2)
= √(pe × (1 − pe)/ (n × (1 − pe)2) =
= √(pe / (n × (1 − pe))

If the null hypothesis were true, κ/SE(κ) would be from a Standard Normal Distribution. For the example, κ/SE(κ) = 6.71, P<0.0001. This test is one tailed (Section 9.5), as zero and all negative values of κ mean no agreement. Because the confidence interval and the significance test use different standard errors, it is possible to get a significant difference when the confidence interval contains zero. In that case, there would be evidence of some agreement, but kappa would be poorly estimated.

Cohen (1960) dealt with only two observers. In many observer variation studies, we have observations on a group of subjects by many observers. For an example, Table 20.8 shows the results of a study of observer variation in transactional analysis (Falkowski et al. 1980).

Observers watched video recordings of discussions between people with anorexia and their families. Observers classified 40 statements as being made in the role of ‘adult’, ‘parent’ or ‘child’, as a way of understanding the psychological relationships between the family members. For some statements, such as statement 1, there was perfect agreement, all observers giving the same classification. Others statements, e.g. statement 15, produced no agreement between the observers. These data were collected as a validation exercise, to see whether there was any agreement at all between observers.

Fleiss (1971) extended Cohen’s kappa to the study of agreement between many observers. To estimate kappa by Fleiss’s method, we ignore any relationship between observers for different subjects. This method does not take any weighting of disagreements into account, and so is suitable for the data of Table 20.8. We shall omit the details. For Table 20.8, κ = 0.43.

Fleiss only gives the standard error of kappa for testing the null hypothesis of no agreement. For Table 20.8 it is SE(κ) = 0.02198. If the null hypothesis were true, the ratio κ/SE(κ) would be from a Standard Normal Distribution; κ/SE(κ) = 0.43156/0.02198 = 19.6, P<0.001. The agreement is highly significant and we can conclude that transactional analysts’ assessments are not random. We can extend Fleiss’s method to the case when the number of observers is not the same for each subject but varies, and for weighted kappa (Section 20.4).

Kappa statistics have several applications. They are used to assess agreement between classifications made on the same participants on different occasions, between classifications made by different observers, between classifications made by different methods, and by different reviewers identifying relevant studies or extracting data from studies in systematic reviews.

### References

Altman, D.G. (1991). Practical Statistics for Medical Research. Chapman and Hall, London.

Bland, J.M., Bewley, B.R., Banks, M.H., and Pollard, V.M. (1975). Schoolchildren’s beliefs about smoking and disease. Health Education Journal, 34, 71–8.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–47.

Falkowski, W., Ben-Tovim, D.I., and Bland, J.M. (1980). The assessment of the ego states. British Journal of Psychiatry, 137, 572–3.

Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–82.

Landis, J.R. and Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.

Adapted from pages 317–322 of An Introduction to Medical Statistics by Martin Bland, 2015, reproduced by permission of Oxford University Press.