Table 20.5 shows answers to the question ‘Have you ever
smoked a cigarette?’ obtained from a sample of children
on two occasions, using a self-administered questionnaire
and an interview (Bland *et al.* 1975).

Self-administered questionnaire | Interview | Total | |
---|---|---|---|

Yes | No | ||

Yes | 61 | 2 | 63 |

No | 6 | 25 | 31 |

Total | 67 | 27 | 94 |

We would like to know how closely the children’s answers agree.

One possible method of summarizing the agreement between the pairs of observations is to calculate the percentage of agreement, the percentage of subjects observed to be the same on the two occasions. For Table 20.5, the percentage agreement is 100 × (61 + 25)/94 = 91.5%. However, this method can be misleading because it does not take into account the agreement which we would expect even if the two observations were unrelated.

Consider Table 20.6, which shows some artificial data relating observations by one observer, A, to those by three others, B, C, and D.

Observer A | Observer B | Observer C | Observer D | ||||||
---|---|---|---|---|---|---|---|---|---|

Yes | No | Total | Yes | No | Total | Yes | No | Total | |

Yes | 10 | 10 | 20 | 0 | 20 | 20 | 4 | 16 | 20 |

No | 10 | 70 | 80 | 0 | 80 | 80 | 16 | 64 | 80 |

Total | 20 | 80 | 100 | 0 | 100 | 100 | 20 | 80 | 100 |

For Observers A and B, the percentage agreement is 80%, as it is for Observers A and C. This would suggest that Observers B and C are equivalent in their agreement with A. However, Observer C always chooses ‘No’. Because Observer A chooses ‘No’ often, A and C appear to agree, but in fact they are using different and unrelated strategies for forming their opinions. Observers A and D give ratings which are independent of one another, the frequencies in Table 20.6 being equal to the expected frequencies under the null hypothesis of independence (chi-squared = 0.0), calculated by the method described in Section 13.1. The percentage agreement is 68%, which may not sound very much worse than 80% for A and B. However, there is no more agreement than we would expect by chance. The proportion of subjects for which there is agreement tells us nothing at all. To look at the extent to which there is agreement other than that expected by chance, we need a different method of analysis: Cohen’s kappa.

Cohen’s kappa (Cohen 1960) was introduced as a measure of agreement which avoids the previously described problems by adjusting the observed proportional agreement to take account of the amount of agreement which would be expected by chance.

First, we calculate the proportion of units where there
is agreement, *p*, and the proportion of units which
would be expected to agree by chance, *p _{e}*.
The expected
numbers agreeing are found as in chi-squared
tests, by row total times column total divided by grand
total (Section 13.1).
For Table 20.5, for example, we get

*p _{e}* = (63 × 67/94 + 31 × 27/94)/94 = 0.572

Cohen’s kappa (*κ*) is then defined by

*κ* = (*p* − *p _{e}*)
/(1 −

For Table 20.5 we get:

*κ* = (0.915 − 0.572)/(1 − 0.572) = 0.801

Cohen’s kappa is thus the agreement adjusted for that
expected by chance. It is the amount by which the
observed agreement exceeds that expected by chance
alone, divided by the maximum which this difference
could be. Kappa distinguishes between the agreement
shown between pairs of observers A and B, A and C, and
A and D in Table 20.6 very well. For Observers A and B,
*κ* = 0.37, whereas for Observers A and C *κ* = 0.00, as it
does for Observers A and D.

We will have perfect agreement when all agree, so
*p* = 1. For perfect agreement *κ* = 1. We may have no
agreement in the sense of no relationship, when *p* = *p _{e}*
and so

Note that kappa is always less than the proportion
agreeing, *p*, unless agreement is perfect. You could just
trust me, or we can see this mathematically because:

*p* − *κ* = *p* −
(*p* − *p _{e}*)/(1 −

= (

= (

=

=

and this must be greater than 0 because *p _{e}*,
1 −

How large should kappa be to indicate good agreement? This is a difficult question, as what constitutes good agreement will depend on the use to which the assessment will be put. Kappa is not easy to interpret in terms of the precision of a single observation. The problem is the same as arises with correlation coefficients for measurement error in continuous data. Table 20.7 gives two guidelines for its interpretation, one by Landis and Koch (1977), the other, slightly adapted from Landis and Koch, by Altman (1991).

Value of kappa | Strength of agreement | |
---|---|---|

Landis and Koch | Altman | |

<0.00 | Poor | – |

0.00 – 0.20 | Slight | Poor |

0.21 – 0.40 | Fair | Fair |

0.41 – 0.60 | Moderate | Moderate |

0.61 – 0.80 | Substantial | Good |

0.81 – 1.00 | Almost perfect | Very good |

I prefer the Altman version, because kappa less than zero to me suggests positive disagreement rather than poor agreement and ‘almost perfect’ seems too good to be true. Table 20.7 is only a guide, and does not help much when we are interested in the clinical meaning of an assessment.

There are problems in the interpretation of kappa.
Kappa depends on the proportions of subjects who have
true values in each category. For a simple example, suppose
we have two categories, the proportion in the first
category is *p*_{1}, and the probability that an observer is
correct is *q*. For simplicity, we shall assume that the probability
of a correct assessment is unrelated to the subject’s
true status. This gives us for kappa:

*κ* =
*p*_{1}(1 − *p*_{1})/
(*q*(1 − *q*)/(2*q* − 1)^{2}
+ *p*_{1}(1 − *p*_{1}))

Trust me. Inspection of this equation shows that unless
*q* = 1 or 0.5, i.e. all observations always correct,
*κ* =1, or
random assessments, *κ* = 0, kappa depends on *p*_{1},
having a maximum when *p*_{1} = 0.5.
Thus kappa will be specific
for a given population. This is like the intraclass correlation
coefficient, to which kappa is related, and has the
same implications for sampling. If we choose a group of
subjects to have a larger number in rare categories than
does the population we are studying, kappa will be larger
in the observer agreement sample than it would be in the
population as a whole.

Figure 20.2 shows the predicted two-category kappa against the proportion who are ‘yes’ for different probabilities that the observer’s assessment will be correct under this simple model.

**Figure 20.2 Predicted kappa for two
categories, ‘yes’ and ‘no’, by probability of a
‘yes’ and probability observer will be correct.
The verbal categories of Altman’s classification are shown.**

Kappa is maximum when the
probability of a true ‘yes’ is 0.5. As this probability gets
closer to zero or to one, the expected kappa gets smaller,
quite dramatically so at the extremes when agreement is
very good. Unless the agreement is perfect, if one of two
categories is small compared with the other, kappa will
almost always be small, no matter how good the agreement
is. This causes grief for a lot of users. We can see
that the lines in Figure 20.2 correspond quite closely to
the categories shown in Table 20.7.
A large-sample approximation standard error and
confidence interval can be found for kappa. The standard
error of *κ* is given by

SE(*κ*) =
√(*p* × (1 − *p*)/
(*n* × (1 − *p _{e}*)

where *n* is the number of subjects. The 95% confidence
interval for *κ* is *κ* −
1.96 × SE(*κ*) to *κ* +
1.96 × SE(*κ*)
as *κ* is approximately Normally Distributed, provided *np*
and *n*(1 − *p*) are large enough, say greater than five.

For the data of Table 20.5:

SE(*κ*) =
√(*p* × (1 − *p*)/
(*n* × (1 − *p _{e}*)

= √(0.915 × (1 − 0.915)/ (94 × (1 − 0.572)

= 0.067

For the 95% confidence interval we have: 0.801 − 1.96 × 0.067 to 0.801 + 1.96 × 0.067 = 0.67 to 0.93.

We can also carry out a significance test of the null
hypothesis of no agreement. The null hypothesis is that
in the population *κ* = 0, or *p* = *p _{e}*).
This affects the standard
error of kappa because the standard error depends on

= √(

= √(

If the null hypothesis were true, *κ*/SE(*κ*) would be
from a Standard Normal Distribution. For the example,
*κ*/SE(*κ*) = 6.71, P<0.0001.
This test is one tailed (Section
9.5), as zero and all negative values of *κ* mean no
agreement. Because the confidence interval and the significance
test use different standard errors, it is possible to
get a significant difference when the confidence interval
contains zero. In that case, there would be evidence of
some agreement, but kappa would be poorly estimated.

Cohen (1960) dealt with only two observers. In many
observer variation studies, we have observations on a
group of subjects by many observers. For an example,
Table 20.8 shows the results of a study of observer variation
in transactional analysis (Falkowski *et al.* 1980).

Statement | Observer | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

A | B | C | D | E | F | G | H | I | J | |

1 | C | C | C | C | C | C | C | C | C | C |

2 | P | C | C | C | C | P | C | C | C | C |

3 | A | C | C | C | C | P | P | C | C | C |

4 | P | A | A | A | P | A | C | C | C | C |

5 | A | A | A | A | P | A | A | A | A | P |

6 | C | C | C | C | C | C | C | C | C | C |

7 | A | A | A | A | P | A | A | A | A | A |

8 | C | C | C | C | A | C | P | A | C | C |

9 | P | P | P | P | P | P | P | A | P | P |

10 | P | P | P | P | P | P | P | P | P | P |

11 | P | C | C | C | C | P | C | C | C | C |

12 | P | P | P | P | P | P | A | C | C | P |

13 | P | A | P | P | P | A | P | P | A | A |

14 | C | P | P | P | P | P | P | C | A | P |

15 | A | A | P | P | P | C | P | A | A | C |

16 | P | A | C | P | P | A | C | C | C | C |

17 | P | P | C | C | C | C | P | A | C | C |

18 | C | C | C | C | C | A | P | C | C | C |

19 | C | A | C | C | C | A | C | A | C | C |

20 | A | C | P | C | P | P | P | A | C | P |

21 | C | C | C | P | C | C | C | C | C | C |

22 | A | A | C | A | P | A | C | A | A | A |

23 | P | P | P | P | P | A | P | P | P | P |

24 | P | C | P | C | C | P | P | C | P | P |

25 | C | C | C | C | C | C | C | C | C | C |

26 | C | C | C | C | C | C | C | C | C | C |

27 | A | P | P | A | P | A | C | C | A | A |

28 | C | C | C | C | C | C | C | C | C | C |

29 | A | A | C | C | A | A | A | A | A | A |

30 | A | A | C | A | P | P | A | P | A | A |

31 | C | C | C | C | C | C | C | C | C | C |

32 | P | C | P | P | P | P | C | P | P | P |

33 | P | P | P | P | P | P | P | P | P | P |

34 | P | P | P | P | A | C | C | A | C | C |

35 | P | P | P | P | P | A | P | P | A | P |

36 | P | P | P | P | P | P | P | C | C | P |

37 | A | C | P | P | P | P | P | P | C | A |

38 | C | C | C | C | C | C | C | C | C | P |

39 | A | C | C | C | C | C | C | C | C | C |

40 | A | P | C | A | A | A | A | A | A | A |

Observers watched video recordings of discussions between people with anorexia and their families. Observers classified 40 statements as being made in the role of ‘adult’, ‘parent’ or ‘child’, as a way of understanding the psychological relationships between the family members. For some statements, such as statement 1, there was perfect agreement, all observers giving the same classification. Others statements, e.g. statement 15, produced no agreement between the observers. These data were collected as a validation exercise, to see whether there was any agreement at all between observers.

Fleiss (1971) extended Cohen’s kappa to the study of
agreement between many observers. To estimate kappa
by Fleiss’s method, we ignore any relationship between
observers for different subjects. This method does not
take any weighting of disagreements into account, and
so is suitable for the data of Table 20.8. We shall omit the
details. For Table 20.8, *κ* = 0.43.

Fleiss only gives the standard error of kappa for testing
the null hypothesis of no agreement. For Table 20.8
it is SE(*κ*) = 0.02198. If the null hypothesis were true, the
ratio *κ*/SE(*κ*) would be from a
Standard Normal Distribution;
*κ*/SE(*κ*) = 0.43156/0.02198 = 19.6, P<0.001. The
agreement is highly significant and we can conclude that
transactional analysts’ assessments are not random. We
can extend Fleiss’s method to the case when the number
of observers is not the same for each subject but varies,
and for weighted kappa (Section 20.4).

Kappa statistics have several applications. They are used to assess agreement between classifications made on the same participants on different occasions, between classifications made by different observers, between classifications made by different methods, and by different reviewers identifying relevant studies or extracting data from studies in systematic reviews.

Altman, D.G. (1991). *Practical Statistics for Medical
Research.* Chapman and Hall, London.

Bland, J.M., Bewley, B.R., Banks, M.H., and Pollard, V.M.
(1975). Schoolchildren’s beliefs about smoking and
disease. *Health Education Journal*, **34**, 71–8.

Cohen, J. (1960). A coefficient of agreement for nominal
scales. *Educational and Psychological Measurement*, **20**,
37–47.

Falkowski, W., Ben-Tovim, D.I., and Bland, J.M. (1980).
The assessment of the ego states. *British Journal of
Psychiatry*, **137**, 572–3.

Fleiss, J.L. (1971). Measuring nominal scale agreement
among many raters. *Psychological Bulletin*, **76**, 378–82.

Landis, J.R. and Koch, G.G. (1977). The measurement of
observer agreement for categorical data. *Biometrics*, **33**,
159–174.

Adapted from pages 317–322 of
*An Introduction to Medical Statistics* by Martin Bland, 2015,
reproduced by permission of
Oxford University Press.

Back to *An Introduction to Medical Statistics
*contents

Back to Martin Bland’s Home Page

This page maintained by Martin Bland

Last updated: 7 August, 2015