Exercise: observer agreement about eye colour

This website is for students following the M.Sc. in Evidence Based Practice at the University of York.

183 students were observed twice by different student observers. These measured height (mm), arm circumference (mm), head circumference, and pulse (beats/min) and recorded sex and eye colour (black, brown, blue, grey, hazel, green, other). They entered these into a computer file. Eye colour and sex were entered as numerical codes.

The following table shows eye colour recorded by the two observers:

Eye colour recorded
by first observer Eye colour recorded by second observer Total
black brown blue grey hazel green other
black 6 4 0 0 0 0 0 10
brown 6 69 0 0 4 0 1 80
blue 0 0 39 1 0 2 2 44
grey 0 1 1 4 0 4 0 10
hazel 0 1 0 0 9 4 0 14
green 0 0 1 1 1 15 2 20
other 0 0 0 0 0 2 3 5
Total 12 75 41 6 14 27 8 183

Eye colour recorded by first observer	Eye colour recorded by second observer	Total
black	brown	blue	grey	hazel	green	other
black	6	4	0	0	0	0	0	10
brown	6	69	0	0	4	0	1	80
blue	0	0	39	1	0	2	2	44
grey	0	1	1	4	0	4	0	10
hazel	0	1	0	0	9	4	0	14
green	0	0	1	1	1	15	2	20
other	0	0	0	0	0	2	3	5
Total	12	75	41	6	14	27	8	183

The Stata output for the kappa statistic for this table is:

. kap eye1 eye2

             Expected
Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z
-----------------------------------------------------------------
  79.23%      26.16%     0.7188     0.0385      18.69      0.0000

Question 1:

How would you describe the level of agreement in this table?

Check suggested answer 1.

Question 2:

The expected agreement is much lower than for sex, where it was 54.52%. Why is this?

Check suggested answer 2.

Question 3:

How could we improve the kappa statistic?

Check suggested answer 3.

Question 4:

What pairs of categories might be regarded as minor disagreements?

Check suggested answer 4.

Question 5:

What might be plausible weights for the pairs of eye colour categories?

Check suggested answer 5.

We can use the following disagreement weights:

black brown blue grey hazel green other
black 0 1 2 2 2 2 2
brown 1 0 2 2 1 2 2
blue 2 2 0 1 2 2 2
grey 2 2 1 0 2 1 2
hazel 2 1 2 2 0 1 2
green 2 2 2 1 1 0 2
other 2 2 2 2 2 2 0

We could use agreement weights instead, as some programs, such as Stata, require. (SPSS 16 does not do weighted kappa.)

Question 6:

What weights for agreement would correspond to these disagreement weights?

Check suggested answer 6.

This is the Stata output:

. kapwgt eyes 1 \ 0.5 1 \ 0 0 1 \ 0 0 0.5 1 \ 0 0.5 0 0 1 \ 0 0 0 0.5 0.5 1 \ 0 0 0 0 0 0 1

. kap eye1 eye2, wgt(eyes)

Ratings weighted by:
   1.0000   0.5000   0.0000   0.0000   0.0000   0.0000   0.0000
   0.5000   1.0000   0.0000   0.0000   0.5000   0.0000   0.0000
   0.0000   0.0000   1.0000   0.5000   0.0000   0.0000   0.0000
   0.0000   0.0000   0.5000   1.0000   0.0000   0.5000   0.0000
   0.0000   0.5000   0.0000   0.0000   1.0000   0.5000   0.0000
   0.0000   0.0000   0.0000   0.5000   0.5000   1.0000   0.0000
   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   1.0000

             Expected
Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z
-----------------------------------------------------------------
  86.61%      34.52%     0.7955     0.0432      18.40      0.0000

Question 7:

How does the weighting change the results?

Check suggested answer 7.

Back to Measurement in Health and Disease index.

To Martin Bland's home page.

This page maintained by Martin Bland.
Last updated: 21 July, 2008.