Types of missing data

This is a section from Martin Bland’s text book An Introduction to Medical Statistics, Fourth Edition. I hope that the topic will be useful in its own right, as well as giving a flavour of the book. Section references are to the book.

19.2 Types of missing data

It is usual to define three kinds of missing data:

missing completely at random (MCAR);
missing at random (MAR);
missing not at random (MNAR).

These terms are widely used, but are a bit misleading. When we say data are missing completely at random, we mean that the missingness is nothing to do with the person being studied. For example, a questionnaire might be lost in the post, or a blood sample might be damaged in the lab. In CADET, sex might be MCAR. Of course, this is not truly random, but means that whether something is missing is not related to the subject of the missing data.

When we say data are missing at random, we mean that the missingness is to do with the person but can be predicted from other information about the person. It is not specifically related to the missing information. For example, if a child does not attend an educational assessment because the child is (genuinely) ill, this might be predictable from other data we have about the child’s health, but it would not be related to what we would have measured had the child not been ill. Are the depression data MAR? We cannot tell this from the data. We know that the PHQ9 scores are not MCAR, because the proportions missing in the two treatment groups are different. We know that at least one observation is not MAR, because, tragically, the participant had committed suicide. This is always a danger in depression research.

When data are missing not at random, the missingness is specifically related to what is missing, e.g. a person does not attend a drug test because the person took drugs the night before. The suicide victim has the PHQ9 at 4 months MNAR. The problem is to decide which of these situations we have and in the same dataset we may have some data missing for each reason. We had some missing data in the foot ulcer data in Table 10.2. Some of the capillary densities were missing because the skin biopsy was not usable to count the capillaries. We could regard these as MCAR. Some were missing because the foot had been amputated. As a frequent reason for foot amputation is gangrene from severe foot ulcers, I think we would have to classify these as MNAR.

There are several strategies which can be applied:

try to obtain the missing data;
leave out incomplete cases and use only those for which all variables are available;
replace missing data by a conservative estimate, e.g. the sample mean;
try to estimate the missing data from the other data on the person.

Trying to obtain missing data is obviously a good idea if we can do it. For the missing recording of sex, we were able to fill in two observations fromthe participants’ forenames. Often, this is not possible. One participant committed suicide. The PHQ9 for this person cannot be regarded as missing at random. The 4-month PHQ9 was set to the maximum 27. We checked whether this had a large effect on the estimates by running the analysis with and without this participant.

Leaving out incomplete cases and using only the available data may cause bias, as we have seen, and is inefficient. If all our missing data are MCAR there should be no bias, at least, but this is unusual in health applications.

Adapted from pages 306–307 of An Introduction to Medical Statistics by Martin Bland, 2015, reproduced by permission of Oxford University Press.

Back to An Introduction to Medical Statistics contents

Back to Martin Bland’s Home Page

This page maintained by Martin Bland
Last updated: 7 August, 2015