The problem of missing data

This is a section from Martin Bland’s text book An Introduction to Medical Statistics, Fourth Edition. I hope that the topic will be useful in its own right, as well as giving a flavour of the book. Section references are to the book.

19.1 The problem of missing data

Missing data are an almost inevitable part of research on people, whether it is medical, educational, or social. People who participate in these studies are usually free agents and can decide to withhold information at any time or may omit information by accident, or information might be misrecorded or lost.

For an example, we shall look at what happened in CADET (Richards et al. 2009, 2013), a randomized controlled trial of treatments for people presenting with depression in primary care. In this cluster randomized trial, 582 patients were allocated to collaborative care provided by a mental health worker or to treatment as usual by the primary care doctor. The trial was done at three geographical sites (Manchester, Bristol, and London). Fifty-one primary care practices were allocated using minimization within each site (Section 2.14), balanced for the index of multiple deprivation (IMD), number of primary care doctors, and the list size (the number of patients registered). Two practices recruited no patients, leaving us with 49. (The data presented here come from a preliminary analysis of CADET, we later filled in some of the missing data.)

The data which were collected included

primary care practice size (number of patients and number of doctors) and the index of multiple deprivation for catchment area,
participant age, sex, employment, marital status, etc.,
participant depression (using the nine-item PHQ9 scale), anxiety (using the seven-item GAD7 scale), and quality of life (using the widely used SF36 scale) at recruitment to the trial, i.e. at baseline, all from multi-item questionnaire scales,
participant depression (PHQ9), anxiety (GAD7), quality of life (SF36), and a multi-item client satisfaction with care questionnaire scale (using the CSQ scale) after 4 months.

Data were missing in several ways:

for four participants the sex was not recorded, which must be a data entry error by researchers,
for some of the completed scales, one or more items were omitted while other items had been completed,
for some participants, all items were omitted in a scale, so the scale had not been completed at all.

The planned primary analysis for CADET was to use the PHQ9 depression score at 4 months as the primary outcome variable and to adjust the estimated effect of collaborative care for three cluster level variables (geographical site, IMD, and list size) and two individual level variables (age and PHQ9 at baseline). Because there were three sites, two dummy variables (Section 15.8) were needed for these and with treatment, IMD, and list size that made five predictor variables at the practice level. With 49 practices, we thought that five was all that we should include for a valid multiple regression giving us almost 10 observations per variable (Section 15.1). Number of doctors was highly correlated with list size, so we did not think that omitting it would be a problem. As this was a cluster randomized trial, we needed to take the clustering into account in the analysis. For CADET, we used the robust standard errors method (Section 15.15).

One possible approach to the analysis is to leave out all the incomplete cases and use only those for which all variables are available. If we do this available data or complete case data analysis, we have 499 complete cases for the required variables. We get the adjusted estimate, collaborative care minus treatment as usual, = −1.42 PHQ9 scale points, SE = 0.50, 95% confidence interval = −2.44 to −0.41, P = 0.007. So this would suggest that collaborative care results in lower mean depression scores than does treatment as usual. The variable with most missing values in this analysis is PHQ9 at 4months, where 13.6% were missing altogether, 17.3% of the collaborative care group and 10.2% of the treatment as usual group. This difference in missingness is significant, P = 0.01 by a chi-squared test. It would be easy to criticise the trial on the grounds that the difference in proportion of missing depression scores may explain the treatment difference. We need to find a way of taking the missing data into account.

Available data analysis is also inefficient. Some of the cases were omitted because the baseline PHQ9 was missing (all because some, but not all, items in the nine-item scale were omitted) but had PHQ9 after 4 months

References

Richards, D.A., Hill, J.J., Gask, L., Lovell, K., Chew-Graham, C., Bower, P., Cape, J., Pilling, S., Araya, R., Kessler, D., Bland, J.M., Green, C., Gilbody, S., Lewis, G., Manning, C., Hughes-Morley, A., and Barkham, M. (2013). Clinical effectiveness of collaborative care for depression in UK primary care (CADET): cluster randomised controlled trial. British Medical Journal, 347, f4913.

Richards, D.A., Hughes-Morley, A., Hayes, R.A., Araya, R., Barkham,M., Bland, J.M., Bower, P., Cape, J., Chew- Graham, C.A., Gask, L., Gilbody, S., Green, C., Kessler, D., Lewis, G., Lovell, K., Manning, C., and Pilling, S. (2009). Collaborative Depression Trial (CADET): multicentre randomised controlled trial of collaborative care for depression – study protocol. BMC Health Services Research, 9, 188.

Adapted from pages 305–306 of An Introduction to Medical Statistics by Martin Bland, 2015, reproduced by permission of Oxford University Press.

Back to An Introduction to Medical Statistics contents

Back to Martin Bland’s Home Page

This page maintained by Martin Bland
Last updated: 7 August, 2015