Statistics Guide for Research Grant Applicants
Back to Brief Table of Contents
C. Observational Studies
Consider a case-control study (see A-1.4) that
is concerned with identifying the risks for cardiovascular disease. Smoking
history is an obvious variable worthy of investigation. In case-control
studies, one must first decide upon the population from which subjects will be
selected (e.g. a hospital ward, clinic, general population, etc.). Where cases
have been obtained from a hospital clinic, controls are often selected from
another hospital population for ease of access. However, the latter scenario
can introduce selection bias. For example, if cases are obtained from a
cardiovascular ward and smoking history is one of the variables under
investigation, it would be unsuitable to obtain controls from a ward containing
patients with smoking related disease e.g. lung cancer. The choice of a
suitable control group is fraught with problems both practical and statistical.
These problems are discussed in detail in Breslow & Day 1980. In general the
ideal is to select as controls a random sample from the general population that
gave rise to the cases. However this assumes the existence of a list of
subjects in the population (i.e. a sampling frame exists) which in many cases
it does not.
Back to top
Sometimes in a case-control study (see A-1.4)
cases and controls are matched. You can have 1-1 matching of one control for
each case or 1-m matching of m controls per case. The latter is
often used to increase statistical power for a given number of cases (see Breslow & Day 1980). For each case, 1 or more controls
are found that have the same, or very similar, values in a set of matching
variables. Matching variables typically include age and sex etc. Typically
two or three matching variables are selected; anymore would make the selection
of controls difficult. It is hoped that by matching, any differences between
cases and controls are not a result of differences between groups in the
matching variables. The extent to which this aim is achieved depends to some
extent on the closeness of the matching. Here a balance has to be struck
between matching as closely as possible and what can be achieved. Pilot work
(see A-1.9) may be useful in making this
judgement. When describing a matched case-control study in a grant
application, it is not sufficient to say that a control will be matched to each
case for age for example. The reviewer will want to know how closely - to
within one year, to within 5 years etc? Also of interest to the reviewer is how
the controls will be selected (e.g. at random from a list of all possible
matches) and what happens when a control refuses. Will a second control be
selected as a replacement and how many replacements will be permitted?
The main purpose of matching is to control for confounding (see A-1.6). However it should be appreciated that
confounding factors can be controlled for in other ways (see E-5) and these other ways become increasingly
appealing when we consider some of the problems associated with matching:
1) It is not possible to examine the effects of the matching variables upon the
status of the disease/disorder (either present or absent). Thus although the
disease or condition of interest will be related to the matching variables, the
matching variables should not be of interest in themselves.
2) If we match we should take the matching into account in the statistical
analysis. This makes the analysis quite complicated (see E-6.1, Breslow & Day
1980).
3) In a 1-1 matched case-control study matched pairs are analysed together and
so missing information on a control means that it's case is also treated as
missing in the statistical analysis. Similarly missing information on a case
leads to the loss of information on its matched control(s).
4) Bias can arise if we match on a variable that turns out to form part of the
causal pathway between the risk factor under study and disease. This bias is
said to be due to overmatching.
See Bland & Altman (1994c) and Breslow & Day (1980), for further discussion on
matching.
Back to top
Consider a case-control study where for example the interest may be to
investigate an association between diet and bowel cancer. Let us assume that
diet is to be assessed by an interviewer administered food frequency
questionnaire. If the interviewer is aware of the medical condition of the
patients then this may lead to assessment bias, namely a difference between the
information recorded by the interviewer (assessor) and the actual "truth". The
interviewer may record poorer diets than actually consumed for those patients
with cancer. Assessment bias can be overcome if the assessor is 'blind' to the
medical condition, thus avoiding any manipulation of results either conscious
or subconsciously (although 'blinding' is difficult to do in a case-control
study of cancer where interviews are face to face). Assessment bias can even
arise in a case-control study when data is being extracted from medical records
as the process of extraction may be influenced by the knowledge of outcome
(e.g. case or control). In this case 'blind' extraction is advocated.
Back to top
This is a particular problem in both case-control studies (see A-1.4) and cross-sectional studies (see A-1.5) when information is collected
retrospectively, as the patients outcome e.g. disease status, is known, and
they are being asked to recall past events. Patient data collected
retrospectively may be of poor quality as it is based on the patient's ability
to recollect the past. In addition, their ability to recall may be influenced
by their known outcome and it is this difference in ability that may bias
observed associations. If recall bias is likely to be a problem then the grant
applicants should at least consider alternative methodologies. Could the data
be collected from another source e.g. 'blind' extraction from historical
records? Would it be possible to undertake a prospective study where for
example exposure information is collected prior to and in the lack of knowledge
of future disease?
Back to top
We may be interested in demonstrating an association between unemployment and
current poor health. We might decide to undertake a cross-sectional study and
obtain a sample from a London Borough. The aim of the research would be to
extrapolate our findings from this sample to the population of the borough and
then possibly nationally. Therefore, our sample should be at least
representative of the London borough population from where it was obtained. In
practice, we could only obtain a truly representative sample through random
sampling of the whole borough. Nonetheless, the sample would still only be
representative to a particular time period. It may even be difficult to
extrapolate the results to the same borough during another time period, and
therefore possibly nationally.
Sometimes by chance a random sample is not as representative as we would like.
For example in our cross-sectional survey to investigate associations between
unemployment and current health it may be particularly important to ensure that
we have an adequate representation of all postal areas in the borough, thereby
reflecting the socioeconomic deprivation that exists. One way of doing this is
to undertake stratified random sampling. Stratified random sampling is a means
of using our knowledge of the population to ensure the representative nature of
the sample and increase the precision of population estimates. Post-code area
would be known as the stratification factor. Usually we undertake proportional
stratified sampling. The total sample size is allocated between the strata
proportionally, with the proportion determined by the strata total size as a
proportion of the total population size. For example if 10% of the borough
live in one postal code area then we randomly select 10% of the sample from
this strata.
Stratification does not depart from the principle of random sampling. All it
means is that before any selection takes place, the population is divided into
strata and we randomly sample in each strata. It is possible to have more than
one stratification factor. For example in addition to stratifying by post-code
area, we may stratify by age group within the post code area. Nonetheless, we
have to be careful not to stratify by too many factors. Stratified random
sampling requires that we have a large population, for which all of the members
and their stratification factors are listed. Obviously as the number of
stratification factors increase then so also does the time and expense
involved. Nonetheless we can be more confident of the representative nature of
the sample and thereby the generalisability of the results.
Back to top
All medical research is undertaken on a group of selected individuals.
However, the usefulness of any medical research is centred in the
generalisation of the findings rather than in the information gained about a
group of particular individuals. Nonetheless, most studies often use very
restrictive inclusion criteria making it very difficult to generalize results.
For example, if the study subjects in a cross-sectional study concerned with
investigating associations between bowel cancer and diet were selected from an
area that was predominately social class IV or V, can the results be
extrapolated to individuals in a different social class? Such extrapolation of
results is not obvious and the researchers of such a study should have
considered incorporating other geographical areas with a wider range of social
classes. Even if the study had such a sample, the reader of the journal
article must pay careful attention to the ethnicity of the study subjects
before extrapolating the results of a study conducted in the UK to say Asia.
Observational studies are conducted to investigate associations between risk
factors and a disease or disorder, rather than to find out anything about the
individual patients in the study. See Altman & Bland
(1998) for further discussion on generalisation and extrapolation.
Back to top
The response rate to a questionnaire survey is the proportion of subjects who
respond to the questionnaire. Questionnaire surveys, particularly postal
surveys, tend to have low response rates (anything from 30%-50% is not
unusual). The subjects that respond to questionnaires differ from those that
don't and so the results of a study with a low response rate will not be seen
as representative of the population of interest. Thus, if a grant proposal
includes a questionnaire survey the reviewers will be looking for ways in which
the applicants plan to maximise response. Response rates can be enhanced by
including self-addressed stamped envelopes, informing respondents of the
importance of the study and ensuring anonymity. If anonymity is not given, then
response rates can also be increased by following up the first posting with
another copy of the questionnaire or telephone call. Alternatively if
anonymity is given then a second posting of the questionnaire may result in
duplication from some respondents. See Edwards et
al. (2002) for a discussion on improving response rates to
questionnaires.
Back to top
References for this chapter
Altman DG. & Bland JM. (1998)
Generalisation and extrapolation.
British Medical Journal 317 409-410.
Bland JM & Altman DG. (1994c).
Matching.
British Medical Journal 309 1128.
Breslow NE and Day NE. (1980) Statistical Methods in Cancer Research:
Volume 1 - The analysis of case-control studies. IARC Scientific
Publications No. 32, Lyon.
Edwards P., Roberts I., Clarke M., DiGuiseppi C, Pratap S., Wentz R., Kwan
I. (2002).
Increasing response rates to postal questionnaires: systematic review.
British Medical Journal 324 1183-1185.
Back to top
Back to Brief Table of Contents.
Back to Martin Bland's home page.
This page is maintained by Martin Bland.
Last updated: 10 September, 2009.
Back to top