Statistics Guide for Research Grant Applicants

E. Describing the statistical methods

E-1 Introduction
- E-1.1 Terminology
- E-1.2 Level of detail required
E-2 Is the proposed method appropriate for the data?
- E-2.1 'Ordinal' scores
E-3 Paired and unpaired comparison
E-4 Assumptions
- E-4.1 Transformations
E-5 Adjustment for confounding
E-6 Hierarchical or multilevel data
- E-6.1 Analysing hierarchical data
E-7 Multiple testing
E-8 Change over time (regression towards the mean)
E-9 Intention to treat in clinical trials
E-10 Cluster randomised trials
E-11 Collapsing variables
E-12 Estimation and confidence intervals

E-1 Introduction

When reading through the statistical analysis section of a grant proposal, the reviewer is looking for several things. Have the statistical methods been described adequately and in terms that are unambiguous? Will the data generated by the study be measured on an appropriate scale (see A-4.1) and be of an appropriate type (see A-4.2) for analysis by the methods proposed? Will the assumptions made by the proposed methods hold and what is planned if they do not? Will the proposed statistical methods take adequate account of the study design and the structure of the data (e.g. serial measurements, hierarchical data)?

E-1.1 Terminology

When describing the proposed statistical methods it is appropriate and helpful to the reviewer to use statistical terminology. However it is important that the applicant actually understands the terminology and uses it appropriately if the description is to be unambiguous. Very often applicants say that they plan a multivariate analysis when in fact they mean a multifactorial analysis. These two types of analysis are appropriate in different settings and are used to answer different questions. They also make different assumptions. To add to the confusion the terms multivariate and multifactorial are frequently used interchangeably and therefore incorrectly in the medical literature as well as in grant proposals. So when and how should they be used? In any statistical technique it is the outcome variable whose variation is being modelled and about whom assumptions (e.g. data come from a Normal distribution) are made. The explanatory variable on the other hand is assumed to take fixed values. A statistical method involving only one outcome variable can be described as univariate and a method involving multiple outcome variables as multivariate. Univariate analyses can be further divided into unifactorial if only one explanatory variable is to be considered and multifactorial if multiple explanatory variables are to be considered.

E-1.2 Level of detail required

It is important for the reviewer to be reassured that the applicants have thought in some detail about how they will use their data to address their study aims. Often, however, applicants concentrate on describing their study design and on calculating their sample size but then dismiss their proposed statistical analysis using a single sentence. For example, data will be analysed using the computer package SPSS or data will be analysed using multivariate techniques or data will be analysed using multifactorial methods. Such descriptions are completely inadequate. There are a wealth of statistical techniques that can be run using SPSS or that can be described as multivariate or multifactorial. The actual name of the technique intended should therefore be given explicitly e.g. principal component analysis, multiple regression analysis etc. This information is essential to the reviewer as although certain techniques may come under the same umbrella term e.g. multifactorial, they are appropriate in very different settings. For example, multiple logistic regression and multiple regression may both be described as mutifactorial but the latter can only be used when the outcome is a continuous variable and the former when the outcome is a binary variable (two possible categories e.g. yes and no) or the sum of binary variables (see Bland 2000, chapter 17).

E-2 Is the proposed method appropriate for the data?

When writing the statistical analysis section the applicants should be mindful of the type of data that will be generated by their study (see A-4.1 and A-4.2). For example, if they have one outcome variable measured across two independent groups of subjects is that outcome variable continuous, ordinal, categorical or more specifically binary (only two categories e.g. yes and no)? If the outcome is binary then in order to compare groups one of the following statistical tests might be used: The Chi-squared test or Fisher's Exact Test. However if the outcome is continuous then the two-sample t test or the Mann-Whitney U test might be used. For an ordinal outcome variable the Chi-square test for trend or the Mann-Whitney U test might be employed and finally for a non-ordered categorical outcome the Chi-squared test might be used. Although the appropriateness of each test will depend on other assumptions (see E-3 and E-4) in addition to data type, it is clear that the two-sample t test is not appropriate for binary, categorical outcomes or non-interval data. (For further information on the significance tests mentioned see Armitage, Berry and Matthews 2002, Altman 1991, Bland 2000.)

E-2.1 'Ordinal' scores

A particular type of data of interest are scores that are formed by adding up numerical responses to a series of questions in a questionnaire that has been designed to obtain information on different aspects of the same phenomenon e.g. quality of life. Often each question requires only a yes/no answer (coded Yes=1, No=0). The resulting scores are not interval data and it is debatable whether they are even ordinal (see A-4.1 and A-4.2). A positive response to Question A may add 1 to the overall score as may a positive response to Question B, but since A and B are measuring two different aspects, what does a difference in score of 1 actually mean?

When such scores are based on a large number of questions, they are often treated as if they were continuous variables for the purposes of multivariate or multifactorial analysis (see E-1.1). The main reason for this is the lack of any alternative methodology. However, care should always be taken when interpreting results of such analyses and the problem should not just be ignored. Where scores are based on only a few yes/no type questions (say < 10) treating the score as continuous cannot be justified. Indeed some statisticians would argue that 'ordinal' scores should never be treated as continuous data. Certainly for unifactorial analysis, non-parametric methods should be considered (see Conover 1980).

E-3 Paired and unpaired comparison

Very often whether in a clinical trial or in an observational study (see A-1) there are two sets of data on the same variable and it is of interest to compare these two sets. Of primary importance is whether or not they can be considered to be independent of each other. Dependence will arise if the two sets consist of measurements or counts made on the same subjects at two different points in time. For example middle-aged men with lung function measured at baseline clinic and 5-year follow-up. It can also arise in observational studies if we have subjects with disease (cases) and subjects without disease (controls) and each control is matched to each case for important confounding factors such as age and sex. That is a 1-1 matched case-control study. With this type of dependence the comparison of interest is the paired comparison (i.e. differences within pairs). When the two sets of data are independent as for example when subjects are randomly allocated to groups in a randomised trial or when two distinct groups (e.g. males and females) are compared in an observational study, the comparison of interest is the unpaired comparison (i.e. differences between groups).

The distinction is important as paired and unpaired comparisons require different statistical techniques. Thus if the data are continuous (see A-4) representing some measurement on two independent groups of subjects an unpaired t test (or Mann-Whitney U test) might be used but if measurements are 'before' and 'after' measurements on the same subjects a paired t test (or Wilcoxon Signed rank test) might be used. For binary data i.e. data taking the values 0 and 1 only, the tests for unpaired and paired comparison would be the Chi-squared test (or Fisher's Exact Test) and McNemar's test respectively; although the appropriateness of each test would depend on other assumptions holding as discussed below (see E-4). (For further information on the significance tests mentioned see Armitage, Berry and Matthews 2002, Altman 1991, Bland 2000.)

E-4 Assumptions

It is often not appreciated that in order to formulate statistical significance tests certain assumptions are made and that if those assumptions do not hold then the tests are invalid. For example the unpaired t test (sometimes referred to as the two-sample t test) assumes that data have been selected at random from two independent Normal distributions with the same variance. The assumption of independence is satisfied if as indicated above, data come from measuring two distinct/unmatched groups of subjects. One simple way of checking that data come from a Normal distribution is to produce a histogram of the data or to produce a Normal plot. A symmetrical bell shaped histogram or a straight line of points on the Normal plot indicates Normality. To check for equal variances it is worth just eyeballing the calculated standard deviations for each group to see if they differ markedly. This is as good a method as any. Testing for a significant difference between variances is of limited use as the size of difference that can be detected with any degree of certainty is dependent upon sample size. Thus a large sample size may detect a real though trivial difference in variance whereas a small sample would miss a large important difference.

Other tests:

The paired t test (see E-3) assumes that differences between paired observations come from a Normal distribution.
The chi-squared test and McNemar's test (see E-2 and E-3) are large sample tests and their validity depends on sample size. For small samples Fisher's exact test or an exact version of McNemar's test may be required. (Several programs are available to do these, including StatXact by Cytel.)
Contrary to popular belief, tests based on ranks such as the Mann-Whitney U test or the Wilcoxon signed rank test (see E-2 and E-3) also make assumptions and cannot be used for very small samples (see Conover 1980).

E-4.1 Transformations

If a histogram of the data is not symmetrical and has a long tail to the right, the distribution is described as positively skew. If the longer tail is to the left the distribution is said to be negatively skew. If data appear to follow a positively skewed distribution but the proposed statistical analysis assumes that data come from a Normal distribution then a logarithmic transformation may be useful. We simply take logs of the basic data and use the logged data in the analysis, provided a histogram of the logged data looks sufficiently symmetrical and bell shaped i.e. provided the logged data come from a Normal distribution. If the logged data do not appear to follow a Normal distribution then there are some other transformations that can be tried e.g. the square root transformation. Very often a transformation which restores Normality will also lead to equal variances across groups (see Wetherill 1981).

If the data come from a negatively skew distribution or if transformations are not very helpful then a non-parametric test based on ranks may be a more useful approach. The non-parametric equivalent to the two-sample t test is the Mann-Whitney U test (see Conover 1980 for more details of non-parametric tests).

E-5 Adjustment for confounding

When in an observational study we observe for example an association between good lung function and vitamin C, we cannot assume that vitamin C is directly benefiting lung function. There may be some other factor such as smoking which is inversely associated with vitamin C and which has a direct effect on lung function. In other words smoking may confound the association between vitamin C and lung function.

The effects of confounding can be adjusted for at the design stage in terms of matching (see C-1.2) or stratified randomisation (see B-5.7) etc or at the analysis stage using multifactorial methods (see E-1.1). A list of confounders and how they are to be adjusted for should always form part of the section on proposed statistical methods. If the applicants decide to adjust at the analysis stage then the collection of information on confounding variables should form part of their plan of investigation. Some consideration should also be given to how detailed this information needs to be. To adjust for smoking in the lung function vitamin C example, something more than a 3 category variable of current smoker, ex-smoker and non-smoker is required as both amount smoked and length of time exposed may be important. If the applicants propose to adjust at the design stage then they need to appreciate that this will build some sort of structure into their data which will have implications for the statistical analysis (see E-6 and E-6.1). For example, in a 1-1 matched case-control study you cannot treat the cases and controls as two independent samples but rather as paired samples (see E-3).

E-6 Hierarchical or multilevel data

This is where your data have some sort of hierarchy e.g. patients within GP practices, subjects within families. We cannot ignore the fact that subjects within the same group are more alike than subjects in different groups. This problem is one of a lack of independence. Most basic significance tests (e.g. t tests) assume that within each group being compared (e.g. treatment A or treatment B), the data are independent observations from some theoretical distribution. For example the 2-sample t test assumes that the data within each group are independent observations from a Normal distribution. If for example the data are measurements of total cholesterol made on patients from a sample of general practices, measurements of subjects in the same practice will tend to be more similar than measurements of subjects from different practices. Hence the assumption of independence fails.

E-6.1 Analysing hierarchical data

It is often the case that a carefully designed study incorporates balance by introducing a hierarchical structure to the data. For example, in a case-control study you may match one control of the same age and sex 1-1 to each case (see C-1.2). You then have case-control pairs and below them in the hierarchy, the subjects within pairs. This matching should be taken into account in the statistical analysis (see Breslow & Day 1980). In a clinical trial you may stratify your subjects by age and randomly allocate subjects to treatments within strata (see B-5.7). Strata should be adjusted for in the statistical analysis if we are to maximise precision. In a cluster randomised trial (see B-5.9) you may randomly allocate a sample of general practices to one of two interventions and measure outcome in the patients. In this case, general practice should be used as the unit of analysis and not the patient. In other words we have to summarise the outcome for each practice; we cannot simply add up all the patients in the intervention practices and the non-intervention practices (see Altman & Bland 1997, Bland & Kerry 1997, Kerry & Bland 1998, Kerry & Bland 1998c)).

Sometimes the hierarchical structure is there because you sample at one level and collect data at another as for example, in a survey of old people's homes where each client at each home in the sample is asked to take part. Some adjustment for clustering within homes is required here, which may involve complex statistical methods such as multilevel modelling (Goldstein 1995).

The reviewer is looking for some indication that the applicants appreciate the structure of their data and how this will impact in terms of their proposed statistical analysis and the likely complexity of that statistical analysis.

E-7 Multiple testing

E-7.1 Multiple testing: when does it arise?

Multiple significance testing arises in several ways:

More than one outcome measurement in a clinical trial, e.g. in a trial of ventilation of small neonates we might want to look at survival and time on ventilation. We may want to look at the differences for each variable. (See B-6 and E-7.4a.)
More than one predictor measurement in an observational study, e.g. in a study of the effects of air pollution on hospital admissions, we might need to consider the effects of several pollutants and several lag times, i.e. pollution on the day of the admission, the day before the admission, two days before, etc. If any of these is significant we want to conclude that air pollution as a whole has an effect. (See E-7.4b.)
Measurements repeated over time (serial data), e.g. measurements of circulating hormone level at intervals after the administration of a drug or a placebo. We may want to look at the difference between groups at each time. (See E-7.4c.)
Comparisons of more than two groups, e.g. we may have three different treatments in a trial, such as two doses of an active drug and a placebo. We may want to compare each pair of groups, i.e. the two doses and each dose with placebo. (See E-7.4d.)
Testing the study hypothesis within subgroups, e.g. for males and females separately or for severe and mild disease separately. (See B-6 and E-7.4e.)
Repeatedly testing the difference in a study as more patients are recruited. (see B-5.10d, B-7.2 and E-7.4f.)

If the main analysis of your study involves any of these, this should be described in the proposal, together with how you intend to allow for the multiple testing.

E-7.2 Multiple testing: why is it a problem?

The problem is that if we carry out many tests of significance, we increase the chance of false positive results, i.e. spurious significant differences or type I errors. If there are really no differences in the population, i.e. all the null hypotheses are true, the probability that we will get at least one significant difference is going to be a lot more than 0.05. (For explanation of statistical terms see D-4).

For a single test when the null hypothesis is true, the probability of a false positive, significant result is 0.05, by definition, and so the probability of a true negative, non significant result is (1-0.05) = 0.95. If we have two tests, which are independent, i.e. the variables used in the tests are independent, the probability that both tests are true negative, not significant results is 0.95² = 0.9025. Hence the probability that at least one of the two tests will be a false positive is 1-0.9025 = 0.0975, not 0.05. If we do k independent tests, the probability that at least one will be significant is 1-0.95^k.

For 14 independent tests, the probability that at least one will be significant is thus 1-0.95¹⁴ = 0.51. There would be a more than 50% chance of a spurious significant difference.

Tests are independent in subgroup analysis, provided the subgroups do not overlap. If the tests are not independent, as is usually the case, the probability of at least one false positive will be less than 1-0.95^k, but by an unknown amount. If we do get a false positive, however, the chance of more than one false positive is greater than when tests are independent. To see this, imagine a series of variables which are identical. Then the chance of a false positive is still 0.05, less than 1-0.95^k, but if one occurs it will be significant for all the variables. Hence having several significant results does not provide a reliable guide that they are not false positives.

E-7.3 The Bonferroni correction

One possible way to deal with multiple testing is to use the Bonferroni correction. Suppose we do several tests, k in all, using a critical P value of alpha, the null hypotheses all being true. The probability of at least one significant difference is 1 - (1 - alpha)^k. We set this to the significance level we want, e.g. 0.05. We get 1- (1 - alpha)^k = 0.05. Because alpha is going to be very small, we can use an approximation: (1 - alpha)^k = 1 - k alpha. Hence 1- (1 - alpha)^k = 1 - (1 - k alpha) = 0.05. Hence k alpha = 0.05 and alpha = 0.05/k. So if we do our k multiple tests and find that one of them has P < 0.05/k, the P value for the composite null hypothesis that all k null hypotheses are true is 0.05. In practice it is better to multiply all the individual P values by k, then if any is significant (P < 0.05) the test of the composite null hypothesis is significant at the 0.05 level, and the smallest modified P value gives the P value for the composite null hypothesis (Bland & Altman 1995, Bland 2000a).

The Bonferroni correction assumes that the tests are independent. Applying the Bonferroni correction when tests are not independent means that the P value is larger than it should be, but by an unknown amount. Hence the power of the study is reduced, also by an unknown amount. If possible, we look for other methods which take the structure of the data into account, unlike Bonferroni.

If we are going to have multiple significance tests in our study, we should say in the proposal how we are going to deal with them (see E-7.4).

E-7.4 How to deal with multiple testing

We could ignore the problem and take each test at face value. This would lay us open to charges of misleading the reader, so it is not a good idea.

We could choose one test as our main test and stick to it. This is good in clinical trials but can be impractical in other designs. It may ignore important information.

We could use confidence intervals (see E-12) instead of significance tests. This is often desirable quite apart from multiple testing problems, but confidence intervals will be interpreted as significance tests whatever the author may wish.

There are several better options, depending on the way multiple testing comes about (see E-7.4a, E-7.4b, E-7.4c, E-7.4d, E-7.4e and E-7.4f).

E-7.4a More than one outcome measurement in a clinical trial.

We should keep the number of outcomes to a minimum, but there is a natural desire to measure anything which may be interesting and then comparing the treatment groups is almost irresistible. We can use the Bonferroni method (see E-7.3). We multiply each observed P value by the total number of tests conducted. If any modified P value is less than 0.05 then the treatment groups are significantly different. This tests a composite null hypothesis, i.e. that the treatments do not differ on any of the variables tested. If we have two or three main outcome variables where a difference in any of them would lead us to conclude that the treatments were different, we should build this into sample size calculations by dividing the preset type I error probability (usually 0.05) by the number of tests. If we are going to carry out many tests on things we have measured, we should state in the protocol that these will be adjusted by the Bonferroni method. Alternatively that any such tests will be described clearly as hypothesis-generating analyses which will not enable any firm conclusion to be drawn.

E-7.4b More than one predictor measurement in an observational study.

Usually we ignore this unless the variables are closely related making multiple testing a real problem. When this is the case we can use the Bonferroni method (see E-7.3). We should test each of our predictors and apply the correction to the P values. In a protocol, we should use 0.05/number of tests as the type I error in sample size calculations. An alternative would be to put the group of variables into a multifactorial analysis such as multiple or logistic regression, and test them all together using the reduction in sum of squares or equivalent. We would ignore individual P values. You need quite a lot of observations to do this reliably.

E-7.4c Measurements repeated over time (serial measurements)

We should not carry out tests at each time point separately. Not only does this increase the chance of a false positive but, as it uses the data inefficiently, it increases the chance of a false negative also. There are several possible approaches. One is to create a summary statistic (Bland 2000b, Matthews et al. 1990) such as the area under the curve. The peak value and time to peak can also be used, but as they do not use all the data they may be less efficient, particularly the time to peak. On the other hand, they have a direct interpretation. For data where the variable increases or decreases throughout the observation the rate of change, measured by the slope of a regression line, may be a good summary statistic. For a proposal, you should decide on the summary statistic you are going to use. For your sample size calculation you will need an estimate of the standard deviation and of the size of difference you wish to detect (much more difficult, as you are using an indirect measure). A pilot study (see A-1.9) is very useful for this. There are several other approaches, including repeated measures analysis of variance and multilevel modelling (Goldstein 1995). These are more difficult to do and to interpret. The research team should include an experienced statistician if these are to be used.

E-7.4d Comparisons of more than two groups

We start off with a comparison of all groups, using analysis of variance or some other multiple group method. If the groups difference is significant, we then go on to compare each pair of groups. There are several ways to do this. One way would be to do t tests between each pair, using the residual variance from the analysis of variance (which increases the degrees of freedom and so increases the power compared to a standard t test). This is called the least significant difference analysis. However, we are multiple testing, the risk of false positives is high. We can apply Bonferroni, but this loses a lot of power and as we are not testing a composite hypothesis it is not really appropriate. There are several better and more powerful methods, which have the property that only one significant difference should be found in 20 analyses of variance if the null hypothesis is true, rather than one in 20 pairs of groups. These include the Newman Keuls range test, suitable for groups of equal size (see Armitage, Berry and Matthews 2002, Bland 2000), and Gabriel's test, suitable for groups of unequal size (see Bland 2000). Different statistical packages offer different methods for doing this, and you may be rather limited by your software. We will not try to review this as it will change as new releases of software are issued. In your protocol you should say which method you are going to use. Sample size calculations with more than two groups are difficult. It should be acceptable if you use the method for two groups and assume that if your sample is adequate for a comparison of two of your groups it will be OK for all of them.

E-7.4e Testing the study hypothesis within subgroups.

There are two possible reasons for wanting to do this. First, we might want to see whether, even though a treatment difference may not be significant overall, there is some group of patients within which there is a difference. If so, we wish to conclude that the treatment has an effect. The Bonferroni correction is required here, as we are testing a composite hypothesis. As the tests are independent we should not experience loss of power. Second, we might want to see whether the main study difference varies between groups; for example, whether a treatment effect is greater in severe than in mild cases. We should do this by estimating the interaction (see A-1.6a) between the main study factor (e.g. treatment) and the subgrouping factor (e.g. severity) (see Altman and Matthews 1996, Matthews and Altman 1996a, 1996b).) Separate tests within subgroups will not tell us this. Not only do we have multiple testing, but we cannot conclude that two subgroups differ just because we have a significant difference in one but not in the other (Matthews and Altman 1996a). Not significant does not mean that there is no effect.

E-7.4f Repeatedly testing the difference in a study as more patients are recruited.

This is a classic multiple testing problem. It is solved by adopting a sequential trial design (see B-5.10d), where the multiple testing is built in and allowed for in the sample size estimation. The testing is arranged so that the overall P value is 0.05. There are several designs for doing this. Anyone using such designs should consult Whitehead (1997).

E-8 Change over time (regression towards the mean)

Problems occur if one of the aims of a study is to investigate the association between change over time and initial value. Due to measurement error alone, those with high values at baseline are more likely to have lower than higher values at follow-up and those with low values at baseline are more likely to have higher than lower values at follow-up. A spurious inverse association between change and initial value is therefore to be expected. This phenomenon is an example of regression towards the mean (Bland & Altman 1994a, 1994b). If we are interested in any real association between change and initial value then we must first remove the effects of regression towards the mean (see Hayes 1988).

If we are convinced that change over time depends on initial value then we may want to adjust for initial value in any multifactorial analysis (see E-1.1). However if we have change as the outcome variable and include initial value as an explanatory variable we may introduce bias due to regression towards the mean. Such bias may be reduced by adjusting for average ((initial + follow-up)/2) rather than initial value (Oldham 1962). Although in observational studies (see A-1.1) any attempt to adjust for initial value may be an over-adjustment due to the horse-racing effect. The horse racing effect is basically the tendency for individuals with faster rates of decline over time in the outcome measure of interest (e.g. lung function) to have lower initial values because of past decline. (see Vollmer 1988).

E-9 Intention to treat in clinical trials

In a randomised clinical trial subjects are allocated at random to groups. The aim of this is to produce samples that are similar/comparable at baseline in terms of factors, other than treatment, that might influence outcome. In effect they can be regarded as random samples from the same underlying population. However as a clinical trial progresses some patients may change treatments or simply stop taking their' allocated treatment. There is then a temptation to analyse subjects according to the treatment they actually received rather than the treatment to which they were originally allocated. This approach though appearing reasonable at first glance fails to retain the comparability built into the experiment at the start by the random allocation. Patients that change or stop taking treatment are unlikely to be 'typical'. Indeed they may well have changed because the treatment they were on was not working or they were experiencing adverse side effects. It is therefore important in the analysis of randomised controlled trials to adhere (and to be seen to adhere) to the concept of analysis by intention to treat. This means that in the statistical analysis all subjects should be retained in the group to which they were originally allocated regardless of whether or not that was the treatment that they actually received. A statement to this effect should be made when describing the proposed statistical analysis for any randomised trial. (See also B-9.)

E-10 Cluster randomised trials

When subjects are in clusters, such as GP practices, this must be taken into account in the analysis plan. See B-5.9, E-6, E-6.1, Altman & Bland (1997), Bland & Kerry (1997), Kerry & Bland (1998), and Kerry & Bland (1998c).

E-11 Collapsing variables

If information is collected on a continuous (see A-4.2) or pseudo-continuous variable then in general it is this variable that should be used in any statistical analysis and not some grouped version. If we group a continuous variable prior to analysis we are basically losing information unless the variable has been recorded to some spurious level of precision and grouping simply reflects a more realistic approach. The idea of grouping continuous variables for presentation purposes is fine provided it is the full ungrouped variable that is used in any significance testing. For example you may wish to present lung function by 5ths of the distribution of dietary fatty fish intake. However when testing for an association between fatty fish and lung function the continuous fatty fish variable should be used. One exception would be if some strong a priori reason existed for us to believe that any association between fatty fish and lung function was step like or discontinuous rather than following a straight-line or smooth curve. For example if we believed a priori that lung function was influenced by any versus no intake of fatty fish but did not vary according to the amount eaten. Another exception would be if very few people ate fatty fish. In either case fatty fish could be analysed as a dichotomy (yes/no).

E-12 Estimation and confidence intervals

In most studies, even those primarily designed to detect as statistically significant a difference in outcome between two groups, the magnitude of any difference or association is of interest. In other words in most studies one of the study aims is estimation whether we are estimating some beneficial effect of a treatment, the prevalence of a disease, the gradient of some linear association, the relative risk associated with some exposure, or the sensitivity of a screening tool etc. In all these examples we are attempting to estimate some characteristic of a wider population using a single sample/study and we need to be mindful that another study of the same size might yield a slightly different estimate. It is therefore important when presenting estimates that we provide some measure of their variability from study to study of the same size. This is done by the calculation of confidence intervals. A 95% confidence interval is constructed in such a way that 95 times out of 100 it captures the true population value that we are trying to estimate. A 90% confidence interval will capture the true population value 90 times out of 100. Confidence intervals are based on the standard error of the estimate and therefore reflect the variability of the estimate from sample to sample of the same size. They give us some idea of how large the true population value might be (i.e. the upper limit of the interval) and how small it might be (the lower limit). Further if the interval is wide it tells us that we do not have a very accurate estimate but if it is narrow then we have a good and useful estimate.

In most studies therefore the calculation of confidence intervals should form an important part of the statistical analysis and the fact that confidence intervals will be calculated and how they will be calculated should form part of the section in the grant proposal on proposed statistical methods. Normally 95% confidence intervals are calculated. If the applicant envisages calculating say 90% confidence intervals or 99% confidence intervals etc then some justification should be given. The method of calculation should always be appropriate to the data. Further the validity of confidence intervals as with significance tests depends on certain assumptions. For example if we use the t method to calculate a 95% confidence interval around the difference in two means (Bland 2000) then the basic data should be continuous. Further, we assume (as for the two-sample t test; see E-4) that data come from two Normal distributions with the same variance. The reviewer is looking for some acknowledgement that the applicants are aware of assumptions and that they have some idea of what they will do if assumptions do not hold (see Altman et al. 2000).

If estimation rather than significance testing is the primary aim of the study then confidence intervals will also form an intrinsic part of any sample size calculations. Sample size will be chosen such that population characteristics will be estimated with adequate precision where precision is measured in terms of the width of confidence intervals (see D-8.1 and Altman et al. 2000).

E-12.1 Proportions close to 1 or zero

When estimating a proportion close to 1 (e.g. 0.95, 0.92), as is often the case in studies of sensitivity and specificity, or a proportion close to 0 (e.g. 0.05, 0.07), the 95% confidence interval is unlikely to be symmetrical and should be calculated using exact rather than large sample methods. Exact 95% confidence intervals can be calculated using specialist statistical software such as StatXact (by Cytel) and CIA (Altman et al. 2000). For the simple case of a single proportion, we offer biconf, a free DOS program by Martin Bland. (See also D-8.1). Robert Newcombe gives some free Excel programs to do this and more.

References for this chapter

Altman DG. (1991) Practical Statistics for Medical Research. Chapman and Hall, London.

Altman DG, Bland JM. (1997) Units of analysis. British Medical Journal 314 1874.

Altman DG, Machin D, Bryant T, Gardner MJ. (2000) Statistics with Confidence, 2nd. ed., British Medical Journal, London.

Altman DG, Matthews JNS. (1996) Interaction 1: Heterogeneity of effects. British Medical Journal 313 486.

Armitage P, Berry G, Matthews JNS. (2002) Statistical Methods in Medical Research 4th ed. Blackwell, Oxford.

Bland JM and Altman DG. (1994a). Regression towards the mean. British Medical Journal 308 1499.

Bland JM and Altman DG. (1994b). Some examples of regression towards the mean. British Medical Journal 309 780.

Bland JM, Altman DG. (1995) Multiple significance tests: the Bonferroni method. British Medical Journal 310 170.

Bland JM, Kerry SM. (1997) Trials randomised in clusters. British Medical Journal 315 600.

Bland M. (2000) An Introduction to Medical Statistics, 3rd. ed. Oxford University Press, Oxford.

Bland M. (2000a) An Introduction to Medical Statistics, 3rd. ed. Oxford University Press, section 9.10.

Bland M. (2000b) An Introduction to Medical Statistics, 3rd. ed. Oxford University Press, section 10.7.

Breslow NE and Day NE. (1980) Statistical Methods in Cancer Research: Volume 1 - The analysis of case-control studies. IARC Scientific Publications No. 32, Lyon.

Conover WJ. (1980). Practical Nonparametric Statistics, 2nd ed. John Wiley & Sons, New York.

Goldstein H. (1995) Multilevel Statistical Models, 2nd ed. Arnold, London.

Hayes RJ. (1988) Methods for assessing whether change depends on initial value. Statistics in Medicine 7 915-27.

Kerry SM, Bland JM. (1998) Analysis of a trial randomised in clusters. British Medical Journal 316 54.

Kerry SM, Bland JM. (1998c) Trials which randomise practices I: how should they be analysed? Family Practice 15 80-83

Matthews JNS, Altman DG. (1996a) Interaction 2: compare effect sizes not P values. British Medical Journal 313 808.

Matthews JNS, Altman DG. (1996b) Interaction 3: How to examine heterogeneity. British Medical Journal 313 862.

Matthews JNS, Altman DG, Campbell MJ, and Royston P. (1990) Analysis of serial measurements in medical research. British Medical Journal 300 230-35.

Oldham PD. (1962). A note on the analysis of repeated measurements of the same subjects. J Chron Dis 15 969.

Vollmer WM. (1988) Comparing change in longitudinal studies: adjusting for initial value. J Clin Epidemiol 14 651-657.

Wetherill GB (1981). Intermediate Statistical Methods. Chapman & Hall, London.

Whitehead J. (1997) The Design and Analysis of Sequential Clinical Trials, revised 2nd. ed. Chichester, Wiley.

Back to Brief Table of Contents.

Back to Martin Bland's home page.

This page is maintained by Martin Bland.

Last updated: 30 September, 2009.