Talk first presented to the London Hypertension Society.
I was first asked to speak on this topic by Donald Singer for a meeting organised by the London Hypertension Society. The aim, of course, is not to upset the statistical referee, but this way round is more fun.
Researchers come to me with comments from statistical referees quite often. I usually agree with the referee. This is not as bad as it sounds, because I can often show the frustrated authors how to do what the referee suggests and so get their papers accepted. We must accept, however, that referees, statistical or otherwise, are fallible people just like you and me and, like us, they get it wrong. After all, the authors might have spent months working on their paper and the referee is unlikely to spend more than half a day on it. Sometimes I'll disagree with the referee and help the author to fight, but this is definitely the minority of cases. And, of course, those referees (of whom there are far too many) who recommend changing or rejecting my work are fools and charlatans.
When I am the referee, on the other hand, I find it is the authors who are unfit to be let out alone. I find myself gasping at the folly of my fellow men and women and racing down the corridor to show my colleagues the latest jaw-dropper. I could not resist, for example, the grant applicant who asked under computing for money to buy 'soft wear'; a nice cashmere sweater for cold computer rooms, perhaps. I have often thought it a pity that such things do not get to a wider audience. Accordingly, when I was given this tempting title I decided to make it is a personal account and use some of my referee's reports. I based it on my experience as a statistical referee for the Lancet, as summer relief in 1994 and 1995. I doubt very much that things have improved dramatically since then, but if you know different, let me know.
In what follows, I shall use quotes from some of my reports to the Lancet. As all the papers were confidential, I have changed a few details to protect the ignorant. As a rule, I have no qualms about publicly pointing out the mistakes of others once they have been published. If we do not do this, the conclusions that follow from these mistakes will be quoted by others, usually without any criticism, and become generally accepted. Also, if you publish your work, you must be prepared to defend your position, or amend it. But when work has not been published, but rather is at some point on the twisting road to publication, I think that it would be unfair to criticise it publicly. On the other hand, such work can often be very illuminating. I'm not the only statistician who has thought that I would really like this paper that I am reviewing to be published, because it would make a wonderful teaching example of what not to do. I have therefore done my best in what follows to describe real papers but at the same time to preserve the confidentiality of the reviewing process. I have disguised the nature of the research, sometimes calling the variables 'X' rather than giving them their proper name. I have even changed the some of the numbers. However, I do not think that I have changed or exaggerated the nature of the statistical mistake that I was pointing out. These quotes come from reports on just 15 papers, so if something comes up several times it may be pretty common.
After I had given the talk a couple of times, I wanted generalise a bit and incorporate the views of other statistical referees. I used a completely non-statistical approach: a convenience sample with a low response rate. I used Allstat, an email list that keeps statisticians in touch with one another. I broadcast the following message:
Subject: Statistical referees for medical journals
To allstaters who act as statistical referees for medical journals.
I am preparing a talk entitled "How to upset the statistical referee". This is based on my own (rather limited) adventures with the Lancet. I wondered what are the pet hates of other referees, the things which really irritate them? If there is something which authors do which really upsets you, could you tell me what it is? I shall, of course, post a summary of replies.
Allstat responded beyond all expectations. I received 35 replies, many of which were very extensive and wide-ranging. I found this rather overwhelming and I never did produce that summary of replies which I had promised. Apologies to Allstat for that.
Eventually, I managed to sort and classify these replies. I have added the Allstat comments wherever they fit in with my own and added them separately where they do no not.
I think of this as my only purely qualitative research project, using two convenience samples (of reviews and of respondents to my Allstat message) one of which was self-selected, used to triangulate the theory generated.
You may be surprised by some of the things which my colleagues and I object to, as you will see many of them appearing frequently in journals. Some of them are what might be termed 'parastatistics', statistics as practiced by users of statistics but not by statisticians. Not all statisticians would agree with me or with my respondents, either, and we should not forget that the collective noun for statisticians is a 'variance'. Given these cautions, I hope that what follows will give a good introduction as to what might going through the mind of the statistical referee for your paper. Here we go.
My most frequent and severe complaints concern significance tests and confidence intervals. I think that one of the greatest statistical crimes is to carry out a significance test, get a large P value, and then interpret this as meaning that there is no difference. This happens again and again. My comments included:
`This is a small trial of two similar regimes. They interpret "no significant difference" as meaning "no difference". I do not think that there was any chance of a significant difference anyway. They should present confidence intervals as in the Lancet's guidelines.'
`Not significant should NOT be interpreted as "no change".'
`The conclusion interprets "not significant" as meaning "no difference", which it does not. It means that a difference has not been shown to exist.'
`The habit of reporting non-significant differences as no differences gives me no confidence in the report of no change here. I suggest that some data be included.'
A couple of my Allstat respondents mentioned this, too:
'Interpreting P>0.5 as "evidence" of no difference, without reference to sample size or confidence intervals'
'Interpreting non-significance as "no difference" to such an extent that the Discussion focuses around why this should be also grates high on the pet hates scale.'
Wherever possible, authors should report confidence intervals for differences, not just significance tests. For years statisticians have been trying to persuade researchers of this (e.g. Gardner and Altman 1986). This is the usual guideline of most journals anyway, including the Lancet. The current guideline, from the Lancet website, include: 'When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the use of P values, which fails to convey important quantitative information.' Authors continually ignore this, and my papers were no exception. My comments included:
`The results should be presented as confidence intervals, not significance tests. For example, the non-significant 19% adverse reactions on the test treatment compared to 12% on the standard treatment is a relative risk of adverse reaction 1.5, 95% confidence interval 0.5 to 4.6. Thus the data are compatible with more than four times as many adverse reactions on the new than on the standard treatment. For the presence of X, one in each group, the relative risk is 1.3; the 95% confidence interval is 0.08 to 20. Thus the data are compatible with more than twenty times as many Xs on the new than on the standard treatment!'
`A confidence interval for the mean difference would be much better than significance tests. A non-significant difference in 10 subjects cannot be interpreted.'
`A finding of "not significant" is meaningless in 4 or 5 subjects. Confidence intervals should be used.'
`This non-significant difference, reported as "unchanged", is proportionately greater than many significant differences in this paper. A confidence interval for the mean difference would be much better.'
My Allstat sample agreed with me, four respondents mentioning the point. Typical comments were:
'Papers where the only statistics are p-values.'
'Insisting on giving the test statistics, and refusing to give estimated effects.'
Computers now print out the exact P values for most test statistics. These should be given, rather than change them to "not significant" or P>0.05. Similarly, if we have P=0.0072, we are wasting information if we report this as P<0.01. This method of presentation arises from the pre-computer era, when calculations were done by hand and P values had to be found from tables. Personally, I would quote this to one significant figure, as P=0.007, as figures after the first do not add much, but the first figure can be quite informative. Two of my 15 Lancet papers had these problems:
`A report of "p=NS" is not very informative. If significance tests must be used, the exact P value is preferable.'
`These are P=0.01, P=0.006, P=0.05, not P<0.01, P<0.006, P<0.05. In fact, the first is actually P=0.012, so what they have written is incorrect.'
Several Allstat respondents raised this issue. Their comments included:
'Using 'NS' for any p>0.05, including p=0.0501 (three replies made this point)'
'Showing a table of p-values to huge numbers of decimal places when they're significant, but not even to one place when not: 'NS' should be banished!'
'Also, statistical methods sections which say "all results were regarded as significant at the 5% level", followed by results where p<0.05 or p=NS.'
'The term "failed to achieve statistical significance"'.
'I mainly derive irritation from little things, such as "P<0.013"'
So if you want to avoid irritating the statistical referee (and you may not) you should quote your P values correctly to one significant figure.
P values greatly exercised my Allstat respondents. Three complained about multiple testing:
'Carrying out hundreds of significance tests, instead of either addressing specified hypotheses, or admitting that the study is descriptive.'
'Massed p-values, like firing a blunderbuss into a fishpond.'
'Skipping of a non-significant finding on the principal outcome to concentrate on a significant result in a side issue, whether this is the infamous sub-group or some minor outcome measure.'
If we carry out many tests of significance, even if the null hypotheses are all true #we expect that 5% of them will be significant. If we then concentrate on these significant tests in our report we can give a very misleading impression. One of my favourite examples is due to Newnham et al. (1993), who randomized pregnant women to receive a series of Doppler ultrasound blood flow measurements or to control. They found a significantly higher proportion of birthweights below the 10th and 3rd centiles in the Doppler group compared to the controls (P=0.006 and P=0.02). These were only two of many comparisons and at least 35 were reported in the paper. Only these two were reported in the abstract. (Birthweight was not the intended outcome variable for the trial). This trial was widely reported and the finding that Doppler ultrasound reduced birthweight was reported in the national news.
Another Allstat respondent raised an interesting point:
'If you aren't already a fan you should be watching ER. Last night there was talk of a chi-squared analysis showing significance at the 0.06 level, "so we only need one more positive result"'.
I wonder how many of the television audience understood that one. Most statistical analyses assume that the observations are independent of one another. If we do not have independent observations, an analysis which requires this will be wrong. If we test each time an observation is added, the observations cannot be independent, because an observation will only be made if the previous ones did not show a significant difference. We would be doing multiple testing, and the probability of a test reaching the nominal P value of 5% if the null hypothesis were true would be much more than 5%. I doubt that people who do this would actually mention it in their paper. The final test would be presented as if it were the only one carried out. Doing this could be the result of ignorance, researchers genuinely thinking that this is a valid procedure. If the researcher knows that the procedure is not valid, it is fraud. In either case, we would end with a potentially false and misleading conclusion.
I don't know quite what an Allstat respondent meant by this complaint:
'Authors who use p-value cut-offs other than <0.05, <0.01 or <0.001 and then don't attempt to justify the levels they use (I find this especially in the papers concerning large animals where there are only 3 cows and insufficient data for any conventional statistical significance at all).'
I suspect that he or she was referring to authors who regard differences as significant if P<0.10 or even higher probabilities. This can be justifiable in some circumstances. An example might be in the screening of novel chemicals for pharmaceutical activity. We put all chemicals through an initial screen intended to select some for a further more intensive screen. It is more important to detect any which have biological activity than to avoid further testing any which do not. A high P value is therefore appropriate: we have a high type I error in order to get a low type II error. If authors wish to do this in published papers, however, they must justify it to the reader, and to the referee.
One of my respondents complained about the use of:
'"Significant" when they mean important.'
This is a difficult one. According to the Shorter Oxford Dictionary, the second meaning of 'significant' is 'important, notable', and has been since 1761. Its statistical meaning relates more to its first definition: 'full of meaning or import'. Thus, if a difference is significant in a sample this difference has meaning, because there is evidence that it exists in the population. I do not think that statisticians can really appropriate 'significant' and deny its other uses, but it's unlikely that I am going to be the statistical referee for your paper, because I do it as rarely as possible. Other statisticians may be more jealous of 'significant' than I and, in the interests of publication, I recommend avoiding its non-statistical applications. The Lancet supports this line, instructing authors to 'Avoid nontechnical uses of technical terms in statistics, such as . . . "significant" . . .'
Another respondent mentioned:
'Direct comparison of p-values.'
I think that what this person had in mind was concluding that one difference is larger or more important than another because it has a smaller P value. This is sometimes done, for example, when a change is tested in two separate groups of subjects and a difference between the P values is interpreted as evidence of a difference between the groups. This is one of my own particular bętes noirs. An example came up in one of my Lancet reviews:
`It is not correct to compare two groups by testing changes in each one separately. Significance does not depend only on magnitude, but on variability and sample size. A two sample t method should be used to compare the log ratios in the two groups.'
One of my respondents made the same point:
'People who carry out controlled clinical trials but do not carry out a controlled analysis. Instead of quoting the estimated treatment effect (active - placebo) with its standard error, they quote the "effect" in the group given active treatment (usually difference from baseline).'
In general we need only note that the P value measures the strength of the evidence that an effect exists in the population, it doesn't convey much about the magnitude of that difference, and a large P value does not, in itself, mean that there is no population difference or that the difference is small.
We must compare effect sizes, not P values. A special case of this was mentioned by another Allstat respondent:
'Sub-group analyses unsupported by interaction tests.'
Sometimes authors will carry out significance tests of the same difference or relationships in different subgroups of their subjects, for example in young and old, male and female. They will then conclude that the difference exists only or mainly in the subgroups where a significant difference was found. As explained above, this conclusion does not follow from the analysis and the correct approach is to test the difference between the magnitudes of the effects in the subgroups (Altman and Matthews 1996; Matthews and Altman 1996a, 1996b; Altman and Bland 2003). This is known as a test of interaction.
Referees' criticisms of the study design are the most difficult to deal with. Criticisms of the presentation, analysis, and interpretation of the data can be remedied fairly easily, because all these things can be changed. Once the study has been carried out and the data collected, it cannot be redesigned. It is therefore essential that the design be correct to begin with. Statisticians are forever saying that they should be consulted before the project begins, although, as we are elusive beasts, this is often pretty difficult to achieve.
In my Lancet reports there were only two design issues. The first was a treatment comparison using observational data:
`From a statistical viewpoint, this is pretty awful. I don't think we should have non-randomised clinical trials in the Lancet.'
I think that we have now got past the argument about whether randomised trials are effective or ethical and want to know what the randomised trial evidence for a treatment is. I do not think that randomized trials are the only source of useful information, but authors must be aware of the principles of randomization and have a pretty clear idea of why they are using data from non-randomized subjects and what the limitations of such data are. Two of my Allstat respondents mentioned randomisation. One complained about:
'The adamant refusal of medical investigators to use randomization and random sampling.'
I found this surprising, as in my experience medical investigators are usually very ready to use randomization and there are vast numbers of randomized trials in the literature. However, experience can vary greatly and this informant may have been working in an area of application where trials are few. Sometimes the perspective of others can be startling. In their textbook Using and understanding medical statistics (Matthews and Farewell 1988) the authors wait until chapter 8 before mentioning the Normal distribution, saying that continuous data are rarely encountered in medical research! They devote three chapters to survival analysis. Their experience in cancer research had certainly given them an entirely different perspective to myself, who cut his statistical teeth on peak expiratory flow and forced expiratory volume. When I read that for the first time, my thought was; 'Ever heard of blood pressure?' (Despite this, it's a good book.) However, I entirely agreed with my respondent about random sampling. This is almost unknown in medicine, though usually there are good reasons for this.
Another respondent mentioned:
'Claims that a study is randomised or blinded when in fact allocation has been by hospital number, date of birth, day of week etc, and blinding has been patently superficial and ineffective.'
This is spot on. People who use systematic allocation of this type (hospital number, etc.) sometimes argue that this is random, because the hospital number is not going to be related to the patients' prognosis. But when Bradford Hill first advocated randomisation in clinical trials, it was firstly to avoid such allocation schemes (Chalmers 1999). If clinicians admitting patients to a trial know what treatment the patient will receive, as they will in these systematic systems, this may bias the decision to admit the patient or not. Schulz et al. (1995) have shown that when the admitting clinician is aware of the treatment patients will receive; the treatment effect is larger, on average, than when treatment is concealed. This implies that such open allocation tends to be biased. This might arise, for example, because clinicians might judge a potential trial recruit to be too frail for the trial treatment, but not for the control treatment. They might then decide to recruit the patient to the trial if the patient would receive the control treatment, but not if the patient would receive the trial treatment. Thus a bias in favour of the trial treatment would be built in. Schulz et al. (1995) also showed that trials where the investigators were not blinded to treatment had larger average treatment effects than trials where investigators were blinded. Sometimes blinding is impossible, sometimes it is difficult, but we must always be aware of its importance and the potential for bias when it is not used. I think the referee wants to see that the authors understand this and are suitably cautious in their interpretation as a result. A good point for the discussion.
The other design issue which came up was sample size:
`This is a small trial of two similar regimes. How was the sample size decided? Was there a power calculation? What difference were the authors hoping to detect?'
I have had experience of sample size calculations being removed from papers to shorten them, at the request of the journal. I think we should resist such shortsighted editing, but I think that in this case no sample size calculations, other than feasibility, had been done. I doubted that even had there been the modest treatment effect which they might have hoped for, the chance of getting a significant difference in such a small trial would have been much above 5%. (It is 5% even if there is no difference at all.)
On the subject of sample size, I had no example in my Lancet series, but another thing I would pounce on would be a sample size calculation for a cluster randomized trial which ignored the clustering. I would treat analysis which ignored the clustering in the same way. See other talks: Cluster designs: a personal view, Sample size in guidelines trials, and Cluster randomised trials in the medical literature.
Standard deviations and standard errors are the basic currency of statistics, familiar to most researchers, yet they seem to cause a lot of difficulty. One problem is that authors often quote them without specifying what they are quoting. I had two examples in my 15 papers:
`I presume the numbers in brackets are standard deviations. The authors should say so.'
`Are these ± numbers standard deviations, standard errors or confidence intervals?'
One of my respondents also mentioned this:
'± notation without any interpretation of whether it refers to se, sd, or CIs.'
Actually, I find the use of the '±' symbol itself is rather misleading. If we quote 'mean ± SD' as researchers often do, what does this mean? We are not saying that the bservations all lie between mean - SD and mean + SD. In fact, we expect about one third of them to be outside these limits. Similarly, if we quote 'mean ± SD' as researchers often do, what does this mean? We are not saying that the observations all lie between mean - SD and mean + SD. In fact, we expect about one third of them to be outside these limits. Similarly, if we quote 'mean ± SE' we do not actually wish to imply that the population mean lies between mean - SE and mean + SE. This would only be true for 2/3 of samples. I think that standard deviations and standard errors are best placed in parentheses: mean (SD). In one of my papers this ± notation seems to have gone rather haywire:
`There is something wrong with the presentation of X. We have "mean X ... was 51.9 ± 7.9 (range)". Is 7.9 the standard deviation? Have the authors omitted the range by mistake?'
Or did they perhaps mean that the minimum value was 51.9 - 7.9 and the maximum 51.9 + 7.9? This seems most unlikely.
Sometimes the main comparison in a paper is for the same subjects under different conditions, e.g. before and after an intervention. A paired t test might be used. This test uses the mean, standard deviation and standard error of the mean for the differences. Authors often quote the P value from a paired test, but quote the standard deviation or standard error for each condition separately, instead of for differences within the subject. I had a sample of this:
`Most of the standard errors given are irrelevant, as it is the change within subjects which is important, and the standard error of the mean difference is the relevant figure.'
One of my respondents complained about the same thing:
'Confidence intervals (or SE's) on group means, rather than on comparisons.'
If the correct standard deviations and standard errors are given, it is much easier
for other workers to incorporate your results in meta-analysis, to compare them with
their own data, and so on.
I had very few comments on specifically on presentation in my Lancet reviews, although
my Allstat respondents had quite a lot to say. I made a suggestion that the zero should
be included on the y-axis of a graph, and I made this point about a graph:
`I think a scatter plot, showing the actual data, would be much more informative. Are
the thin lines standard errors?'
On similar lines, one of my Allstat respondents complained about:
'Dynamite pushers, skyscrapers with TV-aerials'.
What he had in mind, and on which I had been commenting, was a graph like Figure 1:
Figure 1. Bar graph showing capillary density (per mm2) in the feet of ulcerated patients
and a healthy control group (data, but not graph, supplied by Marc Lamah).
You see graphs like this frequently in journals and it may come as a surprise to researchers
that many statisticians dislike them intensely. There are several reasons for this.
My Allstat respondents complained about:
'Summary graphs with less information than the original data.'
Figure 2. Scatter graph of the capillary density data.
Figure 3. Scatter graph of the capillary density data with mean and standard deviation added.
This now shows all the information in Figure1 and Figure 2.
If there are a large number of points, the scatter diagram will become a mass of
indistinguishable points. In this case we can use box and whisker plots
(see Bland 2000a), as in Figure 4.
Figure 4. Box and whisker graph of the capillary density data.
These do not give all the information in a scatter diagram, but they do show central
tendency, spread and the shape of the distribution. We can see from Figure 4 that the
distributions are roughly symmetrical, apart from one rather extreme point, that the
control group tend to have higher capillary density that the ulcer group, and that the
data are suitable for the t distribution to be applied.
My Allstat respondents had quite a lot to say. A common complaint about graphs such
as Figure 1, which I had made in my review, is that authors do not always make clear
what the vertical lines represent, standard deviations, standard errors, or confidence
intervals, an irritation which I mentioned above concerning '±' notation. A third objection
to the bar graph shown in Figure 1 is that it has only four numbers in it, which could be
reported much more efficiently in the text. Two of my respondents made similar points:
'Using bar charts to show that the proportion of women in the study was 55% and men 45%,
and similar low information ways of using ink and space.' (Two similar replies.)
On the other hand, one respondent complained about:
'Tables of data with (literally) hundreds of figures when the information content is
minimal and a graph would be more useful.'
The Lancet instructs its authors to 'Use graphs as an alternative to tables with many
entries'. Personally, I am usually inclined to tables rather than graphs. I think
that this bias (yes, I have them!) arises because I do not have a strong visual imagination
or ability to think pictorially. However, I also think that the argument that other
researchers can make use of your findings more easily if they are presented numerically
rather than graphically is a forceful one, and this should lead us to choose numbers when
in doubt.
I have no problems with the view of my respondents who were irritated by authors:
'Giving far too many decimal places.' (3 replies).
The week before writing this, I reviewed a paper which gave all P values, F statistics,
and even degrees of freedom to four decimal places, e.g. 'F=1.9367 with 34.3452 and
45.3298 degrees of freedom, P=0.0189'. This used an approximation to the F distribution
which involved changing the degrees of freedom, making them fractional. Now I doubt
that the F statistic conveys much useful information anyway, but all those decimal
places do not. There is no point in reporting F, t, or chi-squared statistics to more
than two decimal places. I do not think that anything would be lost by reducing the
decimal places to two here: 'F=1.94 with 34.35 and 45.33 degrees of freedom, P=0.019'.
Indeed, I would render the P value to one significant figure: 'P=0.02'. Only the first
non-zero number and the number of zeros preceding it are important. The reason for this
profligate and unconsidered reporting of many decimal places must be that computer
programs deliver them. Programmers try to give the users everything they could possibly
want and if the program calculates the F statistic to seven significant figures, why not
print them out? But this is no reason for the researcher to burden his readers with them.
They often make text and tables much more difficult to read. Correlation coefficients
are frequent example. Programs often print them to four decimal places, but is there
really any important difference between 'r=0.3421 and 'r=0.3379'? I think that 'r=0.34'
would do very nicely for both and make the meaning text and tables easier to grasp.
One respondent complained about something which I also dislike:
'Using multiple crosshatched three-dimensional bars' (2 replies).
I find that three-dimensional effects seldom make a graph clearer. The effect is usually
to make it more difficult to read.
Many statistical methods require the data to meet some assumptions, such as that data
follow a Normal distribution with uniform variance. Such assumptions are often not
checked, particularly for t methods. The statistical referee can often detect skewness
from the data and graphs given in the paper (Altman and Bland 1996).
One giveaway is a
standard deviation which is greater than half the mean, which implies that two standard
deviations below the mean would be a negative number. For most measurements negative
values are impossible we could not have any observations less than mean minus two
standard deviations, and 2.5% of observations from a Normal distribution would be found
there. Such data cannot therefore be from a Normal distribution. Another is to give
mean or median and quartiles or extreme values. If the mean or median is not close to
the centre of the interval determined by the limits, we should suspect that the
distribution is skew. Yet another betrayer of non-Normal distributions can arise when
the mean and standard deviation or standard error are calculated separately for several
different groups, then given in a table or graph. The standard deviation should not be
related to the mean. Often we see that groups with large means also have large standard
deviations. A scatter diagram of the data, while highly desirable, can also reveal
deviations from the assumptions of statistical methods. I three examples of obvious
deviations from assumptions in my 15 papers:
`Are the thin lines standard errors? If so, they suggest that the data are not Normal,
which casts doubt on the F test.'
`I would be surprised if these measurements followed Normal distributions. Figure 2
suggests that this is not the case, as the distribution of X looks positively skew.
The authors should check the distributions of their variables, and use a logarithmic
transformation where appropriate.'
`The data are very skewed, positively for X (mean 17.6, range 16.0-21.7) and negatively
for Y (mean 8.6, range 4.9-9.4). This is produced by the selection criteria for the
trial, which accepts subjects with X > 16.0 and Y < 9.5. No attempt is made to allow
for this in the analyses, which assume that data follow Normal distributions.'
To my surprise, only one of my respondents mentioned this:
'Authors who don't attempt to check the normality of their data and use normal
theory with clearly non-normal data.'
The Lancet specifies that authors should: 'Put a general description of methods
in the Methods section. When data are summarized in the Results section, specify
the statistical methods used to analyze them.' This is good advice. It is certainly
annoying when authors do not tell the reader what statistical method is being used
and I had an instance in my 15 reviews, in one of which I complained that:
`The statistical test used should be stated.'
My Allstat respondents thought this was an important problem, complaining about:
'Authors who assume that the description of the statistics is so unimportant that
they don't actually give any information at all' (5 similar replies).
One had a specific complaint about authors:
'Stating only that "statistical analysis was done using x computer package"'.
Telling us which package was used is important, as they are not all the same and
many statistical methods can be implemented in different ways which may give
different answers. Indeed, the Lancet asks for it: 'Specify any general-use
computer programs used'. But it is not enough to tell us what is being done.
In mathematical language, we would say that it is necessary but not sufficient.
This reported statistical methods section deserves to become a classic of pointless
minimalism:
'The analysis was performed on an IBM486, under MSDOS'
A less frequent, but also irritating, practice is not using the methods stated
in the method section of the paper. It is easy to do this, as papers often go
through many drafts, with parts being cut out and new one inserted, but it is
annoying when an obscure method is references and the referee spends time looking it
up only to find that this time had been wasted. I had an example of this in my 15 papers:
`I do not think Hotelling's t test is actually used anywhere.'
An Allstat respondent made the same point:
'Reference in the methods section to analyses undertaken but with no results
appearing anywhere in the report.'
This comment from my reviews combined a method reported in the method section
which was not used with not saying what done in the analyses which were reported:
`I think that tests other than paired t tests were done. I can't actually find
any data suitable for a paired t test. ... the appropriate method would be Fisher's
exact test, which gives P=0.2 ... this should be a rank correlation. I get
tau=0.37, P=0.08 . . . The appropriate method would be Fisher's exact test, which
gives P=0.09.'
I have no idea what they had actually done, but I was pretty confident that whatever
it was, was wrong. Sometimes I had to pinch myself to reassure myself that this was
not a ghastly nightmare, and that people had really submitted to this stuff to the
world's most prestigious medical journal.
Baseline characteristics deserve special mention because two common parastatisical
practices relate to them. Baseline characteristics are those which we record after
subjects have been recruited to the trial but before treatment begins. There are
several good reasons for making and reporting baseline measurements. The first of
these is obvious: we want to describe the population which our trial subjects represent.
The second is that we want to check and demonstrate that the randomization process has
worked. This is not always the case.
I was asked to advise on a trial where a programming
error had resulted in almost all the older subjects being allocated to one arm of the trial a
nd almost all the younger subjects to the other. My advice had to be 'Do it again'.
(MacArthur 2001)
The third is that we may want to adjust the treatment difference
for prognostic variables. If a variable measured at baseline is a strong predictor of the
outcome of treatment, adjusting for it statistically may lead to reveal treatment effects
which were masked. Altman (1991) gives a good example.
The first common parastatistical mistake is to carry out tests of significance on the
baseline variables between the randomized treatment groups. Randomization produces
treatment groups which are random samples from the same population. Therefore, any
null hypothesis that states that there is no difference between the populations from
which the groups come is true. Any significant differences between the treatment groups
have arisen by chance; they are type I errors. I had two examples of this in my 15 reviews:
`The tests of significance at baseline should not be done. If the subjects are randomized,
they come from the same population and the null hypothesis is true. There is no reason to test it.'
`There is no need to test the difference between the groups before the withdrawal of
treatment. Because they are randomised, they are from the same population until treatment
is changed, and hence the null hypotheses are true.'
One of my Allstat respondents mentioned this, too, complaining about:
'Significance testing of baseline variables in RCTs.'
The second parastatistical error is that, having tested for differences between baseline
characteristics, adjustment of the difference in the outcome measurement between treatments
is done for those variables which are significant one the baseline measurements but not for
any others. It is not the chance relationship of baseline variables to treatment which
is important, but their relationship to the outcome variable. Even when the treatment
groups are exactly balanced for the prognostic variable, adjusting for it statistically
should remove a lot of variability from the error term and so make confidence intervals
narrower and possibly make P values smaller. I had a good example of this approach in
one of my reviews:
`The statement that adjustment for baseline characteristics is not needed because
baseline differences are not significant is quite wrong. Such adjustments may reduce
the variability and so improve the power.'
An Allstat respondent made the same point, complaining about authors:
'Not reporting analyses adjusted for baseline values of prognostic covariates.'
A lot of other issues came up once or twice, either in my own reviews or from my
correspondents. I think that this represents the tip of a very large iceberg of
possible mistakes on the part of researchers. I present them in the hope that my
readers will in future avoid these particular ones at any rate.
An occasional mistake is to include repeated measurements on same subject as if
they were different subjects. The data are then analysed using methods which
assume that the observations are independent. This can have the effect of making
P values too small and confidence intervals too wide. I had a couple of examples
in my reviews:
`It is wrong to mix multiple observations from different subjects in this way
(Bland and Altman 1994). An appropriate method is described
by Bland and Altman (1995).'
`It is not clear why two subjects were measured twice. Inspection of Table 1
suggests that the intention was to measure at 18 hours but that subject 3 was tested
additionally at 2 hours and subject 5 at 48 hours. This should be clarified. Repeat
observations on the same subject and observations on different subjects cannot be
mixed as if they were all independent. I suggest that the first observation on
subject 3 and the second on subject 5 should be omitted from the statistical analysis,
as they are at very different times.'
The same problem can occur on a larger scale:
`However, they ignore the fact that these 21 groups of subjects are from 9 different
trials, and analyse the data as if they are all from the same population.'
Again, this would have the effect of making the P values too small and the confidence
intervals too wide. There are well-established mbethods of meta-analysis
(see, for example, Bland 2000b)
for carrying out the combination of data from different trials and authors should use them.
Significance test methods based on rank order, such as the Mann Whitney and Wilcox on
tests and those associated with the Spearman and Kendall rank correlation coefficients,
are inappropriate when samples are very small. One cannot have a significant two-sided
test at the 5% level when samples are smaller than two groups of four for the Mann Whitney
U test or less than six for the Wilcoxon paired test or the rank correlation coefficients.
Each possible rank ordering has probability greater than 0.05. Hence rank methods on very
small samples are inevitably not significant and there is no point in using them. I made
this point in one of my reviews:
`Rank methods are inappropriate for such small samples as they cannot detect any
differences, no matter how large the difference is.'
Curiously, I have been asked by publishers to review at least three proposals for
introductory statistics text-books (not written by statisticians) which contained
the statement that when we have fewer than six observations we should use non-parametric
methods, because parametric methods such as t tests are inappropriate, it being
impossible to verify the Normal distribution assumptions. The opposite is the case,
because parametric methods can produce significant differences for very small samples
although rank-based methods cannot. I wish I knew the source of this often-repeated idea.
As for checking the Normal assumption, we often have a good idea from other data whether
this is reasonable.
Correlation coefficients can cause a problem because there is an assumption that the same
is a representative (i.e. random) sample of its population and that both variables are
random variables. They should not be used when the values of one variable are set by
the experimenter. I had two instances of this in my reviews:
` . . . Correlation is inappropriate when one of the variables is fixed by the investigator
(dose and time) . . . One and two sample t methods and regression should be used.'
`The statement that there is no significant correlation between time of measurement and
X is meaningless. The times are almost equal except for the duplicate measurements.
The ratio is much higher for the early measurement and much lower for the late
measurement, suggesting that there is a possibility of a strong relationship with time.'
One my respondents, somewhat enigmatically, cited:
'Spurious use of correlation and regression (oh dear not again!)'
Statisticians mostly have a background in mathematics, as do I, and have been trained
for many years to think logically. Indeed, a colleague, Shirley Beresford, once remarked
that she thought that the main contribution of statisticians in medical research was not
to carry out statistical analyses but 'to inject a bit of logic into the situation'. So
imbued with logic are we that we can forget that this is not the only way of thinking and
is not the main method of thinking for most people, nor is it always the most useful. Thus
to us this one is jaw-dropping:
`The comparisons of X means between the low X and high X groups are not useful. If we
divide subjects according X and then compared the mean X between the two groups, of course
it will be significant. We could do the same thing with their telephone numbers.'
Of course, the null hypothesis that a group chosen to have X below a cut-off and a group
chosen to have X above the cut-off the mean X will be the same is inevitably false. As
we know this, there is no point in testing it. I presume the authors simply split the
subjects into two groups then tested everything between them. One of my Allstat
respondents made a similar point about:
'Dichotomising continuous variables especially if they identify 'responders' and
'non-responders' using these variables.'
Splitting the subjects into two groups using a continuous variable reduces the
amount of information which we have. P values may become larger and we may miss
important relationships. Some researchers might be tempted to split the sample
not at an arbitrary cut-off, such as the overall mean, but to choose a cut-off to
minimise a P value and make a relationship significant. This is a real misuse of
statistics and will produce misleading results.
The authors of one of the Lancet papers were particularly unlucky (or lucky,
depending how you look at it) because they were applying my own work on
agreement
between methods of measurement and received this comment:
'I suggest replacing the term "95% confidence intervals of agreement" by "95%
limits of agreement". The "95% limits of agreement" of Bland and Altman are not
a confidence interval, but two point estimates.'
My Allstat respondents came up with a lot more. One mentioned:
'Chi-square test analyses of ordered categorical data.'
What was meant is that we often have categorical data where the categories are
ordered in some way, such as physical condition being classified as 'poor', 'fair',
'good' or 'excellent'. The usual chi-squared test for a contingency table ignores
this ordering and tests the null hypothesis of no relationship of any sort between
the variables. (NEED REAL EXAMPLE HERE.) This is usually a mistake, but an
understandable one. Many textbooks use examples with ordered categories to
illustrate chi-squared tests.
Another gave the example of
'Rate per 1000 person-years = 3 (95% CI -3 to 9).'
The rate of something per year cannot be negative, so the calculation of the
confidence interval has produced an impossible lower limit. This happens because
researchers use methods designed for the analysis of large samples or large numbers
of events to small samples or small numbers of events. They calculate standard errors
and then calculate the confidence interval using the Normal distribution, as the
observed value ± 1.96 standard errors. But if the number of events or the sample
size is not large enough for this Normal approximation we can get negative lower
limits. The same thing can happen with proportions close to the top of their
range of possible values, such as sensitivities and specificities, which are sometimes
given confidence intervals with upper limits above 100%. There are better approximations
and exact methods which can be used in these cases to give confidence intervals which
do not include impossible values. Even zero would be an impossible lower limit for
the rate in the example, for if in the sample we had observed a case, as we must to
get a rate of 3 per 1000 person-years, then the rate in the population cannot be zero.
We sometimes see confidence intervals like the one given presented as '3 (95% CI 0 to 9).'
This happens because researchers calculate the interval as -3 to 9, recognise that -3 is
impossible, and replace it with zero.
My respondents made a couple of general points about the way statistics is carried out
in medical research. One complained about:
'Papers where the statistical methods are copied from a previous paper in the field,
which was in turn copied from a previous paper, which was in turn . . .'
This undoubtedly happens, and most statisticians have had the experience of researchers
who say that a published paper had used a particular method of which the statistician
disapproves, and was published, so why shouldn't they? Another respondent complained about:
'Doctors who don't realise that statistics is an advancing science; and the best methods
of 20 years ago are not always the best methods of today.'
Well, I think that there are plenty of statisticians in this category, too, and I have
no doubt that I am guilty of this from time to time. I do not think we can expect
researchers to keep up with what is happening in statistics as well as in their own
field. Perhaps, though, we can expect them to embrace a new and better technique
when the referee has pointed it out.
One despondent respondent commented:
'There is no hope, at times.'
Some of my respondents complained about authors' attitude to statisticians: These included:
'Papers which show no sign of having had input from a statistician.'
I can sympathise with this, but statisticians can be hard to find for many researchers.
The trouble is, you don't know what you don't know, so it hard to spot your own mistakes
or to realise that you need help. I think that it should be much easier for researchers
to get not just statistical advice but also collaboration. Trying to teach doctors how
to analyse their own data is very inefficient. It requires a different way of thinking
from medicine, and few people can do both. It is much better to train statisticians to
collaborate with them. An additional advantage, unfortunately, is that we do not pay the
statistician as much as the doctor, so it makes economic sense too. Another respondent
felt that statisticians did not get the prominence they deserved:
'Acknowledgements to a statistician who clearly did all the analysis and should be on the paper.'
Researchers sometimes ask me whether I would like to be acknowledged for my help. I usually
paraphrase Oscar Wilde and tell them that there is only one thing worse than being acknowledged,
and that is not being acknowledged. I think that the role of the statistician in research is
often worthy of authorship, but when I think I am entitled to be an author I am usually
welcomed. I think that statisticians have to make clear to researchers who consult them
that they have to have something to show for the time they spend in advisory work and that
if they make a real contribution, they should be included in the author list. On the other
hand, I often refuse authorship because I feel that I have not done enough or could defend
the paper.
Two respondents commented on the attitude of authors to statistical referees:
'People who ignore referees comments and send [the] paper to another journal.'
Sometimes this is all an author can do, but I agree that usually authors should
take note of what referees say. If, as can happen, the referee has missed the
point of the paper entirely, the author should ask why and see how the point can
be clarified. Another respondent mentioned:
'The view of many doctors that any comment made by a statistician regarding the
quality of the design must by definition be niggling and unimportant.'
I have been accused of being an academic who does not understand the real world of
life and death in which doctors operate. This may be true, but so what? I understand
something about the world of research and its interpretation. On the whole though, I get
on very well with medical profession and have found them warmly welcoming.
Some respondents did not answer my question about what researchers did to annoy referees,
but got a few things off their chests about what reviewers did to annoy authors. One
complained about:
'Making comments which you know are a matter of opinion and not fact without declaring
them as such.'
This is fair enough. If a referee knows that something is only a matter of opinion,
they should not condemn others for disagreeing. Another complained about referees:
'Suggesting extensions to analyses which you know will involve far more work than is
justified by any likely improvement to the analysis.'
If a referee did really know this then complaints would be justified. Another
respondent did not like referees:
'Taking far more time to review a manuscript than is reasonable.'
Mea culpa to that. Refereeing is a difficult task for which one gets little or
know reward and which competes for time with the work for which the statistician is paid.
Some journal do pay a small fee, but it could not possibly compensate for the time spent
in understanding a paper and finding the holes in it. However, I will try to do better.
'Using the anonymity usually afforded to pursue your own interests.'
My own experience as a statistical referee is that I am not remotely interested in the
papers which I sent and I am not clear how I could pursue my own interests by impeding
their publication. This is more likely to be a complaint about specialist referees who
are working in the same area.
'I am giving a pet hate of my own about statistical referees. It is the apparently
absolute conviction that their own method of dealing with a data set, whether it be by
confidence intervals for differences between groups, their favourite (and usually
obscure) measure of agreement, or idiosyncratic ways of normalising data before analysis,
is the only right and proper one. In fact, as we all know, a collection of statisticians
represents a variance of at least two standard deviations, and they agree to an even
lesser extent than psychiatrists. So let's have a bit more humility, please.'
I wondered if the comment about the measure of agreement was a dig at myself. I am
quite keen on confidence intervals for differences, too. However, it is certainly true
that there is often more than one acceptable way to analyse data. I am irritated by
referees who always insist on nonparametric methods because they do not believe that
any data follow a Normal distribution, and by those who always insist that nonparametric
methods are replaced by parametric ones.
When I first gave this talk, without the Allstat sample, one of my audience said
that he did not think that any of the things I had mentioned really upset me. He
thought that what really annoyed me was statistics not being taken seriously by researchers.
I did not think this was the case. I think that what really upset me about this
refereeing experience was that there were so many errors in so few papers, and in
papers submitted to one of the world's most prestigious medical journals. The
journal's own guidelines were ignored. Nothing about most of these papers
suggested that the authors had read them.
This suggests a lack of care about research, regarding it as an unimportant activity
which does not merit the effort which one hopes these medical researchers put into other
aspects of their work. This matters. Incorrect analysis may lead to incorrect conclusions.
Incorrect conclusions may lead to incorrect treatments and advice to patients.
People can die.
We can draw a few tentative conclusions from this study. The things which
should be avoided above all are:
A good aid to writing up clinical trials, and worth reading anyway,
is the CONSORT statement (Moher et al., 2001),
a template for doing this developed by a
group of statisticians and trialists. If you follow this you should sail
through the refereeing process.
I'll finish this talk with three comments from my Lancet reviews:
`The statistics are all wrong but it should be fairly easy to put them right.
What a huge number of authors and none of them understand statistics!'
`Why do they do a totally statistical project without a statistician?
I suggest they get one!'
And just to show that not all my 15 reviews were negative:
`My comments are very minor, not enough to make me rate any part of the paper as inadequate.
I like it.'
I thank Donald Singer for first suggesting the topic, the editors of the Lancet for providing
such rich source material, and my Allstat respondents, including Colin Chalmers, Rick Chappell,
Tim Cole, Margaret Corbett, Carole Cull, Keith Dear, Michael Dewey, Simon Dunkley, the late
Nicola Dollimore, Clarke Harris, Dan Heitjan, Jim Hodges, Alan Kelly, Peter Lewis, Russell
Localio, Alison Macfarlane, Sarah MacFarlane, David Mauger, Richard Morris, Ian Plewis,
Mike Procter, Paul Seed, Stephen Senn, Jim Slattery, Anthony Staines, Graham Upton, Andy Vail,
Ian White, Sheila Williams, Ian Wilson, and a few whose names did not come through with the email.
Altman, D.G. (1991) Practical Statistics for Medical Research
Chapman and Hall, London, p. 389-391.
Back to text.
Altman DG, Bland JM. (1996)
Detecting
skewness from summary information.
British Medical Journal 313, 1200.
Back to text.
Altman DG and Bland JM. (2003)
Interaction
revisited: the difference between two estimates. 326, 219.
Back to text.
Altman DG, Matthews JNS. (1996)
Interaction 1: Heterogeneity of effects.
British Medical Journal 313, 486.
Back to text.
Bland JM, Altman DG. (1994)
Correlation,
regression and repeated data.
British Medical Journal 308, 896.
Back to text.
Bland JM, Altman DG. (1995)
Calculating
correlation coefficients with repeated observations: Part 1, correlation
within subjects.
British Medical Journal 310, 446.
Back to text.
Bland M (2000a)
An Introduction to Medical Statistics, 3rd edition
Oxford, University Press.
Section 4.5 Medians and quantiles.
Back to text.
Bland M (2000b)
An Introduction to Medical Statistics, 3rd edition
Oxford, University Press.
Section 17.11 Meta-analysis: data from several studies.
Back to text.
Chalmers I. (1999)
Why transition from alternation to randomisation in clinical trials was made.
British Medical Journal 319, 1372.
Back to text.
Gardner, M.J. and Altman, D.G. (1986) Confidence intervals rather than P values:
estimation rather than hypothesis testing.
British Medical Journal 292, 746-50.
Back to text.
MacArthur C, Shennan AH, May A, Whyte J, Hickman N, Cooper G, Bick D, Crewe L,
Garston H, Gold L, Lancashire R, Lewis M, Moore P, Wilson M, Bharmal S, Elton C,
Halligan A, Hussain W, Patterson M, Squire P, de Swiet M. (2001)
Effect of low-dose mobile versus traditional epidural techniques on mode of delivery:
a randomised controlled trial.
Lancet
358, 19-23.
Back to text.
Matthews, D.E. and Farewell, V. (1988)
Using and understanding medical statistics, second edition
Karger, Basel,
Back to text..
Matthews JNS, Altman DG. (1996a)
Interaction
2: compare effect sizes not P values.
British Medical Journal 313, 808.
Back to text.
Matthews JNS, Altman DG. (1996b)
Interaction
3: How to examine heterogeneity.
British Medical Journal 313, 862.
Back to text.
Moher D, Schultz KF, Altman DG. (2001)
The CONSORT statement: revised recommendations for improving the quality of
reports of parallel group randomized trials.
Lancet 357, 1191-1194.
Back to text.
Newnham, J.P., Evans, S.F., Con, A.M., Stanley, F.J., Landau, L.I. (1993)
Effects of frequent ultrasound during pregnancy: a randomized controlled trial.
Lancet 342, 887-91.
Back to text.
Schulz, K.F., Chalmers. I., Hayes, R.J., and Altman, D.G. (1995)
Bias due to non-concealment of randomization and non-double-blinding.
Journal of the American Medical Association 273, 408-12.
Back to text.
From the Lancet's instructions to authors:
Describe statistical methods with enough detail to enable a knowledgeable reader with
access to the original data to verify the reported results. When possible, quantify
findings and present them with appropriate indicators of measurement error or uncertainty
(such as confidence intervals). Avoid relying solely on statistical hypothesis testing,
such as the use of P values, which fails to convey important quantitative information.
Discuss the eligibility of experimental subjects. Give details about randomization.
Describe the methods for and success of any blinding of observations. Report complications
of treatment. Give numbers of observations. Report losses to observation (such as dropouts
from a clinical trial). References for the design of the study and statistical methods
should be to standard works when possible (with pages stated) rather than to papers in
which the designs or methods were originally reported. Specify any general-use computer
programs used.
Put a general description of methods in the Methods section. When data are summarized in
the Results section, specify the statistical methods used to analyze them. Restrict tables
and figures to those needed to explain the argument of the paper and to assess its support.
Use graphs as an alternative to tables with many entries; do not duplicate data in graphs
and tables. Avoid nontechnical uses of technical terms in statistics, such as "random"
(which implies a randomizing device), "normal," "significant," "correlations," and "sample."
Define statistical terms, abbreviations, and most symbols.
The Lancet's full instructions to authors
are well worth reading.
Back to Some full length papers and talks.
Back to Martin Bland's Home Page.
This page is maintained by Martin Bland.
Assumptions
Incorrect descriptions of statistical methods
Baseline characteristics in randomised trials
A miscellany
Not taking us seriously
The author bites back
What really upsets me
How to avoid upsetting the statistical referee
And finally
Acknowledgements
References
Appendix
Statistics
Last updated: 19 August, 2004.