Books

Books : reviews

Robert P. Abelson.
Statistics as Principled Argument.
Lawrence Erlbaum Associates. 1995

rating : 2 : great stuff
review : 21 January 2010

Statistics is not boring. Really! It can be (and often is) made boring by bad teaching, but the underlying subject is fascinating – drawing valid conclusions from seemingly noisy or random observations, by a process that can look like magic to the uninitiated. And like any applied mathematical technique, there’s much more to it than just the mathematics itself. The correct application, and explanation of that application, are equally as important to get right. And so here Abelson describes his MAGIC approach to principled statistical arguments: the researcher needs to consider Magnitude, Articulation, Generality, Interestingness, and Credibility.

Magnitude

Statistical significance is not enough. It merely tells you whether your observations could have arisen by chance, or, as Abelson delightfully puts it:

p39. The particular level (p value) at which the null hypothesis is rejected … is often used as a gauge of the degree of contempt in which the null hypothesis deserved to be held.

But as he points out, this isn’t very helpful, since p depends on the sample size: the bigger the sample, the more “statistically significant” the result. (This is particularly a problem in areas like computer science, where doing a few thousand, or million runs, might be quite cheap, and ridiculously small p values can be obtained.) So it is good practice also to quote the effect size: not only show your observation is not due to chance, but show that the difference is big enough to get excited about.

However, even having a significant p value might not mean what you think it means. Abelson describes The Replication Fallacy, which is:

p75. … an overconfidence in the repeatability of statistically significant results. The following thought experiment may help to correct the fallacy. Imagine an experimenter who has run a two-group study, and has found by t test the result p = .05. What is the chance that if she exactly repeated the study with a new sample of subjects (and the same n per group) that she would again get a significant result at the .05 level? …
     Half the time, the observed effect size from the second study ought to be bigger than that of the first study, and half the time, smaller. Because the first observed effect size was just big enough to obtain a p value of .05, anything smaller would yield a nonsignificant p > .05. This analysis thus yields an expected repeatability of 50-50, much lower than the usual intuition.

Articulation

Articulation is about reporting the results clearly, without getting lost in irrelevant and uninteresting details (but without fudging the important but inconvenient details). This can be difficult if the results are inconclusive or borderline. Abelson makes the very important, but often forgotten point that:

p105. If a null hypothesis survives a significance test … this outcome does not mean that a value of zero can be assigned to the true comparative difference; it merely signifies that we are not confident of the direction of the (somewhat small) true difference.

Even if the null hypothesis is refuted, it can still be hard to articulate the results. Abelson gives advice around “ticks and buts”, that help expose the interesting and relevant results.

Generality

All results are in some sense specific to the particular experiments run, but are actually only useful if they can be interpreted more generally. If every experiment can say only what those people did, on that day, under those circumstances, it’s not very useful: we want to know what people in general will do, or even, why they do it.

This might require doing more experiments, particularly to test a theory. If your theory explains why something happens, you should be able to construct a situation in which it won’t:

p143. “You never understand a phenomenon unless you can make it go away” … “or unless you can reverse its direction”

This fits in with the overall approach: statistical analysis isn’t about a single isolated experiment, it’s about a series of experiments advancing and changing the understanding, contributing to the lore.

Interestingness

Moreover, results should be interesting: they should change the way people think about the subject, they should be “surprising”. What makes something interesting?

p160. when an unambiguous prediction of a folk theory or a scientific theory is generally believed, it is more interesting to cast doubt on it than to provide evidence strengthening it

One needs to be careful here, however. There is folk theory that is “generally believed” but for which there is precious little evidence. The move for evidence-based medicine is based on this observation: there are things everybody knows that just ain’t so. In these cases it is important to gather the evidence. I agree, evidence that disproves such folk theories is more interesting, but in some cases, it might not be more important. Nevertheless, in general this is good advice, and Abelson suggests a “surprisingness coefficient”: how different the observed result is from what you expect it to be. If you expect the null hypothesis to hold (which, let’s face it, you rarely do), then the surprisingness is the same as the effect size. But if you expect a big effect, and find it, that’s not very surprising. If you expect an effect in one direction and find it in the other, that’s the best of all!

Credible

Finally, we come on to credibility. If your results beggar belief, then your methodology is going to be attacked. You can counter this by guarding against the well known problems (which grow in number as the lore progresses), but eventually, if your result is just too unbelievable, you are probably going to have to come up with a corresponding theory that is better than the prevailing one. Anomalous results are not enough to overturn the world: you have to replace it with something better.

As well as the MAGIC chapters, there is a lovely chapter entitled “On Suspecting Fishiness” (who could resist buying such a book?), which highlights some things to look out for that could indicate error, or worse, fraud. It includes an interesting little tale about Mendel and his pea plants. We’ve all been brought up on the story that Mendel cheated, and massaged his results to get the answer he wanted, caught out by clever statisticians who showed his answers were just too good to be true. Well, maybe the story isn’t as clear. Here Abelson recounts a different analysis that shows Mendel might have got his results by using a dodgy statistical procedure: not such a heinous crime after all, particularly as statistics wasn’t the well-developed subject in the mid-1800s that it is today.

This is not an introductory text: it assumes you know about t tests and p values, etc. But it is also not particularly mathematical – it is simply full of good, clear-headed advice, wisdom even, of doing statistics properly. Although it is written from the perspective of a psychologist analysing experiments done in noisy environments on a small number of subjects, its advice is a must read for anyone using statistics seriously.