Introduction to Statistics for Research: Transformations

The need for transformations
Commonly used transformations for quantitative data
Logarithms
Transformations for a single sample
Transformations when comparing two groups
Transformations for paired data
Can all data be transformed?
Are transformations cheating?
References

The need for transformations

In the last lecture I described statistical methods in which we have to assume that data follow a Normal distribution with uniform variance. Later we shall meet regression methods, which require similar assumptions to be made about the data. Most analyses of continuous data in the health research literature are of this type.

We should always check these assumptions. If the data meet the assumptions we can analyse the data as described. If they are not met, we have two possible strategies: we can use a method which does not require these assumptions, such as a rank-based method, or we can transform the data mathematically to make them fit the assumptions more closely. In this lecture I describe the second approach. Instead of analysing the data as observed, we carry out a mathematical transformation first.

For example, Figure 1 shows serum triglyceride from cord blood in 282 babies.

Figure 1. Histogram with Normal distribution curve for serum triglyceride measurements from cord blood in 282 babies (data of Tessi Hanid)
See detailed description at d d

These data do not follow a Normal distribution closely. Figure 2 shows the same plots for the logarithms of triglyceride measurements.

Figure 2. Histogram with Normal distribution curve for log transformed serum triglyceride
See detailed description at d d

The logarithm follows a Normal distribution more closely than do the triglyceride measurements themselves. We could analyse the logarithm of serum triglyceride using methods which required the data to follow a Normal distribution. We call the logarithm of the triglyceride a logarithmic transformation of the data, or log transformation for short. We call the data without any transformation the raw data.

Even if a transformation does not produce a really good fit to the Normal distribution, it may still make the data much more amenable to analysis. Figure 3 shows a histogram for the area of venous ulcer at recruitment to the VenUS I trial.

Figure 3. Histogram for area of venous ulcer at recruitment, VenUS I trial
See detailed description at d d

The raw data have a very skew distribution and the small number of very large ulcers might lead to problems in analysis. Although the log transformed data are still skew, the skewness is much less and the data much easier to analyse. Figure 4 shows a histogram for the log transformed ulcer area.

Figure 4. Histogram for area of venous ulcer at recruitment after log transformation, VenUS I trial
See detailed description at d d

Making a distribution more like the Normal is not the only reason for using a transformation. Figure 5 shows prostate specific antigen (PSA) for three groups of prostate patients: with benign conditions, with prostatitis, and with prostate cancer.

Figure 5. Prostate specific antigen (PSA) by prostate diagnosis (data of Cutting et al., 1999)
See detailed description at d d

One very high value makes it very difficult to see the structure in the rest of the data, although, as we would expect, we can see that the cancer group have the highest PSA values. A log transformation of the PSA gives a much clearer picture, shown in Figure 6.

Figure 6. Log transformed PSA by prostate diagnosis
See detailed description at d d

The variability is now much more similar in the three groups.

The logarithm is not the only transformation used in the analysis of continuous data. Figure 7 shows arm lymphatic flow in patients with and without rheumatoid arthritis and oedema.

Figure 7. Arm lymphatic flow in rheumatoid arthritis with oedema (data of Kiely et al., 1995)
See detailed description at d d

The distribution is positively skew and the variability is clearly greater in the groups with greater lymphatic activity. A square root transformation has the effect of making the data less skew and making the variation more uniform, as shown in Figure 8.

Figure 8. Arm lymphatic flow in rheumatoid arthritis with oedema, after square root transformation
See detailed description at d d

In these data, a log transformation proved to have too great an effect, making the distribution negatively skew, and so the square root of the data was used in the analysis (Kiely et al., 1995).

Commonly used transformations for quantitative data

There are three commonly used transformations for quantitative data: the logarithm, the square root, and the reciprocal. (The reciprocal of a number is one divided by that number, hence the reciprocal of 2 is ½.) There are good mathematical reasons for these choices, Bland (2000) discusses them. They are based on the need to make variances uniform. If we have several groups of subjects and calculate the mean and variance for each group, we can plot variability against mean. We might have one of these situations:

Variability and mean are unrelated. We do not usually have a problem and can treat the variances as uniform. We do not need a transformation.
Variance is proportional to mean. A square root transformation should remove the relationship between variability and mean.
Standard deviation is proportional to mean. A logarithmic transformation should remove the relationship between variability and mean.
Standard deviation is proportional to the square of the mean. A reciprocal transformation should remove the relationship between variability and mean.

We call these transformations variance-stabilising, because their purpose is to make variances the same. For most data encountered in healthcare research, the first or third situation applies.

Variance-stabilising transformations also tend to make distributions Normal. There is a mathematical reason for this, as for so much in statistics. It can be shown that if we take several samples from the same population, the means and variances of these samples will be independent if and only if the distribution is Normal. This means that uniform variance tends to go with a Normal Distribution. A transformation which makes variance uniform will often also make data follow a Normal distribution and vice versa.

There are many other transformations which could be used, but you see them very rarely. We shall meet one other, the logistic transformation used for dichotomous data, in Week 7.

By far the most frequently used is the logarithm. This is particularly useful for concentrations of substances in blood. The reason for this is that blood is very dynamic, with reactions happening continuously. Many of the substances we measure are part of a metabolic chain, both being synthesised and metabolised to something else. The rates at which these reactions happen depends on the amounts of other substances in the blood and the consequence is that the various factors which determine the concentration of the substance are multiplied together. Multiplying and dividing tends to produce skew distributions. If we take the logarithm of several numbers multiplied together we get the sum of their logarithms. So log transformation produces something where the various influences are added together and addition tends to produce a Normal distribution.

The square root is best for fairly weak relationships between variability and magnitude, i.e. variance proportional to mean or standard deviation proportional to the square root of the mean. The logarithm is next, for standard deviation proportional to the mean, and the reciprocal is best for very strong relationships, where the standard deviation is proportional to the square of the mean. In the same way, the square root removes the least amount of skewness and reciprocal the most.

The square root can be used for variables which are greater than or equal to zero, the log and the reciprocal can only be used for variables which are strictly greater than zero, because neither the logarithm nor the reciprocal of zero are defined. We shall look at what to do with zero observations in Paired data, below.

Which transformation should we use for what kind of data? For physical body measurements, like limb length or peak expiratory flow, we often need use only the raw data. For concentrations measured in blood or urine, we usually try the log first, then if this is insufficient try the reciprocal. For counts, the square root is usually the first thing to try. There are methods to determine which transformation will best fit the data, but trial and error, with scatter plots, histograms and Normal plots to check the shape of the distribution and relationship between variability and magnitude, are usually much quicker because the computer can produce them almost instantaneously.

Logarithms

What is a logarithm?

Statistics is often thought of as a mathematical subject. For readers of research and for users of statistical methods, however, the mathematics seldom put in much of an appearance. They are hidden down in the computer program engine room. All we actually see on deck are basic mathematical operations (we add and subtract, divide and multiply) and the occasional square root. There is only one other mathematical operation with which the user of statistics must become familiar: the logarithm. We come across logarithms in graphical presentation (logarithmic scales), in relative risks and odds ratios (standard errors and confidence intervals, logistic regression), in the transformation of data to have a Normal distribution or uniform variance, in the analysis of survival data (hazard ratios), and many more. In these notes I explain what a logarithm is, how to use logarithms, and try to demystify this most useful of mathematical tools.

I shall start with logarithms (usually shortened to ‘log’) to base 10. In mathematics, we write 10² to mean 10×10. We call this ‘10 to the power 2’ or ‘10 squared’. We have 10² = 10×10 = 100.

We call 2 the logarithm of 100 to base 10 and write it as log₁₀(100) = 2.

In the same way, 10³ = 10×10×10 is ‘10 to the power 3’ or ‘10 cubed’, 10³ = 1000 and log₁₀(1000) = 3. 10⁵ = 10×10×10×10×10 is ‘10 to the power 5’, 10⁵ = 100,000 and log₁₀(100,000) = 5.

10 raised to the power of the log of a number is equal to that number. 10¹ = 10, so log₁₀(10) = 1.

Before the days of electronic calculators, logarithms were used to multiply and divide large or awkward numbers. This is because when we add on the log scale we multiply on the natural scale. For example

log₁₀(1000) + log₁₀(100) = 3 + 2 = 5 = log₁₀(100,000)

10³ × 10² = 10³⁺² = 10⁵

1000 × 100 = 100,000

Adding the log of 1000, which is 3, and the log of 100, which is 2, gives us 5, which is the log of 100,000, the multiple of 1000 and 100. So adding on the log scale is equivalent to multiplying on the natural scale.

When we subtract on the log scale we divide on the natural scale. For example

log₁₀(1000) – log₁₀(100) = 3 – 2 = 1 = log₁₀(10)

10³ ÷ 10² = 10^3–2 = 10¹

1000 ÷ 100 = 10

Subtracting the log of 100, 2, from the log of 1000, 3, gives us 1, which is the log of 10, 1000 divided by 100. So subtracting on the log scale is equivalent to dividing on the natural scale.

So far we have raised 10 to powers which are positive whole numbers, so it is very easy to see what 10 to that power means; it is that number of 10s multiplied together. It is not so easy to see what raising 10 to other powers, such as negative numbers, fractions, or zero would mean. What mathematicians do is to ask what powers other than positive whole numbers would mean if they were consistent with the definition we started with.

What is ten to the power zero? The answer is 10⁰ = 1, so log₁₀(1) = 0. Why is this? Let us see what happens when we divide a number, 10 for example, by itself:

10 ÷ 10 = 1

log₁₀(10) – log₁₀(10) = 1 – 1 = 0 = log₁₀(1)

When we subtract the log of 10, which is 1, from the log of 10, 1, the difference is zero. This must be the log of 10 divided by 10, so zero must be the logarithm of one.

So far, we have added and subtracted logarithms. If we multiple a logarithm by a number, on the natural scale we raise to the power of that number. For example:

3×log₁₀(100) = 3×2 = 6 = log₁₀(1,000,000)

100³ = 1,000,000.

If we divide a logarithm by a number, on the natural scale we take that number root. For example, log₁₀(1,000)/3 = 3/3 = 1 = log₁₀(10) and the cube root of 1,000 is 10, i.e. 10 × 10 × 10 = 1,000. Note that we are multiplying and dividing a logarithm by a plain number, not by another logarithm.

Logarithms which are not whole numbers

Logarithms do not have to be whole numbers. For example, 0.5 (or ½) is the logarithm of the square root of 10. We have 10^0.5 = 10^½ = square root 10 = 3.16228. We know this because

10^½ × 10^½ = 10^½+½ = 10¹ = 10.

We do not know what 10 to the power ½ means. We do know that if we multiply 10 to the power ½ by 10 to the power ½, we will have 10 to the power ½ + ½ = 1. So 10 to the power ½ multiplied by itself is equal to 10 and 10^½ must be the square root of 10. Hence ½ is the log to base 10 of the square root of 10.

Logarithms which are not whole numbers are the logs of numbers which cannot be written as 1 and a string of zeros. For example the log₁₀ of 2 is 0.30103 and the log₁₀ of 5 is 0.69897. Of course, these add to 1, the log₁₀ of 10, because 2 × 5 = 10:

0.30103 + 0.69897= 1.0000

Negative logarithms are the logs of numbers less than one. For example, the log of 0.1 is –1. This must be the case, because 0.1 is one divided by 10:

1 ÷ 10 = 0.1

log₁₀(1) – log₁₀(10) = 0 – 1 = –1 =log₁₀(0.1)

In the same way, the log of ½ is minus the log of 2: log₁₀(½) = –0.30103. Again, this is consistent with everything else. For example, if we multiply 2 by ½ we will get one:

log₁₀(2) + log₁₀(½) = 0.30103 – 0.30103 = 0 = log₁₀(1)

2 × ½ = 1

What is log₁₀(0)? It does not exist. There is no power to which we can raise 10 to give zero. To get a multiple equal to zero, one of the numbers multiplied must equal zero. As we take the logs of smaller and smaller numbers, the logs are larger and larger negative numbers. For example, log₁₀(0.0000000001) = –10. We say that the log of a number tends towards minus infinity as the number tends towards zero. The logarithms of negative numbers do not exist, either. We can only use logarithms for positive numbers.

The logarithmic curve and logarithmic scale

Figure 9 shows the curve representing the logarithm to base 10.

Figure 9. Graph of the logarithm to base 10, with a logarithmic scale on the right side.

The curve starts off at the bottom of the vertical scale just right of zero on the horizontal scale, coming up from minus infinity as the log of zero, if we were able to get it on the paper. It goes through the point defined by 1 on the horizontal axis and 0 on the vertical axis, and continues to rise but less and less steeply, going through the points defined by 10 and 1 and by 100 and 2.

The right hand vertical axis of Figure 9 shows the variable on a logarithmic scale. The scale in marked in unequal divisions which correspond to the logarithms of the numbers printed. On this scale, the distance between 1 and 10 is the same as the distance between 10 and 100. So equal distances mean equal ratios (10/1 = 100/10) rather than equal differences, as on a linear or natural scale.

Showing data on a logarithmic scale can often show us details which are obscured on the natural scale. For example, Figure 3 shows Prostate Specific Antigen (PSA) for three groups of subjects. Figure 4 shows the log of PSA. A lot more detail is clear, including the huge overlap between the three groups.

The units in which log₁₀(PSA) is measured may not be easily understood by those who use PSA measurements. Instead, we can put the original units from Figure 3 onto the graph shown in Figure 4 by means of a logarithmic scale. Figure 10 shows the PSA on a logarithmic scale, the structure revealed by the logarithm is shown but in the original units.

Figure 10. PSA for three groups of subjects, showing the logarithm and the original units on a logarithmic scale

Natural logarithms and base ‘e’

An early use of logarithms was to multiply or divide large numbers, to raise numbers to powers, etc. For these calculations, 10 was the obvious base to use, because our number system uses base 10, i.e. we count in tens. We are so used to this that we might think it is somehow inevitable, but it happened because we have ten fingers and thumbs on our hands. If we had had twelve digits instead, we would have counted to base 12, which would have made a lot of arithmetic much easier. Other bases have been used. The ancient Babylonians, for example, are said to have counted to base 60, though perhaps only a few of them did much counting at all.

Base 10 for logarithms was chosen for convenience in arithmetic, but it was a choice, it was not the only possible base. Logarithms to the base 10 are also called common logarithms, the logarithms for everyone to use.

Mathematicians also find it convenient to use a different base, called ‘e’, to give natural logarithms. The symbol ‘e’ represents a number which cannot be written down exactly, like pi, the ratio of the circumference of a circle to its diameter. In decimals, e = 2.718281 . . . and this goes on and on indefinitely, just like pi.

We use this base because the slope of the curve y = log₁₀(x) is log₁₀(e)/x. The slope of the curve y = log_e(x) is 1/x. It displays a rather breathtaking insouciance to call ‘natural’ the use of a number which you cannot even write down and which has to be labelled by a letter, but that is what we do. Using natural logs avoids awkward constants in formulae and as long as we are not trying to use them to do calculations, it makes life much easier. When you see ‘log’ written in statistics, it is the natural log unless we specify something else.

Logs to base e are sometimes written as ‘ln’ rather than ‘log_e’. On calculators, the button for natural logs is usually labelled ‘ln’ or ‘ln(x)’. The button labelled ‘log’ or ‘log(x)’ usually does logs to the base 10. If in doubt, try putting in 10 and pressing the log button. As we have seen, log₁₀(10) = 1, whereas log_e(10) = 2.3026.

Antilogarithms

The antilogarithm is the opposite of the logarithm. If we start with a logarithm, the antilogarithm or antilog is the number of which this is the logarithm. Hence the antilog to base 10 of 2 is 100, because 2 is the log to base 10 of 100. To convert from logarithms to the natural scale, we antilog:

antilog₁₀(2) = 10² = 100

We usually write this as 10² rather than antilog₁₀(2). On a calculator, the antilog key for base 10 is usually labelled ‘10^x’.

To antilog from logs to base e on a calculator, use the key labelled ‘e^x’ or ‘exp(x)’. Here ‘exp’ is short for ‘exponential’. This is another word for ‘power’ in the sense of ‘raised to the power of’ and the mathematical function which is the opposite of the log to base e, the antilog, is called the exponential function. So ‘e’ is for ‘exponential’.

We also use the term ‘exponential’ to describe a way of writing down very large and very small numbers. Suppose we want to write the number 1,234,000,000,000. Now this is equal to 1.234 × 1,000,000,000,000. We can write this as 1.234 × 10¹². Similarly, we can write 0.000,000,000,001234 as 1.234 × 0.000,000,000,001 = 1.234 × 10^–12. Computers print these numbers out as 1.234E12 and 1.234E–12. This makes things nice and compact on the screen or printout, but can be very confusing to the occasional user of numerical software.

Transformations for a single sample

As we have seen, for the serum cholesterol in stroke patients data, the log transformation gives a good fit to the Normal. What happens if we analyse the logarithm of serum cholesterol then try to transform back to the natural scale?

For the raw data, serum cholesterol: mean = 6.34, SD = 1.40.

For log (base e) serum cholesterol: mean = 1.82, SD = 0.22.

If we take the mean on the transformed scale and back-transform by taking the antilog, we get exp(1.82) = 6.17. This is less than the mean for the raw data. The antilog of the mean log is not the same as the untransformed arithmetic mean.

In fact, it is the geometric mean, which is found by multiplying all the observations and taking the n’th root. (It is called geometric because if we have just two numbers we could draw a rectangle with those two numbers as the lengths of the long and short sides. The geometric mean is the side of a square which has the same area as this rectangle.) Now, if we add the logs of two numbers we get the log of their product. Thus when we add the logs of a sample of observations together we get the log of their product. If we multiply the log of a number by a second number, we get the log of the first raised to the power of the second. So if we divide the log by n, we get the log of the n’th root. Thus the mean of the logs is the log of the geometric mean.

What about the units for the geometric mean? If cholesterol is measured in mmol/L, the log of a single observation is the log of a measurement in mmol/L. The sum of n logs is the log of the product of n measurements in mmol/L and is the log of a measurement in mmol/L to the power n. The n’th root is thus again the log of a number in mmol/L and the antilog is back in the original units, mmol/L.

The antilog of the standard deviation is not measured in mmol/L. To find a standard deviation, we calculate the differences between each observation and the mean, square and add. On the log scale, we take the difference between each log transformed observation and subtract the log geometric mean. We have the difference between the log of two numbers each measured in mmol/L, giving the log of their ratio ,which is the log of a dimensionless pure number. We cannot transform the standard deviation back to the original scale.

If we want to use the standard deviation, it is easiest to do all calculations on the transformed scale and transform back, if necessary, at the end. For example, to estimate the 95% confidence interval for the geometric mean, we find the confidence interval on the transformed scale. On the log scale the mean is 1.8235 with standard error of 0.0235. This standard error is calculated from the standard deviation, which is a pure number without dimensions, and the sample size, which is also a pure number. It, too, is a pure number without dimensions. The 95% confidence interval for the mean is

1.8235 – 1.96 × 0.0235 to 1.8235 + 1.96 × 0.0235 = 1.777 to 1.870.

If we antilog these limits we get 5.91 to 6.49. To get the confidence limits we took the log of something in mmol/L, the mean, and added or subtracted the log of a pure number, the standard error multiplied by 1.96. On the natural scale we have taken something in mmol/L and multiplied or divided by a pure number. We therefore we still have something in mmol/L. The 95% confidence interval for the geometric mean is therefore 5.91 to 6.49 mmol/L.

For the arithmetic mean, using the raw, untransformed data we get 6.04 to 6.64 mmol/L. This interval is slightly wider than for the geometric mean. In highly skew distributions, unlike serum cholesterol, the extreme observations have a large influence on the arithmetic mean, making it more prone to sampling error and the confidence interval for the arithmetic mean is usually quite a lot wider.

In the same way we can estimate centiles on the transformed scale and then transformed back. In the a Normal distribution the central 95% of observations are within 1.96 standard deviations from the mean. For log serum cholesterol, this is 1.396 to 2.251. The antilog is 4.04 to 9.50 mmol/L.

We can do this for square root transformed and reciprocal transformed data, too. If we do all the calculations on the transformed scale and transform back only at the end, we will be back in the original units. The mean calculated in this way using a reciprocal transformations also has a special name, the harmonic mean.

Transformations when comparing two groups

Table 1 shows measurements of biceps skinfold thickness compared for two groups of patients, with Crohn’s disease and Coeliac disease.

Table 1. Biceps skinfold thickness (mm) in two groups of patients
Crohn’s Disease Coeliac Disease
1.8 2.8 4.2 6.2 1.8 3.8
2.2 3.2 4.4 6.6 2.0 4.2
2.4 3.6 4.8 7.0 2.0 5.4
2.5 3.8 5.6 10.0 2.0 7.6
2.8 4.0 6.0 10.4 3.0

Table 1. Biceps skinfold thickness (mm) in two groups of patients
Crohn’s Disease		Coeliac Disease
1.8	2.8	4.2	6.2	1.8	3.8
2.2	3.2	4.4	6.6	2.0	4.2
2.4	3.6	4.8	7.0	2.0	5.4
2.5	3.8	5.6	10.0	2.0	7.6
2.8	4.0	6.0	10.4	3.0

We ask whether there is any difference in skinfold between patients with these diagnoses and what it might be.

Figure 11 shows the distribution of biceps skinfold. This is clearly positively skew.

Figure 11. Untransformed biceps skinfold thickness for Crohn’s disease and Coeliac disease patients, with histogram and Normal plot of residuals
See detailed description at d. d

Figure 12 shows the same data after square root transformation. This is still skew, though less so than the untransformed data.

Figure 12. Square root transformed biceps skinfold thickness
See detailed description at d. d

Figure 13 shows the effect of a log transformation.

Figure 13. Log transformed biceps skinfold thickness
See detailed description at d. d

The distribution is now more symmetrical and the Normal plot is closer to a straight line. Figure 14 shows the effect of a reciprocal transformation, which looks fairly similar to the log.

Figure 14. Reciprocal transformed biceps skinfold thickness
See detailed description at d. d

Any of the transformations would be an improvement on the raw data.

Table 2 shows the result of a two sample t test and confidence interval for the raw data and the transformations. The transformed data clearly gives a better test of significance than the raw data, in that the P values are smaller.

Table 2. Comparison of mean biceps skinfold between Crohn’s disease and Coeliac disease patients using different transformations
Transformation Two sample t test, 27 d.f. 95% confidence interval for difference on transformed scale Variance ratio larger/smaller
t P
None 1.28 0.21 -0.71mm to 3.07mm 1.52
Square root 1.38 0.18 -0.140 to 0.714 1.16
Logarithm 1.48 0.15 -0.114 to 0.706 1.10
Reciprocal -1.65 0.11 -0.203 to 0.022 1.63

Table 2. Comparison of mean biceps skinfold between Crohn’s disease and Coeliac disease patients using different transformations
Transformation	Two sample t test, 27 d.f.	95% confidence interval for difference on transformed scale	Variance ratio larger/smaller
t	P
None	1.28	0.21	-0.71mm to 3.07mm	1.52
Square root	1.38	0.18	-0.140 to 0.714	1.16
Logarithm	1.48	0.15	-0.114 to 0.706	1.10
Reciprocal	-1.65	0.11	-0.203 to 0.022	1.63

The confidence intervals for the transformed data are more difficult to interpret. The confidence limits for the difference between means cannot be transformed back to the original scale.

For the square root transformation, the lower limit is negative. We can square this, which would give a positive number, and this will happen whatever the limits because all squares are positive. Hence squaring the limits will not give a 95% confidence interval for the difference in biceps skinfold. The confidence interval must include the null hypothesis value, which would be zero. The same problem arises with the logarithmic transformation, all antilogs are positive. For the reciprocal, we could transform back, but what would this mean? The closer the limits are on the reciprocal scale, the further apart they will be on the natural scale. The upper limit for the reciprocal is very small (0.022) with reciprocal 45.5. The difference clearly could not be 45.5 mm, as all the observations are much smaller than this. The null hypothesis value, zero on the reciprocal scale, transforms back to infinity! A point to watch out for is that the square root and logarithm keep differences in the same direction as the raw data, the reciprocal reverses the direction.

Confidence limits for the difference cannot be transformed back to the original scale. However, the logarithm does give interpretable results (0.89 to 2.03) but these are not limits for the difference in millimetres. They do not contain zero yet the difference is not significant. The back-transformed 95% confidence interval using the log transformation, 0.89 to 2.03, are the 95% confidence limits for the ratio of the Crohn’s disease mean to the Coeliac disease mean. When we take the difference between the logarithms of the two geometric means, we get the logarithm of their ratio, not of their difference.

Transformed data give us only a P value when comparing groups, unless we use the log, in which case we can get confidence intervals for ratios.

Transformations for paired data

Table 3 shows Attacks of angina over four weeks on in a crossover trial comparing pronethalol with placebo.

Table 3. Attacks of angina over four weeks on pronethalol and on placebo (Prictchard et al., 1963)
Patient Placebo Pronethalol Placebo minus
Pronethalol
1 71 29 42
2 323 348 –25
3 8 1 7
4 14 7 7
5 23 16 7
6 34 25 9
7 79 65 14
8 60 41 19
9 2 0 2
10 3 0 3
11 17 15 2
12 7 2 5

Table 3. Attacks of angina over four weeks on pronethalol and on placebo (Prictchard *et al.*, 1963)
Patient	Placebo	Pronethalol	Placebo minus Pronethalol
1	71	29	42
2	323	348	–25
3	8	1	7
4	14	7	7
5	23	16	7
6	34	25	9
7	79	65	14
8	60	41	19
9	2	0	2
10	3	0	3
11	17	15	2
12	7	2	5

People reporting a lot of attacks have much larger differences than those reporting few attacks. A paired t test would not be valid. We would like to transform the data to make them fit the assumptions required for the paired t method. Differences are often negative, as one is here. We cannot log or square root negative numbers. Mathematically, there is nothing to stop us taking reciprocals, but the reciprocal has what we call a discontinuity at zero. As we go towards zero from the positive end, we get larger and larger positive numbers and zero has no reciprocal, it is an infinitely large number. As we go towards zero from the negative end, we get larger and larger negative numbers. So at zero the reciprocal switches from an infinitely large negative number to an infinitely large positive number. We cannot use any of these transformations for the differences. Instead, we transform the original observations then calculate the differences from the transformed observations.

Table 4 shows the square root transformed data.

Table 4. Square root transformed attacks of angina over four weeks on pronethalol and on placebo
Patient Placebo Pronethalol Placebo minus
Pronethalol
1 8.426149 5.385165 3.040985
2 17.972200 18.654760 ”0.682558
3 2.828427 1.000000 1.828427
4 3.741657 2.645751 1.095906
5 4.795832 4.000000 0.795832
6 5.830952 5.000000 0.830952
7 8.888194 8.062258 0.825936
8 7.745967 6.403124 1.342843
9 1.414214 0.000000 1.414214
10 1.732051 0.000000 1.732051
11 4.123106 3.872983 0.250122
12 2.645751 1.414214 1.231538

Table 4. Square root transformed attacks of angina over four weeks on pronethalol and on placebo
Patient	Placebo	Pronethalol	Placebo minus Pronethalol
1	8.426149	5.385165	3.040985
2	17.972200	18.654760	”0.682558
3	2.828427	1.000000	1.828427
4	3.741657	2.645751	1.095906
5	4.795832	4.000000	0.795832
6	5.830952	5.000000	0.830952
7	8.888194	8.062258	0.825936
8	7.745967	6.403124	1.342843
9	1.414214	0.000000	1.414214
10	1.732051	0.000000	1.732051
11	4.123106	3.872983	0.250122
12	2.645751	1.414214	1.231538

We could also try a log transformation. There is a problem, however. We have a zero observation and zero has no logarithm. What we usually do is to add a small constant to everything. This should be somewhere between zero and the smallest non-zero observation. If there is no reason not to do so, we usually choose 1.0 for this constant. Table 5 shows the transformed data.

Table 5. Log(x + 1) transformed attacks of angina over four weeks on pronethalol and on placebo
Patient Placebo Pronethalol Placebo minus
Pronethalol
1 4.276666 3.401197 0.875469
2 5.780744 5.855072 ”0.074328
3 2.197225 0.693147 1.504077
4 2.708050 2.079442 0.628609
5 3.178054 2.833213 0.344841
6 3.555348 3.258096 0.297252
7 4.382027 4.189655 0.192372
8 4.110874 3.737670 0.373204
9 1.098612 0.000000 1.098612
10 1.386294 0.000000 1.386294
11 2.890372 2.772589 0.117783
12 2.079442 1.098612 0.980830

Table 5. Log(x + 1) transformed attacks of angina over four weeks on pronethalol and on placebo
Patient	Placebo	Pronethalol	Placebo minus Pronethalol
1	4.276666	3.401197	0.875469
2	5.780744	5.855072	”0.074328
3	2.197225	0.693147	1.504077
4	2.708050	2.079442	0.628609
5	3.178054	2.833213	0.344841
6	3.555348	3.258096	0.297252
7	4.382027	4.189655	0.192372
8	4.110874	3.737670	0.373204
9	1.098612	0.000000	1.098612
10	1.386294	0.000000	1.386294
11	2.890372	2.772589	0.117783
12	2.079442	1.098612	0.980830

Figure 15 shows the Normal plot for the differences for the raw data and for the two transformations.

Figure 15. Normal plots for differences for the raw data, square root transformed data, and log plus one transformed data for the number of attacks of angina
See detailed description at d d

In the Normal plots, the points lie much closer to the straight line for both square root and log(x + 1) transormed data. There is not much to choose between them, though the log may look a bit better. The transformed data appear to fit the Normal assumption needed for the paired t method much better than the raw data.

If we apply paired t tests to each of the scales, we get for the natural scale: P = 0.11, for the square root scale: P = 0.0011, and for the log(x + 1) scale: P=0.0012. The two transformations give very similar P values. Both are smaller than the P = 0.006 we got with the sign test in the third lecture.

Can all data be transformed?

Not all data can be transformed successfully. Sometimes we have very long tails at both ends of the distribution, which makes transformation by log, square root or reciprocal ineffective. For example, Figure 16 shows the distribution of blood sodium in ITU patients.

Figure 16. Blood sodium in 221 ITU patients (data of Friedland et al., 1996)
See detailed description at d d

This is fairly symmetrical, but has longer tails than a Normal distribution. The shape of the Normal plot is first convex then concave, reflecting this. We can often ignore this departure from the Normal distribution, for example when using the two-sample t method, but not always. If we were trying to estimate a 95% range, for example, we might not get a very reliable answer.

Sometimes we have a bimodal distribution, which makes transformation by log, square root or reciprocal ineffective. Figure 17 shows systolic blood pressure in the same sample of ITU patients.

Figure 17. Systolic blood pressure in 250 ITU patients (data of Friedland et al., 1996)
See detailed description at d d

This is clearly bimodal. The serpentine Normal plot reflects this. None of the usual transformations will affect this and we would still have a bimodal distribution. We should not ignore this departure from the Normal distribution.

Sometimes we have a large number of identical observations, which will all transform to the same value whatever transformation we use. These are often at one extreme of the distribution, usually at zero. For example, Figure 18 shows the distribution of coronary artery calcium in a large group of patients.

Figure 18. Coronary artery calcium in 2217 subjects (Data of Sevrukov et al., 2005)
See detailed description at d d

More than half of these observations were equal at zero. Any transformation would leave half the observations with the same value, at the extreme of the distribution. It is impossible to transform these data to a Normal distribution.

What can we do if we cannot transform data to a suitable form? If the sample is large enough, we can ignore the distribution and use large sample z methods. For small samples we can do the same, and hope that the effect of the departure from assumptions is to make confidence intervals too wide and P values too big, rather than the other way round. This is usually the effect of skewness, but we should always be very cautious in drawing conclusions. It is usually safer to use methods that do not require such assumptions. These include the non-parametric methods, such as the Mann-Whitney U test and Wilcoxon matched-pairs test, which are beyond the scope of this course. These methods will give us a valid significance test, but usually no confidence interval.

Are there data which should not be transformed? Sometimes we are interested in the data in the actual units only. Cost data is a good example. Costs of treatment usually have distributions which are highly skew to the right. However, we need to estimate the difference in mean costs in pounds. No other scale is of interest. We should not transform such data. We rely on large sample comparisons or on methods which do not involve any distributions. Economists often use a group of methods which do not rely on any assumptions about distributions called bootstrap or resampling methods.

Are transformations cheating?

At about this point, someone will ask ‘Aren’t transformations cheating?’. Data transformation would be cheating if we tried several different transformations until we found the one which gave the result we wanted, just as it would be if we tried several different tests of significance and chose the one which gave the result nearest to what we wanted, or compared treatment groups in a clinical trial using different outcome variables until we found one which gave a significant difference. Such approaches are cheating because the P values and confidence intervals we get are wrong. However, it is not cheating if we decide on the analysis we want to use before we see its result and then stick to it.

It should be remembered that the linear scale is not the only scale which we use for measurements. Some variables are always measured on a log scale. Well-known examples are the decibel scale for measuring sound intensity and the Richter scale for measuring earthquakes. The reason we use logarithmic scales for these is that the range of energy involved is huge and a difference which is easily perceived at low levels would not be noticed at high. If you are sitting in a library and someone said ‘Hello’ to you, you would certainly notice it, but if you were standing next to a jumbo jet preparing for take-off or in a disco you would not. Many healthcare professionals see several measurements of acidity every working day, but they do worry about pH being a logarithmic scale, and a logarithm with its minus sign removed at that. It is simply the most convenient scale on which to measure.

Should we measure the power of spectacle lenses by focal length or in dioptres? We use dioptres in ophthalmology, which is the reciprocal transformation of the focal length in metres. Concentrations are measured in units of solute in contained in one unit of solvent, but this is an arbitrary choice. We could measure in units of solvent required to contain one unit of solute -- the reciprocal. Similarly, we measure car speed in miles or kilometres per hour, but we could just as easily use the number of hours or minutes required to go one mile. We measure fuel consumption like this, in miles per gallon or kilometres per litre rather than gallons per mile or litres per kilometre.

We often choose scales of measurement for convenience, but they are just that, choices. There is often no overwhelming reason to use one scale rather than another. In the same way, when we use a transformation, we are choosing the scale for ease of statistical analysis, not to get the answer we want.

References

Bland M. (2000) An Introduction to Medical Statistics. Oxford University Press.

Cutting CW, Hunt C, Nisbet JA, Bland JM, Dalgleish AG, Kirby RS. (1999) Serum insulin-like growth factor-1 is not a useful marker of prostate cancer. BJU International 83, 996-999.

Friedland JS, Porter JC, Daryanani S, Bland JM, Screaton NJ, Vesely MJJ, Griffin GE, Bennett ED, Remick DG. (1996) Plasma proinflammatory cytokine concentrations, Acute Physiology and Chronic Health Evaluation (APACHE) III scores and survival in patients in an intensive care unit. Critical Care Medicine 24, 1775-81.

Kiely PDW, Bland JM, Joseph AEA, Mortimer PS, Bourke BE. (1995) Upper limb lymphatic function in inflamatory arthritis. Journal of Rheumatology, 22, 214-217.

Markus HS, Barley J, Lunt R., Bland JM, Jeffery S, Carter ND, Brown MM. (1995) Angiotensin-converting enzyme gene deletion polymorphism: a new risk factor for lacunar stroke but not carotid atheroma. Stroke 26, 1329-33

Pritchard BNC, Dickinson CJ, Alleyne GAO, Hurst P, Hill ID, Rosenheim ML, Laurence DR. (1963) Report of a clinical trial from Medical Unit and MRC Statistical Unit, University College Hospital Medical School, London. British Medical Journal 2: 1226-7.

Sevrukov AB, Bland JM, Kondos GT. (2005) Serial electron beam CT measurements of coronary artery calcium: Has your patient’s calcium score actually changed? American Journal of Roentgenology 185, 1546-1553.

To Introduction to Statistics for Research index.

To Martin Bland's home page.

This page maintained by Martin Bland.
Last updated: 13 January, 2020.