Prof. of Health Statistics

University of York

- What is a cross-over trial?
- Advantages of cross-over designs
- Estimation and significance tests
- Analysis for a simple two period two treatment crossover trial
- Interaction between period and treatment
- Carry over effects
- Should we test the period × treatment interaction?
- Washout periods
- Baseline measurements
- Books on cross-over trials
- References

In these notes I shall describe the uses of and contra-indications for cross-over trials, the analysis of a cross-over trial comparing two treatments, and some features of cross-over trial design. I shall be describing a recent trial on which I have collaborated and two older trials drawn from the literature.

For example, an early two treatment cross-over trial was done to compare
pronethalol with placebo for the treatment of angina pectoris.
Patients received placebo for two periods of two weeks and pronethalol
for two periods of two weeks, in random order (Pritchard *et al.* 1963).
They completed diaries of attacks of angina. The results were as follows:

Placebo: | 2 | 3 | 7 | 8 | 14 | 17 |

23 | 34 | 60 | 79 | 71 | 323 | |

Pronethalol: | 0 | 0 | 1 | 2 | 7 | 15 |

16 | 25 | 29 | 41 | 65 | 348 |

There is great variability in the numbers of attacks and the difference is not significant. The Mann Whitney U test gives P = 0.4. But this analysis is wrong; it ignores the data structure. These observations should be paired, as in Table 1.

Patient | Placebo | Pronethalol | Placebo minus Pronethalol |
---|---|---|---|

1 | 71 | 29 | 42 |

2 | 323 | 348 | –25 |

3 | 8 | 1 | 7 |

4 | 14 | 7 | 7 |

5 | 23 | 16 | 7 |

6 | 34 | 25 | 9 |

7 | 79 | 65 | 14 |

8 | 60 | 41 | 19 |

9 | 2 | 0 | 2 |

10 | 3 | 0 | 3 |

11 | 17 | 15 | 2 |

12 | 7 | 2 | 5 |

Now we can see, despite the great variability, a suggestion of a treatment effect. Eleven of the 12 participants had more attacks on placebo than on pronethalol. As the distribution of differences in far from Normal, we can use the sign test to compare the two treatments. This gives P = 0.006. We have a highly significant difference compared to that for a two sample analysis using the Mann Whitney U test, which gave P = 0.4.

Cross-over designs have several advantages over a parallel group design of the same size:

- each participant acts as their own control,
- removes variability between participants,
- fewer subjects needed.

They have some disadvantages, too:

- short term treatment, because we need to switch treatments before participants quit the trial,
- no follow-up, because at the end of treatment all patients have had both treatments.

Cross-over trials are not suitable for many disease and treatment combinations. Cross-over trials are suitable for:

- chronic diseases (such as angina, asthma, or arthritis),
- symptomatic treatment, where the disease will still be present and in a similar state for both treatments,
- quick, quantitative outcome variables (such as attack frequency, lung function, pain scores),
- early stages in treatment development.

Cross-over trials are not suitable for:

- acute conditions (such as myocardial infarction, pneumonia),
- treatment to cure or change the course of the disease (antibiotics, clot-busters), because they would leave no disease present for the second treatment,
- treatments which persist or have long-term effects,
- slow outcomes (such as time to recurrence), because we must move on to the next treatment,
- qualitative outcomes which are yes or no, because they typically require large samples and cross-over trials are usually small,
- later stages in treatment development (side effects of long term treatment), because we usually want a long follow-up time.

Trialists are encouraged to present results of trials as estimates with confidence intervals rather than use significance tests, i.e. give P values. Cross-over trials are typically small, so t methods are required to do this. In the pronethalol example, only P values were given, because the distributions were very skew.

Does this matter? We can argue that it does not matter so much as it would in a larger trial, as cross-over trials are usually at an early stage in treatment development. The estimate of the treatment effect which we would get might not be very relevant to that which we would achieve in long term use. P values are often more important than estimates.

A trial where there are two treatments, each given once, in random order, is called a simple two period two treatment cross-over trial. It is also called an AB/BA design, because patients are randomised to receive A then B or B then A.

The analysis will be illustrated using a cross-over trial of a homeopathic preparation intended to reduce mental fatigue. This was a trial in healthy volunteers. On different occasions, paid student and staff volunteers received either the homeopathic preparation or a placebo. They underwent a psychological test to measure their resistance to mental fatigue.

There were two treatments labelled A and B, one was a homeopathic dose of potassium phosphate and the other an apparently identical placebo as control. This was a triple blind trial, in that I did not know which was which at the time of analysis.

Subjects took A or B, in random order, on different occasions, and carried out a test where accuracy was the outcome measurement. There were 86 subjects, 43 for each order.

Table 2 shows the results of the homeopathy trial.

A first | B first | ||
---|---|---|---|

acc1 | acc2 | acc1 | acc2 |

84 | 108 | 50 | 101 |

85 | 108 | 86 | 99 |

88 | 82 | 89 | 106 |

88 | 89 | 91 | 102 |

88 | 107 | 92 | 100 |

91 | 104 | 93 | 106 |

92 | 107 | 93 | . |

93 | 89 | 97 | 106 |

98 | 89 | 99 | 106 |

98 | 107 | 101 | 103 |

101 | 80 | 102 | 95 |

101 | 90 | 102 | 99 |

101 | 99 | 102 | 101 |

103 | 98 | 102 | 101 |

103 | 106 | 102 | 106 |

103 | 107 | 102 | 108 |

104 | 107 | 102 | 108 |

104 | 108 | 103 | 105 |

105 | 106 | 103 | 108 |

105 | 107 | 104 | 90 |

105 | 108 | 105 | 104 |

106 | 100 | 105 | 107 |

106 | 104 | 105 | 107 |

106 | 107 | 105 | 108 |

106 | 107 | 106 | 96 |

106 | 107 | 106 | 108 |

106 | 108 | 106 | 108 |

106 | 108 | 106 | 108 |

106 | 108 | 106 | . |

107 | 100 | 107 | 105 |

107 | 104 | 107 | 106 |

107 | 105 | 107 | 106 |

107 | 107 | 107 | 106 |

107 | 107 | 107 | 107 |

107 | 108 | 107 | 107 |

107 | 108 | 107 | 108 |

108 | 94 | 108 | 107 |

108 | 104 | 108 | 107 |

108 | 106 | 108 | 108 |

108 | 108 | 108 | 108 |

108 | 108 | 108 | 108 |

108 | 108 | 108 | 108 |

108 | 108 | 108 | 108 |

The variable acc1 and acc2 are the accuracy scores for
the first period and second period. The observations are sorted by first observation. |

There appears to be a ceiling effect, where the maximum possible score is 108 and many students achieve this. Two students did not come back for the second measurement.

Figure 1 shows a plot of the accuracy score by treatment and period.

**Figure 1. The accuracy test for the two periods and two treatments**

(Observations have been jittered slightly so that they can be seen.)

. ttest diffamb=0 One-sample t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- diffamb | 84 1.035714 1.0045 9.206397 -.9621963 3.033625 ------------------------------------------------------------------------------ Degrees of freedom: 83 Ho: mean(diffamb) = 0 Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 t = 1.0311 t = 1.0311 t = 1.0311 P < t = 0.8472 P > |t| = 0.3055 P > t = 0.1528

The estimated treatment effect = 1.0 (95% CI –1.0 to 3.0, P=0.3). However, we should ask whether the assumptions of this analysis are met by the data. The mean and standard deviation of the differences should be constant throughout the range, because we estimate them as single numbers. We can check this by a plot of the difference against average of the two scores, as in Figure 2.

**Figure 2. Difference in accuracy scores against average of the two scores**

Clearly the standard deviation depends strongly on the accuracy. The differences should follow an approximately Normal distribution and the histogram (Figure 3) suggests that the tails are much too long.

**Figure 3. Distribution of differences in accuracy scores**

. signrank diffamb=0 Wilcoxon signed-rank test sign | obs sum ranks expected -------------+--------------------------------- positive | 36 1991.5 1739.5 negative | 35 1487.5 1739.5 zero | 13 91 91 -------------+--------------------------------- all | 84 3570 3570 unadjusted variance 50277.50 adjustment for ties -180.00 adjustment for zeros -204.75 ---------- adjusted variance 49892.75 Ho: diffamb = 0 z = 1.128 Prob > |z| = 0.2592

We have P = 0.3, as before. The conclusion must be that there is no evidence for a treatment effect.

In this simple analysis, any difference between periods goes into the error. They increase the standard deviation of treatment differences. A better way to analyse such data is to adjust for period effects. We can do this in two ways. First we show a step by step method using t tests (Armitage and Hills, 1982), then an all-in-one method using analysis of variance.

To see how the analysis works, we will use the following notation:

- A1 = the mean for A in the first period
- A2 = the mean for A in the second period
- B1 = the mean for B in the first period
- B2 = the mean for B in the second period

First we ask whether there is evidence for a period effect, i.e. are scores in the first period the same as in the second? For example, in this study there might be a learning effect, with accuracy increasing with repetition of the test.

If there is no period effect, we expect the differences between the treatment to be the same in the two periods.

The period effect, first period minus second period, will be estimated by

(A1 – A2 + B1 – B2)/2.

We can rearrange this as

(A1 – B2 – A2 + B1)/2 = (A1 – B2)/2 – (A2 – B1)/2

(A1 – B2) is the mean treatment difference for the group with A first, (A2 – B1) is the mean treatment difference for the group with A first. We can test the null hypothesis that the difference between these two mean differences is zero. We compare difference A minus B between orders. Figure 4 shows a scatter plot of the difference in accuracy score between treatments against treatment order.

**Figure 4. Difference in accuracy score between treatments for the two treatment orders**

We can compare the mean difference, A minus B, between the two orders using a two sample t test:

. ttest diffamb, by(order) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- A first | 43 -.8604651 1.295952 8.498127 -3.475803 1.754872 B first | 41 3.02439 1.498978 9.598145 -.0051582 6.053939 ---------+-------------------------------------------------------------------- combined | 84 1.035714 1.0045 9.206397 -.9621963 3.033625 ---------+-------------------------------------------------------------------- diff | -3.884855 1.975746 -7.815243 .0455321 ------------------------------------------------------------------------------ Degrees of freedom: 82 Ho: mean(A first) - mean(B first) = diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 t = -1.9663 t = -1.9663 t = -1.9663 P < t = 0.0263 P > |t| = 0.0527 P > t = 0.9737

There is weak evidence of a period effect, P=0.05. If A is first, mean A minus B is negative, meaning the second score (B) is higher, if B is first, mean A minus B is positive, meaning the second score (A) is higher.

The distribution in Figure 4 looks quite good for the t test, but we can compare the non-parametric analysis. This uses the Mann Whitney U test or two sample rank sum test:

. ranksum diffamb, by(order) Two-sample Wilcoxon rank-sum (Mann-Whitney) test order | obs rank sum expected -------------+--------------------------------- A first | 43 1610 1827.5 B first | 41 1960 1742.5 -------------+--------------------------------- combined | 84 3570 3570 unadjusted variance 12487.92 adjustment for ties -151.09 ---------- adjusted variance 12336.83 Ho: diffamb(order==A first) = diffamb(order==B first) z = -1.958 Prob > |z| = 0.0502

Again we have weak evidence of a period effect, P=0.05. The two analyses produce very similar results. So there appears to be some evidence for a learning effect in the accuracy score.

We can allow for a possible period effect by looking at the treatment difference for period 1, A1 – B1 and the treatment difference for period 2, A2 – B2, and averaging them to give (A1 – B1)/2 + (A2 – B2)/2. We can rearrange this:

(A1 – B1)/2 + (A2 – B2)/2 = (A1 – B2)/2 – (B1 – A2)/2

Hence to estimate and test the treatment effect, we use the difference between the average difference between period 1, A1 – B2, and period 2, B1 – A2, for the two orders. This is called the CROS analysis. We get:

. ttest diff1m2, by(order) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- A first | 43 -.8604651 1.295952 8.498127 -3.475803 1.754872 B first | 41 -3.02439 1.498978 9.598145 -6.053939 .0051582 ---------+-------------------------------------------------------------------- combined | 84 -1.916667 .9887793 9.062312 -3.883309 .0499756 ---------+-------------------------------------------------------------------- diff | 2.163925 1.975746 -1.766462 6.094313 ------------------------------------------------------------------------------ diff = mean(A first) - mean(B first) t = 1.0952 Ho: diff = 0 degrees of freedom = 82 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.8617 Pr(|T| > |t|) = 0.2766 Pr(T > t) = 0.1383

The estimate of the effect is half the observed difference: 2.163925/2 = 1.1 (95% CI –0.9 to 3.0, P=0.3). There is no evidence for a treatment effect.

The non-parametric equivalent is a Mann Whitney U test of the difference, period 1 minus period 2, between the two orders:

. ranksum diff1m2, by(order) Two-sample Wilcoxon rank-sum (Mann-Whitney) test order | obs rank sum expected -------------+--------------------------------- A first | 43 1949 1827.5 B first | 41 1621 1742.5 -------------+--------------------------------- combined | 84 3570 3570 unadjusted variance 12487.92 adjustment for ties -100.77 ---------- adjusted variance 12387.15 Ho: diff1m2(order==A first) = diff1m2(order==B first) z = 1.092 Prob > |z| = 0.2750

Again there is no evidence for a treatment effect, P=0.3. The two analyses give very similar results.

We can do the same analysis by analysis of variance, with accuracy score as the outcome variable and subject, treatment, and period as factors:

. anova score sub treat period Number of obs = 170 R-squared = 0.6490 Root MSE = 6.40033 Adj R-squared = 0.2765 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 6210.08374 87 71.3802729 1.74 0.0059 | sub | 5990.61699 85 70.4778469 1.72 0.0071 treat | 49.1391331 1 49.1391331 1.20 0.2766 period | 158.377228 1 158.377228 3.87 0.0527 | Residual | 3359.0692 82 40.9642585 -----------+---------------------------------------------------- Total | 9569.15294 169 56.6222068

If we compare the results of the CROS analysis with the simple paired t test, we have treatment estimate 1.1 (95% CI –0.9 to 3.0, P=0.3) by CROS and estimated treatment effect = 1.0 (95% CI –1.0 to 3.0, P=0.3) by paired t test. In fact the P value for the CROS test is fractionally smaller, 0.28 compared to 0.31, and the confidence interval very slightly narrower, so ignoring the period effect has little impact on the results in this example.

In general though, it is better to take the period effect into account and do the CROS analysis. The period effect might be bigger than here and there is nothing to lose.

We often want to ask whether the effects of B are the same if it follows A as they are if B comes first. In the mental fatigue trial, there could be an interaction because of the ceiling effect and practice improving accuracy. Treatment A could raise scores to the ceiling and all participants could get near the ceiling in the second period, due to practice. This would result in no treatment difference if A came first, but a difference if B came first.

We ask whether the treatment difference is the same whatever order of treatments is given. In other words, is there an interaction between period and treatment? Is there an order effect? If there is no interaction, the participant’s average response should be the same whichever order treatments were given. We ask: is A1 + B2 = A2 + B1? Note that this is the same as comparing the treatment difference in period 1 with the treatment difference in period 2:

is A1 – B1 = A2 – B2?

To test for a period × treatment interaction, we can compare the sum or the average of the scores on the two treatments between orders. The participant’s average response should be the same in whichever order treatments are given.

Figure 5 shows the average score for treatments A and B plotted against the order in which treatments were given.

**Figure 5. Average score for treatments A and B by order in which treatments were given**

We can compare average between orders using a two sample t test:

. ttest av1and2, by(order) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- A first | 43 102.593 .9138191 5.992312 100.7489 104.4372 B first | 41 103.2439 .9346191 5.984482 101.355 105.1328 ---------+-------------------------------------------------------------------- combined | 84 102.9107 .6504313 5.961301 101.617 104.2044 ---------+-------------------------------------------------------------------- diff | -.6508792 1.307167 -3.251251 1.949493 ------------------------------------------------------------------------------ Degrees of freedom: 82 Ho: mean(A first) - mean(B first) = diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 t = -0.4979 t = -0.4979 t = -0.4979 P < t = 0.3099 P > |t| = 0.6199 P > t = 0.6901

There is no evidence of an interaction, P=0.6. However, the distributions shown in Figure 5 are negatively skew, not Normal and the assumptions for the t test are not well met.

We can do a Mann Whitney U test instead:

. ranksum av1and2, by(order) Two-sample Wilcoxon rank-sum (Mann-Whitney) test order | obs rank sum expected -------------+--------------------------------- A first | 43 1766 1827.5 B first | 41 1804 1742.5 -------------+--------------------------------- combined | 84 3570 3570 unadjusted variance 12487.92 adjustment for ties -67.77 ---------- adjusted variance 12420.15 Ho: av1and2(order==A first) = av1and2(order==B first) z = -0.552 Prob > |z| = 0.5811

Again we have P = 0.6 and no evidence of any interaction between treatment and order in this trial.

The power of the test of interaction is low and alpha = 0.10 is often recommended as a decision point, rather that 0.05. As we shall see below, the real question is whether we should test at all and what we should do if we find anything.

In the mental fatigue trial, there could be an interaction because of the ceiling effect and practice. Another possibility in cross-over trials is a carry-over effect, where the first treatment continues to have an effect in the second period.

This example, a trial of Nicardipine against placebo in patients with
Raynaud’s phenomenon (Kahan *et al.*, 1987) was given by Altman (1991).
Patients with Raynaud’s phenomenon were given either the drug
Nicardipine or a placebo, each for a two week period, in random order.
They were asked to record the number of attacks of Raynaud’s phenomenon
which they experienced.
Table 3 shows the results.

Nicardipine first | Placebo first | ||||
---|---|---|---|---|---|

Period 1 Nicardipine | Period 2 Placebo | Placebo – Nicardipine | Period 1 Placebo | Period 2 Nicardipine | Placebo – Nicardipine |

16 | 12 | – 4 | 18 | 12 | 6 |

26 | 19 | –7 | 12 | 4 | 8 |

8 | 20 | 12 | 46 | 37 | 9 |

37 | 44 | 7 | 51 | 58 | –7 |

9 | 25 | 16 | 28 | 2 | 26 |

41 | 36 | –5 | 29 | 18 | 11 |

52 | 36 | –16 | 51 | 44 | 7 |

10 | 11 | 1 | 46 | 14 | 32 |

11 | 20 | 9 | 18 | 30 | –12 |

30 | 27 | –3 | 44 | 4 | 40 |

Means: | |||||

24.0 | 25.0 | 1.0 | 34.3 | 22.3 | 12.0 |

When Nicardipine was the first treatment, there was no obvious difference between Nicardipine and placebo and the mean difference was only 1.0 attacks. When placebo was the first treatment, there was a much larger difference between Nicardipine and placebo and the mean difference was 12.0 attacks. That looks like carry-over to me! The Nicardipine appears to be still acting when the subject takes the placebo. Figure 6 shows the difference in numbers of attacks on placebo and on Nicardipine for the two treatment orders.

**Figure 6. Difference in attacks of Raynaud’s phenomenon on Nicardipine and placebo by treatment order, with zero line**

. ttest av , by(order) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 1st peri | 10 24.5 3.952496 12.49889 15.55883 33.44117 2nd peri | 10 28.3 4.782027 15.1221 17.4823 39.1177 ---------+-------------------------------------------------------------------- combined | 20 26.4 3.050582 13.64262 20.01506 32.78494 ---------+-------------------------------------------------------------------- diff | -3.8 6.204031 -16.83419 9.234185 ------------------------------------------------------------------------------ diff = mean(1st peri) - mean(2nd peri) t = -0.6125 Ho: diff = 0 degrees of freedom = 18 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.2739 Pr(|T| > |t|) = 0.5479 Pr(T > t) = 0.7261

There is no evidence for an interaction, P = 0.5. It is not significant even at the liberal 0.10 level. But there appears to be one!

We can compare treatments using the CROS analysis:

. ttest diff1m2, by(order) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 1st peri | 10 -1 3.119829 9.865766 -8.057544 6.057544 2nd peri | 10 12 5.168279 16.34353 .3085399 23.69146 ---------+-------------------------------------------------------------------- combined | 20 5.5 3.294733 14.73449 -1.395955 12.39595 ---------+-------------------------------------------------------------------- diff | -13 6.036923 -25.68311 -.3168945 ------------------------------------------------------------------------------ diff = mean(1st peri) - mean(2nd peri) t = -2.1534 Ho: diff = 0 degrees of freedom = 18 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0225 Pr(|T| > |t|) = 0.0451 Pr(T > t) = 0.9775 Evidence for a treatment effect, P = 0.045.

There is some evidence for a treatment effect, P = 0.045. But the estimate must be in doubt, due to the apparent interaction and I would not trust it.

As an aside we can compare the results of the CROS with those of a simple paired t test:

. ttest diffamb=0 One-sample t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- diffamb | 20 6.5 3.19745 14.29943 -.192339 13.19234 ------------------------------------------------------------------------------ mean = mean(diffamb) t = 2.0329 Ho: mean = 0 degrees of freedom = 19 Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 0.9719 Pr(|T| > |t|) = 0.0563 Pr(T > t) = 0.0281

From this, the evidence for a treatment effect is weaker and not conventionally significant, P = 0.056. CROS adjusts for the period effect so reduces the effect of the (non-significant) period difference a bit. It is more powerful.

Should we test for an interaction in a crossover trial? And what should we do about it when there is one? There are two views about this. One follows Grizzle (1965). He recommended testing the interaction routinely. He argued that if the interaction were significant, we cannot use the tainted second period. We should use the period 1 data only.

If we do this for the fatigue trial, we get difference, A – B = 0.5 (95% CI –3.2 to 4.2, P=0.8). This approach is called the two-stage analysis. Its proponents recommend that we should do the cross-over trial with a sufficient sample size to have adequate power from a two-group comparison of period 1 only. This seems to contradict the whole purpose of a cross-over trial, sacrificing its greater efficiency.

If we compare the full data estimate using the CROS analysis, our estimated difference, A – B, = 1.1 (95% CI –0.9 to 3.0, P=0.3). The confidence interval is narrower and the P value smaller. With the two stage analysis we lose power and precision.

For the Nicardipine data, the estimate for the first period only is given by:

. ttest per1, by( order) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 1st peri | 10 24 4.935135 15.60627 12.83595 35.16405 2nd peri | 10 34.3 4.740019 14.98926 23.57733 45.02267 ---------+-------------------------------------------------------------------- combined | 20 29.15 3.533505 15.80231 21.75429 36.54571 ---------+-------------------------------------------------------------------- diff | -10.3 6.842758 -24.6761 4.076101 ------------------------------------------------------------------------------ diff = mean(1st peri) - mean(2nd peri) t = -1.5052 Ho: diff = 0 degrees of freedom = 18 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0748 Pr(|T| > |t|) = 0.1496 Pr(T > t) = 0.9252

This is not significant, P = 0.15, compared to P = 0.045 by CROS.

Senn (1989) argued that the interaction test is highly misleading. The average of the first and second periods is highly correlated with the first period. For the fatigue trial, this is shown in Figure 7, where the accuracy score in the first period can be seen to be quite strongly related to the average of the two.

**Figure 7. Accuracy score for Period 1 against the average accuracy score over both periods**

A washout period is a time when the participants do not receive any active trial treatment. It is intended to prevent continuation of the effects of the trial treatment from one period to another, carry-over. A typical cross-over trial with washout periods might look like this:

- Washout / run-in — removes effects of pre-trial treatments
- Treatment 1
- Washout — removes effects of treatment 1
- Treatment 2
- Washout — removes effects of treatment 2
- Usual care

A washout period is necessary if treatments might interact in an adverse way. If two drugs are being compared which have antagonistic methods of action, we do not them both to be present at the same time.

In a placebo controlled trial, we do not need washout periods for safety reasons. We could simply make the treatment periods long enough so that the first treatment has been eliminated by the time we make the measurements for the second treatment.

In drug trials, washout periods should be at least 3 × half life of drug in body (FDA). If no washout periods are used, the treatment periods should be longer than would be required for washout and no measurements made in the time that would be needed for washout.

Baseline measurements are made before we begin the trial. In a cross-over trial, baseline measurements may be made before the trial treatments begin:

- Washout
- Baseline 1
- Period 1 outcome
- Period 2 outcome

or at the start of each period:

- Washout
- Baseline 1
- Period 1 outcome
- Washout
- Baseline 2
- Period 2 outcome

In a parallel group trial, baseline measurements can be very useful as covariates and can greatly improve power or reduce required sample size. They are of less value in cross-over trials.

As in a parallel group trial, Baseline 1 can be used as a descriptive variable for the trial population, so that we know what kind of participants are taking part in the trial. We can also use Baseline 1 to look for a baseline × treatment interaction, e.g. do people with high baseline values of the outcome measurement have a different treatment effect from people with low baseline values? This requires a larger sample than required to detect the overall treatment effect to be worthwhile.

With only one baseline, we can also include it as a treatment period, to give a three period design and use it to improve the estimate of variance. This might be of some limited value in very small trials where we have few degrees of freedom.

When there are two baselines, we can include them as covariates in an analysis of covariance. This may increase power and improve the estimate. We make the treatment period the unit of analysis and use subject, treatment, and period or treatment order as categorical factors and baseline for the period as a continuous covariate. This might be of value if the level of the outcome would be changing slowly over time in the absence of a trial, but it runs the risk of being distorted because the effects of the first treatment are still present at the time of the second baseline. We must have an adequate washout period to do this. Just as in a parallel group trial, we should not use differences from baseline as our outcome variable. This increases the measurement error. Comparing two differences from baseline in a cross-over trial would give four lots of measurement error rather than two and we would lose power.

Although we can make some use of them, baseline measurements are not really necessary in a cross-over trial. The comparison is within the trial participant anyway.

There are two text-books devoted to cross-over trials. I have referred to both, but they both now have second editions, Senn (2002) and Jones and Kenward (2003). For a brief introduction, try Altman (1991).

Altman DG. (1991) *Practical Statistics for Medical Research.*
Chapman and Hall, London.

Grizzle JE. (1965) The two-period change-over design and its use in clinical trials.
*Biometrics*, **21**: 467-480.

Jones B and Kenward MG. (1989) *Design and Analysis of Cross-Over Trials*.
London: Chapman and Hall.

Jones B and Kenward MG. (2003) *Design and Analysis of Cross-Over Trials, 2nd ed.*
London: Chapman and Hall.

Kahan A, Amor B, Menkes CJ, *et al.* (1987)
Nicardipine in the treatment of Raynaud’s phenomenon: a randomised doubleblind trial.
*Angiology* **38**: 333-7.

Pritchard BNC, Dickinson CJ, Alleyne GAO, Hurst P, Hill ID, Rosenheim ML, Laurence DR.
(1963) Report of a clinical trial from Medical Unit and MRC Statistical Unit,
University College Hospital Medical School, London.
*British Medical Journal* **2**: 1226-7.

Senn S. (1989) *Cross-Over Trials in Clinical Research*. Chichester: Wiley.

Senn S. (2002) *Cross-Over Trials in Clinical Research, 2nd ed.* Chichester: Wiley.

To Martin Bland's M.Sc. index.

This page maintained by Martin Bland.

Last updated: 15 September, 2010.