1 Overview
1.1 What is Clinstat?
Clinstat is an interactive program which can be used to carry out the analysis
of small data sets and as an aid to learning statistics. Clinstat is
menu-driven, so there are no commands to learn. It carries out all the main
univariate, unifactorial statistical calculations: regression and correlation,
t tests, chi-squared tests, analysis of variance, rank methods, etc., using
data from data files or from direct keyboard entry. It has a self-checking
data entry program and many data editing options.
There are many more powerful statistical packages available, such as SPSS, SAS,
GLIM, etc., but Clinstat's simple menu-driven format makes it especially
suitable for those new to statistical analysis and so particularly useful as a
teaching aid. In addition, Clinstat includes a number of programs specifically
written for teaching and learning statistics. These carry out simulations to
illustrate concepts such as standard error and the central limit theorem, and
draw the main probability distributions for parameters of your choice.
Clinstat was written originally for medical researchers who came for
statistical advice, hence its name "Clinical Statistics". It was developed as
an aid for subsidiary calculations on tables produced by other programs in
survey analysis, for use in student projects and for use in teaching students
about the principles of statistics. With the development of the ubiquitous PC,
it is possible to use the program interactively in teaching where each student
runs the program as the class proceeds.
1.2 About this manual
Clinstat does not really need a manual, as its menu-driven format makes most
operations straightforward and self-evident. The manual describes the
functions available in Clinstat, but is not necessary for day to day use of the
program.
The manual contains some explanations of the statistical procedures it
describes, these are very limited. References are given to textbooks and to
particular papers in the literature where appropriate. Most references are to
An Introduction to Medical Statistics, (Bland, 1987). Clinstat was developed
in conjunction with this book and most of the analyses, simulations and
graphics which the book contains were done using it. Clinstat will also do
many analyses which are not included in Bland (1987), and for these appropriate
references will be given.
1.3 Machine requirements
Clinstat is an interactive data input, editing and statistical analysis package
for IBM compatible microcomputers. It is written and compiled in Microsoft
QuickBasic. It requires an IBM compatible computer with at least one floppy
disk drive. It can use two disk drives, a hard disk and a printer if
available. You can run Clinstat from your hard disk or from floppy disks.
Clinstat supports CGA, EGA, VGA and Hercules video graphics adapters, and
supports colour graphics on all except Hercules. If there is a maths
coprocessor, Clinstat will find and use it.
1.4 Menus and questions
Clinstat is a menu-driven program, that is, it gives you lists of options and
you choose one of these by typing its number. These are arranged in a
hierarchy, so that choosing an option may lead to another menu giving more
detailed options, and so on. For example, the regression option from the main
menu leads to a menu including data input, data editing, summary statistics and
plots, regression, correlation, etc. Choosing the data input option leads to a
menu with options to get data from keyboard, read data from a disk file, etc.
If you choose the disk file option, Clinstat then asks for the name of the file
and gets the data. Although this sounds complicated, it is much simpler for
the beginner than a program which is driven by commands, which have to be
learned from a manual or help facility. Clinstat does not have a help facility
because it does not need one.
In almost all Clinstat menus, the first option is "0 none required", which
means that you do not want any of the options which follow. Typing "0" RETURN
will move Clinstat on to the next step, which is usually a higher level menu.
Thus you can quickly return to the main menu and use a different procedure or
quit altogether.
Sometimes Clinstat asks a question, such as "Do you want to label your
variables?" The answer is almost always yes or no. Clinstat only requires the
first character, "y" or "n" and will accept upper or lower case. Only for such
things as file names is the whole word needed.
If you accidentally go down a path you do not want, typing RETURN will usually
halt the procedure and get you back to the menu. Some sub-menus have an option
to quit the procedure instead.
1.5 Variables and cases
Many of these programs refer to variables and cases. Suppose we record the
age, sex, height and weight for each of 15 people. Then age, sex, height and
weight will be variables and each person will be a case. We have a value for
each variable obtained for each case. If you are setting up a data file on
disk, you are strongly recommended to have a case number as one of your
variables. The case number used by Clinstat to refer to cases is the sequence
number of the case in the computer's memory and may not be useful for
identifying cases to match with your paper records.
1.6 Data capacity
Clinstat will handle a data matrix with 10,000 numbers (cases times
variables), with a maximum of 100 variables and 500 cases. Cross-tabulations
are limited to 20 rows and 20 columns, group comparisons to 20 groups.
1.7 Missing data
Programs which use data from disk have a provision for setting a missing data
code. When a disk file is set up a missing data code, say 9999, can be defined
for any variable. Then if the variable is not known for a case , the missing
data code is entered instead.
When new variables are created by the edit programs, they can also be assigned
missing data codes and, for example, a new variable which is the sum of two old
variables will be given the missing data code when either old variable is
missing.
The regression, group comparison and survival data programs read only two or
three variables from disk. They will automatically exclude any case which has
either variable missing.
The cross-tabulation program does not recognize missing data codes. In this
program many variables may be read at once, and excluding cases with one of
them missing could lead to unacceptable loss of information. Two methods can
be used to deal with missing data in this program. There is a restrictions
facility which can restrict a table to cases having values for a variable
between specified limits. There are also facilities to edit a table, deleting
or combined rows or columns. It will be seen that the missing data code is of
limited value if you are only going to tabulate and do chi-squared tests.
1.8 The "Restrictions" feature
Several of these programs have a restrictions feature. This means that
operations can be restricted to cases having values for variables between
certain limits. For example, having examined the relationship between height
and weight you may wish to do this for men aged over 30 years.
After the program asks you to enter restrictions, it will ask for a variable
number. Tell it which variable you want. In the example in 1.4 above, sex is
variable 2, so type 2. It then asks for permissible values for a variable.
These may be either separate values or inclusive ranges. The lower and upper
limits of a range must be separated by 'TO', the first limit, 'TO' and the
second limit , each being followed by RETURN. There is a maximum of 10
separate values or ranges. When the permissible values have been entered, type
'NO' or 'N'. For the example, if age had been coded 1 for male we would have 1
RETURN N RETURN. This is then repeated for another variable. For the example,
males over 30 could be selected by 30 RETURN TO RETURN 200 RETURN N RETURN.
200, of course, is greater than any possible age. This goes on until a
variable request receives a reply of 'NO'. The program lists the restrictions
and asks whether they are correct. If not, they can be re-entered. If yes,
the program proceeds to select cases which meet these restrictions.
1.9 Clinstat data files
Clinstat data sets consist of two ASCII text files, filename.DAT and
filename.CLB. The main file, filename.DAT, holds the actual data in free
format. Each line corresponds to a case, the variables being separated by two
spaces and the line being terminated by a carriage return. This file can be
read by many other statistical programs, so having started with Clinstat it is
possible to move to another program if more advanced statistical procedures are
needed. Similarly, Clinstat can read data produced by other programs.
The second file, filename.CLB, is the Clinstat labels file. For each data
matrix, the following information is stored:
the number of variables the number of cases variable labels (up to 15
characters) input limits, (minimum and maximum allowable values for each
variable) missing data codes (0,0 if code absent, 1,code if present) number of
value labels value label codes for each variable, start and count value labels
0 0 0 0 (future use)
Clinstat can read files created by other programs provided they are in the free
ASCII format described above. The input program (main menu option 1) is used
to create a CLB file to go with the main data file.
1.10 Clinstat graphics
Clinstat produces a number of graphs: scatter diagrams, histograms, survival
curves, probability distributions, etc. These may be displayed using the CGA,
EGA, VGA or Hercules graphics adapters. Of these, the CGA, EGA and VGA options
support colour, the Hercules does not. For computers without any graphics
adapter, some graphs are shown as character graphics; only the simulation and
distribution plotting programs do not do this.
If the log file option is used, it is the character graphics form which is
stored on the log file. If a non-graphics printer (e.g. a daisy wheel) is
used, the character graphics form is printed.
Clinstat will print some graphs on a suitable dot matrix, laser or inkjet
printer. When the graph is displayed, the prompt line at the bottom of the
screen reads:
Press space bar to continue, S to save graph on disk and P to print
Pressing the "P" key will erase the prompt line and start the printer. The
graph will be in landscape mode, i.e. turned on its side. This is an 8 pin
print, so any dot matrix printer should work. Hewlett Packard compatible laser
and desk jet printers will print graphs. On some PCs this is a very slow
procedure, particularly 086 computers with Hercules screens. Do not worry if
nothing seems to happen for a while. If you find graph printing too slow, try
saving the graphs to disk (see below) and printing them as a batch job using
the graph retrieval program, main menu option 2.
Graphics printing is not available on the teaching programs, main menu option
9.
Clinstat will save some graphs on disk and retrieve them. When the graph is
displayed, the prompt line at the bottom of the screen reads:
Press space bar to continue, S to save graph on disk and P to print
Pressing the "S" key will erase the prompt line and ask for the name of the
destination file. The information required to draw the graph is then stored on
the file. The prompt line reappears when the graph has been saved.
The graph can be retrieved using main menu option 2, sub menu option 5. This
program reads the graphics information file and draws the graph. It can be
used to print the graph as described above.
Clinstat graphs are stored as text, not as a screen image. They can only be
retrieved within Clinstat. If you want to incorporate your graph into a word
processor file, you should use one of the memory resident programs which can be
used to capture screens, such as Word Perfect's grab or Lotus' scr. Clinstat's
graphics screens are accessible to these on most computers.
Graphics saving is not available on the teaching programs, main menu option 9.
1.11 Errors in programs and user's guide
This User's Guide is correct at the date of release. However, these programs
are continually being modified to improve speed and data capacity, remove
errors and add new facilities. Thus the current program disks may contain
features not described in the user's guide. The programs will still work as
described here, however.
Every effort is made to ensure that there are no bugs in the programs. The
programs are still under development, and there are sure to be some. If you
find one, tell Martin Bland, who will endeavour to fix it.
2 Installing Clinstat
2.1 Clinstat on floppy disks
You can run Clinstat from floppy disks. For systems with two floppy drives,
use a: for the program and b: for data disks and choose the "two floppy disk
drives" option from the setup menu (see below). Otherwise, use the single
drive option. Clinstat will tell you when to change disks.
Copy the Clinstat disks and keep the originals as backup. You can then use the
copies to run Clinstat. You may make as many copies as you wish.
2.2 Installing Clinstat on your fixed disk
You can run Clinstat from the fixed disk c:. If you have an IBM graphics
adapter (CGA, EGA or VGA), or no graphics at all, install Clinstat as follows:
Put yourself in the root directory by c: cd \ Put Clinstat disk 1 in
drive a: and type a:install Clinstat will now set up a directory called
CLINSTAT on the hard disk c: and a file CLINSTAT.BAT in the root directory. It
will ask you to put in each disk in turn.
You may have more than one disk image on the same physical disk. If Clinstat
has been supplied on three 3«" disks, each disk will contain two disk images.
If Clinstat has been supplied on high density disks, each disk will contain
three disk images. When asked to put in Disk 2, you should leave the first
disk in place, as the Disk 2 files are on it. Press any key to continue. On
the hard disk, Clinstat requires 1.6 megabytes of storage.
Alternatively, you can set up your own directory and copy all the files to it
using DOS. Note that Clinstat must be started from within its own directory as
it reads files there. The command from within the directory is cl
If you have a Hercules graphics adapter proceed as above except that you must
type a:installh instead of a:install If you use DOS to copy Clinstat,
the starting command from within the directory is then clh This is a batch
file which runs a file called QBHERC.COM, which enables the QuickBasic in which
Clinstat was written to use the Hercules graphics card.
3 Starting Clinstat
3.1 Starting Clinstat from the fixed disk
Unless it has been installed in some special way, you can run Clinstat from DOS
by typing CLINSTAT at the DOS prompt. This will work if Clinstat has been
installed on your hard disk using the install program, for example. If a menu
shell has been installed, follow the appropriate instructions for the menu.
3.2 Starting Clinstat from the floppy disk
To run Clinstat from the floppy disk, put disk 1 in drive a: and type
A: CL (CLH if you are using a Hercules graphics adapter)
Clinstat will start. When Clinstat will tell you when you need to change
disks. The package resides on six 360K disks:
Disk 1: Start, input and edit data files, installation.
Disk 2: Copy, merge and join disk files, tabulation and cross-tabulation
from data files, two way table analysis using chi-squared tests,
comparison of two proportions, Cohen's Kappa.
Disk 3: Regression, correlation and paired data analysis.
Disk 4: Independent samples comparisons.
Disk 5: Survival analysis, random sampling, SMR calculations.
Disk 6: Simulations, probability distributions
You may have more than one disk image on the same physical disk. If Clinstat
has been supplied on three 3«" disks, each disk will contain two disk images.
If Clinstat has been supplied on high density disks, each disk will contain
three disk images.
3.3 The set up menu
Clinstat produces a banner screen including the release date and version
details and also the current date supplied by DOS. The current date and time
are also printed at the start of your hard copy. The details of the system are
now given. When running Clinstat for the first time, make sure that this is
correct. In particular, make sure that the graphics adapter is correct. If
you try to create a graph on a video adapter which your computer does not have,
you will get an error code 5 message. If you want to change the setup, type
"Y" RETURN and the change setup menu will appear. Choose the item you wish to
change and when you have finished choose 0 to return to the main program.
Note that colour screens are not supported with the Hercules video adapter.
3.4 Printers and log files
Clinstat will next give three options for output: to screen only, to log file
or to printer. You will only be offered the printer option if there is one in
your setup, and the log file option if you have a hard disk or twin floppy
drive system.
If you want printed output, you can send this directly to the printer, which
will print your results as you go along. You can stop Clinstat printing and
start it again as you wish.
Alternatively, you can put the output onto a log file. This is a disk file
which records the results of calculations as you go along. It can then be
printed from DOS or can be edited using edlin or a word processor. See your
DOS or word processor manual for details of reading, editing and printing ASCII
files. The log file is quiet. It is particularly useful in teaching
laboratories, etc., where several computers may share one printer and the noise
of printers may distract others.
To send output to a log file, choose option 2. You are then asked for the name
of the log file.
PC file names consist of a name of up to eight letters and numbers, optionally
followed by a dot and an extension of up to three letters and numbers. You are
strongly recommended to use the extension ".DAT" for all data files and ".LOG"
for all log files. Other extensions may have special meanings and lead to
errors. For example, Clinstat automatically creates a file filename.CLB,
containing the variable labels, limits, etc., to go with the data file
filename.DAT. It is essential that your log file has a different name to your
data files, or unpredictable errors will occur.
All the output produced by Clinstat will then be stored on the floppy disk for
printing later. As with the printer, you can stop output going to the log file
and restart it as you wish.
3.5 Printing the log file
When you have quit Clinstat you can print your log file. Make sure your
computer is connected to a printer. Make sure that the printer is switched on
and ready.
To print your file you must use commands from the disk operating system (DOS).
From the DOS prompt type PRINT filename.LOG The computer may ask you
NAME OF LIST DEVICE [PRN]: The printer should be PRN so just press RETURN. The
computer will now print your output file.
As an alternative, you can print the log file directly from option 2 of the
main menu. This will terminate the Clinstat session, however.
3.6 The main Clinstat menu
The main Clinstat menu looks like this:
Programs available: 0 quit Clinstat 1 input and edit data files 2
copy, join and merge files, create subfiles, retrieve saved graph, print log
file 3 tabulation and cross-tabulation (data on disk file only) 4
chi-squared and other tests on two-way tables 5 regression, correlation,
paired comparisons (disk file or keyboard) 6 comparing two or more groups
(disk file or keyboard) 7 survival data (disk file or keyboard) 8
miscellaneous calculations 9 simulations and other demonstrations of
statistical ideas Program required = ?
The particular Clinstat program you want is chosen from this list or from a sub
menu reached from this. For example, keying 5 RETURN will start the regression
program. Keying 8 RETURN leads to a second menu listing the miscellaneous
programs. The first of these, random sampling and allocation, is started by
keying 1 RETURN. The following sections of the manual describe each of the
Clinstat programs.
The sub menus are arranged as follows:
1 input and edit data files 1 set up new data file 2 add to,
list and edit existing data file
2 copy, join and merge files, create subfiles, retrieve saved graph, print
log file 1 copy a Clinstat file 2 join two Clinstat files (same
variables, different cases) 3 merge two Clinstat files (different
variables, same cases) 4 create a subfile 5 retrieve Clinstat
graphs
3 tabulation and cross-tabulation (data on disk file only)
4 chi-squared and other tests on two-way tables 1 chi-squared tests
and other two-way table analysis 2 combine several 2x2 tables 3
compare two proportions 4 Cohen's kappa
5 regression, correlation, paired comparisons (disk file or keyboard)
6 comparing two or more groups (disk file or keyboard) 1 from
individual observations 2 t tests from means and standard
deviations 3 ANOVA and multiple comparisons from means and standard
deviations
7 survival data (disk file or keyboard)
8 miscellaneous calculations 1 random sampling and allocation 2
determination of sample size 3 histogram, mean and standard deviation
(disk file or keyboard) 2 calculation of standardized mortality ratios
from mortality rates 5 confidence intervals for standardized mortality
ratios
9 simulations and other demonstrations of statistical ideas 1 tossing
coins 2 pintable simulation 3 people on boxes - simulation of
mean and variance 4 central limit simulation 5 sampling
distribution of mean 6 sum of squares simulation 7 confidence
interval simulation 8 probability distributions 9 simulations of
clinical trials
4 Data input to a disk file and editing the file, main menu option 1
4.1 Starting a new data file from the keyboard
After main menu option 1, choose option 1, set up a new data file. Choose the
option to start a new file from scratch. The program asks for the full path
name for the new file. This means the directory and file name. See your DOS
manual for more information. If you specify a filename only, with no path, the
file will go into the Clinstat directory. You are strongly recommended to use
the extension .DAT for your data files.
A new file will destroy any previous file of that name, so Clinstat warns you
and asks you if you wish to continue. If in doubt, quit and use DOS to check.
Otherwise, answer Y and the program will go on.
4.1.1 Variable labels
Clinstat asks for the number of variables. This is the number of separate
pieces of information recorded for each case or record. Remember that,
although Clinstat denotes each case by the case number corresponding to the
order in which cases were entered, it is often useful to have a separate study
number which corresponds to the original paper record.
Clinstat asks for variable labels, up to 15 characters. These are not used to
refer to the variable, that is, you do not have to type them in every time you
use the variable. Clinstat uses variable number for this purpose. The
variable label is printed whenever the variable is used, so it is a check and
aide memoir. Clinstat checks the number of characters and a warning is given
for more than 15. The program truncates the name and gives you the option to
type it in again.
After the labels are all entered, a menu allows you to list labels, change
them, add and delete them. This last facility is useful if you have made a
mistake in counting the number of variables or have missed one out. Variables
can be inserted at any point in the variable list.
4.1.2 Missing data codes
The program next asks for missing data codes. These are optional and you may
have them for some variables and not others. You can use the same missing data
code for all the variables if you like. 999 is a good missing data code, as
you are unlikely to type this by mistake. Zero or blank are not good codes as
they make it very difficult to trap keying errors in data entry.
4.1.3 Input limits
The program then asks for minimum and maximum input limits. These are
optional, the default being -1E38 and +1E38 (i.e. 1 times 10 to the power 38).
The limits need not include the missing date code. During data input, they
trap many errors arising from either hitting RETURN too soon or not at all and
from hitting keys twice.
4.1.4 Value labels
Clinstat will store value labels. These are labels of up to seven characters
attached to each possible value of a variable. For example, a variable "sex"
could be coded "1" or "2" and these could have labels "male" and "female"
respectively. These labels are retrieved by the tabulation program and printed
on cross-tabulations. They are also retrieved as group labels by the group
comparison program.
Value labels are optional. You do not need to have any at all. If you use
them, you do not need to have labels for all variables. Missing data codes are
automatically coded "missing" unless you stipulate otherwise.
The same value labels are often used by several variables. For example, the
codes "yes", "no", "dt know" might be used by many variables in a file. In
Clinstat they are only put in once, together with the list of variables which
uses them. Up to 200 value labels can be stored.
After the labels and limits have been written to disk, the program proceeds to
the data input and edit menu, as described below.
4.2 Setting up a file from an existing ASCII data matrix
You can read any rectangular data matrix, case by case, with Clinstat. The
data should be an ASCII file in free format, the numbers being separated by
spaces, commas or carriage returns. If you choose this option in the input
program, Clinstat asks for the name of the file and the numbers of cases and
variables it contains. You then put in variable labels and the optional
missing data codes and input limits as described above. Clinstat writes these
onto the .CLB file. It does not write anything on your data file. The input
program then leads to the data input and edit menu, which you would usually
want to quit.
4.3 Adding to or editing an existing data file
For an old file, the program asks the name of the file and then reads the
numbers of variables and cases, the variable labels, the missing data codes and
input limits from disk. It then proceeds to the data input and edit menu.
4.4 Putting in data at the data input and edit menu
You arrive at the main data editing menu via the new file input or old file
input and edit options. Depending on whether the file has any data, the menu
offers options to input and edit data, edit labels and write the file to disk,
or in addition to sort data, find frequency distributions, recode, create and
delete variables. To put in more data, choose option 1.
The data input section presents a menu offering options to enter more data,
list and change data entered, list labels, update the disk file, and control
printing. Option 1 enables you to add cases to the file. Data are put in one
case at a time. Clinstat prints both number and name for each variable and
asks for the value to be entered. A 'NO' or 'N' will tell the program that no
more data is to be entered and the menu will reappear. A value outside the
input limits and not equal to the missing data code will cause the computer to
beep and the program will ask for that variable again.
After data input, data can be listed, errors corrected, and data stored on
disk. When the data is on the disk file they are safe.
4.5 Using more than one computer for data input
If you have several computers available, in a computer classroom for example,
you can have several people putting in your data at once, onto separate disks.
The files can then be joined together.
To use more than one computer to enter data from different sets of forms, first
use one computer to set up the variable names, missing data codes and input
limits for your data file. When the second menu comes up, ready to put in
data, do not put in any data, but quit Clinstat. Copy the Clinstat label file
filename.CLB from the original disk to your other computers using a floppy
disk. Exactly how you do this depends on the type of computer, whether they
are networked, etc. You should use different file names for each set of
cases, such as SAMPLE1.CLB, SAMPLE2.CLB, SAMPLE3.CLB. You can then type in
data on each disk. Make sure each subject is only entered once.
When all the data are in, use the file utilities in Clinstat to join the files
and produce a composite file on another disk. This is option 2 on the main
Clinstat menu. Use the option for joining two data files described below.
4.6 Data file editing
4.6.1 Editing features available
Clinstat provides comprehensive data checking, data correcting and data
manipulation facilities. The Clinstat main menu described above leads to the
editing program. This will read the data matrix, missing date codes, labels
and limits from disk. The main menu provides options for listing data, listing
data with a specified value for a given variable, correcting data, finding
frequency distributions of variables, correcting variable labels, missing data
codes and input limits, recoding variables, controlling the printer, deleting
cases and variables, and creating new variables as functions of the old ones.
The corrected data file is stored on disk.
4.6.2 Checking and correcting data errors (data cleaning)
The principal use of the program is to check and correct data entered using the
input program. Typical use would be to find the frequency distribution for
each variable. This gives the frequency of every value, rather than for scale
intervals. (If you want that, see Sections 9.4, 10.4). Thus, for example, the
frequency distribution of the study number given to each subject in a study
will reveal any duplicate entries. The frequency distribution for height will
reveal any values which are suspiciously high or low, etc. When a value which
may be wrong has been identified, we can list all cases with that particular
value of the variable. We can then check back to the original documents, find
the correct code, and use the error correction facility to put in the correct
value. The program will delete a duplicate case, or put in a case which has
been omitted.
4.6.3 Changing labels, limits and missing data codes.
Option 2 on the input and edit menu enables you to inspect and change the
variable labels, value labels, missing data codes and input limits.
4.6.4 Recoding variables
The other use of the program is to prepare data for analysis. This is done by
recoding and by creating new variables. For example, suppose we have entered
age in years, but want to tabulate other variables by age in 10 year age
groups. We can recode age from years to decades. We want codes 0 - 9 to be
replaced by 1, codes 10 - 19 to be replaced by 2, etc. The recode option
enables us to do this. First we define the new code 1 to be all old codes 0 to
9. The program put a 1 wherever there is a value between 0 and 9 for age. We
then define another new code, 2, to be old codes 10 to 19, and all values
between 10 and 19 for age will be replaced by 2, and so on until all the old
codes have been recoded. Note that we must not start by changing 90 - 99 into
10, 80 - 89 into 9, and so on until 0 - 9 into 1, as the last step will change
everything to 1. If in any doubt, create a new variable equal to age (see
below) and recode that. You can then compare the old and new code using the
data listing facility to check that the recoding is what you expected.
4.6.5 Creating new variables out of old
New variables are created as functions of the old. For example, we can create
a new variable "age in months" = "age in years" times 12. We can create a copy
of a variable for recoding by new variable = old variable + constant, and
setting the constant as zero. We can also add, subtract and multiply
variables. We can divide one variable by another, transform a variable by
logarithm, exponent, raise to any power (e.g. -1 gives reciprocal, 0.5 gives
square root), sine, cosine, arcsine, arctangent, integer part. These enable
many other functions to be created, e.g. tangent, arccosine etc. Clinstat will
also find any linear combination of variables: i.e. any sum of variables
multiplied by constants. In addition, we can do such things as change time in
hours and minutes to minutes (by multiplying by .01, take integer part to give
the hours, subtract hours, multiply by 100 to give minutes, multiply hours by
60 and add). The program will create new variables until the data space is
full. After this, any more new variables must replace existing ones, but this
is unlikely to be a serious limitation. Procedures such as the hours to minutes
calculation described above use a lot of intermediate variables which can be
over-written. In any case these redundant variables should be deleted using
the option for deleting variables.
The variable creation program preserves missing data codes. You define a code
for the new variable and any cases with missing data for an old variable has
the new variable set to the missing data code.
4.6.6 Sorting data
Clinstat will also sort data into the order given by any of the variables. Note
that this will change all the case numbers. You should make sure that your
own, external case number is one of the variables.
4.6.7 Storing the updated file
When the data matrix is correct, it is saved on disk. It is a good idea to use
a different name to the original file name, perhaps by adding a number. This
helps make it clear what the file contains. If you use the old file name then
the old file will be destroyed. More than one copy of the file can be made by
changing the data disk or file name after writing is complete and running the
data file to disk option again.
5 File handling utilities, main menu option 2
5.1 Copy a file
File copy simply copies a file from one disk to another, and is simple and self
explanatory. On floppy based systems, once the program is running the program
disk must be removed to make way for the data disk bearing the file to be
copied. You can also do this using the DOS copy command. Copy 'filename.' to
get the labels file as well.
5.2 Join two files
Join files will join two files with the same variables in the same order, but
different cases. It is useful if you have used more than one computer to put
in your data.
5.3 Merge two files
Merge files will put together two files with the same cases in the same order,
but different variables. The files must be on the same disk. Use the file
copy utility if they are not. This program enables you to add variables to
your data set. Set up a new file with the new variables and enter the data
using the input program. Make sure the cases are in the same order. If you
have a study number for each case, the editing program can be used to sort the
data matrix into case number order. Then use this program to merge the files
together. Any duplicated variables can be deleted in the editing program or by
creating a subfile as described below.
5.4 Create a subfile
Creating a subfile is a very useful procedure. This will create a new data
file containing all or some selected variables, and all or some cases. You will
need a separate disk for the subfile. Variables are selected by entering the
required variable numbers. Cases are selected by the restriction procedure
described in 1.8 above. This means, for example, that if sex is a variable you
can produce a subfile of females only, and another of males only. Regression
analysis could then be done for sexes separately.
5.5 Retrieve a Clinstat graph saved on disk
Scatter diagrams and histograms and survival curves can be saved as disk
files. Clinstat saves the text required to create the graph rather than a
screen image. This program retrieves the graphics file and recreates the
graph. This can then be printed if you have a suitable printer (see Section
1.10).
6 Tabulation and cross-tabulation, main menu option 3
6.1 Program limitations
The program lists data and produces one way and two way tables. There is no
limit on the size of one way tables, but two way tables are limited to 20 rows
and 20 columns. Variables with more than 20 different possible values cannot
be cross-tabulated. After a two way table has been found, a set of statistical
procedures including chi-squared tests, Fisher's exact test, etc. is available.
6.2 Multi-way tables
The most powerful feature of the program is the ability to restrict the data to
certain cases. For example, we might tabulate a respiratory symptom, present
or absent, by sex. It may be that the two are related because more males than
females smoke cigarettes. We can repeat the tabulation restricting our
attention to non-smokers, and then to smokers only, to see if the relationship
still exists. In this way we can build up a three way table, using the third
variable to make the restriction. In the same way four way or five way tables
can be built up. The restriction facility can be used for data listing,
one-way and two-way tables. It is described in detail in 1.8 above.
6.3 Reading the data file
The program is entered from the Clinstat main menu option 3. A list of options
appears, data input being option 1. This produces a request for the file
name. The program reads and lists the variable labels. It may then tell you
that there is not enough room for all the variables. If there is enough room,
it will read in the data file. Otherwise, you will be forced to select
variables required for this analysis. The program asks for a list of the
variables you require and gives opportunity to change this before reading the
data. Even though only a few variables may be used, the variable numbers used
by this program are those on the original disk file. Thus we may choose
variables 1, 5, 7, 23, and these are the numbers the program will use, not 1,
2, 3, 4.
After the data has been read from disk the message "Setting up arrays" is
shown. The computer is finding the number of possible values for each
variable. If a variable has more than 20 values a message is shown warning
that this cannot be cross-tabulated.
When the menu returns the data entry is complete. Return to the main menu.
6.4 Data listing and restrictions on the data
If data listing is selected from the main menu, the program menu will offer
options for listing without and with restrictions. If restrictions are chosen,
Clinstat will request 'Enter restrictions'. These are entered as described in
1.8 above. First the variable number is requested, then the permissible values
for a variable (maximum of 10). The list of values is terminated by 'NO'. The
permissible values may be single values or the lower and upper limits of a
range. These are separated by 'TO'. The RETURN key must be pressed after 'TO'
and after each limit. More than 10 values will halt the procedure. This is
repeated for another variable, and so on until the variable request receives a
reply of 'NO'. The program then lists the restrictions. There is a facility
to change these if you make a mistake. Cases will be selected if they satisfy
all the restriction procedures. The listing program will then ask for all
variables or some, all cases or some, and list the data in the usual way. If
no cases meet the restrictions a message 'No such cases exist' will be printed.
6.5 One-way tables
The one-way table option will ask for the variable to be entered. It will then
ask for restrictions as described in 1.8 and 6.4 above. It will then list the
possible values of the variable, the number of cases with each value, and the
percentage.
6.6 Two-way tables
The two-way table option will ask for the row variable and column variable. It
will then ask for restrictions as described in 1.8 and 6.4 above. It will then
list the two-way table, giving the values of the
variables at the start of the relevant row and the top of the relevant column.
It will then ask whether you want percentages, chi-squared tests etc. 'NO'
returns to the menu. 'YES' leads to a menu of options. This is described in
detail under 'Two-way table analysis', Section 7 below.
To return to the tabulation menu, request to enter a new table.
7 Two-way table analysis, main menu options 3 and 4
7.1 Facilities and limitations
The two way table analysis options can be used for contingency tables produced
by the cross-tabulation program, main menu option 3, or for tables keyed in
directly using the keyboard entry version, main menu option 4. The program
carries out chi-squared tests for two way tables, including options for trend,
Yates' correction, partitioning chi-squared and McNemar's test. It also
carries out Fisher's exact test, gives row and column percentages, expected
frequencies and residuals, and provides extensive table editing. The limit on
table size is 20 rows and 20 columns.
7.2 Data input
Data entry is from disk file via the tabulator program, main menu option 3 (see
Section 6), or directly from the keyboard using option 1 of main menu option 4.
The main menu offers the range of options. The enter new table option gets a
new two way table from the keyboard using the main menu 4 version, or goes back
to the cross-tabulation menu if main menu option 3 is used (see section 6). The
keyboard data entry asks for the number of rows and columns, then for the data,
row by row. When the data have been entered the two-way table is displayed,
with row and column totals. The rows are numbered on the left, and the columns
across the top.
7.3 Editing the table
The editing option gives a menu offering change one cell, delete row or column,
combine rows or columns, reorder rows or columns, display current table and
reinstate original table. The change cell option enables input errors to be
corrected. The delete row and column and combine row and column option enable
the table to be manipulated to remove small expected values for valid
chi-squared tests. The reordering options are for use in trend analysis and
for partitioning chi-squared. The edited table can be displayed at any time
during editing, and the original table can be returned as it was before editing
if an error has been made.
7.4 Percentages
Row percentages displays the table with each cell as a percentage of the row
total. Column percentages gives each cell as a percentage of the column
total. Percentages are given to one decimal place.
7.5 Chi-squared tests
The chi-squared test is the usual chi-squared test for a two-way table, testing
the null hypothesis of no association (Bland, 1987, Section 13.1). The
chi-squared statistic, degrees of freedom and associated probability is given,
together with the number of cells with expected values less than 5. If there
are too many such cells, the expected values can be displayed and the edit
routine used to combine or delete rows and columns as appropriate (Bland, 1987,
Section 13.2). For two by two tables with small expected values Yates'
continuity correction is available (Bland, 1987, Section 13.6), or Fisher's
exact test (Bland, 1987, Section 13.5).
7.6 Fisher's exact test
Fisher's exact test (Bland, 1987, Section 13.5) gives the probability of the
observed table and of each table which is more extreme. More extreme tables
are printed, with their probabilities. The total of these is the total
one-sided probability and this is doubled to give the approximate two-sided
probability. The program chooses which cells to increment and which to
decrement; the order of the rows and columns does not matter.
7.7 Chi-squared for trend
Chi-squared for trend (Bland, 1987, Section )13.4 carries out the usual trend
analysis. You have the option of choosing the row and column variables. If
you do not choose them, they are set to 1, 2, 3, 4, .., i.e. the
row and column categories are regarded as being equally spaced. The program
gives the chi-squared for linear trend, about linear trend and the total
chi-squared, together with their degrees of freedom and probabilities.
7.8 Tests for matched samples, McNemar's and Stuart-Maxwell test
Two tests are given: McNemar's test (Bland, 1987, Section 13.8) and the
Stuart-Maxwell test (Maxwell, 1970). McNemar's test gives the test for
equality of proportions in matched samples. It is assumed that the two
variables are in the same order, e.g. first row and column both yes, second row
and column both no. The table is printed in this format as a reminder. The
program gives the chi-squared statistic testing equality without and with a
continuity correction, the probability in each case, and the number of cells
with expected value less than 5. For testing equality of the row and column
totals in variables with more than two categories, a test suggested by Stuart
and by Maxwell is provided. The numbers of rows and columns must be equal.
For a 2 x 2 table this is the same as McNemar's test.
7.9 Partitioning chi-squared
Partitioning chi-squared is complicated. You should refer to a book (Armitage
and Berry, 1987, Section 12.2) for details. A short menu offers options to
list partition chi-squares, to sum partition chi-squares, and to type a
comment. The list option give the chi-squared statistic for each partition.
The partition is defined by a row and column. It refers to the two by two
table defined by that row and column element as the second row, second column
cell, the sum of the row up to but not including this element as the second
row, first column, the sum of the column up to but not including this element
as the first row, second column cell, and the sum of all the elements in both
rows before and columns above this element as the first row, first column
cell. The sum partition option asks for the partitions to be added, in terms
of the row and column of the defining element for each partition. It then
prints the chi-squared statistic for each partition, and their sum with degrees
of freedom and probability.
7.10 Residuals
The display residuals option gives three possible definitions of residual about
the no-association model: observed - expected, (observed - expected) / square
root expected, and as adjusted by division by their standard deviations. Only
one will be displayed, the program returning immediately to the main menu.
8 Calculations for discrete and binary data, main menu option 4
8.1 Combining several 2 by 2 tables by the Mantel-Haenszel method
Combining several 2 by 2 tables by the Mantel-Haenszel method (Armitage and
Berry, 1987, Section 16.2) is only available in a keyboard entry version. The
program asks for the number of tables to be entered. After the number of two
by two tables, the program asks for the first table, row by row. This is
displayed and the option give to re-enter the table if an error has been made.
If the table is correct, the next table is requested and displayed. When all
the tables have been entered, the combined chi-squared statistic for no
association is given, with its degrees of freedom (always 1) and probability.
The Standard Normal deviate which some texts prefer is the square root of the
chi-squared statistic. The probability is the same.
8.2 Comparison of two proportions and odds ratio
8.2.1 Facilities and limitations
This program compares two proportions using the large-sample Normal
approximation (Bland, 1987, Section 8.6, Section 9.8). Input is from the
keyboard only and printer output is provided.
8.2.2 Data input
The program first asks for the denominators of the two proportions, then the
numerators. We often want to compare several sets of proportions using the
same denominators, so after the first data input there is a facility to enter
numerators only, keeping the denominators from the previous calculation.
8.2.3 Output
The program prints the two proportions, their difference, the standard error of
the difference, a 95% confidence interval for the difference (Bland, 1987,
Section 8.6), and a Normal test of the null hypothesis of equal proportions,
with the associated two-tailed probability (Bland, 1987, Section 9.8).
A check is made on the adequacy of the approximation.
The comparison of two proportions program also calculates large sample
confidence intervals for a single proportion (Bland, 1987, Section 8.4), a
single odds, the ratio of two proportions and the odds ratio (Gardner and
Altman, 1989, Section 6).
8.3 Cohen's Kappa, main menu option 4
8.3.1 Facilities and limitations
This program takes a square two-way table from the keyboard. Cohen's Kappa,
weighted Kappa (Fleiss, 1981, Section 13.1) and the test suggested by Stuart
and Maxwell (see Maxwell, 1970) for equality of marginal totals can be
calculated. The limit on table size is 20 rows and columns.
8.3.2 Data input and editing
The program asks for the number of rows of the square two-way table. It then
asks for data row by row. When all the data has been entered the table is
printed.
Errors in data entry can be corrected by the edit table option. The only
facility provided is to change one cell of the table. The table can be printed
again at any time.
8.3.3 Cohen's Kappa
Cohen's Kappa is printed assuming that the row and column categories are in the
same order. This means that cells along the main diagonal of the table are
"agree", and other cells are "disagree". Kappa is printed with its standard
error and an approximate 95% confidence interval. A one-sided test of
significance for no agreement is also given. Note that this may give a
significant difference when the confidence interval includes zero.
8.3.4 Weighted Kappa
Weights are put in under option 4. The data must be put in before the weights,
as the program assumes the same number of rows for weights as for the last
two-way table entered. Errors in input can be corrected by the edit weights
option, and the weights can be printed at any time.
Weights are zero for cells with perfect agreement and increase as disagreement
becomes more important. For example, a variable with categories "good", "fair"
and "poor" might have weights 0 for "good" & "good", 1 for "good" & "fair" and
2 for "good" & "poor".
If no weights have been entered, Clinstat assumes weights at equal intervals
with agreement along the main diagonal, like this:
--------------------------- | | Column
| | Row | 1 2 3 |
|-----+-------------------| | 1 | 0 1 2 | |
2 | 1 0 1 | | 3 | 2 1 0 |
---------------------------
The same weights can be used for weighted Kappa on successive two-way tables,
without re-entering the weights, unless the size of the table is changed.
Weighted Kappa is printed with the same statistics as Kappa.
8.3.5 Test equality of margins, Stuart-Maxwell test
The Stuart-Maxwell test tests the null hypothesis that the marginal
distributions are equal, i.e. that the numbers put in each category by the two
judges would be the same. The result is given as a chi-squared statistic.
This is also available in the contingency table program, option 1 in the
Clinstat sub menu.
9 Regression, correlation, and paired data comparisons, main menu option 5
9.1 Facilities and limitations
This program is started option 5 from the Clinstat main menu. It will accept
data from the keyboard or from a disk file, and will save data on disk. This
program performs calculations on a pair of variables, denoted by X and Y. It
will do simple linear regression (with options such as residual plots),
correlation and rank correlation (Spearman's and Kendall's), produce basic
statistics (mean, median, variance, etc), plot histograms and scatter diagrams
and carry out paired t tests, sign test and Wilcoxon test.
The main regression, correlation and paired data analysis menu offers data
input, listing and editing, summary statistics and plots, regression,
correlation and rank correlation, tests on paired data, and storage of data on
disk.
9.2 Data input
Data input offers 3 options: from keyboard, from disk, or from disk with
restrictions. If the data are to be put in via the keyboard, option 1,
Clinstat first asks whether you want to name your two variables. The default
option is "X" and "Y". The program will then ask for the data case by case,
first the value of X then Y. "N" or "NO" terminates the data entry.
If the data are on a Clinstat disk file, the program will ask for the file name
and read the variable labels. It will then ask which variables are required as
X and Y. If restrictions are used the procedure of 1.8 is followed. It then
reads the data from disk and returns control to the menu. Cases with missing
data for either variable are omitted.
9.3 List and edit data
Option 2 of the regression menu provides a number of listing and editing
options. The data can be listed on the screen or on the printer or log file.
Data editing enables errors to be corrected. Cases can be changed, added or
deleted. Cases are referred to by the case number.
As an aid to data checking, a scatter diagram can be plotted. This option
plots Y against X in high resolution graphics, fully labelled. Single points
are shown by a cross; multiple points are represented by a small number showing
the number of coincident observations.
The data transformations option offers log (base 10), exponent of 10 (antilog),
and raise to any power for either X or Y. If an invalid transformation has
been requested, such as log (0) or (-1) to the power .5, a message is given and
the transformation is aborted. More extensive transformations are available in
the data file editing program obtained from Clinstat main menu option 1 if you
want them.
9.4 Summary statistics and plots
Option 3 of the regression menu, summary statistics and plots, produces a
further menu, offering histograms of X, Y and X-Y, Normal plots of X, Y and
X-Y, frequency distributions and summary statistics.
Summary statistics provide mean, minimum and maximum, median, variance,
standard deviation, standard error of mean (Bland, 1987, Section 8.2), and
corrected sum of squares for X and Y (Bland, 1987, Section 4.7), and the
corrected sum of products of X and Y (Bland, 1987, Section 11.2, Section
11.8). Summary statistics are available for X-Y also, used when looking at
differences between the same quantity measured under two different conditions
and in comparisons of different methods of measurement and repeatability
studies.
Histograms (Bland, 1987, Section 4.3) are available for X, Y and X-Y. You can
choose the interval size and starting point yourself or let Clinstat do it for
you. Clinstat will also calculate and print the frequencies themselves. For
frequency distributions, too, the choice of interval can be left to the program
or done by the user. A histogram of the frequency distribution is optional.
This enables you to draw any histogram you like.
At the point where Clinstat asks whether you wish to set your own scales, it
displays a menu which also offers to switch a Normal Distribution curve (Bland,
1987, Section 7.2) option on and a mean and standard deviation option on. Both
these options can be used at the same time. The Normal curve option draws a
Normal Distribution curve of the same mean and variance as the histogram,
standardised to have the same area. This provides a rough idea of the fit of
the data to the Normal Distribution and enables comparison of this with the
shape of the Normal plot. The mean and standard deviation option marks the
position of the mean along the horizontal axis, together with the mean ñ one
and two standard deviations. This illustrates the meaning of standard
deviation and is useful for showing how it behaves with symmetrical and skew
distributions (Bland, 1987, Section 4.7). The options remain selected until
you turn them off again or return to the main Clinstat menu.
Three kinds of scatter diagram are printed. A Normal plot is a plot of the
variable against the corresponding percentile of the Normal Distribution
(Bland, 1987, Section 7.5). If the data follow a Normal Distribution the
Normal plot will be a straight line. As a guide, Clinstat draws the line along
which the points are expected to lie if the data are Normally distributed.
This option is rather slow on some PCs.
The straightforward scatter diagram of Y against X is also available (Bland,
1987, Section 5.7). Clinstat will plot the difference between X and Y against
the mean of X and Y. This is useful in the analysis of paired data, where we
often want to know whether the magnitude of the difference is related to the
magnitude of the measurement itself (Bland, 1987, Section 10.2, Section 15.1,
Section 15.3). Clinstat will optionally draw a line at zero difference, useful
in gauging whether there is a tendency for differences to be in one direction.
It will also draw mean difference and mean difference ñ 2 standard deviation
lines, useful in the analysis of studies comparing two methods of measurement
(Bland, 1987, Section 15.3, Bland and Altman, 1986).
9.5 Simple linear regression and regression through the origin
Option 4 of the regression menu performs linear regression, the fitting of a
straight line relationship between Y and X (Bland, 1987, Section 11.2). Simple
linear regression gives the least squares regression equation of Y on X. This
is given with the standard errors of slope and intercept, 95% confidence
interval for the slope, t test for the null hypothesis that slope = 0, and the
residual sum of squares (Bland, 1987, Section 11.5). A menu of options
follows, including analysis of variance table (Armitage and Berry, 1987,
Section 9.1), prediction of Y from X and X from Y (both expected value and
future observation, with standard errors) (Bland, 1987, Section 11.6), scatter
plots with regression line and prediction errors (Bland, 1987, Section 11.6),
residuals listing, plot against X, Normal plot and histogram (Bland, 1987,
Section 11.7). The histogram has the Normal Distribution curve and mean and
standard deviation options.
The regression through the origin option gives the same output as for simple
linear regression, except that the intercept is constrained to be zero
(Armitage and Berry, 1987, Section 9.3). Note that this will produce silly
answers if the model is inappropriate, which it usually is.
Switch X and Y enables the variables to be exchanged so that regression of X on
Y can be done. Labels are switched if they are used, otherwise the old Y is
labelled X and vice versa. Listing the data should remove any confusion. This
option is also useful if the data have been written in order Y, X on the
original recording sheet, as the data can be entered in this way.
9.6 Correlation and rank correlation
The correlation option, option 5 of the regression menu, offers correlation and
rank correlation coefficients. The first option in the correlation section
gives the product moment correlation coefficient, r, its degrees of freedom,
95% confidence interval and the two sided probability of r under the null
hypothesis of zero correlation (Bland, 1987, Section 11.10, Section 11.11).
Fisher's z transformation and its standard error are also given (Gardner and
Altman, 1989, Section 5). This is a Normally distributed transformation of r,
provide both variables themselves follow Normal distributions. It is the basis
of the confidence interval and could be used to compare correlation
coefficients in different samples.
Two rank correlation coefficients are available, Spearman's rho and Kendall's
tau b (Bland, 1987, Section 12.4, Section 12.5). Only significance tests of
the null hypothesis are given; no confidence intervals are available.
9.7 Paired comparisons, paired t method, sign test, Wilcoxon paired test
Option 6 offers three ways of looking at differences between X and Y, the
paired t confidence interval and test, the sign test or Binomial test and the
Wilcoxon signed rank test. The sign test and Wilcoxon tests provide tests of
the non-parametric hypotheses corresponding to that of the paired t test.
9.7.1 Paired t method
The t option gives the mean difference, its standard error, a 95% confidence
interval, and a test of the null hypothesis that the mean difference is zero
(Bland, 1987, Section 10.2).
9.7.2 Sign test
The sign test tests the null hypothesis that differences are equally likely to
be positive as negative. The sign test gives the exact probability of the data
under the null hypothesis, using the Binomial Distribution (Bland, 1987,
Section 9.2).
9.7.3 Wilcoxon matched-pairs signed rank test
For samples of size less than or equal to 25, an exact probability table is
used in the Wilcoxon test, giving "P>0.05", "P<0.05", or "P<0.01". For larger
samples a Normal approximation is used (Bland, 1987, Section 12.3).
9.8 Storing data on disk
Saving data on disk enables you to keep the data which has been keyed into this
program. On twin floppy drive systems, a data disk is needed in Drive B:. On
one drive systems it must be exchanged for the program disk. The labels can
be changed at this point if you wish. The file created can be read by the
general data editing program as if it had been entered under the general input
program from Clinstat main menu option 1.
10 Independent sample comparisons, main menu option 5
10.1 Facilities and limitations
This program is started from the Clinstat main menu option 5 and the choosing
sub menu option 1. It will accept data from the keyboard or from disk and save
data on disk. Functions include basic statistics and plots, t and F tests, and
rank tests. Up to 20 groups can be compared.
This program carries out several comparison procedures for a continuous
variable between two or more independent groups. The program displays a menu
offering data input, data editing, summary statistics and plots, t and F
methods, and rank tests and storage of data on a disk file.
Two other programs in the sub menu from main menu option 5 enable similar
comparisons to be made using means already calculated. F and t tests between
two groups can be done given data in the form of mean, standard deviation and
number for each group. Multiple comparisons between more than two means can be
done from the same information, or from means and the residual sum of squares
from an analysis of variance.
10.2 Data input
Option 1, data input, leads to the data input menu, which offers keyboard
input, data from disk, data from disk with restrictions.
If the data are entered via the keyboard, the program asks how many groups
there are. It then asks whether you want to label groups and variables, the
defaults being "Group 1", "Group 2", etc. and "X" for the variable. Data are
then entered for each group, case by case, the group being terminated by "NO"
or "N".
If input is from disk, Clinstat asks for the file name and reads the file size
and labels, which are displayed. Two variables are requested: the variable to
be analysed and the group variable. For example, if we wish to compare the
blood pressures of men and women in a sample, blood pressure would be the
variable to be analysed and sex would be the grouping variable. If
restrictions are used, the procedure of 1.8 is followed. Cases with missing
data for either group or analysis variable are omitted. The program reads the
data and sorts it into groups, each group being defined by one value of the
grouping variable. It then prints the number of groups and the number in each
group with the defining value of the grouping variable. If there are value
labels for the group variable on the label file, the these will become the
group labels. Otherwise, the next option is to label the groups, e.g. as
"male" and "female" in the above example. The default is "Group 1", "Group 2",
etc. The same labels can be retained for the next analysis.
10.3 Listing and editing
Data can be listed on the screen or on the printer. Listing is group by
group. If the data were read from disk, the grouping procedure may have
changed the order.
Editing offers options to change a case, delete a case, or add a case, with the
option to list the data at any point. Each correction option asks for the
group number and for change or deletion the case number within the group. If
you can't remember the case number, 0 will return you to the edit menu and you
can list the data to check. Deleting all the cases in a group deletes the
group. Groups can be combined.
Three standard transformations are available: log to base 10, exponent of 10
(antilog) and raise to a power. The last allows such transformations as square
root (power = 0.5) and reciprocal (power = -1). If an invalid procedure is
met, e.g. log (0) or -1 to power 0.5, a message is printed and the
transformation aborted.
10.4 Summary statistics and plots
For each group the number, mean, median, minimum, maximum, variance, standard
deviation, standard error of the mean and sum of squares are printed.
The plots available are a scatter plot of groups, a histogram or normal plot
for one group and a histogram or normal plot of within-group residuals.
Within-group residuals are the difference between the observation and the group
mean. Thus they enable us to assess assumptions of Normal Distribution etc.
without these being obscured by difference between groups.
At the point where Clinstat asks whether you wish to set your own scales, it
displays a menu which also offers to switch a Normal Distribution curve option
on and a mean and standard deviation option on. Both these options can be used
at the same time. The Normal curve option draws a Normal Distribution curve of
the same mean and variance as the histogram, standardised to have the same
area. This provides a rough idea of the fit of the Normal and enables
comparison of this with the shape of the Normal plot (Bland, 1987, Section
7.5). The mean and standard deviation option marks the position of the mean
along the horizontal axis, together with the mean ñ one and two standard
deviations. This illustrates the meaning of standard deviation and is useful
for showing how it behaves with symmetrical and skew distributions. The
options remain selected until you turn them off again or return to the main
Clinstat menu.
The scatter plot option gives a "dot plot" for each group. Coincident points
are separated and shown side by side. This plot can be used to assess the
assumption of uniform variance. Unless there are many groups, the group labels
are printed on the plot.
The histogram intervals can be set by you or calculated by Clinstat. The
frequency distribution option also enables you to set your own interval or
leave it to Clinstat. You can plot the histogram if you wish.
A Normal plot is a plot of the variable against the corresponding percentile of
the Normal Distribution (Bland, 1987, Section 7.5). If the data follow a
Normal Distribution the Normal plot will be a straight line. As a guide,
Clinstat draws the line along which the points are expected to lie if the data
are Normally distributed. This option is rather slow on some PCs.
10.5 Normal Distribution methods
Option 4 from the independent comparisons menu offers several procedures which
require that the observations be Normally distributed: t and F methods and
Bartlett's test for uniformity of variances.
10.5.1 Two-sample t and F tests
The t test option is for comparing two groups. If there are more than two
groups, the programs asks which are to be compared, otherwise, it compares
groups 1 and 2 immediately. Three comparisons are made. Firstly, the means are
compared using the two sample t tests assuming uniform variance within the
groups. The mean difference, its standard error, a 95% confidence interval for
the difference, a t test of the null hypothesis of difference = 0, with degrees
of freedom and two tailed probability, and the within groups variance (Bland,
1987, Section 10.3). Secondly, the variances are compared using an F test
(Armitage and Berry, 1987, Section 7.7). Again the degrees of freedom and
probability are given. Thirdly, the means are compared using an approximate t
test, without assuming uniform variance, using the Satterthwaite approximation
which reduces the degrees of freedom. This gives the same statistics as the
uniform variance test. This is a less powerful test when the variances are
uniform.
Both the individual observations and mean and standard deviations comparison of
means include the large sample Normal confidence interval and test of
significance. The effect of the large sample assumption can thus be explored.
10.5.2 One-way analysis of variance
This carries out the usual one-way analysis of variance for unequal sized
groups (Armitage and Berry, 1987, Section 7.1). The ANOVA table is printed.
All the groups are automatically included. After the analysis of variance a
number of procedures are available. The means for each group may be presented,
with standard errors and 95% confidence intervals based on the residual
variance. A linear contrast may be tested, either using the usual F test for
pre-defined orthogonal contrasts or Scheff‚'s test for a contrast picked out as
large (Armitage and Berry, 1987, Section 7.4). The contrast may be defined by
giving coefficients or by giving two sets of groups to be contrasted. The
value of the contrast, its standard error, and 95% confidence interval are
given, together with the variance ratio and the probabilities for the two
tests.
Several multiple comparison procedures are offered. Straightforward group
differences, standard errors and 95% (Student) confidence intervals are given.
Two least significant difference methods, Student's and Fisher's, are available
for any chosen significance level. Two Studentized range tests, Tukey's and
the Newman-Keuls test (Armitage and Berry, 1987, Section 7.4), and available
for probabilities 0.05 and 0.01 only. These are only for equal sized groups.
For unequal groups Gabriel's test (Kendall and Stuart, 1968, Section 35.54) is
given. This is a multiple comparison procedure which can be used with groups
of any size. It says that two groups are significantly different if every
subset of groups containing this pair has a sum of squares large enough to give
an overall significant difference between the groups if groups outside the
subset made no contribution. The test is not widely known because the
calculations required are very time consuming, but it is a useful technique. On
some PCs this is slow if there are many groups.
10.5.3 Bartlett's test
Bartlett's test tests the homogeneity of variance in Normal populations
(Armitage and Berry, 1987, Section 4.6). The result is printed as a
chi-squared statistic, with degrees of freedom and associated probability. All
the groups are automatically included.
10.5.4 Confidence intervals for means
The program calculates confidence intervals for each group mean, using that
group's own standard deviation. If you want to calculate a confidence interval
using the combined standard deviation of all the groups, you should use one way
analysis of variance (Section ) and then choose option one from its sub-menu.
The confidence interval uses the t method, so data are assumed to be from a
Normal distribution. For larger samples (say > 30) this is equivalent to the
Normal method, 1.96 standard errors on either side of the mean, and for samples
greater than 100 the Normal distribution assumption can be ignored (Bland 1987,
ch10).
10.6 Methods based on ranks
Three rank tests are available, the Mann-Whitney U test (an exact equivalent of
the Wilcoxon two sample test), the Kruskal-Wallis one-way analysis of variance
by ranks and the Kolmogorov Smirnov two sample test.
10.6.1 Mann Whitney U test
The Mann-Whitney U test (Bland, 1987, Section 12.2) compares any two groups.
If there are only two groups the program proceeds directly to the analysis, if
there are more than two groups it asks which pair are to be compared. The
program prints the value of U and the two group sizes. If both groups have
size less than 20, an exact probability table is used, which results in the
probability being recorded as ">0.05", "<0.05" or "<0.01". For larger groups,
the Normal approximation is used and the two-tailed probability for this is
printed.
10.6.2 Kruskal-Wallis test
The Kruskal-Wallis one-way analysis of variance by ranks (Conover, 1980,
Section 5.2) compares all groups automatically. If there are three groups each
with size less than or equal to five, an exact probability table is used, which
results in the probability being recorded as ">0.05", "<0.05", or "<0.01".
Otherwise, a chi-squared approximation is used, and the probability associated
with this is printed.
10.6.3 Kolmogorov-Smirnov test
The Kolmogorov-Smirnov test (Conover, 1980, Section 6.3) compares any two
groups, chosen as for the Mann-Whitney test. The program prints the value of D
and the probability as ">0.05", "<0.05" or "<0.01". If both groups have size
less that 25 an exact table is used. Otherwise an approximation is used.
10.7 Save data on disk
This option asks for the file name. The variable labels may be changed if you
wish. The data are then stored in the usual Clinstat format. Group names are
stored as value labels.
11 Survival analysis, main menu option 7
11.1 Facilities and limitations
This program is started from the Clinstat main menu option 7. It will accept
censored survival data from the keyboard or from disk and save data on disk.
Functions include survival probabilities, survival curves, and logrank tests.
Up to 20 groups can be held, but only two compared in any one operation. Data
can be stored on a Clinstat disk file.
The program uses the terminology "death" and "withdrawal" to describe the
possible end points of the survival time, but of course survival analysis has
many other applications. "Death" denotes a definite outcome, "withdrawal" a
censored observation. In the study of time to conception, for example, the
definite outcome or "death" would be the start of a life.
11.2 Data input
Option 1, data input, leads to the data input menu, which offers keyboard
input, data from disk, data from disk with restrictions.
If the data are entered via the keyboard, the program asks how many groups
there are. It then asks whether you want to label groups and variables, the
defaults being "Group 1", "Group 2", etc., "time" for the survival time
variable and "Outcome" for the variable which records the type of outcome, e.g.
withdrawal or death. Data are then entered for each group, case by case, the
group being terminated by "NO" or "N". For each case the time is entered, then
"D" or "W" to denote death or withdrawal.
If input is from disk, Clinstat asks for the file name and reads the file size
and labels, which are displayed. Three variables are requested: the time
variable, the outcome variable and the group variable. Clinstat also asks for
the codes in the outcome variable which represent death or withdrawal. For
example, these might be 1 for a death and 2 for a withdrawal or censoring. If
restrictions are used, the procedure of 1.8 is followed. The program reads the
data and sorts them into groups, each group being defined by one value of the
grouping variable. Cases with missing data for time, outcome or group variable
are omitted. It then prints the number of groups and the number in each group
with the defining value of the grouping variable. If there are value labels
for the group variable on the data file, these become the group labels.
Otherwise, the next option is to label the groups, e.g. as "Treated" and
"Control". The default is "Group 1", "Group 2", etc. The same labels can be
retained for the next analysis.
11.3 Listing and editing
Data can be listed on the screen or on the printer. Listing is group by
group. If the data were read from disk, the grouping procedure may have
changed the order.
Editing offers options to change a case, delete a case, or add a case, with the
option to list the data at any point. Each correction option asks for the
group number and for change or deletion the case number within the group. If
you can't remember the case number, 0 will return you to the edit menu and you
can list the data to check. Deleting all the cases in a group deletes the
group. Groups can be combined.
11.4 Survival curves
Survival curves can be plotted for a single group or for two groups. For two
groups, the second survival curve is shown as a broken line and in a different
colour on colour monitors. Censored observations are indicated by short
vertical lines at the censoring point (Bland, 1987, Section 15.6).
11.5 Logrank test and standard errors
Option 4 from the survival analysis menu offers the logrank test (Armitage and
Berry, 1987, Section 14.6), survival probabilities (Armitage and Berry, 1987,
Section 14.5), and a standard error and confidence interval for a survival rate
(Greenwood's method, Armitage and Berry, 1987, Section 14.4).
The logrank test is a nonparametric test of the significance of the difference
in survival between two groups. It can done between any two groups. If there
are more than two groups, Clinstat asks which groups you wish to compare. The
number of deaths observed and expected under the null hypothesis are printed
for each group, with the chi-squared test statistic and probability.
The survival probabilities by the Kaplan Meier method for censored data are
printed for any group, with the details of the calculation.
11.6 Save data on disk
This option asks for the file name. Three variables are stored: the survival
time, the group and the outcome. Outcome is either 1 for a death or 2 for a
withdrawal. The variable labels may be changed if you wish. The data are then
stored in the usual Clinstat format. Group names are stored as value labels.
12 Random numbers, sampling and allocation, main menu option 8
12.1 Facilities and limitations
This program will print random numbers in sets of 100, 1000 (printer only) or a
random permutation of digits 1 to N for any N up to 1000. It will carry out
three types of random allocation for up to 1000 subjects: unconstrained, in
equal groups, or in equal subsets within equal groups. It will carry out two
types of random sampling: for fixed sampling probability or fraction (any
number) and fixed sample size (population up to 1000).
12.2 Random digits
The program will print 100 random digits in a 10 x 10 matrix with row and
column numbers. It will also print a page of random digits. This option is
only available on the printer and prints 1000 random digits in groups of four.
The program will also print a random permutation of the digits from 1 to any
given digit up to 1000.
12.3 Random allocation
Random allocation (Bland, 1987, Section 2.2) can be unconstrained. This gives
random allocation into any number of groups, which are labelled 1, 2, 3,...,
etc., for a given total number of subjects. The group sizes may not be equal.
Equal groups random allocation may be selected. This allocates up to 1000
subjects into any number of groups. The groups will be of equal size, so the
number of groups must divide the number of subjects. Finally, subjects may be
allocated to groups in blocks, so that each subset of patients contains equal
numbers in each group. If this is used, then whenever a trial is stopped there
will be approximately equal numbers in the groups (Pocock, 1983, Section 5).
This option allocates up to 1000 subjects into any number of groups within
subsets. Thus if there are two groups and subsets of 10 subjects, the first 10
subjects will have 5 allocated to group 1 and 5 to group 2. The next 10
subjects will also have this and so on. The number of groups must divide the
number in a subset, which must divide the total number to be allocated.
12.4 Simple random sampling
There are two simple random sample options. The first chooses a simple random
sample (Bland, 1987, Section 3.4) for a given sampling probability. This
chooses a sample from a population of given size for a given probability. Each
subject is chosen with the probability independently. The sample size obtained
is printed at the end. Alternatively, a simple random sample of fixed size can
be found. This option gives a random sample of required size for a population
of up to 1000 subjects.
13 Determination of sample size using power calculations
13.1 Facilities and limitations.
This program enables investigation of sample size options for the comparison of
two means, the comparison of two proportions and the detection of a correlation
between two variables. The method is based on large sample significance tests
and the power against the alternative hypothesis (Bland 1987, Section 9.9,
Section 9.10, Armitage and Berry, 1987, Section 6.6).
All sample size calculations are approximate, and where small are samples
indicated these will be inaccurate as large sample formulae are used. There is
no allowance for the effect of degrees of freedom in the comparison of means,
for example.
The program's first menu gives a choice between comparison of means, comparison
of proportions, or detection of correlation. The second menu depends on this
choice. As appropriate, the significance level, ratio of group one sample size
to group two sample size, power and standard deviation can be changed. Sample
size can then be estimated given population difference or correlation (the
alternative hypothesis), or difference or correlation detectable with a given
sample size.
Three plots are also available: power against sample size, power against
difference or correlation, and sample size against difference or correlation.
These graphs can be printed and saved on disk for retrieval.
For comparisons of means and proportions, the sample size used is the total
sample size, the size of the two groups combined. The ratio of one group size
to the other is initially set at 1:1, but can be changed to allow consideration
of other schemes.
13.2 Sample size for the comparison of two means
Power, significance level, sample size ratio, and standard deviation are set
separately from the menu. Given these, the sample size required to detect a
difference or the difference detectable with a particular sample size can be
found.
When comparing two means, the standard deviation of the observations is
important. Clinstat sets this to 1.0. If you know what it should be, from a
pilot study or previous publications, you can put it in at the next menu.
Otherwise, you can interpret the difference between means in terms of a number
of standard deviations.
13.3 Sample size for the comparison of two proportions
The standard error of the difference between two proportions depends on the
magnitude of the proportions as well as their difference, so for all options
comparing proportions at least one of the population proportions must be
specified. You need to know roughly what the proportions involved are going to
be.
Power, significance level, and sample size ratio are set separately from the
menu. Given these, the sample size required to detect a difference can be
found. Because the standard error of a proportion depends on the proportion
itself, both proportions have to be entered rather than the difference between
them. The difference from a given proportion detectable with a particular
sample size can also be found.
13.4 Sample size for the detection of a correlation
Power and significance level are set separately from the menu. Given these,
the sample size required to detect a correlation or the correlation detectable
with a particular sample size can be found.
13.5 Sample size for the mean difference or comparison of two means in paired
or matched samples
When comparing two means in paired samples or the same sample on two occasions,
the standard deviation of the differences between pairs of observations on the
same subject or matched pair is very important. Clinstat presents results
based on this standard deviation. You can either start with this or with the
standard deviation of the observations between subjects (the usual standard
deviation) and the correlation coefficient between the first and second
measurements, from which the standard deviation of differences can be
calculated. Clinstat sets the standard deviation of differences to 1.0. If
you know what any of these should be, from a pilot study or from previous
publications, you can put it in at the next menu. If you put in the standard
deviation of the observations between subjects and the correlation coefficient
between paired measurements, Clinstat calculates the standard deviation of
differences from them. Otherwise, you can interpret the difference between
means in terms of a number of standard deviations of differences, as for two
independent means (). This may not be very helpful, and some data is usually
required before anything useful can be obtained from this option. Note that
you need either the standard deviation of the differences or both the standard
deviation between subjects and the correlation coefficient between paired
measurements. The standard deviation between subjects is insufficient without
the correlation.
Power, significance level, sample size ratio, and standard deviation are set
separately from the menu. Given these, the sample size required to detect a
difference or the difference detectable with a particular sample size can be
found.
13.6 Power against difference or correlation
This is a plot showing power on the vertical axis against difference or
correlation on the horizontal axis. The program asks for the total sample
size. For differences between means and proportions this is then split into n1
and n2 according to the sample size ratio set separately from the menu.
The plot is symmetrical about zero for the difference between two means and for
correlation, as positive and negative differences can be found with equal
power. It is not symmetrical for the difference between two proportions. The
proportion for group 1 is fixed and the other proportion cannot be less than
zero or greater than one.
13.7 Power against sample size
This is a plot showing power on the vertical axis against total sample size on
the horizontal axis. The program asks for the difference between two means,
the two proportions p1 and p2, or the correlation. For differences between
means and proportions the total sample size is split into n1 and n2 according
to the sample size ratio set separately from the menu.
13.8 Sample size against difference or correlation
This is a plot showing total sample size on the vertical axis against
difference or correlation on the horizontal axis. The power is set separately
from the menu. The program asks for the maximum size in which you are
interested, and plots from zero to this or above. It uses a rounding algorithm
for the scale which means that it often gives sample sizes bigger than those
asked for. For the difference between two proportions, the program asks for
the proportion in group 1, p1. For differences between means and proportions
the total sample size is split into n1 and n2 according to the sample size
ratio set separately from the menu.
The plot is symmetrical about zero for the difference between two means and for
correlation, as positive and negative differences can be found with equal
power. It is not symmetrical for the difference between two proportions. The
proportion for group 1 is fixed and the other proportion cannot be less than
zero or greater than one.
14 Histogram, mean and standard deviation for a single variable, main menu
option 8
14.1 Facilities and limitations
This program is started from option 8 from the Clinstat main menu, sub menu
option 3. It will accept data from the keyboard or from a disk file, and will
save data on disk. This program performs calculations on a single continuous
variable. It will produce basic statistics (mean, median, variance, etc), plot
histograms and scatter diagrams.
The main summary statistics menu offers data input, listing and editing,
summary statistics and plots, and storage of data on disk.
14.2 Data input
Data input offers 3 options: from keyboard, from disk, or from disk with
restrictions. If the data are to be put in via the keyboard, option 1,
Clinstat first asks whether you want to name your variables. The default
option is "X". The program will then ask for the data case by case. "N" or
"NO" terminates the data entry.
If the data are on a Clinstat disk file, the program will ask for the file name
and read the variable labels. It will then ask which variable is required. If
restrictions are used the procedure of 1.8 is followed. It then reads the data
from disk and returns control to the menu. Cases with missing data for the
variable are omitted.
14.3 List and edit data
Option 2 of the single variable menu provides a number of listing and editing
options. The data can be listed on the screen or on the printer or log file.
Data editing enables errors to be corrected. Cases can be changed, added or
deleted. Cases are referred to by the case number.
The data transformations option offers log (base 10), exponent of 10 (antilog),
and raise to any power. If an invalid transformation has been requested, such
as log (0) or (-1) to the .5, a message is given and the transformation is
aborted. More extensive transformations are available in the data file editing
program obtained from Clinstat main menu option 1 if you want them.
14.4 Summary statistics and plots
Option 3 of the single variable menu, summary statistics and plots, produces a
further menu, offering histogram, Normal plot, frequency distribution and
summary statistics.
Summary statistics provide mean, minimum and maximum, median, variance,
standard deviation, standard error of mean, and corrected sum of squares
(Bland, 1987, Section 4.5, Section 4.6, Section 4.7).
You can choose the interval size and starting point for the histogram (Bland,
1987, Section 4.3) yourself or let Clinstat do it for you. A Normal
Distribution curve (Bland, 1987, Section 7.2, Section 7.4) and the position of
the mean and the mean ñ one and two standard deviations (Bland, 1987, Section
4.7) can be added to the histogram if required. Clinstat will also calculate
and print the frequencies themselves. For the frequency distribution, too, the
choice of interval can be left to the program or done by the user. A histogram
of the frequency distribution is optional. This enables you to draw any
histogram you like.
A Normal plot is a plot of the variable against the corresponding percentile of
the Normal Distribution (Bland, 1987, Section 7.5). If the data follow a
Normal Distribution the Normal plot will be a straight line. As a guide,
Clinstat draws the line along which the points are expected to lie if the data
are Normally distributed. This option is rather slow on some PCs.
14.5 Confidence intervals for the mean
The program calculates a confidence interval for the mean. The confidence
interval uses the t method, so data are assumed to be from a Normal
distribution. For larger samples (say > 30) this is equivalent to the Normal
method, 1.96 standard errors on either side of the mean, and for samples
greater than 100 the Normal distribution assumption can be ignored (Bland 1987,
ch10).
14.6 Storing data on disk
Saving data on disk enables you to keep the data which has been keyed into this
program. On twin floppy drive systems, a data disk is needed in Drive B:. On
one drive systems it must be exchanged for the program disk. The variable
label can be changed at this point if you wish. The file created can be read
by the general data editing program as if it had been entered under the general
input program from Clinstat main menu option 1.
15 Standardized Mortality Ratios, main menu option 8
15.1 Calculation of Standardized Mortality Ratios
This program calculates a Standardized Mortality Ratio (Bland, 1987, Section
16.3) from a standard set of mortality rates and an age distribution for the
study population. The SMR, its standard error and a 95% confidence interval
are calculated. For small samples the confidence interval is found by the
exact Poisson method (Gardner and Altman, 1989, Section 6).
Further SMRs can be calculated for other diseases by entering new standard
rates, and for other populations by entering a new age distribution.
15.2 Confidence intervals and significance tests for Standardized Mortality
Ratios
This program calculates a confidence interval for a Standardized Mortality
Ratio from the observed and expected frequencies. It is suitable for small
samples as exact Poisson methods are used (Garner and Altman, 1989, Section
6). The SMR, a 95% confidence interval, and a test of the null hypothesis that
the SMR is one are calculated. The program also compares two SMRs. The
confidence interval for the ratio is calculated, as this is much easier to do
in the small sample case than the interval for the difference. A significance
test of the null hypothesis that the two SMRs are equal is also given.
On some PCs this may be slow if the observed frequencies are large.
16 Simulations and other demonstrations of statistical principles, main menu
option 9
Clinstat includes a number of programs developed for computer aided learning
and teaching. These can be used in private study, in group exercises if a
computer lab or classroom is available, or as demonstrations in the lecture
theatre, using a Kodak Datashow or similar device.
These programs do not produce any printer or log file output.
16.1 Tossing coins
This very simple program illustrates the concept of randomness and
probability. Any number of coins can be tossed. First toss a single coin. If
you repeat this several times you will see that we cannot predict whether Head
or Tail will show. However, if we toss several coins, say 10, we can be fairly
sure that we will get some head and some tails. If we toss a large number of
coins, we see that about half the coins show heads and half show tails. We can
predict what will happen over many trials, "in the long run", but not the
single trial (Bland, 1987, Section 6.1).
16.2 Pintable simulation
This program is based on a program for the Commodore Pet by I. J. Wood (1981).
It represents Bernoulli's experiment of a ball falling down a series of pins,
bouncing in either direction. This simulates a Binomial distribution (Bland,
1987, Section 6.4, Bland, 1984). Because this is a computer, the ball does not
have to have the same probability of bouncing to right and left.
After the pintable, the program will print a histogram for the number of
bounces to the right and compare the observed probabilities for this
distribution with those predicted by the Binomial.
16.3 People on boxes - simulation of mean and variance
This program illustrates the effect of addition and subtraction upon mean and
variance (Bland, 1987, Section 6.6, Bland 1984). The random variables used are
human height, the heights of boxes and the depths of holes. The program will
draw a screen of stick people, with their mean height, standard deviation and
variance. It will do the same for random boxes and holes. Constants are
represented by the constant boxes, all identical, and constant holes, again
identical.
Something can be added to human height by getting the people to stand on a
box. If the boxes are all the same we add a constant. The mean increases but
the variability is unchanged. If the boxes are random and independent of the
heights, the mean of the sum is the sum of the means and the variance of the
sum is the sum of the variances of human height and box height. If the heights
of the boxes are not independent of the heights of the people this is not
necessarily true. If the box is so that the person can reach a light bulb, the
short people must find the big boxes and so the box and people heights will be
negatively correlated. The variance of the sum is not the sum of the
variances; it is reduced.
Something can be subtracted from human height by getting the people to stand in
a hole. If the holes are all of the same depth we subtract a constant. The
mean decreases but the variability is unchanged. If the holes are random and
independent of the heights, the mean of the difference is the difference
between the means. The variability is increased, however, and the variance of
the difference is the sum of the variances of human height and hole depth. If
the depths of the holes are not independent of the heights of the people this
is not necessarily true. If the hole is so that the person can hide, the tall
people must find the deepest holes, and so, as for the boxes, we will have a
negative correlation. The variance of the difference is no longer the sum of
the variances; it is reduced.
16.4 Central limit simulation
The Central Limit Theorem states that if we have the sum of several
independent, identically-distributed random variables, this sum tends towards a
Normal Distribution as the number of variables increases (Bland, 1987, Section
7.2). A consequence of this is that the mean of a large sample will come from
a Normal distribution whatever the distribution of the observations themselves.
This program illustrates this using a Uniform or Rectangular distribution to
produce the observations. All possible numbers between 0 and 1 are equally
likely. This is produced by the RND(X) function familiar to BASIC programmers.
The program first asks for the number of observations to be added then the
number of sums to be generated. The graphic display constructs a histogram of
the Uniform variable. As each number is generated, another block is added to
the histogram. When it is complete, the corresponding Normal Distribution
curve is drawn. This has the same mean and variance and is scaled to have the
same area as the histogram. The program then asks you if you want to do it
again. If you answer "Y" the program asks for the number to be added, if "N"
it returns to the simulations menu.
Start with 1 observation and 400 runs. The graphic display constructs a
histogram of the Uniform variable, showing a roughly rectangular shape. The
Normal Distribution curve looks quite unlike the histogram. Now try two
observations added. The histogram is now triangular and the Normal
Distribution curve fits it better, but not well. Look at the tails of the
distributions to see the poor fit there. As you increase the number of
observations added, you will see that the fit improves rapidly, until by the
time six Uniform variables are added the two distributions are very close
indeed. As you increase the number in the sum, you can increase the number of
runs, too (Bland, 1984).
16.5 Sampling distribution of mean and proportion
This program plots a histogram like that for the central limit theorem. Instead
of the sum of Uniform random variables, it calculates the sum of Normal random
variables, mean = 0, standard deviation = 1. The mean and standard deviation
of the subsequent sampling distribution is printed. The standard deviation is,
of course, the estimated standard error of the mean (Bland, 1987, Section 8.1,
Section 8.2, Bland, 1984).
Start with a single observation, to show the underlying distribution with
standard deviation = 1. Then increase the sample size to 4, to 9 to 16. The
standard deviations will be approximately 1/2, 1/3, and 1/4, showing that
standard error depends on the square root of the number of observations.
The program also plots a histogram of Binomial proportions. The population
proportion is set by the user. The program generates proportions from samples
of specified size. The mean and standard deviation of the subsequent sampling
distribution is printed. The standard deviation is, of course, the estimated
standard error of the proportion (Bland, 1987, Section 8.4).
This program illustrates both standard error and the Normal approximation to
the Binomial distribution. It needs larger samples than the means option. A
good demonstration uses a population proportion 0.3, and sample size 10 with
300 samples, sample size 40 with 300 samples, sample size 90 with 300 samples,
sample size 160 with 200 samples, and sample size 250 with 200 samples.
16.6 Sum of squares simulation
This program shows why we use n-1 as the divisor when estimating variance
(Bland, 1987, Section 4.7, Section 4A.1, Section 6A.2, Bland, 1985). It takes
random samples from a population to show that the sum of squares about the
sample mean is proportional to the number of observations minus one, whereas
the sum of squares about the population mean (which is, unusually, known in
this case) is proportional to the number of observations itself.
This is a menu driven program. The population can be listed and the population
mean and sum of squares shown, with the mean squared difference from the mean,
the population variance. We use small samples to estimate this.
You can draw a simple random sample of any size, say 4. The sample chosen is
printed, together with a number of quantities, including the sum of squares
about the sample mean. This sum of squares divided by n-1 is the usual
estimate of the population variance. This sum of squares divided by n is also
given. Many calculators give the square root of this as sigma sigma n.
Try it again and you will get a different sample. A simulation run will
generate and aggregate the results from many such samples. Try a simulation
run with sample size 2 and 50 runs. You get the same things averaged over the
50 samples. This can be repeated using sample sizes 3, 4 and 5. A summary of
the runs so far shows how these estimates of variance change with n.
16.7 Confidence interval simulation
This program illustrates the behaviour of confidence intervals (Bland, 1987,
Section 8.3, Bland, 1985). It has a menu structure similar to that of the sums
of squares simulation. It draws random samples from a fixed population and
calculates means with 95% confidence intervals. The population can be listed
and the true mean found.
A single random sample can be drawn, and the 95% confidence interval
calculated. For most samples this should include the population mean. A
simulation run gives a graphic display, showing the value of the variable along
the horizontal scale, with the population mean marked by a vertical line. One
in twenty confidence intervals exclude the population mean in the long run.
16.8 Probability distributions
This program plots any of the following distributions: Normal, Binomial,
Poisson, t, Chi-squared and F (Bland, 1987, Section 6.4, Section 6.7, Section
7.2, Section 7A). The Normal is plotted for any mu and sigma over the range -6
to +6, the Binomial for any p and n<=200, the Poisson for any mu <= 100, the t
for any degrees of freedom over the range -6 to +6, the Chi-squared for any
degrees of freedom up to 30, and the F for any degrees of freedom over the
range 0 to 6. The Binomial, Poisson and t plots have a Normal Distribution
curve option.
16.9 Simulations of clinical trials
There are two simulated clinical trials. The first has a fixed sample size of
55 patients, the second can have any sample size (Bland, 1986).
16.9.1 Clinical trial with fixed sample size
For this trial you have to use a table of random numbers or some other method
to allocated 55 subjects to two treatments. The random allocation program
could be used. Clinstat prints a table of results. You must decide what to
do about several patients without full data. This program can be used to
illustrate the concept of significance in class. Calculate chi-squared tests
for survival versus death and for degree of improvement. Some will be
significant and some not. In fact, there is no difference in mortality but
there is a true difference in the degree of improvement. The sample size is
too small to be sure of detecting this, however.
16.9.2 Clinical trial with variable sample.
This program does the randomization for you. You can vary the sample size. The
chi-squared test is built in, together with some table editing functions to get
a valid and appropriate test. You can print the actual model on which the
simulation is based and repeat the simulation, using the same model again or a
different one.
17 Error messages
Clinstat traps some errors before they happen, for others it uses the Quick
Basic run time error codes (Microsoft, 1987). Some of these should never
happen, others will be labelled as they occur. A full list of these codes
follows:
3 RETURN without GOSUB
4 Out of DATA
5 Illegal function call
6 Overflow
7 Out of memory
9 Subscript out of range
11 Division by zero
14 Out of string space
16 String formula too complex
19 No RESUME
20 Resume without error
21 Device timeout
25 Device fault
27 Out of paper
39 CASE ELSE expected
40 Variable required
50 FIELD overflow
51 Internal error
52 Bad file name or number
53 File not found
54 Bad file mode
55 File already open
56 FIELD statement active
57 Device IO error
58 File already exists
59 Bad record length
64 Bad file name
67 Too many files
68 Device unavailable
69 Communication buffer overflow
70 Permission denied
71 Disk not ready
72 Disk-media error
73 Rename across disks
75 Path/file access error
76 Path not found
Most of these errors relate to file and peripheral problems. You should be
able to deal with these quite easily. Error 24, device timeout, probably means
your printer is not switched on, for example.
Error 5 occurs, among other causes, when you try to do something for which your
computer does not have the hardware. You may be using a graphics adapter you
do not have, or using a Hercules adapter without having loaded QBHERC.COM. If
Clinstat was installed correctly, the batch file loads QBHERC.COM for you. See
Sections 2 and 3.3.
If errors 3, 4, 19, 20, 39, 50 or 56 occur, this represents a programming
error. Contact Martin Bland.
18 References
Armitage P, Berry G. (1987) Statistical Methods in Medical Research.
Blackwell, Oxford.
Bland JM. (1984). Using a microcomputer as a visual aid in the teaching of
statistics. The Statistician, 33, p.253-259.
Bland JM. (1985). Computer simulation used to illustrate two statistical
principles. Teaching Statistics, 7(3), 74,78.
Bland JM. (1986). Computer simulation of a clinical trial as an aid to
teaching the concept of statistical significance. Statistics in Medicine, 5,
193-197.
Bland JM, Altman DG. (1986). Statistical methods for assessing agreement
between two methods of clinical measurement. Lancet, i, 307-310.
Bland M. (1987). An Introduction to Medical Statistics. Oxford University
Press, Oxford.
Conover WJ. (1980) Practical Nonparametric Statistics, 2nd. ed. Wiley, New
York.
Fleiss JL. (1981) Statistical Methods for Rates and Proportions, 2nd. ed.
Wiley, New York.
Gardner M, Altman DG. (1989) Statistics with Confidence. BMJ, London.
Kendall MG, Stuart A. (1968) The Advanced Theory of Statistics, Vol 3, 2nd.
ed. Griffin, London.
Maxwell AE (1980) Comparing the classification of subjects by two independent
judges. British Journal of Psychiatry, 116, 651-5.
Microsoft Corporation (1987) QuickBASIC 4.0
Pocock SJ (1983) Clinical Trials: a Practical Approach. Wiley, Chichester.