1 Overview 1.1 What is Clinstat? Clinstat is an interactive program which can be used to carry out the analysis of small data sets and as an aid to learning statistics. Clinstat is menu-driven, so there are no commands to learn. It carries out all the main univariate, unifactorial statistical calculations: regression and correlation, t tests, chi-squared tests, analysis of variance, rank methods, etc., using data from data files or from direct keyboard entry. It has a self-checking data entry program and many data editing options. There are many more powerful statistical packages available, such as SPSS, SAS, GLIM, etc., but Clinstat's simple menu-driven format makes it especially suitable for those new to statistical analysis and so particularly useful as a teaching aid. In addition, Clinstat includes a number of programs specifically written for teaching and learning statistics. These carry out simulations to illustrate concepts such as standard error and the central limit theorem, and draw the main probability distributions for parameters of your choice. Clinstat was written originally for medical researchers who came for statistical advice, hence its name "Clinical Statistics". It was developed as an aid for subsidiary calculations on tables produced by other programs in survey analysis, for use in student projects and for use in teaching students about the principles of statistics. With the development of the ubiquitous PC, it is possible to use the program interactively in teaching where each student runs the program as the class proceeds. 1.2 About this manual Clinstat does not really need a manual, as its menu-driven format makes most operations straightforward and self-evident. The manual describes the functions available in Clinstat, but is not necessary for day to day use of the program. The manual contains some explanations of the statistical procedures it describes, these are very limited. References are given to textbooks and to particular papers in the literature where appropriate. Most references are to An Introduction to Medical Statistics, (Bland, 1987). Clinstat was developed in conjunction with this book and most of the analyses, simulations and graphics which the book contains were done using it. Clinstat will also do many analyses which are not included in Bland (1987), and for these appropriate references will be given. 1.3 Machine requirements Clinstat is an interactive data input, editing and statistical analysis package for IBM compatible microcomputers. It is written and compiled in Microsoft QuickBasic. It requires an IBM compatible computer with at least one floppy disk drive. It can use two disk drives, a hard disk and a printer if available. You can run Clinstat from your hard disk or from floppy disks. Clinstat supports CGA, EGA, VGA and Hercules video graphics adapters, and supports colour graphics on all except Hercules. If there is a maths coprocessor, Clinstat will find and use it. 1.4 Menus and questions Clinstat is a menu-driven program, that is, it gives you lists of options and you choose one of these by typing its number. These are arranged in a hierarchy, so that choosing an option may lead to another menu giving more detailed options, and so on. For example, the regression option from the main menu leads to a menu including data input, data editing, summary statistics and plots, regression, correlation, etc. Choosing the data input option leads to a menu with options to get data from keyboard, read data from a disk file, etc. If you choose the disk file option, Clinstat then asks for the name of the file and gets the data. Although this sounds complicated, it is much simpler for the beginner than a program which is driven by commands, which have to be learned from a manual or help facility. Clinstat does not have a help facility because it does not need one. In almost all Clinstat menus, the first option is "0 none required", which means that you do not want any of the options which follow. Typing "0" RETURN will move Clinstat on to the next step, which is usually a higher level menu. Thus you can quickly return to the main menu and use a different procedure or quit altogether. Sometimes Clinstat asks a question, such as "Do you want to label your variables?" The answer is almost always yes or no. Clinstat only requires the first character, "y" or "n" and will accept upper or lower case. Only for such things as file names is the whole word needed. If you accidentally go down a path you do not want, typing RETURN will usually halt the procedure and get you back to the menu. Some sub-menus have an option to quit the procedure instead. 1.5 Variables and cases Many of these programs refer to variables and cases. Suppose we record the age, sex, height and weight for each of 15 people. Then age, sex, height and weight will be variables and each person will be a case. We have a value for each variable obtained for each case. If you are setting up a data file on disk, you are strongly recommended to have a case number as one of your variables. The case number used by Clinstat to refer to cases is the sequence number of the case in the computer's memory and may not be useful for identifying cases to match with your paper records. 1.6 Data capacity Clinstat will handle a data matrix with 10,000 numbers (cases times variables), with a maximum of 100 variables and 500 cases. Cross-tabulations are limited to 20 rows and 20 columns, group comparisons to 20 groups. 1.7 Missing data Programs which use data from disk have a provision for setting a missing data code. When a disk file is set up a missing data code, say 9999, can be defined for any variable. Then if the variable is not known for a case , the missing data code is entered instead. When new variables are created by the edit programs, they can also be assigned missing data codes and, for example, a new variable which is the sum of two old variables will be given the missing data code when either old variable is missing. The regression, group comparison and survival data programs read only two or three variables from disk. They will automatically exclude any case which has either variable missing. The cross-tabulation program does not recognize missing data codes. In this program many variables may be read at once, and excluding cases with one of them missing could lead to unacceptable loss of information. Two methods can be used to deal with missing data in this program. There is a restrictions facility which can restrict a table to cases having values for a variable between specified limits. There are also facilities to edit a table, deleting or combined rows or columns. It will be seen that the missing data code is of limited value if you are only going to tabulate and do chi-squared tests. 1.8 The "Restrictions" feature Several of these programs have a restrictions feature. This means that operations can be restricted to cases having values for variables between certain limits. For example, having examined the relationship between height and weight you may wish to do this for men aged over 30 years. After the program asks you to enter restrictions, it will ask for a variable number. Tell it which variable you want. In the example in 1.4 above, sex is variable 2, so type 2. It then asks for permissible values for a variable. These may be either separate values or inclusive ranges. The lower and upper limits of a range must be separated by 'TO', the first limit, 'TO' and the second limit , each being followed by RETURN. There is a maximum of 10 separate values or ranges. When the permissible values have been entered, type 'NO' or 'N'. For the example, if age had been coded 1 for male we would have 1 RETURN N RETURN. This is then repeated for another variable. For the example, males over 30 could be selected by 30 RETURN TO RETURN 200 RETURN N RETURN. 200, of course, is greater than any possible age. This goes on until a variable request receives a reply of 'NO'. The program lists the restrictions and asks whether they are correct. If not, they can be re-entered. If yes, the program proceeds to select cases which meet these restrictions. 1.9 Clinstat data files Clinstat data sets consist of two ASCII text files, filename.DAT and filename.CLB. The main file, filename.DAT, holds the actual data in free format. Each line corresponds to a case, the variables being separated by two spaces and the line being terminated by a carriage return. This file can be read by many other statistical programs, so having started with Clinstat it is possible to move to another program if more advanced statistical procedures are needed. Similarly, Clinstat can read data produced by other programs. The second file, filename.CLB, is the Clinstat labels file. For each data matrix, the following information is stored: the number of variables the number of cases variable labels (up to 15 characters) input limits, (minimum and maximum allowable values for each variable) missing data codes (0,0 if code absent, 1,code if present) number of value labels value label codes for each variable, start and count value labels 0 0 0 0 (future use) Clinstat can read files created by other programs provided they are in the free ASCII format described above. The input program (main menu option 1) is used to create a CLB file to go with the main data file. 1.10 Clinstat graphics Clinstat produces a number of graphs: scatter diagrams, histograms, survival curves, probability distributions, etc. These may be displayed using the CGA, EGA, VGA or Hercules graphics adapters. Of these, the CGA, EGA and VGA options support colour, the Hercules does not. For computers without any graphics adapter, some graphs are shown as character graphics; only the simulation and distribution plotting programs do not do this. If the log file option is used, it is the character graphics form which is stored on the log file. If a non-graphics printer (e.g. a daisy wheel) is used, the character graphics form is printed. Clinstat will print some graphs on a suitable dot matrix, laser or inkjet printer. When the graph is displayed, the prompt line at the bottom of the screen reads: Press space bar to continue, S to save graph on disk and P to print Pressing the "P" key will erase the prompt line and start the printer. The graph will be in landscape mode, i.e. turned on its side. This is an 8 pin print, so any dot matrix printer should work. Hewlett Packard compatible laser and desk jet printers will print graphs. On some PCs this is a very slow procedure, particularly 086 computers with Hercules screens. Do not worry if nothing seems to happen for a while. If you find graph printing too slow, try saving the graphs to disk (see below) and printing them as a batch job using the graph retrieval program, main menu option 2. Graphics printing is not available on the teaching programs, main menu option 9. Clinstat will save some graphs on disk and retrieve them. When the graph is displayed, the prompt line at the bottom of the screen reads: Press space bar to continue, S to save graph on disk and P to print Pressing the "S" key will erase the prompt line and ask for the name of the destination file. The information required to draw the graph is then stored on the file. The prompt line reappears when the graph has been saved. The graph can be retrieved using main menu option 2, sub menu option 5. This program reads the graphics information file and draws the graph. It can be used to print the graph as described above. Clinstat graphs are stored as text, not as a screen image. They can only be retrieved within Clinstat. If you want to incorporate your graph into a word processor file, you should use one of the memory resident programs which can be used to capture screens, such as Word Perfect's grab or Lotus' scr. Clinstat's graphics screens are accessible to these on most computers. Graphics saving is not available on the teaching programs, main menu option 9. 1.11 Errors in programs and user's guide This User's Guide is correct at the date of release. However, these programs are continually being modified to improve speed and data capacity, remove errors and add new facilities. Thus the current program disks may contain features not described in the user's guide. The programs will still work as described here, however. Every effort is made to ensure that there are no bugs in the programs. The programs are still under development, and there are sure to be some. If you find one, tell Martin Bland, who will endeavour to fix it. 2 Installing Clinstat 2.1 Clinstat on floppy disks You can run Clinstat from floppy disks. For systems with two floppy drives, use a: for the program and b: for data disks and choose the "two floppy disk drives" option from the setup menu (see below). Otherwise, use the single drive option. Clinstat will tell you when to change disks. Copy the Clinstat disks and keep the originals as backup. You can then use the copies to run Clinstat. You may make as many copies as you wish. 2.2 Installing Clinstat on your fixed disk You can run Clinstat from the fixed disk c:. If you have an IBM graphics adapter (CGA, EGA or VGA), or no graphics at all, install Clinstat as follows: Put yourself in the root directory by c: cd \ Put Clinstat disk 1 in drive a: and type a:install Clinstat will now set up a directory called CLINSTAT on the hard disk c: and a file CLINSTAT.BAT in the root directory. It will ask you to put in each disk in turn. You may have more than one disk image on the same physical disk. If Clinstat has been supplied on three 3«" disks, each disk will contain two disk images. If Clinstat has been supplied on high density disks, each disk will contain three disk images. When asked to put in Disk 2, you should leave the first disk in place, as the Disk 2 files are on it. Press any key to continue. On the hard disk, Clinstat requires 1.6 megabytes of storage. Alternatively, you can set up your own directory and copy all the files to it using DOS. Note that Clinstat must be started from within its own directory as it reads files there. The command from within the directory is cl If you have a Hercules graphics adapter proceed as above except that you must type a:installh instead of a:install If you use DOS to copy Clinstat, the starting command from within the directory is then clh This is a batch file which runs a file called QBHERC.COM, which enables the QuickBasic in which Clinstat was written to use the Hercules graphics card. 3 Starting Clinstat 3.1 Starting Clinstat from the fixed disk Unless it has been installed in some special way, you can run Clinstat from DOS by typing CLINSTAT at the DOS prompt. This will work if Clinstat has been installed on your hard disk using the install program, for example. If a menu shell has been installed, follow the appropriate instructions for the menu. 3.2 Starting Clinstat from the floppy disk To run Clinstat from the floppy disk, put disk 1 in drive a: and type A: CL (CLH if you are using a Hercules graphics adapter) Clinstat will start. When Clinstat will tell you when you need to change disks. The package resides on six 360K disks: Disk 1: Start, input and edit data files, installation. Disk 2: Copy, merge and join disk files, tabulation and cross-tabulation from data files, two way table analysis using chi-squared tests, comparison of two proportions, Cohen's Kappa. Disk 3: Regression, correlation and paired data analysis. Disk 4: Independent samples comparisons. Disk 5: Survival analysis, random sampling, SMR calculations. Disk 6: Simulations, probability distributions You may have more than one disk image on the same physical disk. If Clinstat has been supplied on three 3«" disks, each disk will contain two disk images. If Clinstat has been supplied on high density disks, each disk will contain three disk images. 3.3 The set up menu Clinstat produces a banner screen including the release date and version details and also the current date supplied by DOS. The current date and time are also printed at the start of your hard copy. The details of the system are now given. When running Clinstat for the first time, make sure that this is correct. In particular, make sure that the graphics adapter is correct. If you try to create a graph on a video adapter which your computer does not have, you will get an error code 5 message. If you want to change the setup, type "Y" RETURN and the change setup menu will appear. Choose the item you wish to change and when you have finished choose 0 to return to the main program. Note that colour screens are not supported with the Hercules video adapter. 3.4 Printers and log files Clinstat will next give three options for output: to screen only, to log file or to printer. You will only be offered the printer option if there is one in your setup, and the log file option if you have a hard disk or twin floppy drive system. If you want printed output, you can send this directly to the printer, which will print your results as you go along. You can stop Clinstat printing and start it again as you wish. Alternatively, you can put the output onto a log file. This is a disk file which records the results of calculations as you go along. It can then be printed from DOS or can be edited using edlin or a word processor. See your DOS or word processor manual for details of reading, editing and printing ASCII files. The log file is quiet. It is particularly useful in teaching laboratories, etc., where several computers may share one printer and the noise of printers may distract others. To send output to a log file, choose option 2. You are then asked for the name of the log file. PC file names consist of a name of up to eight letters and numbers, optionally followed by a dot and an extension of up to three letters and numbers. You are strongly recommended to use the extension ".DAT" for all data files and ".LOG" for all log files. Other extensions may have special meanings and lead to errors. For example, Clinstat automatically creates a file filename.CLB, containing the variable labels, limits, etc., to go with the data file filename.DAT. It is essential that your log file has a different name to your data files, or unpredictable errors will occur. All the output produced by Clinstat will then be stored on the floppy disk for printing later. As with the printer, you can stop output going to the log file and restart it as you wish. 3.5 Printing the log file When you have quit Clinstat you can print your log file. Make sure your computer is connected to a printer. Make sure that the printer is switched on and ready. To print your file you must use commands from the disk operating system (DOS). From the DOS prompt type PRINT filename.LOG The computer may ask you NAME OF LIST DEVICE [PRN]: The printer should be PRN so just press RETURN. The computer will now print your output file. As an alternative, you can print the log file directly from option 2 of the main menu. This will terminate the Clinstat session, however. 3.6 The main Clinstat menu The main Clinstat menu looks like this: Programs available: 0 quit Clinstat 1 input and edit data files 2 copy, join and merge files, create subfiles, retrieve saved graph, print log file 3 tabulation and cross-tabulation (data on disk file only) 4 chi-squared and other tests on two-way tables 5 regression, correlation, paired comparisons (disk file or keyboard) 6 comparing two or more groups (disk file or keyboard) 7 survival data (disk file or keyboard) 8 miscellaneous calculations 9 simulations and other demonstrations of statistical ideas Program required = ? The particular Clinstat program you want is chosen from this list or from a sub menu reached from this. For example, keying 5 RETURN will start the regression program. Keying 8 RETURN leads to a second menu listing the miscellaneous programs. The first of these, random sampling and allocation, is started by keying 1 RETURN. The following sections of the manual describe each of the Clinstat programs. The sub menus are arranged as follows: 1 input and edit data files 1 set up new data file 2 add to, list and edit existing data file 2 copy, join and merge files, create subfiles, retrieve saved graph, print log file 1 copy a Clinstat file 2 join two Clinstat files (same variables, different cases) 3 merge two Clinstat files (different variables, same cases) 4 create a subfile 5 retrieve Clinstat graphs 3 tabulation and cross-tabulation (data on disk file only) 4 chi-squared and other tests on two-way tables 1 chi-squared tests and other two-way table analysis 2 combine several 2x2 tables 3 compare two proportions 4 Cohen's kappa 5 regression, correlation, paired comparisons (disk file or keyboard) 6 comparing two or more groups (disk file or keyboard) 1 from individual observations 2 t tests from means and standard deviations 3 ANOVA and multiple comparisons from means and standard deviations 7 survival data (disk file or keyboard) 8 miscellaneous calculations 1 random sampling and allocation 2 determination of sample size 3 histogram, mean and standard deviation (disk file or keyboard) 2 calculation of standardized mortality ratios from mortality rates 5 confidence intervals for standardized mortality ratios 9 simulations and other demonstrations of statistical ideas 1 tossing coins 2 pintable simulation 3 people on boxes - simulation of mean and variance 4 central limit simulation 5 sampling distribution of mean 6 sum of squares simulation 7 confidence interval simulation 8 probability distributions 9 simulations of clinical trials 4 Data input to a disk file and editing the file, main menu option 1 4.1 Starting a new data file from the keyboard After main menu option 1, choose option 1, set up a new data file. Choose the option to start a new file from scratch. The program asks for the full path name for the new file. This means the directory and file name. See your DOS manual for more information. If you specify a filename only, with no path, the file will go into the Clinstat directory. You are strongly recommended to use the extension .DAT for your data files. A new file will destroy any previous file of that name, so Clinstat warns you and asks you if you wish to continue. If in doubt, quit and use DOS to check. Otherwise, answer Y and the program will go on. 4.1.1 Variable labels Clinstat asks for the number of variables. This is the number of separate pieces of information recorded for each case or record. Remember that, although Clinstat denotes each case by the case number corresponding to the order in which cases were entered, it is often useful to have a separate study number which corresponds to the original paper record. Clinstat asks for variable labels, up to 15 characters. These are not used to refer to the variable, that is, you do not have to type them in every time you use the variable. Clinstat uses variable number for this purpose. The variable label is printed whenever the variable is used, so it is a check and aide memoir. Clinstat checks the number of characters and a warning is given for more than 15. The program truncates the name and gives you the option to type it in again. After the labels are all entered, a menu allows you to list labels, change them, add and delete them. This last facility is useful if you have made a mistake in counting the number of variables or have missed one out. Variables can be inserted at any point in the variable list. 4.1.2 Missing data codes The program next asks for missing data codes. These are optional and you may have them for some variables and not others. You can use the same missing data code for all the variables if you like. 999 is a good missing data code, as you are unlikely to type this by mistake. Zero or blank are not good codes as they make it very difficult to trap keying errors in data entry. 4.1.3 Input limits The program then asks for minimum and maximum input limits. These are optional, the default being -1E38 and +1E38 (i.e. 1 times 10 to the power 38). The limits need not include the missing date code. During data input, they trap many errors arising from either hitting RETURN too soon or not at all and from hitting keys twice. 4.1.4 Value labels Clinstat will store value labels. These are labels of up to seven characters attached to each possible value of a variable. For example, a variable "sex" could be coded "1" or "2" and these could have labels "male" and "female" respectively. These labels are retrieved by the tabulation program and printed on cross-tabulations. They are also retrieved as group labels by the group comparison program. Value labels are optional. You do not need to have any at all. If you use them, you do not need to have labels for all variables. Missing data codes are automatically coded "missing" unless you stipulate otherwise. The same value labels are often used by several variables. For example, the codes "yes", "no", "dt know" might be used by many variables in a file. In Clinstat they are only put in once, together with the list of variables which uses them. Up to 200 value labels can be stored. After the labels and limits have been written to disk, the program proceeds to the data input and edit menu, as described below. 4.2 Setting up a file from an existing ASCII data matrix You can read any rectangular data matrix, case by case, with Clinstat. The data should be an ASCII file in free format, the numbers being separated by spaces, commas or carriage returns. If you choose this option in the input program, Clinstat asks for the name of the file and the numbers of cases and variables it contains. You then put in variable labels and the optional missing data codes and input limits as described above. Clinstat writes these onto the .CLB file. It does not write anything on your data file. The input program then leads to the data input and edit menu, which you would usually want to quit. 4.3 Adding to or editing an existing data file For an old file, the program asks the name of the file and then reads the numbers of variables and cases, the variable labels, the missing data codes and input limits from disk. It then proceeds to the data input and edit menu. 4.4 Putting in data at the data input and edit menu You arrive at the main data editing menu via the new file input or old file input and edit options. Depending on whether the file has any data, the menu offers options to input and edit data, edit labels and write the file to disk, or in addition to sort data, find frequency distributions, recode, create and delete variables. To put in more data, choose option 1. The data input section presents a menu offering options to enter more data, list and change data entered, list labels, update the disk file, and control printing. Option 1 enables you to add cases to the file. Data are put in one case at a time. Clinstat prints both number and name for each variable and asks for the value to be entered. A 'NO' or 'N' will tell the program that no more data is to be entered and the menu will reappear. A value outside the input limits and not equal to the missing data code will cause the computer to beep and the program will ask for that variable again. After data input, data can be listed, errors corrected, and data stored on disk. When the data is on the disk file they are safe. 4.5 Using more than one computer for data input If you have several computers available, in a computer classroom for example, you can have several people putting in your data at once, onto separate disks. The files can then be joined together. To use more than one computer to enter data from different sets of forms, first use one computer to set up the variable names, missing data codes and input limits for your data file. When the second menu comes up, ready to put in data, do not put in any data, but quit Clinstat. Copy the Clinstat label file filename.CLB from the original disk to your other computers using a floppy disk. Exactly how you do this depends on the type of computer, whether they are networked, etc. You should use different file names for each set of cases, such as SAMPLE1.CLB, SAMPLE2.CLB, SAMPLE3.CLB. You can then type in data on each disk. Make sure each subject is only entered once. When all the data are in, use the file utilities in Clinstat to join the files and produce a composite file on another disk. This is option 2 on the main Clinstat menu. Use the option for joining two data files described below. 4.6 Data file editing 4.6.1 Editing features available Clinstat provides comprehensive data checking, data correcting and data manipulation facilities. The Clinstat main menu described above leads to the editing program. This will read the data matrix, missing date codes, labels and limits from disk. The main menu provides options for listing data, listing data with a specified value for a given variable, correcting data, finding frequency distributions of variables, correcting variable labels, missing data codes and input limits, recoding variables, controlling the printer, deleting cases and variables, and creating new variables as functions of the old ones. The corrected data file is stored on disk. 4.6.2 Checking and correcting data errors (data cleaning) The principal use of the program is to check and correct data entered using the input program. Typical use would be to find the frequency distribution for each variable. This gives the frequency of every value, rather than for scale intervals. (If you want that, see Sections 9.4, 10.4). Thus, for example, the frequency distribution of the study number given to each subject in a study will reveal any duplicate entries. The frequency distribution for height will reveal any values which are suspiciously high or low, etc. When a value which may be wrong has been identified, we can list all cases with that particular value of the variable. We can then check back to the original documents, find the correct code, and use the error correction facility to put in the correct value. The program will delete a duplicate case, or put in a case which has been omitted. 4.6.3 Changing labels, limits and missing data codes. Option 2 on the input and edit menu enables you to inspect and change the variable labels, value labels, missing data codes and input limits. 4.6.4 Recoding variables The other use of the program is to prepare data for analysis. This is done by recoding and by creating new variables. For example, suppose we have entered age in years, but want to tabulate other variables by age in 10 year age groups. We can recode age from years to decades. We want codes 0 - 9 to be replaced by 1, codes 10 - 19 to be replaced by 2, etc. The recode option enables us to do this. First we define the new code 1 to be all old codes 0 to 9. The program put a 1 wherever there is a value between 0 and 9 for age. We then define another new code, 2, to be old codes 10 to 19, and all values between 10 and 19 for age will be replaced by 2, and so on until all the old codes have been recoded. Note that we must not start by changing 90 - 99 into 10, 80 - 89 into 9, and so on until 0 - 9 into 1, as the last step will change everything to 1. If in any doubt, create a new variable equal to age (see below) and recode that. You can then compare the old and new code using the data listing facility to check that the recoding is what you expected. 4.6.5 Creating new variables out of old New variables are created as functions of the old. For example, we can create a new variable "age in months" = "age in years" times 12. We can create a copy of a variable for recoding by new variable = old variable + constant, and setting the constant as zero. We can also add, subtract and multiply variables. We can divide one variable by another, transform a variable by logarithm, exponent, raise to any power (e.g. -1 gives reciprocal, 0.5 gives square root), sine, cosine, arcsine, arctangent, integer part. These enable many other functions to be created, e.g. tangent, arccosine etc. Clinstat will also find any linear combination of variables: i.e. any sum of variables multiplied by constants. In addition, we can do such things as change time in hours and minutes to minutes (by multiplying by .01, take integer part to give the hours, subtract hours, multiply by 100 to give minutes, multiply hours by 60 and add). The program will create new variables until the data space is full. After this, any more new variables must replace existing ones, but this is unlikely to be a serious limitation. Procedures such as the hours to minutes calculation described above use a lot of intermediate variables which can be over-written. In any case these redundant variables should be deleted using the option for deleting variables. The variable creation program preserves missing data codes. You define a code for the new variable and any cases with missing data for an old variable has the new variable set to the missing data code. 4.6.6 Sorting data Clinstat will also sort data into the order given by any of the variables. Note that this will change all the case numbers. You should make sure that your own, external case number is one of the variables. 4.6.7 Storing the updated file When the data matrix is correct, it is saved on disk. It is a good idea to use a different name to the original file name, perhaps by adding a number. This helps make it clear what the file contains. If you use the old file name then the old file will be destroyed. More than one copy of the file can be made by changing the data disk or file name after writing is complete and running the data file to disk option again. 5 File handling utilities, main menu option 2 5.1 Copy a file File copy simply copies a file from one disk to another, and is simple and self explanatory. On floppy based systems, once the program is running the program disk must be removed to make way for the data disk bearing the file to be copied. You can also do this using the DOS copy command. Copy 'filename.' to get the labels file as well. 5.2 Join two files Join files will join two files with the same variables in the same order, but different cases. It is useful if you have used more than one computer to put in your data. 5.3 Merge two files Merge files will put together two files with the same cases in the same order, but different variables. The files must be on the same disk. Use the file copy utility if they are not. This program enables you to add variables to your data set. Set up a new file with the new variables and enter the data using the input program. Make sure the cases are in the same order. If you have a study number for each case, the editing program can be used to sort the data matrix into case number order. Then use this program to merge the files together. Any duplicated variables can be deleted in the editing program or by creating a subfile as described below. 5.4 Create a subfile Creating a subfile is a very useful procedure. This will create a new data file containing all or some selected variables, and all or some cases. You will need a separate disk for the subfile. Variables are selected by entering the required variable numbers. Cases are selected by the restriction procedure described in 1.8 above. This means, for example, that if sex is a variable you can produce a subfile of females only, and another of males only. Regression analysis could then be done for sexes separately. 5.5 Retrieve a Clinstat graph saved on disk Scatter diagrams and histograms and survival curves can be saved as disk files. Clinstat saves the text required to create the graph rather than a screen image. This program retrieves the graphics file and recreates the graph. This can then be printed if you have a suitable printer (see Section 1.10). 6 Tabulation and cross-tabulation, main menu option 3 6.1 Program limitations The program lists data and produces one way and two way tables. There is no limit on the size of one way tables, but two way tables are limited to 20 rows and 20 columns. Variables with more than 20 different possible values cannot be cross-tabulated. After a two way table has been found, a set of statistical procedures including chi-squared tests, Fisher's exact test, etc. is available. 6.2 Multi-way tables The most powerful feature of the program is the ability to restrict the data to certain cases. For example, we might tabulate a respiratory symptom, present or absent, by sex. It may be that the two are related because more males than females smoke cigarettes. We can repeat the tabulation restricting our attention to non-smokers, and then to smokers only, to see if the relationship still exists. In this way we can build up a three way table, using the third variable to make the restriction. In the same way four way or five way tables can be built up. The restriction facility can be used for data listing, one-way and two-way tables. It is described in detail in 1.8 above. 6.3 Reading the data file The program is entered from the Clinstat main menu option 3. A list of options appears, data input being option 1. This produces a request for the file name. The program reads and lists the variable labels. It may then tell you that there is not enough room for all the variables. If there is enough room, it will read in the data file. Otherwise, you will be forced to select variables required for this analysis. The program asks for a list of the variables you require and gives opportunity to change this before reading the data. Even though only a few variables may be used, the variable numbers used by this program are those on the original disk file. Thus we may choose variables 1, 5, 7, 23, and these are the numbers the program will use, not 1, 2, 3, 4. After the data has been read from disk the message "Setting up arrays" is shown. The computer is finding the number of possible values for each variable. If a variable has more than 20 values a message is shown warning that this cannot be cross-tabulated. When the menu returns the data entry is complete. Return to the main menu. 6.4 Data listing and restrictions on the data If data listing is selected from the main menu, the program menu will offer options for listing without and with restrictions. If restrictions are chosen, Clinstat will request 'Enter restrictions'. These are entered as described in 1.8 above. First the variable number is requested, then the permissible values for a variable (maximum of 10). The list of values is terminated by 'NO'. The permissible values may be single values or the lower and upper limits of a range. These are separated by 'TO'. The RETURN key must be pressed after 'TO' and after each limit. More than 10 values will halt the procedure. This is repeated for another variable, and so on until the variable request receives a reply of 'NO'. The program then lists the restrictions. There is a facility to change these if you make a mistake. Cases will be selected if they satisfy all the restriction procedures. The listing program will then ask for all variables or some, all cases or some, and list the data in the usual way. If no cases meet the restrictions a message 'No such cases exist' will be printed. 6.5 One-way tables The one-way table option will ask for the variable to be entered. It will then ask for restrictions as described in 1.8 and 6.4 above. It will then list the possible values of the variable, the number of cases with each value, and the percentage. 6.6 Two-way tables The two-way table option will ask for the row variable and column variable. It will then ask for restrictions as described in 1.8 and 6.4 above. It will then list the two-way table, giving the values of the variables at the start of the relevant row and the top of the relevant column. It will then ask whether you want percentages, chi-squared tests etc. 'NO' returns to the menu. 'YES' leads to a menu of options. This is described in detail under 'Two-way table analysis', Section 7 below. To return to the tabulation menu, request to enter a new table. 7 Two-way table analysis, main menu options 3 and 4 7.1 Facilities and limitations The two way table analysis options can be used for contingency tables produced by the cross-tabulation program, main menu option 3, or for tables keyed in directly using the keyboard entry version, main menu option 4. The program carries out chi-squared tests for two way tables, including options for trend, Yates' correction, partitioning chi-squared and McNemar's test. It also carries out Fisher's exact test, gives row and column percentages, expected frequencies and residuals, and provides extensive table editing. The limit on table size is 20 rows and 20 columns. 7.2 Data input Data entry is from disk file via the tabulator program, main menu option 3 (see Section 6), or directly from the keyboard using option 1 of main menu option 4. The main menu offers the range of options. The enter new table option gets a new two way table from the keyboard using the main menu 4 version, or goes back to the cross-tabulation menu if main menu option 3 is used (see section 6). The keyboard data entry asks for the number of rows and columns, then for the data, row by row. When the data have been entered the two-way table is displayed, with row and column totals. The rows are numbered on the left, and the columns across the top. 7.3 Editing the table The editing option gives a menu offering change one cell, delete row or column, combine rows or columns, reorder rows or columns, display current table and reinstate original table. The change cell option enables input errors to be corrected. The delete row and column and combine row and column option enable the table to be manipulated to remove small expected values for valid chi-squared tests. The reordering options are for use in trend analysis and for partitioning chi-squared. The edited table can be displayed at any time during editing, and the original table can be returned as it was before editing if an error has been made. 7.4 Percentages Row percentages displays the table with each cell as a percentage of the row total. Column percentages gives each cell as a percentage of the column total. Percentages are given to one decimal place. 7.5 Chi-squared tests The chi-squared test is the usual chi-squared test for a two-way table, testing the null hypothesis of no association (Bland, 1987, Section 13.1). The chi-squared statistic, degrees of freedom and associated probability is given, together with the number of cells with expected values less than 5. If there are too many such cells, the expected values can be displayed and the edit routine used to combine or delete rows and columns as appropriate (Bland, 1987, Section 13.2). For two by two tables with small expected values Yates' continuity correction is available (Bland, 1987, Section 13.6), or Fisher's exact test (Bland, 1987, Section 13.5). 7.6 Fisher's exact test Fisher's exact test (Bland, 1987, Section 13.5) gives the probability of the observed table and of each table which is more extreme. More extreme tables are printed, with their probabilities. The total of these is the total one-sided probability and this is doubled to give the approximate two-sided probability. The program chooses which cells to increment and which to decrement; the order of the rows and columns does not matter. 7.7 Chi-squared for trend Chi-squared for trend (Bland, 1987, Section )13.4 carries out the usual trend analysis. You have the option of choosing the row and column variables. If you do not choose them, they are set to 1, 2, 3, 4, .., i.e. the row and column categories are regarded as being equally spaced. The program gives the chi-squared for linear trend, about linear trend and the total chi-squared, together with their degrees of freedom and probabilities. 7.8 Tests for matched samples, McNemar's and Stuart-Maxwell test Two tests are given: McNemar's test (Bland, 1987, Section 13.8) and the Stuart-Maxwell test (Maxwell, 1970). McNemar's test gives the test for equality of proportions in matched samples. It is assumed that the two variables are in the same order, e.g. first row and column both yes, second row and column both no. The table is printed in this format as a reminder. The program gives the chi-squared statistic testing equality without and with a continuity correction, the probability in each case, and the number of cells with expected value less than 5. For testing equality of the row and column totals in variables with more than two categories, a test suggested by Stuart and by Maxwell is provided. The numbers of rows and columns must be equal. For a 2 x 2 table this is the same as McNemar's test. 7.9 Partitioning chi-squared Partitioning chi-squared is complicated. You should refer to a book (Armitage and Berry, 1987, Section 12.2) for details. A short menu offers options to list partition chi-squares, to sum partition chi-squares, and to type a comment. The list option give the chi-squared statistic for each partition. The partition is defined by a row and column. It refers to the two by two table defined by that row and column element as the second row, second column cell, the sum of the row up to but not including this element as the second row, first column, the sum of the column up to but not including this element as the first row, second column cell, and the sum of all the elements in both rows before and columns above this element as the first row, first column cell. The sum partition option asks for the partitions to be added, in terms of the row and column of the defining element for each partition. It then prints the chi-squared statistic for each partition, and their sum with degrees of freedom and probability. 7.10 Residuals The display residuals option gives three possible definitions of residual about the no-association model: observed - expected, (observed - expected) / square root expected, and as adjusted by division by their standard deviations. Only one will be displayed, the program returning immediately to the main menu. 8 Calculations for discrete and binary data, main menu option 4 8.1 Combining several 2 by 2 tables by the Mantel-Haenszel method Combining several 2 by 2 tables by the Mantel-Haenszel method (Armitage and Berry, 1987, Section 16.2) is only available in a keyboard entry version. The program asks for the number of tables to be entered. After the number of two by two tables, the program asks for the first table, row by row. This is displayed and the option give to re-enter the table if an error has been made. If the table is correct, the next table is requested and displayed. When all the tables have been entered, the combined chi-squared statistic for no association is given, with its degrees of freedom (always 1) and probability. The Standard Normal deviate which some texts prefer is the square root of the chi-squared statistic. The probability is the same. 8.2 Comparison of two proportions and odds ratio 8.2.1 Facilities and limitations This program compares two proportions using the large-sample Normal approximation (Bland, 1987, Section 8.6, Section 9.8). Input is from the keyboard only and printer output is provided. 8.2.2 Data input The program first asks for the denominators of the two proportions, then the numerators. We often want to compare several sets of proportions using the same denominators, so after the first data input there is a facility to enter numerators only, keeping the denominators from the previous calculation. 8.2.3 Output The program prints the two proportions, their difference, the standard error of the difference, a 95% confidence interval for the difference (Bland, 1987, Section 8.6), and a Normal test of the null hypothesis of equal proportions, with the associated two-tailed probability (Bland, 1987, Section 9.8). A check is made on the adequacy of the approximation. The comparison of two proportions program also calculates large sample confidence intervals for a single proportion (Bland, 1987, Section 8.4), a single odds, the ratio of two proportions and the odds ratio (Gardner and Altman, 1989, Section 6). 8.3 Cohen's Kappa, main menu option 4 8.3.1 Facilities and limitations This program takes a square two-way table from the keyboard. Cohen's Kappa, weighted Kappa (Fleiss, 1981, Section 13.1) and the test suggested by Stuart and Maxwell (see Maxwell, 1970) for equality of marginal totals can be calculated. The limit on table size is 20 rows and columns. 8.3.2 Data input and editing The program asks for the number of rows of the square two-way table. It then asks for data row by row. When all the data has been entered the table is printed. Errors in data entry can be corrected by the edit table option. The only facility provided is to change one cell of the table. The table can be printed again at any time. 8.3.3 Cohen's Kappa Cohen's Kappa is printed assuming that the row and column categories are in the same order. This means that cells along the main diagonal of the table are "agree", and other cells are "disagree". Kappa is printed with its standard error and an approximate 95% confidence interval. A one-sided test of significance for no agreement is also given. Note that this may give a significant difference when the confidence interval includes zero. 8.3.4 Weighted Kappa Weights are put in under option 4. The data must be put in before the weights, as the program assumes the same number of rows for weights as for the last two-way table entered. Errors in input can be corrected by the edit weights option, and the weights can be printed at any time. Weights are zero for cells with perfect agreement and increase as disagreement becomes more important. For example, a variable with categories "good", "fair" and "poor" might have weights 0 for "good" & "good", 1 for "good" & "fair" and 2 for "good" & "poor". If no weights have been entered, Clinstat assumes weights at equal intervals with agreement along the main diagonal, like this: --------------------------- | | Column | | Row | 1 2 3 | |-----+-------------------| | 1 | 0 1 2 | | 2 | 1 0 1 | | 3 | 2 1 0 | --------------------------- The same weights can be used for weighted Kappa on successive two-way tables, without re-entering the weights, unless the size of the table is changed. Weighted Kappa is printed with the same statistics as Kappa. 8.3.5 Test equality of margins, Stuart-Maxwell test The Stuart-Maxwell test tests the null hypothesis that the marginal distributions are equal, i.e. that the numbers put in each category by the two judges would be the same. The result is given as a chi-squared statistic. This is also available in the contingency table program, option 1 in the Clinstat sub menu. 9 Regression, correlation, and paired data comparisons, main menu option 5 9.1 Facilities and limitations This program is started option 5 from the Clinstat main menu. It will accept data from the keyboard or from a disk file, and will save data on disk. This program performs calculations on a pair of variables, denoted by X and Y. It will do simple linear regression (with options such as residual plots), correlation and rank correlation (Spearman's and Kendall's), produce basic statistics (mean, median, variance, etc), plot histograms and scatter diagrams and carry out paired t tests, sign test and Wilcoxon test. The main regression, correlation and paired data analysis menu offers data input, listing and editing, summary statistics and plots, regression, correlation and rank correlation, tests on paired data, and storage of data on disk. 9.2 Data input Data input offers 3 options: from keyboard, from disk, or from disk with restrictions. If the data are to be put in via the keyboard, option 1, Clinstat first asks whether you want to name your two variables. The default option is "X" and "Y". The program will then ask for the data case by case, first the value of X then Y. "N" or "NO" terminates the data entry. If the data are on a Clinstat disk file, the program will ask for the file name and read the variable labels. It will then ask which variables are required as X and Y. If restrictions are used the procedure of 1.8 is followed. It then reads the data from disk and returns control to the menu. Cases with missing data for either variable are omitted. 9.3 List and edit data Option 2 of the regression menu provides a number of listing and editing options. The data can be listed on the screen or on the printer or log file. Data editing enables errors to be corrected. Cases can be changed, added or deleted. Cases are referred to by the case number. As an aid to data checking, a scatter diagram can be plotted. This option plots Y against X in high resolution graphics, fully labelled. Single points are shown by a cross; multiple points are represented by a small number showing the number of coincident observations. The data transformations option offers log (base 10), exponent of 10 (antilog), and raise to any power for either X or Y. If an invalid transformation has been requested, such as log (0) or (-1) to the power .5, a message is given and the transformation is aborted. More extensive transformations are available in the data file editing program obtained from Clinstat main menu option 1 if you want them. 9.4 Summary statistics and plots Option 3 of the regression menu, summary statistics and plots, produces a further menu, offering histograms of X, Y and X-Y, Normal plots of X, Y and X-Y, frequency distributions and summary statistics. Summary statistics provide mean, minimum and maximum, median, variance, standard deviation, standard error of mean (Bland, 1987, Section 8.2), and corrected sum of squares for X and Y (Bland, 1987, Section 4.7), and the corrected sum of products of X and Y (Bland, 1987, Section 11.2, Section 11.8). Summary statistics are available for X-Y also, used when looking at differences between the same quantity measured under two different conditions and in comparisons of different methods of measurement and repeatability studies. Histograms (Bland, 1987, Section 4.3) are available for X, Y and X-Y. You can choose the interval size and starting point yourself or let Clinstat do it for you. Clinstat will also calculate and print the frequencies themselves. For frequency distributions, too, the choice of interval can be left to the program or done by the user. A histogram of the frequency distribution is optional. This enables you to draw any histogram you like. At the point where Clinstat asks whether you wish to set your own scales, it displays a menu which also offers to switch a Normal Distribution curve (Bland, 1987, Section 7.2) option on and a mean and standard deviation option on. Both these options can be used at the same time. The Normal curve option draws a Normal Distribution curve of the same mean and variance as the histogram, standardised to have the same area. This provides a rough idea of the fit of the data to the Normal Distribution and enables comparison of this with the shape of the Normal plot. The mean and standard deviation option marks the position of the mean along the horizontal axis, together with the mean ñ one and two standard deviations. This illustrates the meaning of standard deviation and is useful for showing how it behaves with symmetrical and skew distributions (Bland, 1987, Section 4.7). The options remain selected until you turn them off again or return to the main Clinstat menu. Three kinds of scatter diagram are printed. A Normal plot is a plot of the variable against the corresponding percentile of the Normal Distribution (Bland, 1987, Section 7.5). If the data follow a Normal Distribution the Normal plot will be a straight line. As a guide, Clinstat draws the line along which the points are expected to lie if the data are Normally distributed. This option is rather slow on some PCs. The straightforward scatter diagram of Y against X is also available (Bland, 1987, Section 5.7). Clinstat will plot the difference between X and Y against the mean of X and Y. This is useful in the analysis of paired data, where we often want to know whether the magnitude of the difference is related to the magnitude of the measurement itself (Bland, 1987, Section 10.2, Section 15.1, Section 15.3). Clinstat will optionally draw a line at zero difference, useful in gauging whether there is a tendency for differences to be in one direction. It will also draw mean difference and mean difference ñ 2 standard deviation lines, useful in the analysis of studies comparing two methods of measurement (Bland, 1987, Section 15.3, Bland and Altman, 1986). 9.5 Simple linear regression and regression through the origin Option 4 of the regression menu performs linear regression, the fitting of a straight line relationship between Y and X (Bland, 1987, Section 11.2). Simple linear regression gives the least squares regression equation of Y on X. This is given with the standard errors of slope and intercept, 95% confidence interval for the slope, t test for the null hypothesis that slope = 0, and the residual sum of squares (Bland, 1987, Section 11.5). A menu of options follows, including analysis of variance table (Armitage and Berry, 1987, Section 9.1), prediction of Y from X and X from Y (both expected value and future observation, with standard errors) (Bland, 1987, Section 11.6), scatter plots with regression line and prediction errors (Bland, 1987, Section 11.6), residuals listing, plot against X, Normal plot and histogram (Bland, 1987, Section 11.7). The histogram has the Normal Distribution curve and mean and standard deviation options. The regression through the origin option gives the same output as for simple linear regression, except that the intercept is constrained to be zero (Armitage and Berry, 1987, Section 9.3). Note that this will produce silly answers if the model is inappropriate, which it usually is. Switch X and Y enables the variables to be exchanged so that regression of X on Y can be done. Labels are switched if they are used, otherwise the old Y is labelled X and vice versa. Listing the data should remove any confusion. This option is also useful if the data have been written in order Y, X on the original recording sheet, as the data can be entered in this way. 9.6 Correlation and rank correlation The correlation option, option 5 of the regression menu, offers correlation and rank correlation coefficients. The first option in the correlation section gives the product moment correlation coefficient, r, its degrees of freedom, 95% confidence interval and the two sided probability of r under the null hypothesis of zero correlation (Bland, 1987, Section 11.10, Section 11.11). Fisher's z transformation and its standard error are also given (Gardner and Altman, 1989, Section 5). This is a Normally distributed transformation of r, provide both variables themselves follow Normal distributions. It is the basis of the confidence interval and could be used to compare correlation coefficients in different samples. Two rank correlation coefficients are available, Spearman's rho and Kendall's tau b (Bland, 1987, Section 12.4, Section 12.5). Only significance tests of the null hypothesis are given; no confidence intervals are available. 9.7 Paired comparisons, paired t method, sign test, Wilcoxon paired test Option 6 offers three ways of looking at differences between X and Y, the paired t confidence interval and test, the sign test or Binomial test and the Wilcoxon signed rank test. The sign test and Wilcoxon tests provide tests of the non-parametric hypotheses corresponding to that of the paired t test. 9.7.1 Paired t method The t option gives the mean difference, its standard error, a 95% confidence interval, and a test of the null hypothesis that the mean difference is zero (Bland, 1987, Section 10.2). 9.7.2 Sign test The sign test tests the null hypothesis that differences are equally likely to be positive as negative. The sign test gives the exact probability of the data under the null hypothesis, using the Binomial Distribution (Bland, 1987, Section 9.2). 9.7.3 Wilcoxon matched-pairs signed rank test For samples of size less than or equal to 25, an exact probability table is used in the Wilcoxon test, giving "P>0.05", "P<0.05", or "P<0.01". For larger samples a Normal approximation is used (Bland, 1987, Section 12.3). 9.8 Storing data on disk Saving data on disk enables you to keep the data which has been keyed into this program. On twin floppy drive systems, a data disk is needed in Drive B:. On one drive systems it must be exchanged for the program disk. The labels can be changed at this point if you wish. The file created can be read by the general data editing program as if it had been entered under the general input program from Clinstat main menu option 1. 10 Independent sample comparisons, main menu option 5 10.1 Facilities and limitations This program is started from the Clinstat main menu option 5 and the choosing sub menu option 1. It will accept data from the keyboard or from disk and save data on disk. Functions include basic statistics and plots, t and F tests, and rank tests. Up to 20 groups can be compared. This program carries out several comparison procedures for a continuous variable between two or more independent groups. The program displays a menu offering data input, data editing, summary statistics and plots, t and F methods, and rank tests and storage of data on a disk file. Two other programs in the sub menu from main menu option 5 enable similar comparisons to be made using means already calculated. F and t tests between two groups can be done given data in the form of mean, standard deviation and number for each group. Multiple comparisons between more than two means can be done from the same information, or from means and the residual sum of squares from an analysis of variance. 10.2 Data input Option 1, data input, leads to the data input menu, which offers keyboard input, data from disk, data from disk with restrictions. If the data are entered via the keyboard, the program asks how many groups there are. It then asks whether you want to label groups and variables, the defaults being "Group 1", "Group 2", etc. and "X" for the variable. Data are then entered for each group, case by case, the group being terminated by "NO" or "N". If input is from disk, Clinstat asks for the file name and reads the file size and labels, which are displayed. Two variables are requested: the variable to be analysed and the group variable. For example, if we wish to compare the blood pressures of men and women in a sample, blood pressure would be the variable to be analysed and sex would be the grouping variable. If restrictions are used, the procedure of 1.8 is followed. Cases with missing data for either group or analysis variable are omitted. The program reads the data and sorts it into groups, each group being defined by one value of the grouping variable. It then prints the number of groups and the number in each group with the defining value of the grouping variable. If there are value labels for the group variable on the label file, the these will become the group labels. Otherwise, the next option is to label the groups, e.g. as "male" and "female" in the above example. The default is "Group 1", "Group 2", etc. The same labels can be retained for the next analysis. 10.3 Listing and editing Data can be listed on the screen or on the printer. Listing is group by group. If the data were read from disk, the grouping procedure may have changed the order. Editing offers options to change a case, delete a case, or add a case, with the option to list the data at any point. Each correction option asks for the group number and for change or deletion the case number within the group. If you can't remember the case number, 0 will return you to the edit menu and you can list the data to check. Deleting all the cases in a group deletes the group. Groups can be combined. Three standard transformations are available: log to base 10, exponent of 10 (antilog) and raise to a power. The last allows such transformations as square root (power = 0.5) and reciprocal (power = -1). If an invalid procedure is met, e.g. log (0) or -1 to power 0.5, a message is printed and the transformation aborted. 10.4 Summary statistics and plots For each group the number, mean, median, minimum, maximum, variance, standard deviation, standard error of the mean and sum of squares are printed. The plots available are a scatter plot of groups, a histogram or normal plot for one group and a histogram or normal plot of within-group residuals. Within-group residuals are the difference between the observation and the group mean. Thus they enable us to assess assumptions of Normal Distribution etc. without these being obscured by difference between groups. At the point where Clinstat asks whether you wish to set your own scales, it displays a menu which also offers to switch a Normal Distribution curve option on and a mean and standard deviation option on. Both these options can be used at the same time. The Normal curve option draws a Normal Distribution curve of the same mean and variance as the histogram, standardised to have the same area. This provides a rough idea of the fit of the Normal and enables comparison of this with the shape of the Normal plot (Bland, 1987, Section 7.5). The mean and standard deviation option marks the position of the mean along the horizontal axis, together with the mean ñ one and two standard deviations. This illustrates the meaning of standard deviation and is useful for showing how it behaves with symmetrical and skew distributions. The options remain selected until you turn them off again or return to the main Clinstat menu. The scatter plot option gives a "dot plot" for each group. Coincident points are separated and shown side by side. This plot can be used to assess the assumption of uniform variance. Unless there are many groups, the group labels are printed on the plot. The histogram intervals can be set by you or calculated by Clinstat. The frequency distribution option also enables you to set your own interval or leave it to Clinstat. You can plot the histogram if you wish. A Normal plot is a plot of the variable against the corresponding percentile of the Normal Distribution (Bland, 1987, Section 7.5). If the data follow a Normal Distribution the Normal plot will be a straight line. As a guide, Clinstat draws the line along which the points are expected to lie if the data are Normally distributed. This option is rather slow on some PCs. 10.5 Normal Distribution methods Option 4 from the independent comparisons menu offers several procedures which require that the observations be Normally distributed: t and F methods and Bartlett's test for uniformity of variances. 10.5.1 Two-sample t and F tests The t test option is for comparing two groups. If there are more than two groups, the programs asks which are to be compared, otherwise, it compares groups 1 and 2 immediately. Three comparisons are made. Firstly, the means are compared using the two sample t tests assuming uniform variance within the groups. The mean difference, its standard error, a 95% confidence interval for the difference, a t test of the null hypothesis of difference = 0, with degrees of freedom and two tailed probability, and the within groups variance (Bland, 1987, Section 10.3). Secondly, the variances are compared using an F test (Armitage and Berry, 1987, Section 7.7). Again the degrees of freedom and probability are given. Thirdly, the means are compared using an approximate t test, without assuming uniform variance, using the Satterthwaite approximation which reduces the degrees of freedom. This gives the same statistics as the uniform variance test. This is a less powerful test when the variances are uniform. Both the individual observations and mean and standard deviations comparison of means include the large sample Normal confidence interval and test of significance. The effect of the large sample assumption can thus be explored. 10.5.2 One-way analysis of variance This carries out the usual one-way analysis of variance for unequal sized groups (Armitage and Berry, 1987, Section 7.1). The ANOVA table is printed. All the groups are automatically included. After the analysis of variance a number of procedures are available. The means for each group may be presented, with standard errors and 95% confidence intervals based on the residual variance. A linear contrast may be tested, either using the usual F test for pre-defined orthogonal contrasts or Scheff‚'s test for a contrast picked out as large (Armitage and Berry, 1987, Section 7.4). The contrast may be defined by giving coefficients or by giving two sets of groups to be contrasted. The value of the contrast, its standard error, and 95% confidence interval are given, together with the variance ratio and the probabilities for the two tests. Several multiple comparison procedures are offered. Straightforward group differences, standard errors and 95% (Student) confidence intervals are given. Two least significant difference methods, Student's and Fisher's, are available for any chosen significance level. Two Studentized range tests, Tukey's and the Newman-Keuls test (Armitage and Berry, 1987, Section 7.4), and available for probabilities 0.05 and 0.01 only. These are only for equal sized groups. For unequal groups Gabriel's test (Kendall and Stuart, 1968, Section 35.54) is given. This is a multiple comparison procedure which can be used with groups of any size. It says that two groups are significantly different if every subset of groups containing this pair has a sum of squares large enough to give an overall significant difference between the groups if groups outside the subset made no contribution. The test is not widely known because the calculations required are very time consuming, but it is a useful technique. On some PCs this is slow if there are many groups. 10.5.3 Bartlett's test Bartlett's test tests the homogeneity of variance in Normal populations (Armitage and Berry, 1987, Section 4.6). The result is printed as a chi-squared statistic, with degrees of freedom and associated probability. All the groups are automatically included. 10.5.4 Confidence intervals for means The program calculates confidence intervals for each group mean, using that group's own standard deviation. If you want to calculate a confidence interval using the combined standard deviation of all the groups, you should use one way analysis of variance (Section ) and then choose option one from its sub-menu. The confidence interval uses the t method, so data are assumed to be from a Normal distribution. For larger samples (say > 30) this is equivalent to the Normal method, 1.96 standard errors on either side of the mean, and for samples greater than 100 the Normal distribution assumption can be ignored (Bland 1987, ch10). 10.6 Methods based on ranks Three rank tests are available, the Mann-Whitney U test (an exact equivalent of the Wilcoxon two sample test), the Kruskal-Wallis one-way analysis of variance by ranks and the Kolmogorov Smirnov two sample test. 10.6.1 Mann Whitney U test The Mann-Whitney U test (Bland, 1987, Section 12.2) compares any two groups. If there are only two groups the program proceeds directly to the analysis, if there are more than two groups it asks which pair are to be compared. The program prints the value of U and the two group sizes. If both groups have size less than 20, an exact probability table is used, which results in the probability being recorded as ">0.05", "<0.05" or "<0.01". For larger groups, the Normal approximation is used and the two-tailed probability for this is printed. 10.6.2 Kruskal-Wallis test The Kruskal-Wallis one-way analysis of variance by ranks (Conover, 1980, Section 5.2) compares all groups automatically. If there are three groups each with size less than or equal to five, an exact probability table is used, which results in the probability being recorded as ">0.05", "<0.05", or "<0.01". Otherwise, a chi-squared approximation is used, and the probability associated with this is printed. 10.6.3 Kolmogorov-Smirnov test The Kolmogorov-Smirnov test (Conover, 1980, Section 6.3) compares any two groups, chosen as for the Mann-Whitney test. The program prints the value of D and the probability as ">0.05", "<0.05" or "<0.01". If both groups have size less that 25 an exact table is used. Otherwise an approximation is used. 10.7 Save data on disk This option asks for the file name. The variable labels may be changed if you wish. The data are then stored in the usual Clinstat format. Group names are stored as value labels. 11 Survival analysis, main menu option 7 11.1 Facilities and limitations This program is started from the Clinstat main menu option 7. It will accept censored survival data from the keyboard or from disk and save data on disk. Functions include survival probabilities, survival curves, and logrank tests. Up to 20 groups can be held, but only two compared in any one operation. Data can be stored on a Clinstat disk file. The program uses the terminology "death" and "withdrawal" to describe the possible end points of the survival time, but of course survival analysis has many other applications. "Death" denotes a definite outcome, "withdrawal" a censored observation. In the study of time to conception, for example, the definite outcome or "death" would be the start of a life. 11.2 Data input Option 1, data input, leads to the data input menu, which offers keyboard input, data from disk, data from disk with restrictions. If the data are entered via the keyboard, the program asks how many groups there are. It then asks whether you want to label groups and variables, the defaults being "Group 1", "Group 2", etc., "time" for the survival time variable and "Outcome" for the variable which records the type of outcome, e.g. withdrawal or death. Data are then entered for each group, case by case, the group being terminated by "NO" or "N". For each case the time is entered, then "D" or "W" to denote death or withdrawal. If input is from disk, Clinstat asks for the file name and reads the file size and labels, which are displayed. Three variables are requested: the time variable, the outcome variable and the group variable. Clinstat also asks for the codes in the outcome variable which represent death or withdrawal. For example, these might be 1 for a death and 2 for a withdrawal or censoring. If restrictions are used, the procedure of 1.8 is followed. The program reads the data and sorts them into groups, each group being defined by one value of the grouping variable. Cases with missing data for time, outcome or group variable are omitted. It then prints the number of groups and the number in each group with the defining value of the grouping variable. If there are value labels for the group variable on the data file, these become the group labels. Otherwise, the next option is to label the groups, e.g. as "Treated" and "Control". The default is "Group 1", "Group 2", etc. The same labels can be retained for the next analysis. 11.3 Listing and editing Data can be listed on the screen or on the printer. Listing is group by group. If the data were read from disk, the grouping procedure may have changed the order. Editing offers options to change a case, delete a case, or add a case, with the option to list the data at any point. Each correction option asks for the group number and for change or deletion the case number within the group. If you can't remember the case number, 0 will return you to the edit menu and you can list the data to check. Deleting all the cases in a group deletes the group. Groups can be combined. 11.4 Survival curves Survival curves can be plotted for a single group or for two groups. For two groups, the second survival curve is shown as a broken line and in a different colour on colour monitors. Censored observations are indicated by short vertical lines at the censoring point (Bland, 1987, Section 15.6). 11.5 Logrank test and standard errors Option 4 from the survival analysis menu offers the logrank test (Armitage and Berry, 1987, Section 14.6), survival probabilities (Armitage and Berry, 1987, Section 14.5), and a standard error and confidence interval for a survival rate (Greenwood's method, Armitage and Berry, 1987, Section 14.4). The logrank test is a nonparametric test of the significance of the difference in survival between two groups. It can done between any two groups. If there are more than two groups, Clinstat asks which groups you wish to compare. The number of deaths observed and expected under the null hypothesis are printed for each group, with the chi-squared test statistic and probability. The survival probabilities by the Kaplan Meier method for censored data are printed for any group, with the details of the calculation. 11.6 Save data on disk This option asks for the file name. Three variables are stored: the survival time, the group and the outcome. Outcome is either 1 for a death or 2 for a withdrawal. The variable labels may be changed if you wish. The data are then stored in the usual Clinstat format. Group names are stored as value labels. 12 Random numbers, sampling and allocation, main menu option 8 12.1 Facilities and limitations This program will print random numbers in sets of 100, 1000 (printer only) or a random permutation of digits 1 to N for any N up to 1000. It will carry out three types of random allocation for up to 1000 subjects: unconstrained, in equal groups, or in equal subsets within equal groups. It will carry out two types of random sampling: for fixed sampling probability or fraction (any number) and fixed sample size (population up to 1000). 12.2 Random digits The program will print 100 random digits in a 10 x 10 matrix with row and column numbers. It will also print a page of random digits. This option is only available on the printer and prints 1000 random digits in groups of four. The program will also print a random permutation of the digits from 1 to any given digit up to 1000. 12.3 Random allocation Random allocation (Bland, 1987, Section 2.2) can be unconstrained. This gives random allocation into any number of groups, which are labelled 1, 2, 3,..., etc., for a given total number of subjects. The group sizes may not be equal. Equal groups random allocation may be selected. This allocates up to 1000 subjects into any number of groups. The groups will be of equal size, so the number of groups must divide the number of subjects. Finally, subjects may be allocated to groups in blocks, so that each subset of patients contains equal numbers in each group. If this is used, then whenever a trial is stopped there will be approximately equal numbers in the groups (Pocock, 1983, Section 5). This option allocates up to 1000 subjects into any number of groups within subsets. Thus if there are two groups and subsets of 10 subjects, the first 10 subjects will have 5 allocated to group 1 and 5 to group 2. The next 10 subjects will also have this and so on. The number of groups must divide the number in a subset, which must divide the total number to be allocated. 12.4 Simple random sampling There are two simple random sample options. The first chooses a simple random sample (Bland, 1987, Section 3.4) for a given sampling probability. This chooses a sample from a population of given size for a given probability. Each subject is chosen with the probability independently. The sample size obtained is printed at the end. Alternatively, a simple random sample of fixed size can be found. This option gives a random sample of required size for a population of up to 1000 subjects. 13 Determination of sample size using power calculations 13.1 Facilities and limitations. This program enables investigation of sample size options for the comparison of two means, the comparison of two proportions and the detection of a correlation between two variables. The method is based on large sample significance tests and the power against the alternative hypothesis (Bland 1987, Section 9.9, Section 9.10, Armitage and Berry, 1987, Section 6.6). All sample size calculations are approximate, and where small are samples indicated these will be inaccurate as large sample formulae are used. There is no allowance for the effect of degrees of freedom in the comparison of means, for example. The program's first menu gives a choice between comparison of means, comparison of proportions, or detection of correlation. The second menu depends on this choice. As appropriate, the significance level, ratio of group one sample size to group two sample size, power and standard deviation can be changed. Sample size can then be estimated given population difference or correlation (the alternative hypothesis), or difference or correlation detectable with a given sample size. Three plots are also available: power against sample size, power against difference or correlation, and sample size against difference or correlation. These graphs can be printed and saved on disk for retrieval. For comparisons of means and proportions, the sample size used is the total sample size, the size of the two groups combined. The ratio of one group size to the other is initially set at 1:1, but can be changed to allow consideration of other schemes. 13.2 Sample size for the comparison of two means Power, significance level, sample size ratio, and standard deviation are set separately from the menu. Given these, the sample size required to detect a difference or the difference detectable with a particular sample size can be found. When comparing two means, the standard deviation of the observations is important. Clinstat sets this to 1.0. If you know what it should be, from a pilot study or previous publications, you can put it in at the next menu. Otherwise, you can interpret the difference between means in terms of a number of standard deviations. 13.3 Sample size for the comparison of two proportions The standard error of the difference between two proportions depends on the magnitude of the proportions as well as their difference, so for all options comparing proportions at least one of the population proportions must be specified. You need to know roughly what the proportions involved are going to be. Power, significance level, and sample size ratio are set separately from the menu. Given these, the sample size required to detect a difference can be found. Because the standard error of a proportion depends on the proportion itself, both proportions have to be entered rather than the difference between them. The difference from a given proportion detectable with a particular sample size can also be found. 13.4 Sample size for the detection of a correlation Power and significance level are set separately from the menu. Given these, the sample size required to detect a correlation or the correlation detectable with a particular sample size can be found. 13.5 Sample size for the mean difference or comparison of two means in paired or matched samples When comparing two means in paired samples or the same sample on two occasions, the standard deviation of the differences between pairs of observations on the same subject or matched pair is very important. Clinstat presents results based on this standard deviation. You can either start with this or with the standard deviation of the observations between subjects (the usual standard deviation) and the correlation coefficient between the first and second measurements, from which the standard deviation of differences can be calculated. Clinstat sets the standard deviation of differences to 1.0. If you know what any of these should be, from a pilot study or from previous publications, you can put it in at the next menu. If you put in the standard deviation of the observations between subjects and the correlation coefficient between paired measurements, Clinstat calculates the standard deviation of differences from them. Otherwise, you can interpret the difference between means in terms of a number of standard deviations of differences, as for two independent means (). This may not be very helpful, and some data is usually required before anything useful can be obtained from this option. Note that you need either the standard deviation of the differences or both the standard deviation between subjects and the correlation coefficient between paired measurements. The standard deviation between subjects is insufficient without the correlation. Power, significance level, sample size ratio, and standard deviation are set separately from the menu. Given these, the sample size required to detect a difference or the difference detectable with a particular sample size can be found. 13.6 Power against difference or correlation This is a plot showing power on the vertical axis against difference or correlation on the horizontal axis. The program asks for the total sample size. For differences between means and proportions this is then split into n1 and n2 according to the sample size ratio set separately from the menu. The plot is symmetrical about zero for the difference between two means and for correlation, as positive and negative differences can be found with equal power. It is not symmetrical for the difference between two proportions. The proportion for group 1 is fixed and the other proportion cannot be less than zero or greater than one. 13.7 Power against sample size This is a plot showing power on the vertical axis against total sample size on the horizontal axis. The program asks for the difference between two means, the two proportions p1 and p2, or the correlation. For differences between means and proportions the total sample size is split into n1 and n2 according to the sample size ratio set separately from the menu. 13.8 Sample size against difference or correlation This is a plot showing total sample size on the vertical axis against difference or correlation on the horizontal axis. The power is set separately from the menu. The program asks for the maximum size in which you are interested, and plots from zero to this or above. It uses a rounding algorithm for the scale which means that it often gives sample sizes bigger than those asked for. For the difference between two proportions, the program asks for the proportion in group 1, p1. For differences between means and proportions the total sample size is split into n1 and n2 according to the sample size ratio set separately from the menu. The plot is symmetrical about zero for the difference between two means and for correlation, as positive and negative differences can be found with equal power. It is not symmetrical for the difference between two proportions. The proportion for group 1 is fixed and the other proportion cannot be less than zero or greater than one. 14 Histogram, mean and standard deviation for a single variable, main menu option 8 14.1 Facilities and limitations This program is started from option 8 from the Clinstat main menu, sub menu option 3. It will accept data from the keyboard or from a disk file, and will save data on disk. This program performs calculations on a single continuous variable. It will produce basic statistics (mean, median, variance, etc), plot histograms and scatter diagrams. The main summary statistics menu offers data input, listing and editing, summary statistics and plots, and storage of data on disk. 14.2 Data input Data input offers 3 options: from keyboard, from disk, or from disk with restrictions. If the data are to be put in via the keyboard, option 1, Clinstat first asks whether you want to name your variables. The default option is "X". The program will then ask for the data case by case. "N" or "NO" terminates the data entry. If the data are on a Clinstat disk file, the program will ask for the file name and read the variable labels. It will then ask which variable is required. If restrictions are used the procedure of 1.8 is followed. It then reads the data from disk and returns control to the menu. Cases with missing data for the variable are omitted. 14.3 List and edit data Option 2 of the single variable menu provides a number of listing and editing options. The data can be listed on the screen or on the printer or log file. Data editing enables errors to be corrected. Cases can be changed, added or deleted. Cases are referred to by the case number. The data transformations option offers log (base 10), exponent of 10 (antilog), and raise to any power. If an invalid transformation has been requested, such as log (0) or (-1) to the .5, a message is given and the transformation is aborted. More extensive transformations are available in the data file editing program obtained from Clinstat main menu option 1 if you want them. 14.4 Summary statistics and plots Option 3 of the single variable menu, summary statistics and plots, produces a further menu, offering histogram, Normal plot, frequency distribution and summary statistics. Summary statistics provide mean, minimum and maximum, median, variance, standard deviation, standard error of mean, and corrected sum of squares (Bland, 1987, Section 4.5, Section 4.6, Section 4.7). You can choose the interval size and starting point for the histogram (Bland, 1987, Section 4.3) yourself or let Clinstat do it for you. A Normal Distribution curve (Bland, 1987, Section 7.2, Section 7.4) and the position of the mean and the mean ñ one and two standard deviations (Bland, 1987, Section 4.7) can be added to the histogram if required. Clinstat will also calculate and print the frequencies themselves. For the frequency distribution, too, the choice of interval can be left to the program or done by the user. A histogram of the frequency distribution is optional. This enables you to draw any histogram you like. A Normal plot is a plot of the variable against the corresponding percentile of the Normal Distribution (Bland, 1987, Section 7.5). If the data follow a Normal Distribution the Normal plot will be a straight line. As a guide, Clinstat draws the line along which the points are expected to lie if the data are Normally distributed. This option is rather slow on some PCs. 14.5 Confidence intervals for the mean The program calculates a confidence interval for the mean. The confidence interval uses the t method, so data are assumed to be from a Normal distribution. For larger samples (say > 30) this is equivalent to the Normal method, 1.96 standard errors on either side of the mean, and for samples greater than 100 the Normal distribution assumption can be ignored (Bland 1987, ch10). 14.6 Storing data on disk Saving data on disk enables you to keep the data which has been keyed into this program. On twin floppy drive systems, a data disk is needed in Drive B:. On one drive systems it must be exchanged for the program disk. The variable label can be changed at this point if you wish. The file created can be read by the general data editing program as if it had been entered under the general input program from Clinstat main menu option 1. 15 Standardized Mortality Ratios, main menu option 8 15.1 Calculation of Standardized Mortality Ratios This program calculates a Standardized Mortality Ratio (Bland, 1987, Section 16.3) from a standard set of mortality rates and an age distribution for the study population. The SMR, its standard error and a 95% confidence interval are calculated. For small samples the confidence interval is found by the exact Poisson method (Gardner and Altman, 1989, Section 6). Further SMRs can be calculated for other diseases by entering new standard rates, and for other populations by entering a new age distribution. 15.2 Confidence intervals and significance tests for Standardized Mortality Ratios This program calculates a confidence interval for a Standardized Mortality Ratio from the observed and expected frequencies. It is suitable for small samples as exact Poisson methods are used (Garner and Altman, 1989, Section 6). The SMR, a 95% confidence interval, and a test of the null hypothesis that the SMR is one are calculated. The program also compares two SMRs. The confidence interval for the ratio is calculated, as this is much easier to do in the small sample case than the interval for the difference. A significance test of the null hypothesis that the two SMRs are equal is also given. On some PCs this may be slow if the observed frequencies are large. 16 Simulations and other demonstrations of statistical principles, main menu option 9 Clinstat includes a number of programs developed for computer aided learning and teaching. These can be used in private study, in group exercises if a computer lab or classroom is available, or as demonstrations in the lecture theatre, using a Kodak Datashow or similar device. These programs do not produce any printer or log file output. 16.1 Tossing coins This very simple program illustrates the concept of randomness and probability. Any number of coins can be tossed. First toss a single coin. If you repeat this several times you will see that we cannot predict whether Head or Tail will show. However, if we toss several coins, say 10, we can be fairly sure that we will get some head and some tails. If we toss a large number of coins, we see that about half the coins show heads and half show tails. We can predict what will happen over many trials, "in the long run", but not the single trial (Bland, 1987, Section 6.1). 16.2 Pintable simulation This program is based on a program for the Commodore Pet by I. J. Wood (1981). It represents Bernoulli's experiment of a ball falling down a series of pins, bouncing in either direction. This simulates a Binomial distribution (Bland, 1987, Section 6.4, Bland, 1984). Because this is a computer, the ball does not have to have the same probability of bouncing to right and left. After the pintable, the program will print a histogram for the number of bounces to the right and compare the observed probabilities for this distribution with those predicted by the Binomial. 16.3 People on boxes - simulation of mean and variance This program illustrates the effect of addition and subtraction upon mean and variance (Bland, 1987, Section 6.6, Bland 1984). The random variables used are human height, the heights of boxes and the depths of holes. The program will draw a screen of stick people, with their mean height, standard deviation and variance. It will do the same for random boxes and holes. Constants are represented by the constant boxes, all identical, and constant holes, again identical. Something can be added to human height by getting the people to stand on a box. If the boxes are all the same we add a constant. The mean increases but the variability is unchanged. If the boxes are random and independent of the heights, the mean of the sum is the sum of the means and the variance of the sum is the sum of the variances of human height and box height. If the heights of the boxes are not independent of the heights of the people this is not necessarily true. If the box is so that the person can reach a light bulb, the short people must find the big boxes and so the box and people heights will be negatively correlated. The variance of the sum is not the sum of the variances; it is reduced. Something can be subtracted from human height by getting the people to stand in a hole. If the holes are all of the same depth we subtract a constant. The mean decreases but the variability is unchanged. If the holes are random and independent of the heights, the mean of the difference is the difference between the means. The variability is increased, however, and the variance of the difference is the sum of the variances of human height and hole depth. If the depths of the holes are not independent of the heights of the people this is not necessarily true. If the hole is so that the person can hide, the tall people must find the deepest holes, and so, as for the boxes, we will have a negative correlation. The variance of the difference is no longer the sum of the variances; it is reduced. 16.4 Central limit simulation The Central Limit Theorem states that if we have the sum of several independent, identically-distributed random variables, this sum tends towards a Normal Distribution as the number of variables increases (Bland, 1987, Section 7.2). A consequence of this is that the mean of a large sample will come from a Normal distribution whatever the distribution of the observations themselves. This program illustrates this using a Uniform or Rectangular distribution to produce the observations. All possible numbers between 0 and 1 are equally likely. This is produced by the RND(X) function familiar to BASIC programmers. The program first asks for the number of observations to be added then the number of sums to be generated. The graphic display constructs a histogram of the Uniform variable. As each number is generated, another block is added to the histogram. When it is complete, the corresponding Normal Distribution curve is drawn. This has the same mean and variance and is scaled to have the same area as the histogram. The program then asks you if you want to do it again. If you answer "Y" the program asks for the number to be added, if "N" it returns to the simulations menu. Start with 1 observation and 400 runs. The graphic display constructs a histogram of the Uniform variable, showing a roughly rectangular shape. The Normal Distribution curve looks quite unlike the histogram. Now try two observations added. The histogram is now triangular and the Normal Distribution curve fits it better, but not well. Look at the tails of the distributions to see the poor fit there. As you increase the number of observations added, you will see that the fit improves rapidly, until by the time six Uniform variables are added the two distributions are very close indeed. As you increase the number in the sum, you can increase the number of runs, too (Bland, 1984). 16.5 Sampling distribution of mean and proportion This program plots a histogram like that for the central limit theorem. Instead of the sum of Uniform random variables, it calculates the sum of Normal random variables, mean = 0, standard deviation = 1. The mean and standard deviation of the subsequent sampling distribution is printed. The standard deviation is, of course, the estimated standard error of the mean (Bland, 1987, Section 8.1, Section 8.2, Bland, 1984). Start with a single observation, to show the underlying distribution with standard deviation = 1. Then increase the sample size to 4, to 9 to 16. The standard deviations will be approximately 1/2, 1/3, and 1/4, showing that standard error depends on the square root of the number of observations. The program also plots a histogram of Binomial proportions. The population proportion is set by the user. The program generates proportions from samples of specified size. The mean and standard deviation of the subsequent sampling distribution is printed. The standard deviation is, of course, the estimated standard error of the proportion (Bland, 1987, Section 8.4). This program illustrates both standard error and the Normal approximation to the Binomial distribution. It needs larger samples than the means option. A good demonstration uses a population proportion 0.3, and sample size 10 with 300 samples, sample size 40 with 300 samples, sample size 90 with 300 samples, sample size 160 with 200 samples, and sample size 250 with 200 samples. 16.6 Sum of squares simulation This program shows why we use n-1 as the divisor when estimating variance (Bland, 1987, Section 4.7, Section 4A.1, Section 6A.2, Bland, 1985). It takes random samples from a population to show that the sum of squares about the sample mean is proportional to the number of observations minus one, whereas the sum of squares about the population mean (which is, unusually, known in this case) is proportional to the number of observations itself. This is a menu driven program. The population can be listed and the population mean and sum of squares shown, with the mean squared difference from the mean, the population variance. We use small samples to estimate this. You can draw a simple random sample of any size, say 4. The sample chosen is printed, together with a number of quantities, including the sum of squares about the sample mean. This sum of squares divided by n-1 is the usual estimate of the population variance. This sum of squares divided by n is also given. Many calculators give the square root of this as sigma sigma n. Try it again and you will get a different sample. A simulation run will generate and aggregate the results from many such samples. Try a simulation run with sample size 2 and 50 runs. You get the same things averaged over the 50 samples. This can be repeated using sample sizes 3, 4 and 5. A summary of the runs so far shows how these estimates of variance change with n. 16.7 Confidence interval simulation This program illustrates the behaviour of confidence intervals (Bland, 1987, Section 8.3, Bland, 1985). It has a menu structure similar to that of the sums of squares simulation. It draws random samples from a fixed population and calculates means with 95% confidence intervals. The population can be listed and the true mean found. A single random sample can be drawn, and the 95% confidence interval calculated. For most samples this should include the population mean. A simulation run gives a graphic display, showing the value of the variable along the horizontal scale, with the population mean marked by a vertical line. One in twenty confidence intervals exclude the population mean in the long run. 16.8 Probability distributions This program plots any of the following distributions: Normal, Binomial, Poisson, t, Chi-squared and F (Bland, 1987, Section 6.4, Section 6.7, Section 7.2, Section 7A). The Normal is plotted for any mu and sigma over the range -6 to +6, the Binomial for any p and n<=200, the Poisson for any mu <= 100, the t for any degrees of freedom over the range -6 to +6, the Chi-squared for any degrees of freedom up to 30, and the F for any degrees of freedom over the range 0 to 6. The Binomial, Poisson and t plots have a Normal Distribution curve option. 16.9 Simulations of clinical trials There are two simulated clinical trials. The first has a fixed sample size of 55 patients, the second can have any sample size (Bland, 1986). 16.9.1 Clinical trial with fixed sample size For this trial you have to use a table of random numbers or some other method to allocated 55 subjects to two treatments. The random allocation program could be used. Clinstat prints a table of results. You must decide what to do about several patients without full data. This program can be used to illustrate the concept of significance in class. Calculate chi-squared tests for survival versus death and for degree of improvement. Some will be significant and some not. In fact, there is no difference in mortality but there is a true difference in the degree of improvement. The sample size is too small to be sure of detecting this, however. 16.9.2 Clinical trial with variable sample. This program does the randomization for you. You can vary the sample size. The chi-squared test is built in, together with some table editing functions to get a valid and appropriate test. You can print the actual model on which the simulation is based and repeat the simulation, using the same model again or a different one. 17 Error messages Clinstat traps some errors before they happen, for others it uses the Quick Basic run time error codes (Microsoft, 1987). Some of these should never happen, others will be labelled as they occur. A full list of these codes follows: 3 RETURN without GOSUB 4 Out of DATA 5 Illegal function call 6 Overflow 7 Out of memory 9 Subscript out of range 11 Division by zero 14 Out of string space 16 String formula too complex 19 No RESUME 20 Resume without error 21 Device timeout 25 Device fault 27 Out of paper 39 CASE ELSE expected 40 Variable required 50 FIELD overflow 51 Internal error 52 Bad file name or number 53 File not found 54 Bad file mode 55 File already open 56 FIELD statement active 57 Device IO error 58 File already exists 59 Bad record length 64 Bad file name 67 Too many files 68 Device unavailable 69 Communication buffer overflow 70 Permission denied 71 Disk not ready 72 Disk-media error 73 Rename across disks 75 Path/file access error 76 Path not found Most of these errors relate to file and peripheral problems. You should be able to deal with these quite easily. Error 24, device timeout, probably means your printer is not switched on, for example. Error 5 occurs, among other causes, when you try to do something for which your computer does not have the hardware. You may be using a graphics adapter you do not have, or using a Hercules adapter without having loaded QBHERC.COM. If Clinstat was installed correctly, the batch file loads QBHERC.COM for you. See Sections 2 and 3.3. If errors 3, 4, 19, 20, 39, 50 or 56 occur, this represents a programming error. Contact Martin Bland. 18 References Armitage P, Berry G. (1987) Statistical Methods in Medical Research. Blackwell, Oxford. Bland JM. (1984). Using a microcomputer as a visual aid in the teaching of statistics. The Statistician, 33, p.253-259. Bland JM. (1985). Computer simulation used to illustrate two statistical principles. Teaching Statistics, 7(3), 74,78. Bland JM. (1986). Computer simulation of a clinical trial as an aid to teaching the concept of statistical significance. Statistics in Medicine, 5, 193-197. Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307-310. Bland M. (1987). An Introduction to Medical Statistics. Oxford University Press, Oxford. Conover WJ. (1980) Practical Nonparametric Statistics, 2nd. ed. Wiley, New York. Fleiss JL. (1981) Statistical Methods for Rates and Proportions, 2nd. ed. Wiley, New York. Gardner M, Altman DG. (1989) Statistics with Confidence. BMJ, London. Kendall MG, Stuart A. (1968) The Advanced Theory of Statistics, Vol 3, 2nd. ed. Griffin, London. Maxwell AE (1980) Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651-5. Microsoft Corporation (1987) QuickBASIC 4.0 Pocock SJ (1983) Clinical Trials: a Practical Approach. Wiley, Chichester.