Arch. Toxicol. 38, 13-25 (1977)
TOXICOLOGY 9 by Springer-Verlag 1977
Statistical Problems in Mutagenicity Tests* J. Vollmar Boehringer MannheimGmbH, Medizinische Forscbung, Sandhofer Stral3e 112, D-6800 Mannheim, Federal Republic of Germany Everyone involved in planning and carrying out biological experiments has to concern himself with statistical methods. Thus, descriptive statistics (e. g. calculation of the mean) help to present results briefly and comprehensively and, in addition, statistical test procedures permit decisions to be made even in an uncertain situation. The choice of the appropriate statistical procedure is, however, associated with considerable problems. These problems are discussed in detail using the dominant lethal test as an example. Means of solving the problems with the necessary consequences for planning of the experiment and its evaluation are discussed and practical recommendations are given applying also to other mutagenicity tests. Abstract.
Key words: Mutagenicity tests -- Statistics in testing mutagenicity - Biometrics Dominant lethal test.
Jeder, der biologische Experimente plant und durchf/ihrt, ist gezwungen, sich mit statistischen Methoden auseinanderzusetzen. So hilft die deskriptive Statistik (z. B. Mittelwertberechnung), die Versuchsergebnisse kurz und/ibersichtlich darzustellen, andererseits erlauben statistische Testverfahren, in einer Situation der Unsicherheit dennoch eine Entscheidung zu treffen. Die Auswahl der jeweils zul~issigen statistischen Verfahren ist jedoch mit grol3en Problemen verbunden. Am Beispiel des dominanten Letaltestes werden diese Probleme eingehend behandelt, L6sungsans~itze mit den notwendigen Konsequenzen in der Versuchsplanung und -auswertung diskutiert und praktische Empfehlungen, auch f/ir andere Mutagenit/itspr/ifungen, gegeben. Zusammenfassung.
Attempts have been made for several years to develop various biological test systems for detecting the mutagenic potency of chemical agents. The results of these developments have been presented for 2 test systems by Ehling (1977) and R6hr* Presented at the 3rd Meeting of the Gesellschaftf/Jr Umwelt-Mutationsforschunge. V., D-8042 Neuherberg, July 1-2, 1976
born (1976). Although the problems and possible solutions are shown from the biological side, problems arising in statistical planning and evaluation of mutagenicity tests are only dealt with superficially. Since the evaluation is frequently carried out in the pertinent literature concerned with mutagenicity tests by the use of the usual and generally known tests (e.g. t-test and x2-test) there appear to be no statistical problems at all, or, if there were any they would be negligible compared with experimental and theoretical biological problems. Everyone thinking and acting on this basis can remain comfortable as long as he is working with strongly mutagenic substances. In these cases, the "statistics" supply the necessary cosmetics to make the results of such an experiment "scientific". Statistics thus become a sort of bikini - "what one sees is exciting and stimulating, but the substantials remain hidden". Especially in routine investigations with substances of weak mutagenic action, the substantials remain hidden and the results of the statistical analyses often produce an outcome contradicting "common sense". If at this point one consults a statistician, one very soon realizes that there still exist statistical problems in mutagenicity tests. These problems appear to extend into infinity by consulting several statisticians. The experimental worker is now obliged to come to an understanding of statistical methods (Fig. 1).
General Procedure for the Construction and Analysis of Biological Experiments The following discussion is not intended to be a patent solution to these problems. Experimenter and statistician can only produce a rational solution by working together. This joint solution must take into account both the theoretical and experimental biological and statistical requirements and is expressed, in accordance with general experience, by the following multistage procedure.
1st Stage: System Analysis (Biological and Statistical Model) In this stage, an attempt is made to summarize the biological knowledge into a model. In the construction of a biological model it is essential that all possibilities theoretically permitted should be taken into account. Differing theories lead to different models and thus to different starting conditions. On the other hand, the corresponding statistical model is naturally dependent on the biological model. This direct dependence between the biological and the mathematical/statistical model can be further distorted by misunderstandings between the statistician and experimenter so that both, starting from different premises, commit an incalculable error. This error can only be avoided by continuous discussions between the biologist and the statistician.
2nd Stage: Hypothesis Formation The results of the choice of the model lead to the formation of hypotheses. Thus, mutagenicity can, for example, be clearly defined by the chosen model and expressed
Statistical Problems in Mutagenicity Tests
System Analysis 1st Stage
Biological and Statistical Model
1I Hypothesis 2nd Stage
1 Evaluation and 4th Stage
Fig. 1. Flow diagram of a biological experiment
New t d
. . . . .
mathematically/statistically. Furthermore, at this stage the variables to be investigated, chosen as a measure of the mutagenicity, are characterized so that one can carry out investigations on the distribution function of these variables in "control studies". An interaction between Stage 1 (system analysis) and Stage 2 (hypothesis formation) can commence which is completed only when the biological and statistical models have been adequately tested. In this way a reasonable hypothesis formation is possible.
3rd Stage: Experimental Plan (Designing the Experiment and the Statistical Methods of Evaluation) The trial plan is set up on the basis of the model and the hypothesis to be tested. In this trial plan, both the type and extent of the experimental procedure to be carried out and the documentation of the results obtained are laid down. This naturally assumes that no systematic interfering factor can falsify the results of the investigation. One of these interfering factors is the non-random allocation of the experimental animals to the various treatment groups. Random means that truely random mechanisms (such as drawing lots, random numbers, etc.) must be used separately for each experiment. The arbitrary selection
of the animals is not "random" in the statistical sense so that any statistical analysis based on this and the associated generalizations are worthless.
4th Stage: Evaluation of the Experiment and Interpretation of the Results If an experiment has been carried out according to a pre-determined trial plan with a corresponding sample size, then the data obtained from this experiment are analysed. Statistics supply the necessary procedures for summarizing the essential information and for compressing it. The statistical procedures selected directly depend on the selected model and the trial plan. They supply statements associated with probabilities (known as "probability of errors"). These statistical statements can be transferred directly to the biological model and thus supply the possibility of interpreting the experimental results. If, however, there are contradictions between the results of the investigation and the model or the underlying biological premises, then it is necessary to re-consider the stages of the procedure and to modify them. The various stages just shown are valid for every experimental biological problem. It is important that, in presenting the results and the corresponding analysis, all these stages are clearly shown. Only then is it possible to compare the results of various experimenters. Deviations in any of the 4 stages lead to different results. If, for example, the biological model is not defined clearly enough, this cannot be circumvented by statistical tricks or, in other words, decisions that the biologist must make cannot be pushed onto the statistician. This is, however, done in many cases.
Example: Dominant Lethal Test
The general procedures described above for the construction and analysis of biological experiments will now be exemplified by means of the DLT in male mice.
1st Stage: Biological and Statistical Model Starting with the model described in the literature (Kriiger, 1970) the recommended method of evaluation, on the one hand, and the biologically (factually) inexplicable results, on the other hand, Vollmar and Stucky (1975, 1977) tried to check the premises of the model using a large amount of data from control animals. For this purpose, however, it was necessary to develop a biological "basic model" that could be accepted unanimously by all experimenters. As Ehling (1977) has already described in detail, male animals are treated. The "measuring instrument" for the mutagenic action is the female mouse, so that first of all this "measuring instrument" must be considered. The biological processes that take place inside the female mouse are shown diagramatically in Figure 2. It must be noted that all the stages described in Figure 2 can occur and every observed result in this system can clearly be attributed to a given stage in develop-
Statistical Problems in Mutagenicity Tests
I Ovum Fertilized
Fig. 2. Biologicalmodel
ment. In this biological model, the mutagenic potency of a chemical agent may be defined by the following properties: 1. An increase in the probability that a fertilized ovum will not be implanted. 2. An increase in the probability that an implant dies. The first property is related to the pre-implantation and the second property to the post-implantation egg-loss. Both characteristics indicate the mutagenicity of the substance being investigated. No matter how clear and self-evident the diagram presented in Figure 2 may be, it is difficult, on the other hand, to transfer this model to the results of the DLT. The number of ovulated oocytes can only be determined indirectly from the number of corpora lutea. Furthermore, differentiation between "non-fertilizable oocytes", "non-fertilized", and "non-implanted ova" is impossible in the routine D L T procedure, so that these three different possibilities can be combined, on the basis of their
C Number of Corpora
Number of Implants
L Numberof LiveImplants
Numberof Dead Implants
Fig. 3. Reduced biological model
identical pattern, to give what may be termed "quasi pre-implantation egg-loss'. The biological model from Figure 2 must therefore be translated into the reality of the routine experiment, i.e. the reduced model shown in Figure 3 must be used as a realistic basis for further consideration. All the frequencies shown in this model, such as the corpora lutea graviditatis, live and total implants, can be directly counted or, as in the case of the quasi preimplantation egg-loss, be indirectly determined by the difference in the corpora lutea graviditatis minus the number of implants. In this model, an indication of the mutagenic potency of a chemical substance is characterized by: 1. A rise in the probability for the occurrence of "quasi pre-implantation eggloss'. 2. A rise in the probability for the occurrence of dead implants. Although the second finding, i.e. the greater probability for the occurrence of dead implants, is a direct indication of mutagenicity, the first finding leaves it open as to whether it is a mutagenic or a cytotoxic effect. Further investigations to clarify these special questions must be undertaken. Details have already been given by Ehling (1977) referring to the method of investigation developed by Kratochvil (1975). Our 1st Stage, namely the construction of a biological and statistical model, is still incomplete. If the present biological model appears to be simple, perhaps indeed trivial, it should perhaps be commented at this point that this "trivial" picture is the expression of protracted and intense discussions and investigations by biologists and statisticians. This biological system analysis, in association with investigations on 3514 control animals I of the NMRI mouse strain (Grafe) and 3561 control Dr. A. Grafe, Mannheim, and Dr. U. H. Ehling, Neuherberg, are thanked for supplying the data on these control animals
Statistical Problems in Mutagenicity Tests
(101 • C3H)F 1 hybrid mice (Ehling), led to the development of the following statistical model: The sample unit in the D L T in male mice is the male or by 1 : 1 mating the female that is allotted to it. The development stage of the ovum after which an effect due to mutagenicity can no longer be excluded is embodied indirectly in the number of corpora lutea graviditatis, and for this reason statistical considerations are based on the corpora lutea graviditatis of the mother. From Figure 3 the following can be postulated: The probabilities that a "quasi pre-implantation egg-loss" or a "post-implantation egg-loss" develops from an ovulated oocyte serve as parameters for the statistical model. They are estimated by the ratios: Number of quasi pre-implantation egg-losses
Number of corpora lutea graviditatis Number of implants Number of corpora lutea graviditatis Number of live implants
Number of corpora lutea graviditatis
A clear "test result" can be attributed to each male that has been paired with a female. An alteration in this data compared with a "normal value" (or better "spontaneous value") by a given amount, for example a reduction in PcL of at least 15%, may be interpreted as an expression of mutagenicity. Hypotheses can now be developed which can be tested by experiment. This example clearly shows the close interrelationship between biological and statistical models. An alteration of any of the details in this model can lead to other hypotheses and thus to other experimental plans and systems of analysis. However, Stage 1 (system analysis) is not the sole concern of this paper and the other stages will now be dealt with.
2nd Stage: Hypothesis Formation On the basis of the biological and the corresponding statistical model, two test hypotheses can be formed which will be designated H 0 (null hypothesis) and H 1 (alternative). These two hypotheses are based on the comparison of 2 groups. Group T is treated with the test substance and the other group C is untreated. The hypotheses for the D L T are: Ho: Substance group T and control group C do not differ with regard to mutagenic potency, i.e. the substance tested is not mutagenic. H~: The substance group T exhibits at least 15% less live implants than the control group C, i.e. the test substance is mutagenic. Statistically, this can be formulated as follows:
Ho: PT = Pc H i : PT ~< Pc (1 -- 0.15)
where P = the probability that a live implant will result from an ovulated oocyte.
If one wishes to investigate the 2 probabilities for quasi pre-implantation egg-loss and post-implantation egg-loss individually, then this can be done by a statistical comparison of the corresponding probabilities. The biological term "mutagenicity" is thus also statistically clearly characterized.
3rd Stage: Planning the Experiment The experimental plan on which the DLT should be based cannot be given at once. On the one hand, there are the experimental problems which have already been dealt with extensively by Ehling (1977) (e.g. method of pairing, duration of the treatment period, etc.), and on the other hand, the permissible risk with which the above hypothesis can be tested must be determined. This means that errors of the first and second type must be considered. What are these errors? The error of the 1st type is characterized by the probability that H 1 (mutagenicity of the chemical substance) will be accepted, although no mutagenicity is present. The probability for this error is also called "significance level" and is generally designated or, it is the same as the probability of a false positive decision. The error of the 2nd type consists of making a false negative decision, its probability is designated/5. /3 is thus the probability that one will accept H 0, although H 1 applies. The amount of these two errors must be defined before the experiment is carried out; they influence the sample size necessary for the experiment planned. In order to be able to finally determine the size of the sample, however, further parameters must be known. One parameter is known as the "relevant mutagenic effect", it is designated A, and is intended to show the threshold above which the biologist speaks of mutagenicity. A characterizes the desired sensitivity of the DLT. For example, in a comparison of treated and control groups a lowering by at least 15% of the probability for the occurrence of live implants may be designated as mutagenic. Although or, /3, and A can be freely selected by the experimenter before the experiment, the fourth parameter that affects the sample size is determined by the animals chosen and the experimental conditions in the laboratory concerned. This parameter is known as the "variability" of the experiment. This variability can be separated into its individual components which, in turn, can be attributed to the various influencing parameters (Table 1). Before the overall variability can be analyzed more closely, a detailed investigation of all the possible influencing parameters is necessary. In doing this, one has to differentiate between systematic and random factors. Systematic errors lead to distortion of the results (thus, for example, a period of 10 days means that there are no results for certain days because experience has shown that fertilization takes place in the first 3 or 4 days); chance errors, on the other hand, increase the inaccuracy of the results. If the possible influencing variables are known and the degree of their effect has been carefully examined, then the complete test system can be optimized to such an extent that the greatest accuracy can be achieved with a foreseeable amount of effort. A 100% accuracy cannot, however, be achieved in any biological experiment, so that more or less large systematic and random errors remain. The systematic error is accounted for by comparing a control group with a treated group of animals,
Statistical Problems in Mutagenicity Tests Table
1. Selection of parameters with systematic and/or random errors that can occur in the DLT
Effect(s) on the parameter
Type of error Random
Maintenance conditions Length of p a i r i n g period
Variability within and between the animals PCL, PcQ, t)19 Variabilityof the measured results Measurable results on certain days could be missing
both groups being subject to identical systematic factors. The systematic error should, however, not be so large that hardly any statement is possible. (For example, if in cytogenetic investigations "gaps" and "breaks" are not fundamentally correctly differentiated, statistical analysis of the material obtained is no longer possible.) The random errors can be allowed for by choosing corresponding sample size taking into account c~,/5, and A. If one designates the total chance errors as a and the necessary sample size with n, then the following relationship applies: n is proportional to
i.e. the larger a is, the larger is n; the smaller ce, ~, and /x, the larger must n be. Verbally speaking this means the following: Low variability in the biological system is equivalent to a low experimental effort, while reliable statements require a greater effort. In order to be able to detect small differences one also has to increase the experimental effort. These plausible associations are frequently not fully taken into account so that, in some circumstances, impermissible generalizations are made: with a f'Lxed sample size n and a pre-determined significance level ~ a definite effect can be demonstrated. In this way, frequently neither the influencing variables and their effect nor the errors of the 2nd type are taken into account. More details on this problem have been reported (Grafe and Vollmar, 1977). Unfortunately, the above relationship between sample size, variability, errors of the 1st and 2nd type and biologically relevant differences is not linear, but an extremely complex, frequently unknown, mathematical function. In many cases, as also with D L T , the sample size cannot be determined directly. Extensive and costly simulation runs on a computer are necessary in order to tackle this problem of the sample size. For the mouse strains NMRI-Kisslegg and (101 x C3H)F 1, sample sizes have been determined (for detailed information see Vollmar and Stucky, 1977). The data for simulation runs were taken from the already mentioned total of 7000 untreated control animals. If one assumes a type 1 error of ce = 0.05 and an equally large type 2 error of/3 = 0.05, then the sample sizes given in Table 2 are obtained.
22 Table 2. Sample sizes for D L T in the male mouse
(101 x C3H)F1
a Lowering of the probability that a live implant will arise from an ovulated oocyte by/~%
The different numbers arise from the different sensitivities required. One can clearly see that for the detection of weak mutagenicity a considerably larger sample is required which, under extreme conditions, can very rapidly reach the limits of practicability. The sample sizes in the two columns show the differences between the investigational conditions in two different laboratories with two different strains of mice. However, the degree to which the mutagenicity of the substance is expressed in both test systems is ignored. In this way, the experimental plan for a given laboratory can be laid down clearly. It is advisable, when drawing up the trial plan, to ensure that the data collection is as standardized as possible. This can be done with the aid of record sheets in which the experimental results can be directly entered and which, in addition, ensure the possibility of computer analysis.
4th Stage: Experiment Evaluation and Interpretation The 4th Stage, i.e. evaluation, is now reached. Both the determination of sample size and the trial plan depend on the statistical procedures selected. In my opinion the modification of the non-parametric Wilcoxon test is most suitable for the DL T. This procedure has been described (Stucky and VoUmar, 1976). In accordance with the biological model, various stages in the development of the oocyte and sperm are followed. A multiple stepped procedure has been found practicable. The comparison of the treated and control groups should successively take place in 4 steps. If at any step significance is found, the procedure should be interrupted at this step with a corresponding conclusion. The further statistical evaluation steps should not be carried out as the suitable preconditions are no longer fulfilled and the results can no longer be clearly interpreted. In any case the fundamental requirement is the strictly random allocation of the animals included in the trial to the different groups. The first 3 steps in the statistical analysis are intended to exclude differences not due to mutagenic action. The 4th step is the actual test for mutagenic action. The individual steps and the corresponding conclusions are:
Statistical Problems in Mutagenicity Tests
Step 1: Comparison of Death Rates ( = Number of dead males/ Number of males used/ i.e. investigation for toxic effect. The dose of substance used should be so low that no visible toxic effects occur. If there is a significant difference on comparison of the death rates, then the analysis should be stopped at this step with the conclusion "dose in the lethal toxic range".
Step 2: Comparison of the Fertilization Rates ( = NUN~be~e~f~fertf~lize~-ed?uma~es ) 9 If significant differences between control and treated animals are observed, then a cytotoxic, and under certain circumstances, a mutagenic action of the substance in the males can be deduced. Depending on the practical determination of the number of fertilized females (e.g. the presence of a plug or the presence of implants), this comparison can also detect the total pre-implantation egg-loss. Separation of a cytotoxic action from a mutagenic action is impossible in this experimental procedure. Since the cytotoxic action cannot be excluded at this point, the analysis should be stopped here.
Step 3: Comparison of the Number of Corpora lutea graviditatis, i.e. the number of ovulated oocytes. In the case of a significant difference between control and treated groups, the presence of other, previously possibly unknown actions can be deduced (at a given significance level). The analysis should be stopped at this point with the statement that differences in the number of corpora lutea can cause differences in the conversion possibilities, independent of the treatment concerned. Step 4: Comparison of the Ratios.
Number of live implants Number of corpora lutea graviditatis ' Number of quasi pre-implantation egg-losses
Number of corpora lutea graviditatis Number of dead implants
Number of implants
in each case related to one mother. In case of significant differences between the treated and control groups, a mutagenic action of the substance can be deduced. If these differences occur with the ratio Pce, then there is quasi pre-implantation eggloss, and if it occurs with the ratio Piz, then there is post-implantation egg-loss. If it occurs only with PcL, then a weak pre- and post-implantation egg-loss can be concluded.
If this consecutive procedure finishes with no significant difference at any of the evaluation steps, then one concludes that in this experiment a mutagenic action of the substance was not detectable at the pre-determined level of significance. Testing of the death and fertilization rates is best carried out with a 4-fold table. The most suitable statistical procedure is the exact test according to Fisher and Yates (Stucky and Vollmar, 1975). With respect to statistical analysis of the corpora lutea graviditatis, it must be remembered that in the mouse strains examined none of the usual distribution forms (i.e. normal, binomial, negative binomial, or Poisson distribution) are present. This excludes the use of the well-known parametric methods. An adequate procedure is the generalization of the linear rank test with ties (Stucky and Vollmar, 1976). This procedure can also be used for the comparison of transition probabilities, i.e. the ratios PcL, Pce, and Pip. The alternative hypothesis should be formulated for the fertilization rate and corpora lutea graviditatis in a two-tailed test, and for the death rate and the relative proportion as a one-tailed test. The aim of the dominant lethal test is the investigation of a potential mutagenic action on various development stages of spermatogenesis. In order to investigate these different stages the male, which has been treated once with the substance, is used at various pairing periods. The statistical investigations are carried out separately for each individual pairing period. However, the results belonging to an excluded male should also be excluded from the previous pairing periods in order to ensure a balanced analysis of the entire experiment. Furthermore, with 1 : 1 pairing the dependence of the female on the corresponding male is included in the statistical evaluation.
Conclusion The problems of statistical analysis and the experience in statistical processing of biological experiments shown with the aid of this example permit the following conclusions: Rapid, i.e. short-term, statistical solutions are generally impossible. The procedures described in the literature are frequently not adequately substantiated. A reasonable concept can only be produced by intense co-operation between the experimenter and the statistician on confidential basis. The following 4 stages (a) model development, (b) hypothesis formation, (c) trial design, and (d) evaluation should be successively worked through with the same intensity and thoroughness. The notion that one can compel the statistician with a fait accompli (according to the motto, "first the experiment then, if no clear trend is visible, handing of data to the statistician") is just as erroneous as the commonly held opinion that one should not trouble the statistician with "unnecessary" background information. Biological omissions cannot be corrected with legal statistical methods. It is therefore imperative that the statistician should be confronted as soon as possible with the problems in order to be able to develop a joint solution.
Statistical Problems in Mutagenicity Tests
References Ehling, U. H.: Dominant lethal mutations in male mice. Arch. Toxicol. 38, 1-11 (1977) Grafe, A., Vollmar, J.: Small numbers in mutagenicity tests. Arch. Toxicol. 38, 27-34 (1977) Kratochvil, J.: Pr/iimplantativer Verlust dominanter Letalmutationen nach Behandlung der m/innlichen Maus mit MMS. GSF Bericht B 565, 5--17 (1975) Krfiger, J.: Statistical methods in mutation research. In: Chemical mutagenesis in mammals and man (Vogel, R6hrborn, Eds.), pp. 480-502. Berlin-Heidelberg-New York: Springer 1970 R6hrborn, G.: Zytogenetische Untersuchungen bei S~ugetieren. Paper presented at the 3rd Meeting of Gesellschaft f/ir Umwelt-Mutationsforschung e. V. (GUM) (1976) Stucky, W., Vollmar, J.: Ein Verfahren zur exakten Auswertung von 2 x c-H/iufigkeitstafeln. Biom. Z. 17, 147-162 (1975) Stucky, W., Vollmar, J.: Probabilities for Tied linear Rank tests. J. Statist. Comp. Simulat. 5, 73-81 (1976) Vollmar, J., Stucky, W.: Zur statistischen Auswertung des Dominanten Letaltests bei der miinnlichen Maus. GSF Bericht B 565, 24--30 (1975) Vollmar, J., Stucky, W.: The dominant lethal test. In: Statistics in testing mutagenicity (J. Vollmar, R. J. Lorenz, Eds.). Stuttgart-New York: Fischer 1977 Received December 21, 1976