CLINICAL TOXICOLOGY, 15( 5), pp. 559569 (1979)
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
Research Design from the Biostatisticianâ€™s Viewpoint
DAVID S. SALSBURG, Ph.D. Department of Clinical Research Pfizer, Inc. Groton, Connecticut 06340
It was R. A. Fisher who first suggested that the design of experiments was a statistical topic, and it might pay to review the historical context in which this occurred. When Fisher arrived at Rothamsted Agricultural Research Station in 1912, agricultural experiments involving differing strains of grain and artificial fertilizers had been run in England, Denmark, and the United States for almost 100 years. The result had been a large number of journal articles and bitter debates among the protagonists of various points of view. There were agricultural researchers who spent their efforts deriving indices of fertility, and the supporters of one index had nothing but scorn for those who used another. Other protagonists had come to the conclusion that each strain of wheat responded to different patterns of fertilizer in different ways, while still others were convinced that the patterns of rainfall and early spring temperatures caused one fertilizer to be better than another. In the meantime, the agricultural experimental stations had been collecting data. Fisher found waiting for him 94 years worth of daily rainfall records, complete weighings of grain and straw from 18 fields that had been planted with various combinations of wheat strains and fertilizer over a similar period, and a vast haphazard collection of weights and counts of potatoes. For almost 100 years, anyone who could gain control of research facilities had pursued his own special kind of experiment, often without any attempt at nonexperimental controls. As a result, it appeared 559 Copyright
0 1980 by Marcel Dekker, Inc.
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
560
SALSBURG
that Old Red Cluster wheat responded to sulphate of soda, but only when in combination with sulphate of ammonia and when planted in a field that had seen Red Club wheat treated with chloride of ammonia the year before. Fisher introduced a simple idea to this mess of pottage. He suggested that experiments be designed s o that the data accumulated could be fit in a mathematical model of the effect expected. Thus, if you wished to investigate whether sulphate of magnesia worked better on Old Red Cluster than on Red Club, you planted the two crops so that you could estimate the yield of each, both without the fertilizer and with the fertilizer under nearly identical conditions. He introduced the concepts of additive effects and of interaction among effects, and gave them unambiguous meaning. What has this to do with good laboratory practices and toxicology? Simply this, the mathematical model that lies behind an experimental design will require that certain laboratory practices a r e needed to execute the design, while the existence of other laboratory practices may make some models (and their associated designs) impossible. If you do not mind my being blunt, some of the standard practices of toxicology imply contradictory designs. Because of this, a significant proportion of the toxicology literature resembles the agrigultural literature from before 1912. This is particularly true for studies of teratology and lifetime feeding studies of carcinogenicity. Before I pursue these general ideas, let me show how the use of mathematical models can affect the design of a 90day toxicology study. I have picked this because, of the many standard toxicological procedures I have investigated, the 90day study has one of the most consistent and rational structures. We can view the 90day toxicity study in three different ways in order to evolve a mathematical model, ( A ) The purpose of the study is to discover the potentially toxic activity of the compound. (B)The compound is assumed to be toxic at a sufficiently high dose, and the purpose of the study is to discover that dose at which those toxic events do not occur. (C) The compound is assumed to be toxic at a sufficiently high dose and to carry a risk of toxicity at all doses, and the purpose of the study is to profile the toxic effects associated with doses to be encountered i n human use.
In some sense, these three views represent an increasing sophistication about the meaning of chronic toxicity. Because of this, the mathematical models they imply can be nested within one another. Before I propose such models, let me make what may be an outlandish statement to those of you who may soon recoil at my abstract notation. The methematical model is a crude and false simplification of reality. It is created not because we think it is a perfect description of the experimental situation. We create it because we want to identify the main effects we can expect to see. We use the model to estimate
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
BIOSTATISTICIAN' S VIEWPOINT OF RESEARCH DESIGN
561
the number of experimental units we need in order to detect main effects of the degree we consider worth noting and to determine what concommitant variables must be controlled in order to make sufficiently accurate estimates of the main effects. A mathematical model which describes the simple elements of the first view is one which assigns to each type of lesion a probability of occurrence in a given animal. If there is no effect of the compound, then those probabilities a r e the same across all doses of the compound that would be used. If there is a n effect, then we might restrict attention to only those effects that increase probability with increasing doses. What does this imply about the design? The theory of multinominal probability distributions, which is the simplest version we can use, says that we can detect a small increase in probability best if the control animals have very low probabilities of lesion and if the animals a r e divided half and half to controls and the highest possible dose. If you don't like this design (and I, for one, do not), it may be because you believe in a more complicated version of the 90day toxicology study than this simple model will allow. If ( B ) is your choice, you a r e looking for a "no effect" dose. Then your mathematical model has to involve some sort of doseresponse curve. The model may consist of a family of doseresponse curves. If we restrict the model to a specific family, we can get away with only two o r three doses in order to estimate the parameters of that model. If, however, we wish to distinguish between different families of doseresponse curves, we need enough doses to distinguish the shape characteristics of the doseresponse curve. If, in particular, you a r e looking at families of doseresponse curves that include thresholds, you should recognize that the problem of finding optimum designs f o r the estimation of thresholds remains a major unsolved problem in statistics. In addition to the need for a large number of doses, the need to distinguish between slight effects at doses close together requires that the probability of lesion in the control group be fairly high (10% o r better) and that the effects of the.middle dose of the compound run around 50%. Thus ( A ) and (B) require drastically different designs. It is foolish to expect to apply the view of (A) to a design created with ( B ) in mind, and any compromise between the two designs can only be inadequate to both. View (C) suggests a mathematical model that requires more than the occurrence o r nonoccurrence of a lesion. It requires that the lesions be characterized o r scaled by severity. The degree of severity of the lesion then becomes the measure which is modeled to a doseresponse curve. If you seek to model the severity of lesion associated with expected use doses, then you need experimental doses near the expected use. You also need doses higher and lower than the expected use in order to estimate the %lope" of the doseresponse curve with maximum efficiency. The current design of the 90day chronic toxicology study is most in keeping with the requirements of view (C). We use three doses and
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
562
SALSBURG
a control group. The high dose is designed to produce lesions that are specific to the compound being tested. The toxicologist examines the middle and low doses for the occurrence of the same lesion to determine the severity of those lesions, if he sees them. These two intermediate doses are designed to bracket the maximum expected use dose, If he is lucky, he will be able to call one the minimum toxic dose and the other the maximum nonobservable effect dose. One o r both of these will then be used to propose maximum human exposures. His doseresponse curve is crude. It contains four possible values; nothing expected, nothing seen, minimal effect seen, major effects seen, and he is forced to characterize the effects of any dose into one of those four categories, but it is a doseresponse pattern and, a s such, it is part of a mathematical model. Unfortunately, not all toxicological studies are as well designed as the 90day toxicology study. Other designs often include bits and pieces of various ideas, each one of which seems good at the time, but many of which a r e optimum only for conflicting models. For instance, it is my impression, reading through the FDA's Good Laboratory Practices regulations, that the writers of those regulations thought of toxicity studies primarily as means of discovering unknown toxic effects. The mathematical models implied in such thinking a r e ones that allow for estimates of the probability that an animal will have a particular lesion. Care is to be taken, by the use of randomization of assignment by careful tabulation of all tissues, etc., to be sure that the estimates of probability will be unbiased. But there are peripheral, if not contradictory, ideas i n the Good Laboratory Practices also. For instance, great care is to be taken to insure that the amount of compound ingested is accurately determined. However, the amount of random error involved in estimating probabilities of effects from counts of animals is often much greater than any reasonable e r r o r s involved in dose levels. Thus a certain amount of the care required for feed analysis will be lost in the random noise associated with the observed incidence of lesions. In fact, if we model a toxicity study in terms of the probability of lesions and think of it as a situation for standard statistical tests of hypothesis about changes in those probabilities as a result of test compounds, then some of the tightness implied by the GLP can be weakened without affecting the ability of the study to reach conclusions very much. Tables 1 and 2 might be entitled "HOW sloppy can you get?" They represent a robustness analysis of a controlled toxicology trial. Robustness analyses a r e commonplace in operations research. The idea is to set up a mathematical model of the decision process and then to determine how great a change is needed in the parameters of that model before the decisions that result are materially changed. If you will bear with a little mathematical notation, Table 1 shows a model in which N animals have been put on a given dose of the compound and some are lost, due to autolysis, escape, animal handler e r r o r , or what have you. It requires only that the probability that a given animal will be lost be independent of both the dose and the
BIOSTATISTICIAN' S VIEWPOINT OF RESEARCH DESIGN
563
TABLE 1. Mathematical Model of E r r o r , Independent of Dose o r Lesion Numbers of animals With lesion
Without lesion
Total
Observed
X
U
W
Not observeda Total
Y
V
(N  W)
( N  Z)
N
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
~~
~~
Z
aLost to observation due to autolysis, escape, etc. XIW XI Z W Z
3
X
Then (1)

Binomial (W, p) Binomial ( Z , r) Binomial (N, r ) Binomial (N, p) Binomial (N, r p ) X/W is an unbiased estimator of p
N
(3) The effect of r < 1 is to reduce power animals with bile duct inflammation Example: Observe X = 12 W = 47 animals with livers available for histopathological examination N W= 3 animals lost due to autolysis 12/47 = .255 estimates p The value of r is unknown. If r = .90, and if the true value of p = .25, then variance of X/W = .0042, and .128 5 p I .382 with 95% confidence

lesion being examined. Then, although Z of the N animals would have come down with the lesion, we can only observe X of them and only U of those who a r e without lesion. We have no way of knowing the numbers Y and V. If (1 r ) i s the probabiIity of loss, it turns out that our best unbiased estimator of the probability of that lesion is X/W. In other words, it is best to ignore the animlas lost to observation. The example shows what this might mean in a specific case. However, the loss of animals will make it more difficult to detect a slight increase in the probability of lesion due to the compound at test. How much more difficult? Let me defer an answer to that and first consider a slightly more complicated model of e r r o r . Let us assume that, somehow, the lesion itself induces a further loss of animals to observation (Table 2). Suppose, f o r instance, that

SALSBURG
564
TABLE 2. Mathematical Model of Error, One Dependent on Lesion but Independent of Dose, Another Independent of Lesion o r Dose Numbers of animals With lesion
without lesion
Total
X
U
W
Y
v
(N  W )
Z
(Z
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
~
Observed Loss due to lesion Not observed Total ~~
X'
 W)
N
 B ( X +p) x+x'Iz X B(N, rsp)
X', s ) B(W, B(Z, r )
X I X + X' X + X' IW

U B(N, r ( 1  p) Then ( 1 ) There is no unbiased estimator of p ( 2 ) The "best" test statistics a r e X/N o r U/N ( 3 ) E(X/N) = rsp, E(U/N) = r ( 1  p) ( 4 ) Var (X/N) = r s p (1  rsp)/N ( 5 ) The effects of r < 1, s < 1 a r e to reduce power and increase bias
Example: X = 12, W = 47 as before. We do not know ( 1  8 ) = probability of loss due to lesion or ( 1  r) = probability of loss independent of lesion But let us suppose s = . 9 5 r = .95 That is, half the losses a r e due to the lesion. Then X/N = .240 and U/N = .76 Variance (X/N) = .0035, and the product .124 5 r s p with 95% confidence
5
.356
it is a lethal lesion and some of the animals thus affected a r e lost to autolysis. Now, the number of animals with lesion fall into three categories: X, those observed; X' , those lost because of the lesion; and Y, those lost due to the original random occurrence of chance losses. Remember, we can never know the value of X' and Y. In the example, only X = 12 is known. In this particular model it turns out that we cannot get an unbiased estimator of the probability of lesion. We can test if that probability has changed because the fraction X/N estimates a function of that probability and differences in that function will reflect differences in the underlying probabilities of lesion only. But, as in the first and simpler model, reductions in the number of observable
BIOSTATISTICIAN' S VIEWPOINT OF RESEARCH DESIGN
565
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
Equal Power Contours  80% Power Power = Prob { Detection of a test compound induced lesion} m d Prob { lesion in controls ) = po Prob {additional loss of tissue due to lesion } = 1s
kfold increase in Prob { lesion } due to test compound
FIGURE 1. events decreases the power. Unlike the simpler model they also increase the bias of the estimators. How badly is the power reduced? To answer this question I have plotted equal power contours for 80% power when the probability of lesion in the controls is 1 and 5% (Fig. 1). The upper edge of this graph represents the case where r = 1, where no animals a r e lost to error. An 80% power curve consists of those points where we a r e 80% o r more sure of being able to detect an effect if we use a statistical test of hypothesis of no effect at the nominal 5% level. This graph represents 50 animals p e r dose and compares a single dose group to controls. The lines a r e plotted for s = 1.0, 0.8, and 0.6. At s = 1.0 there are no losses due to lesion. Thus the point where the s = 1.0 line meets the upper edge represents the degree of effect we are 80% sure of detecting if we have no losses of animals at all. Note that if the underlying probability of lesion in the controls is 0.01, we need a true increase of better than 10fold to be 80% sure of getting a statistical significance. Thus this corner of the graph suggests that standard statistical tests of hypothesis are
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
566
SALSBURG
not of much value for slight increments in r a r e types of lesions. If the toxicologist sees, at a given dose, only a few highly unusual and unexpected lesions, he should not expect verification from a statistical test, since h i s common sense will warn him that something has happened long before a statistical test will. However, when the underlying probability of lesion is 5% o r more, statistical tests tend to become more useful. For such a situation, the 80% power line for s = 1.0 crosses the upper edge to the left of the 2fold increase line. But notice how steep these curves a r e for rrrfr greater than 0.75. It would appear from this graph that one could lose up to 25% of the animals due to e r r o r without materially effecting the decision process. Furthermore, if we shift to the line for s = 0.8, it appears that a further loss of 20% of the animals with the lesion will not affect the decision very much. So, how sloppy can you get? If we view a toxicology study as a test of no effect, then it appears that we can lose about 25% of the animals to random e r r o r and another 0.75% = (.20) (.75)(.05) (100) to random e r r o r associated with the lesion being investigated. The validity of these conclusions, of course, depends upon the validity of the model, and it in turn depends upon the validity of the view that generated the model. You, as the scientists who I ask to believe these results, should examine the model and its assumptions very carefully to determine if they a r e valid. Let me end with an example of how what appears to be a good laboratory practice can, in fact, introduce a considerable amount of e r r o r into the conclusions (Table 3). It is common practice in a lifetime carcinogenic study to collect standard sections of specific organs for histopathological examination. In addition, slices are taken of suspicious "bumps." On the face of it, this seems like a reasonable thing to do. Suspicious "bumps" are candidates for malignant tumors and should be examined under the microscope. Let us model this procedure and see what it does to the statistical tests of hypothesis. The rule is "One standard slice from a given organ plus slices from any suspicious 'bumps. 1 1 1 Suppose there a r e "m" possible sites for such bumps. For instance, suppose we a r e looking at mammary tumors and any one of eight mammary glands can have a bump. Suppose further, that the probability of such a bump occurring is "r,lr while the probability that a given slice of tissue will contain a microscopic tumor is t'p." If Y is the number of bumps that a r e found to contain malignant cells, and if Z is 0 o r 1, depending upon whether the routine slice did o r did not contain malignant cells, then a given animal will be declared to have a malignant tumor if either Z o r Y is greater than zero. Thus the number of animals found to have malignant tumors will be a binomial variate with probability that any one is found indicated by rrs.t' In the example, p, the true probability of tumor, is the same for all groups, but r, the probability of unrelated tissue masses, increases with dose. This is the model. What a r e some of its consequences ? Suppose we a r e looking for mammary tumors in a strain of r a t s (like the
BIOSTATISTICIAN' S VIEWPOINT OF RESEARCH DESIGN
567
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
TABLE 3. Mathematical Model of the Effects of Rule: One Standard Slice of a Given Organ Plus Slices from Any Gross Suspicious 7'Bumps" m = Maximum number of possible "bumps" X = # of "bumps" actually sliced X  B(m, r ) Y = # of "bump" slices found to contain malignant growths under the microscope YIX B(X, P) Y  B(m, r p ) 0 if standard slice has no malignant growths Z = 1 otherwise Prob {Z = 1) = P W = # of animals with at least one malignant growth discovered W  B(n, s), s = 1 ( 1  rp)m(l  p) p constant a c ross all doses p = .20 Example: m = 8, r changes with dose


Dose
ri
~~~~~
Si
Probability of significance when compared to controls
~
Controls
.05
.262
Low
*
10
.319
20%
High
.20
.423
54%
SpragueDawley) which have a 20% chance of having a mammary tumor in the controls. Suppose we allow each of the eight mammary glands to have an opportunity to have a bump. Now, suppose the compound at test tends to produce benign fibrous masses in the mammary gland, the probability of such a ma ss increasing with dose. The probability of malignancy remains the same with increasing doses but the fibrous masses a r e of such heterogeneous material as to allow for the random occurrence of malignant microscopic tumors in them with the same probability they can occur in surrounding tissue. If the incidence of these fibrous ma sse s rises from 5 to 10 to 20% with increasing doses, then the probability that a given animal will be observed with a malignant tumor r i s e s from .262 to .423 without any change in the true probability that an animal will have a malignant tumor. Thus the use of this rule can produce an apparent increase in tumorigenesis that will be statistically significant over 50% of the time. Is such a possibility farfetched? It need not occur only with mammary tumors. If the compound at test produces nodules in the liver, some of these may be sufficiently suspicious on gross pathology to
SALSBURG
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
568
call for sections. Or suppose a compoundinduced lesion tends to create a lump in the subcutaneous tissue of an animal prone to have tumors there. All this model requires is that the gross "bump'' have histopathology similar to that of surrounding tissue. How many situations can occur ? I started this talk with a historical look at agricultural research in 1912. A great deal of toxicology is more sophisticated than that literature. However, there a r e places in toxicology where the chaotic confusion of agriculture in 1912 is all too real, Unfortunately, we do not have an R. A. Fisher to untangle it for us, but perhaps we can use some of his insights to eliminate the more obvious e r r o r s in our designs. Q U E S T I O N AND A N S W E R S E S S I O N
Q: Can you think as a statistician,
along with establishing criteria and standards for migration, let's say, f o r materials, from packaging material into foods, say at 50 parts per billion, are you able to think in terms of using, say MantelBryan procedures on some standardized basis at some level? A: Since Nathen Mantel and I have had some very strong disagreements on this subject, I should make clear where I stand. I don't think i t ' s sensible to use statistical techniques to project risk associated with very small doses of something on the basis of doses that a r e well beyond that level. Paul Levy, one of the leading probablists, once said that prediction is very hard, especially the future, and what he meant was that extrapolation doesn't make any sense. I can recognize the bind that the FDA has felt itself locked into by the Delaney clause but my feeling is the MantelBryan procedure o r any other procedure is purely black magic. The fact that you do go through a complicated mathematical procedure and you come up with a value of one in a million as a probability doesn' t mean that you have one in a million probability, it means you have come up with a complicated mathematical procedure that has produced a number. zf you would like a more simple procedure take the lowest dose you have used, divide by the number of lymphomas you say in the mice, multiply by n , and divide by 3,000. It is the same sort of procedure. I think that you people, you toxicologists, should stand up on your hind legs and scream and holler. You a r e being bamboozled by a purely mathematical procedure which is designed to ease the conscience of regulators and Congressmen who aren' t willing to face the consequences of what the Delaney clause really says. That's a strong position. Comment: Thank you very much, I think you understood my question. A lot of people are looking toward this kind of thing to get out of the hassle that's been going on. If you could s e t a number, as we do with migration, I don't know what that number would be. You don't like to think of those things.
Clinical Toxicology Downloaded from informahealthcare.com by UB Giessen on 10/30/14 For personal use only.
BIOSTATIST1CIAN"S VIEWPOINT OF RESEARCH DESIGN
569
Comment: I work in a purely research environment so I have the luxury of thinking of the scientific meaning of these things, especially since my company doesn't make any of these products. I would like to concentrate on one thing though. The MantelBryan extrapolation has taken us away from a very important question and that is, can we use these lifetime carcinogen studies to project the potential toxicity of human use of compounds like drugs and compounds that are faced in the work place where you a r e actually going to deal with levels that a r e close to those you can use in animals? I think we should give a great deal of serious thought as to whether we can use statistics for those purposes. Q: How do you like to see the handling of protocols with respect to statistical review? Do you expect, in your corporation, to see every protocol that is written? Do you decide up front which ones you will s e e ? Do you sign off on everyone of them ? How is that handled in your shop ? A: Let me first point out that we have a small statistical operation at Pfizer and it is associated with the Department of Clinical Research and we provide consulting services to the toxicology people. We haven't yet reached any formal structures as to what' s going to be done with the toxicology protocols. The clinical research protocols a r e definitely signed off by a statistician, who then becomes personally responsible for all of the mistakes in the design that he can holler about when the study comes back. I think that we a r e searching, as well as everybody else, a s to what the mechanism would be.