STATISTICS IN MEDICINE, VOL. 11, 1075-1089 (1992)

ORGANIZATION AND ANALYSIS O F SAFETY DATA USING A MULTIVARIATE APPROACH CHRISTY CHUANG-STEIN AND NOEL R. MOHBERG Research Support Biostatistics, 9164-32-2, The Upjohn Company, Kalamazoo. MI 49001, U.S.A.

AND DAVID M.MUSSELMAN* Product Support. 9159-298-236. The Upjohn Company. Kalamazoo, MI 49001, U.S.A

SUMMARY The collection of safety data is an important part of clinical trials. These safety data are often described and reported in great detail with expenditure of substantial effort and energy. Because of the wide variety of data that require scrutiny from the safety perspective, however, statistical comparisons of the safety profiles of different treatments often lack focus and structure and result in situations where the comparisons for each individual item lack power and thus are inconclusive. In this paper, we propose to organize the safety data into a more manageable form by consolidating them into a number of K classes characterized by body systems and determined in conjunction with the underlying disease as well as the treatments involved. Within each class, we propose assignment to each patient of an overall intensity grade based on all relevant information. The consolidation of the safety data as proposed provides an informative summary for the safety profile of each treatment. The analysis of such organized data concentrates on comparison of the mean intensity grades for different treatments within the K classes simultaneously with use of scores that reflect the acceptability of the various intensity levels to an individual. Furthermore, we demonstrate that the proposed multivariate comparison has much higher power than the univariate one to detect differences in certain cases. We provide examples to illustrate the proposed procedure.

1. INTRODUCTION Although most of the attention in statistical analyses of clinical trial data focuses on efficacy of the study treatment relative to a control, a substantial portion of the data, in fact, pertain to safety concerns. From our experiences, it is not unlikely that 70 per cent to 80 per cent of the data collected and effort expended concern safety, but these data have little relevance in demonstrating efficacy. Because of the surprising lack of analytic tools to handle the broad spectrum of safety data collected, we direct our attention in this paper to the organization and analysis of these data from clinical trials.

*Present address: The USCl Division, Bellerica, MA 01821, U.S.A.

0277-6715/92/08 1075-1 5$07.50 0 1992 by John Wiley & Sons, Ltd.

Received September 1991 Revised November 1991

1076

C. CHUANG-STEIN, N. R. MOHBERG AND D. M. MUSSELMAN

Statisticians have historically emphasized the analysis of efficacy data in clinical trials (See Peace'). In some situations, however, it is impossible to separate clearly efficacy from safety. For example, in many psychapharmacological trials, the same instrument measures efficacy and safety. Thus, statistical procedures that address safety in these situations have efficacy implications as well. It is understandable that statistical methodology for analysing safety data is scant because safety data come from many sources and may differ considerably in nature. They may include observations by the clinicians (signs), complaints of untoward events from the patients (symptoms), clinical laboratory assay reports, and physiological test results such as ECG and CT scans. Because there usually are no definite rules governing collection of safety data, their recording is often both amorphous and irregular. Furthermore, since there are so many different medical events that may be reported, existing statistical methods to compare treatments in regard to each individual event are frequently inappropriate. In fact, it is often difficult to define a safety analysis that is not in part data-driven. The most common method to analyse the safety data is to compare the incidence of each itemized medical event as mentioned above (see Peace') or to compare the mean changes from baselines if one suspects some laboratory abnormalities. An example for the former is the comparison of the incidence rates of nausea between the test drug and the control. Such a comparison will possess all the desirable statistical properties when planned in advance in the protocol with safety concerns clearly specified. Unfortunately, statisticians often encounter a number of reports from different sources that have safety implications and they are asked to make comparisons based on these reports. Several statistical problems ensue. For example, can one compare an elevated liver enzyme on one drug directly with a report of nausea and jaundice on the other drug? Can one deal with EKG and clinical laboratory data simultaneously? In addition to the question as to what one can compare, there is an inherent multiple hypotheses testing problem that can erode the p-values and reduce all but the most dramatic differences to statistical non-significance. Ironically, this problem pervades almost all NDA's (new drug applications) that entail several dozens or even hundreds of comparisons on the reported medical event rates. The validity of the conclusions based on such comparisons is often questionable. Brown et aL2 proposed a method to compare the safety of two antibiotics based on laboratory data obtained during and after therapy. They first selected tolerable limits for laboratory results and ranked abnormalities according to the frequencies of abnormal values detected. They then analysed pairs of related test results concurrently. Sogliero-Gilbert et aL3 proposed a way to combine laboratory results to study the laboratory abnormality profiles of drugs. They constructed a score, called the Genie score, for each functional group for each patient as well as an overall score. Gilbert et aL4 also used Genie scores to compare drug safety in controlled clinical trials. While the concern of the above authors was exclusively with laboratory abnormalities, our objective in this paper is to consider all safety data including laboratory results. Thus, the basis for the comparison of treatment safety proposed in this paper is much broader than those available in the literature. In this paper, we propose to structure the massive safety data into more manageable framework by consolidating them into a number of K classes characterized by body systems and determined in conjunction with the underlying disease as well as the treatment(s) involved. Within each class, we propose assignment to each patient of an overall intensity grade based on all relevant information. The determination of the intensity grade allows for the combination of information from different sources. The analysis of such organized data concentrates on comparison of the mean intensity grades for different treatments within the K classes simultaneously with

ORGANIZATION AND ANALYSIS OF SAFETY DATA

1077

use of scores that reflect the acceptability, to an individual, of the various intensity levels. The approach lends itself to conducting a comparison specified a priori and facilitates the understanding of the study outcome insofar as regards drug safety. Furthermore, we demonstrate that the proposed multivariate comparison has much higher power than each of the existing itemized comparisons to detect differences in certain cases. We provide examples to illustrate the proposed procedure.

2. A PROPOSAL T O ORGANIZE AND ANALYSE THE SAFETY DATA To organize the safety data, we assume that we can summarize the available information regarding patients’ safety experience in a study into K classes characterized by body systems and determined in conjunction with the underlying disease as well as the treatment(s) involved. An example of such classes is: cardiovascular; hepatic; haematologic; CNS; pulmonary; gastrointestinal; neurologic; endocrine, and metabolic, etc. (Depending on the circumstances, one might find it desirable to treat hepatic as part of gastrointestinal.) The determination of the K classes is the first step under our approach and should proceed any data analysis. Within each class, we assume that we can determine an overall intensity grade (for example, none, grade 1, grade 2, . . . , etc.) for each patient, based on clinical observations as well as those test results considered as related physiologically to the body system representative of the class. For convenience, we call the various intensity grades ‘intensity levels’ or ‘levels’for short. For example, we call grade 0, which suggests no adverse reaction, as ‘level 0’.We label grade 1 as ‘level l’, etc. For our presentation, we assume the same number of intensity levels for all classes and denote it by J . The process of consolidating the safety information in this way requires extensive input from medical personnel as we illustrate with the example in Section 3. Like the determination of the K classes, the decision on intensity grades should be objective and take place before any data scrutiny. To analyse the consolidated safety data, we assume that within class i (i = 1, . . . , K ) , individuals can assign scores {wij, j = 0, . . . , J - l } to the J intensity levels in such a way that wio = 0 (i = 1, . . . , K ) and that { w i j }reflect the acceptability, to the individuals, of the various intensity grades (we assume high scores suggest low acceptability). One can assign scores separately within each class without taking into account the relative seriousness of the K classes. One can also assign scores that not only reflect the acceptability of the various intensity levels within a class, but also the seriousness of the various classes relative to one another. Since the assignment of the scores pertains to an individual’s perspective, and therefore has a subjective component, results from the statistical comparisons discussed in this section are best interpreted at the individual decision level. We wish to point out here that this subjectiveness is deliberately built into the approach to allow the treatment selection made at the individual level. Assume xij individuals whose intensity level within the ith class is j. Let p i j denote the probability of the jth level within the ith class and N the total number of patients in a treatment group. It is easy to see that Z j x i j = N for all i. We emphasize that under the proposed method, we use all relevant information to arrive at a single intensity grade for a patient within a class. Let J j = x i j / N denote the observed proportion of the jth intensity level within the ith class, and si = C j w i j J j the observed mean score for the ith class. We can easily show that the expected value of si is mi = Cjwijpij. Furthermore, Let S’ = (sir . . . , sk), then E ( S ‘ ) = (ml,. . . , mK).Denote cov(S) = V = (uij). We now show how to compute V.

1078

C. CHUANG-STEIN,N. R. MOHBERG AND D. M. MUSSELMAN

Define W and F as

We can show that S = WF and cov(S) = W(cov(F))W‘, where cov(F) = ( ~ o v ( f ; . ~ , f is i , ~the )) covariance matrix of F. Thus, our task is to find cov(F). Assuming that (Xio, Xil, . . . , xi.J-1) within the ith class has a multinomial distribution with (pij) as the multinomial probabilities, we have 1 cov(f;j,fij’)= - (ajj’pij - pijpij,), N where 6jj. = 1 if j = j’. Next, we compute cov(fij,fi,j,)for i # i’. To compute cov(fij,fi,j.) for i # i’, we need to consider the joint distribution of the intensity levels in the ith and the i’th classes, for which {xij,j = 0, . . . , J - l } and {xisj.,j ‘ = 0, . . . , J - l } are the observed marginal totals. Let pi,,i,j, denote the probability that the intensity levels in the ith and the i’th class are at the levels o f j and j’, respectively, and Xij,i*j.,fij,i,j., the corresponding observed frequencies and proportions. Then {Xij,isjG, j , j’ = 0, . . . , J - l} has a multinomial distribution with parameters { ~ i j , ~j,f , j , ,= 0, . . . , J - 1). We can then compute the covariance betweenfij andfi,j, as /J- 1

J- 1

ORGANIZATION A N D ANALYSIS OF SAFETY DATA

1079

Substituting the observed proportions for the true proportions in cov(Jj, Jf)and cov(fij,J,y), we obtain an estimate for V which we denote as V. Now suppose we have two treatments and that we want to compare their safety profiles. We can compute S(i) and V C 0(i = 1,2) for the two treatments using the same weights {Wij}. We can construct a multivariate test based on S'" as T = (S") - '#z))'(C(l) + C(z))-l(S(1)- S ( 2 ) ) (1) which, if the two treatments have identical safety profiles, has an asymptotic X2-distribution with K degrees of freedom (d.f.).We form the statistic T by comparing the two treatments with respect to their mean intensity grades within the K classes simultaneously. Two special cases regarding the choice of {wij) deserve mentioning. First is when wij.= 1 for all i a n d j such that j # 0. In this case, S ( i )is the vector that contains the proportions of individuals who receive the ith treatment and who have an intensity grade of 1 or above in each of the K classes. Second is when {wij} not only reflect the acceptability of the various intensity levels within a class, but also reflect the seriousness of the K classes with respect to one another and to the individual making the decision. In this case, we can use a single mean score to summarize the safety profile of a treatment by adding the components in S(i)together. If we let 1' = (1,1, . . . , 1) be a K x 1 vector of l's, then we can use l'(S") - S ( * ) )to quantify the difference in the overall safety profiles between the two treatments. We construct a test statistic as T = (S") - s(2))'{1(1'($1) +

e(2))1)-1lr}(S(i) - S(2)),

(2)

which, if the two treatments have identical safety profiles, has an asymptotic X'-distribution with 1 d.f.

3. AN EXAMPLE In this' section, we use data from two randomized clinical trials to illustrate the proposed procedure. Both trials were double blind studies that compared two drugs for the treatment of angina pectoris. For convenience, we call the two drugs drug A and drug B. One study had 19 patients on drug A and 21 on drug B, while the other had 24 on drug A and 22 on drug B. Because these two studies had the same protocol, we combined data from the two studies to compare the two drugs with respect to their safety profiles. 3.1. Determination of classes The data collected included clinical signs and symptoms, electrocardiogram examinations and the regular safety laboratory assays (that is, haematology, chemistries and urinalyses). For convenience, we use the term 'clinical event' to represent both clinical signs and symptoms. Based on the disease treated and the adverse reactions expected of the two drugs, we used the following ten classes: cardiovascular; haematologic; gastrointestinal/hepatic; genitourinary/renal; neurologic/psychiatric; pulmonary; special senses; metabolic/nutritional; dermatologic and musculoskeletal.

3.2. Determination of the intensity grade within a class Determination of the intensity grade for each patient within a class is the most challenging component of the proposed procedure. With the data collected, we created a set of rules to determine these grades. While we recognize that there is no definitive way to determine the

1080

C. CHUANG-STEIN, N. R. MOHBERG AND D. M. MUSSELMAN

Table I. Relationships between certain biochemical assays and the 10 identified classes CV ~

Haem

~ _ _ _ _ _ _ _

Albumin Alk. Phosphatase

2

Bilirubin BUN

2

Calcium Chloride Creatinine

2

Glucose LDH Inorg. Phosphate Potassium Total Protein SGOT SGPT Sodium

* t

GU

Ren

Neur Psyc

Pul

SS

Met Nut

Der

MS

~

Amylase

CK

GI Hep 2t 1 1 1 2 2 2

1’

2

2

2

1 2

2 1 2

1

1 1

1 1

2

2

2 2

1 2 2 1 1

1 1 1

2 2 2

2

2

1

Suggests a pnmary relationship Suggests a secondary relationship

grades, we feel that rules we chose were both rational and consistent with ordinary medical judgement. Our rules are: (a) We considered reports of clinical events as primary and those based on laboratory findings and physiological testing as secondary. (In other situations, it might be desirable to rely more on laboratory findings. In either case, it is important to decide before looking at the data an order based on certainty of diagnosis and anticipated adverse experience which may include clinical impression, signs, symptoms, laboratory tests, etc.) Thus, if a patient had a clinical event in a class and also had a laboratory abnormality related to the same class, we determined the grade based on the intensity of the clinical event. In situations where the clinical event was innocuous but the laboratory results suggested a more severe problem, such as nocturia and sharply elevated creatinine, we sought additional medical input. To incorporate laboratory data into the determination of the intensity grades, we specified the relationships between certain laboratory parameters and the 10 classes. Table I indicates such relationships where the number ‘1’ suggests a primary relationship and the number ‘2’ suggests a secondary one. For example, we considered both ALKP and bilirubin to have a primary relationship to the GI/Hepatic class and a secondary relationship to haematology. (b) We accepted a treating physician’s judgement on whether a clinical event was unrelated, possibly or probably related to the treatment. (A cautionary note in using physician’s judgement in this context is that one has to be sure that the treating physician was not biased in favour of any treatment because of their financial relationships with the pharmaceutical sponsor.) When assigning the intensity grade, we treat possible and probable relationships the same. Table I1 provides an outline of how we graded the intensity based on a single reported clinical event. In Table 11, a ‘yes’ to the question ‘related to treatment’

ORGANIZATION AND ANALYSIS OF SAFETY DATA

1081

Table 11. Rules from grading a single adverse clinical event Intensity of a

Related to

Grade

clinical event

treatment

assigned

No Yes

No Yes No

0 1 1 2 2 3 3

Yes

4

Mild

Mild Moderate

No

Moderate Severe

Yes

Severe

Intolerable Intolerable

includes both a possible and a probable relationship. For example, with mild intensity of a clinical event judged as unrelated to treatment, we assigned grade 0. The same event, if judged as treatment-related, would receive grade 1. Generally speaking, the grade increases as the intensity of the clinical event increases. Also, those events related to the treatment have an intensity grade one level higher than the corresponding events not related to the treatment. We followed the treatment emergent philosophy with regard to temporal relationships for the laboratory and ECG data and considered observations from both the treatment and the close follow-up periods. We used results reported prior to the treatment to reflect changes in the status. For example, if a pretreatment laboratory or electrocardiogram abnormality did not deteriorate during the course of the treatment, we disregarded the abnormality. On the other hand, if a laboratory assay normal prior to the treatment became abnormal during the course of the treatment, we assigned grade 1 to the abnormality. If an abnormality was incompatible with life, we assigned grade 2. The latter situation did not occur in our example. Unlike the laboratory or ECG results, we graded clinical events according to Table I1 regardless of their documentation prior to treatment. (d) When a patient had more than two separate reports of a clinical event, or more than two clinical events that fell in the same class (with classes defined in Section 3.1), we raised the intensity grade for that class by 1. For example, if a patient had three reports of mild headache related to treatment, we would assign grade 2 for the corresponding class. Also, if a patient had headache, dizziness and somnolence, all of which were mild and related to treatment, we would raise the grade from 1 to 2. The Appendix gives a detailed example of how we determined the intensity grades for one patient. We wish to point out that grade Ievels are not equally-spaced; it takes much more of an adverse reaction to move up a grade at the high end of the grade scale. Using the above rules, we obtained data similar to those displayed in Table 111. We slightly modified the original data to better illustrate the proposed procedure. In Table 111, we use the following four levels to summarize the study patients’ safety experience within each class: none, grades 1, 2 and 3. To compare the two drugs with respect to their overall safety profiles, we first used the statistic in (1) with the same equally-spaced scores wij for each of the 10 classes, for example, wio = 0, wil = 1, wiz = 2, and wi3 = 3. The test statistic T has a value of 21.96, which, upon comparison to a X’-distribution with 10 degrees of freedom, yields a p-value of 0.015 and suggests that the overall

1082

C. CHUANG-STEIN. N. R. MOHBERG AND D. M. MUSSELMAN

Table 111. Results from consolidating the safety data from two randomized studies ~~~

Class

Drug ~

~~

Cardiovascular Haematologic Gastrointestinal/hepatic

Genitourinary/renal Neurologic/psychiatric Pulmonary Special senses Metabolic/nutritional

A A

B A B A B A B A B

A B A

B A

B

Musculoskeletal

Grade 1

Grade 2

Grade 3

8 1 0 0 2

3 0 0 0 1 0 0 0 0 7 0 0 0 0

~

B

Dermatologic

None

~~~

A

B

29 38 40 40 37 36 40

38 26 15

37 41 41 41 40

39 41 41 39 37

3 4 3 3 3 3 2 2 4 11 4 1 1 0 1

2 1 0 3 4

4

1 3 13 10 2 1

1 2 2 2

0 0

1

0

2

0 0 0

1

2

safety profiles for the two drugs differ at the 5 per cent level. The use of the X*-distribution in this example serves as an approximation because of the small frequencies for some intensity categories. A closer look at the data in Table I11 reveals the possible sources of difference between the two drugs. The two drugs differ most for the cardiovascular and the neurologic/psychiatric classes. While drug A has a higher incidence of cardiovascular events, drug B has a higher incidence of neurologic/psychiatric events. Applying the Pearson’s X 2-test to the corresponding 2 x 4 subtables in Table 111, we obtain p-values of 002 (cardiovascular) and 0004 (neurologic/psychiatric). If we collapse grades 1 , 2 , 3 in Table 111 and compare the proportions of non-zero grades for these two drugs, we obtain p-values of 0.02 for both classes. All the above comparisons are significant at the 5 per cent level. Nevertheless, these results need cautious interpretation since the tests concerning the cardiovascular and neurologic/psychiatric classes are data-driven. A possible remedy is to adjust the significance level for each test using Bonferroni’s criterion, in anticipation of a test to be conducted within each class. To see how the differences between the two drugs in the various classes affect the overall comparison when we combine the 10 classes with use of the statistic in (2),we computed the 1 d.f. . { w i j }we considered all had the form wij = ciwj where statistic under various choices of { w i j } The ci reflects the relative seriousness of the 10 classes to an individual and w o = 0, w1 = 1, w 2 = 2, w 3 = 3. We found the computed statistic robust to the choice of ci for i # 1 and i # 5. This is due to the comparable distributions of the intensity level of the two treatments in the corresponding 8 classes. Thus, for presentation, we set ci = 1 for i # 1 and i # 5. On the other hand, the computed statistic was sensitive to the values of c1 (cardiovascular) and c5 (neurologic/psychiatric), a result due to the opposite configurations of the intensity level within these two classes. We

ORGANIZATION AND ANALYSIS OF SAFETY DATA

1083

Table IV. P-values of the statistic in (2) under various choices of cl and c5 with ci = 1 for all i, i # 1 and i # 5 P-value ~~

~~

c1 = 1, cg = 1 c1 = 4, c5 = 1 c1 = 4.7, c5 = 1 c1 = 5, c g = 1

0.743 0.475 0.171 0076 0049 0.04 1

c1 = 1, c5 = 2 c1 = 1, c5 = 4 c1 = 1, cg = 5 CI= 1, CS = 6

0.256 0078 0.058 0.047

c1 = 2, c5 = 1 c, = 3, cs = 1

therefore concentrated on the relative magnitude of c1 to c5. In the first half ofTable IV, we report the p-values obtained under various choices of c1 and c5 with c5 set to 1. Using different c1 and c5 amounts to weighing the two classes differently in computing the overall test statistic in (2). The results in Table IV are revealing. When an individual regards the cardiovascular and neurologic/psychiatric events as equally serious (that is, c1 = l), the comparison fails to conclude differences in the two drugs’ overall safety profiles at the 5 per cent level despite our earlier findings. This is because the observed differences between the two treatment groups in these two key classes occurred in opposite directions and in the combined statistic they balanced one another. With the cardiovascular events regarded as more serious through the use of a higher c1 in Table IVYhowever, the associated p-value decreased. When c1 = 5, the p-value is 0.041, which suggests that drug B has a more favourable safety profile. In other words, if an individual weighs the cardiovascular events at least five times more serious than the neurologic/psychiatric events of the same intensity level, he/she would choose drug B based on the available data. We also reversed the roles of c1 and c5 by setting c1 = 1 in the second half of Table IV. Note that as c5 increases, drug A becomes more appealing. When c5 increases beyond 6, drug A is judged as more favourable. Thus, depending on how an individual weighs one class against the other, he/she will obtain different conclusions on which drug has a more favourable safety profile from his/her perspective. This dependence, a result of weighing the acceptability of the various classes to one another, demonstrates how the acceptability issue translates to that of treatment selection. In this example, the highest intensity grades assigned were several 3’s. There were a few instances where the intensity grades were close to grade 4. Within a class, this draws attention to the latitude within the steps for determination of the intensity grade. It is true that the creation of more intensity grades is likely to increase the sensitivity of the grading system, but it will also lead to insoluble decision problems. Our goal in this paper was to determine the intensity grades within a class while recognizing the necessity for compromises. It is possible that the choice of intensity grades may vary among different researchers. Our experience, however, suggests that as long as the choice is unbiased, consistent, reasonable and made before any data analysis, it will unlikely lead to misinterpretation of study results. As a safeguard toward possible bias in

1084

C. CHUANG-STEIN,N. R. MOHBERG AND D. M. MUSSELMAN

Table V. Summaries of the overall safety data in 8 classes from two treatments Class

Treatment

None

Grade 1

Grade 2

p-value

Renal

1 2 1 2 1

139 138 128 130 131 132 152 147 120 123 150 146 144 142 141 140

16 10 21 16 25 16 7 3 30 20 9 3 12 7 14

5 2 11 4

8

2

0.308* 0.144t 0.162 0116 0313 0.133 0307 0.154 0.325 0.135 0.255 0.129 0.245 0.124 0.272 0.116

Psychiatric Hepatic

2

Cardiovascular CNS Haematologic GI

Metabolic

1 2 1 2 1 2 1 2 1 2

4

2 1 0 10 7 1 1 4 1 5

* P-value from the Pearson’s Xz-test applied to the original 2 x 3 table t P-value from the Pearson’s X2-test applied to the collapsed 2 x 2 table

determining the intensity grades, we recommend that individuals blind to the treatments determine these grades. The most ideal situation is when a group of experts in the target disease area jointly decide the rules that govern the assignment of intensity grades. 4. OTHER APPLICATIONS

In addition to a multivariate comparison of the safety profiles of two treatments, the proposed analytic procedure has an additional appeal as described by the data in Table V. Table V gives the safety summaries of two treatments in 8 classes. Within each class, there are 3 intensity levels (none, grade 1 and grade 2). 160 patients received treatment 1 and 150 patients received treatment 2. The numbers in the table represent the numbers of patients in each treatment group whose overall safety experience within a particular class was at the level described by the column heading. For each class, there are two p-values in the table. The first is that from the Pearson’s X2-test applied to the original 2 x 3 table and the second is that from the same test, but applied to the 2 x 2 table obtained by collapsing grade 1 with grade 2. Although treatment 1 consistently had a higher incidence of non-zero grades in all classes, all the p-values are greater than 10 per cent. The univariate test that compares each class separately did not provide much evidence for a statistically significant difference between the two treatments at the 5 per cent level. Combining grade 1 with grade 2, treatment group 1 totalled 175 cases of non-zero grades in the 8 classes. The corresponding number for treatment group 2 is 92. Various joint distributions of the 8 classes can lead to the marginal distributions in Table V. We considered three such distributions; the first two represent two extreme situations while the third is somewhere between the two extremes. Case 1: Each patient in treatment group 1 experienced at least one treatment-related event. 15 patients who experienced grade 1 psychiatric events also experienced adverse CNS

ORGANIZATION AND ANALYSIS OF SAFETY DATA

1085

Table VI. P-values of the 8 d.f. statistic applied to the data in Table V Choice of weights

Case 1

Case 2

Case 3

0.017 0.011 0.0 18

0020 0.013 0.021

0031

0033

0.016 0.010 0016 0.026

(wo, w1, w2)

event at the same level. As for patients who received treatment 2, 92 experienced exactly one treatment-related event. With this scenario, the total number of patients who experienced any treatment-related events is the highest. Case 2 40 patients in treatment group 1 and 27 in treatment group 2 are responsible for all the cases documented. To be specific, in treatment group 1, patients 1-21 (1 to 21) experienced adverse renal events, patients 1-32 experienced adverse psychiatric events, patients 1-29 experienced adverse hepatic events, patients 1-8 experienced adverse cardiovascular events, etc. Similarly, in treatment group 2, patients 1-12 experienced adverse renal events, patients 1-20 experienced adverse psychiatric events, patients 1-18 experienced adverse hepatic events, etc. With this scenario, the total number of patients who reported any treatment-related events is the lowest. Case 3: The decision on who, in each of the two treatment groups, experienced what adverse reaction(s) was made separately within each class with the use of random numbers. In this case, the total number of patients who experienced any treatment-related events is between those of Case 1 and Case 2. We made several choices of ( w i j } .In Table VI, we give the p-values for the 8 d.f. T-statistics obtained under four different choices of { w i j } that satisfy w i j = w j . The p-values obtain by comparing the T-statistics to a X’-distribution with 8 d.f. The comparison concludes a significant difference at the 5 per cent level for each of the four choices of { w i j } and for all three cases. It does so by pooling together non-significant differences that nevertheless demonstrate consistency. The significance level generally decreases with more weight placed on the severe category. This is not unexpected since the frequencies for this category are usually lower than those for the other categories, therefore a comparison based mostly on the incidence of this category will have less power. We also compared the two treatments using the 1 d.f. statistic. For this statistic, we considered seven choices of ( w i j } . The first four used the same set of scores reported in Table VI. In other words, we treated the eight classes symmetrically. The fifth to the seventh choices involved ranking the eight classes to reflect their relative seriousness to an individual and setting wij as the productof ci and w j with w j as one of the first three choices in Table VI. The ranking led us to set the class-specific scores { c i } as follows: cardiovascular = 6, renal and hepatic = 5, haematologic = 4,metabolic = 3, GI = 2, CNS and Psychiatric = 1, where a higher ci suggests a more serious and less tolerable class. The test statistics involved a X’-distribution with 1d.f. P-values for the comparisons under all seven choices and for all three cases are much smaller than 0401, again suggesting a highly significant difference between the overall safety outcome of the two treatments. This example demonstrates how the proposed procedure can combine consistent, yet nonsignificant, differences into an overall significant difference. The ability to do so is particularly

1086

C. CHUANG-STEIN, N. R. MOHBERG AND D. M. MUSSELMAN

important when two treatments have efficacy judged as equivalent and it is important for a patient to choose a treatment that, from his/her perspective, has a more favourable safety outcome.

5. COMMENTS In this paper, we advocated combining the diverse safety data to provide a more informative picture for the safety profile of a treatment. We also proposed a method for treatment selection that incorporates an individual’s evaluation regarding the acceptability of the various intensity levels within a functional class as well as among different classes. The procedure extends easily to compare the safety profiles of more than two treatments. The crucial steps in consolidating the safety data are determination of the relevant classes and the overall intensity grade for each patient within a class. With few clinical events reported and with the laboratory data dominating the decision on intensity grades, one can use procedures like those proposed by Sogliero-Gilbert er to determine the extent of abnormality. As for the statistical comparison of the safety profiles, the proposed method relies upon an individual’s ability to quantify the degree of acceptability of the various intensity levels. Because different individuals will likely choose different scores that can lead to different conclusions, the interpretation of the comparison results is best done at the individual level. The statistic in (1) compares the safety profiles of two treatments at the individual class level. If a specific sign, symptom or laboratory abnormality is considered detrimental to the target population, it will have a great influence in the determination of the intensity grade for the corresponding class. As a result, a substantial difference in the incidence of the sign, symptom or laboratory abnormality will translate to that in the mean intensity grades for the corresponding class. The latter is likely to show up in the statistic in (1). In contrast, if one is interested in differences in the incidence of any sign, symptom or laboratory abnormality one should perform an exploratory analysis instead of that proposed in this paper. For notational convenience, we assumed in Section 2 the same number of intensity levels J for all classes. We can remove this assumption without affecting the results. Allowing different classes to have different numbers of intensity levels makes the proposed approach more flexible in handling classes that might differ substantially in nature. We recommended determining the classes based on body systems and the anticipated adverse experience. Since under the hypothesis of equal safety profiles the statistic in (1) has an asymptotic X’-distribution with K degrees of freedom, the power of the comparison is affected by choice of the number of classes. Nevertheless, one can always specify a priori examination of only a subset of classes that have primary importance to the target patient population. Thus, the degrees of freedom for the resulting statistics will be the number of classes in the subset. O n the other hand, the 1 d.f. statistic in (2) is less affected by the choice of K . One can handle the less important classes by assigning them near-0 scores when computing the 1 d.f. statistic. The procedure proposed in this paper compares the overall safety profiles between two treatments with a set of scores { w i j } . It is quite likely that two treatments, although with quite different safety outlooks in regard to some classes, fail to demonstrate any significant difference in their overall safety profiles with the statistic in (2). This phenomenon was illustrated in the example in Section 3. This is a direct consequence of the statistic’s weighing the various classes and producing one single average difference between the two treatments. On the other hand, if one is interested in comparing the differences at the individual class level, one should use the statistic described in (1).

ORGANIZATION AND ANALYSIS OF SAFETY DATA

1087

Aggregating experience to decide the safety profile of a treatment is not unknown to practising physicians, who routinely need to evaluate each patient’s need to arrive at an individualized treatment strategy. The process of information aggregation and treatment selection, although existing for many decades, is not an exact one. On the other hand, if done properly, it provides a reasonable solution in a field which, by nature, is not precise. What we propose in this paper is to formalize this process. The primary advantage of the proposed procedure results from quantifying the relative seriousness of the intensity levels within different classes and ascertaining how this quantification translates to treatment selection. A possible refinement of the analytic procedure is to include individual’s prognostic factors when estimating the probabilities of the various intensity levels within different classes. That is, instead of using the population-based probability estimates in computing the statistics, one can use individualized probability estimates. One can easily include individual’s prognostic factors or pre-treatment information for this purpose with use of logistic regression models such as those suggested in McCullagh,’ among others. The process of sorting out relevant safety data to arrive at appropriate intensity grades can be tedious. Nevertheless, once a logical plan is in place, one can write computer programs to automate this process. The programs can systematically sift through various numerical results, key words and clinical event description. With modern computing facilities, the implementation of such programs is quite feasible. In our opinion, with safety data in clinical trials, the effort to implement their analysis is minute compared to that of collecting them. Our approach, through the choice of { w i j } ,has a subjective component. While it is crucial to have the classes and the overall intensity grades determined in an objective manner, the assignment of the scores is left to be made at the individual level. In other words, subjectiveness is deliberately built into the procedure so that treatment selection can reflect an individual’s evaluation of the various intensity levels and possibly the acceptability of the various classes. This subjectiveness does create some ambivalence when one attempts to use this approach to make a population-oriented treatment comparison as in a new drug application. One possible solution for the latter is to use w i j = J for each class and use the statistic in (1). Many researchers (for example Agresti6) have opted for the equally-spaced scores when analysing ordinal data. This somewhat conventional choice of scores for ordered categories can eliminate the subjective nature of the scores. In the paper, we concentrated on results from one clinical trial. One can generalize the approach to data from studies that are similar in nature, that is, studies with the same treatments and the same target population. To do so, one needs to apply the same set of rules to determine intensity grades for the same set of classes. The latter should not be a problem since the determination of the classes and the intensity grades should only depend on the underlying disease and the treatments involved and not on data obtained in any trial.

APPENDIX: DETERMINATION OF THE GRADES FOR A PATIENT This is an example to illustrate how the data of a patient would be used to grade his safety experience in the study. The patient under consideration had the following abnormalities: (1) on the electrocardiogram at baseline, he had an interventricular conduction defect, inverted T waves and abnormal ST waves suggestive of an old myocardial infarct or myocardial ischemia, but no changes were observed while on medication; (2) on the haematology report, he had a normal number of leukocytes at screen and baseline, but the number was elevated after 4 weeks of medication; (3) urinalysis revealed an elevated protein content and an elevated number of red

1088

C. CHUANG-STEIN, N. R. MOHBERG AND D. M. MUSSELMAN

blood cells after 4 weeks of medication; (4) on the safety forms, he reported perianal pain, headache, tiredness, and arthritis. CVD: No clinical events, score = 0 EKG: QRS-IVCD, T-Inverted, ST-abnormal, old infarct, myocardial ischemia at baseline. No changes at follow-up, score = 0 Overall CVD score = 0 Haematology: No clinical events, score = 0 Leukocytes normal at screen, normal at wk 0, high at wk 4, score = 1 Overall Haematology score = 1 GI/Hepatic: Perianal pain, wk 7/8, severe, not related, score = 2 No abnormal labs, score = 0 Overall GI/Hepatic score = 2 GU/Renal: No clinical events, score = 0 Urine Protein high wk 4; RBCHPF high wk 4, score = 1 Overall GU/Renal score = 1 Neuro/psych: Headache, wk 5, mild, related, score = 1; headache, wk 7/8, mild, not related, score = 0 Tiredness, wk 1.5, 2, 2.5, 3,4, 5, 7/8, mild, related, score = 2, (Raised 1 grade because of frequency of report > 2) No abnormal Neuro/Psych labs, score = 0 Overall Neuro/Psych score = 2 Pulmonary: No abnormalities, overall Pulm score = 0 Special senses: No abnormalities, overall SS score = 0 Metab/nutrit: No abnormalities, overall M/N score = 0 Dermatology: No abnormalities, overall Derm score = 0 Musculoskeletal: Arthritis, wk 1.5, wk 4, severe, not related, score = 2 No lab abnormalities, score = 0 Overall MS score = 2

ACKNOWLEDGEMENT

The authors wish to thank an editor and two referees for their helpful comments and suggestions which have greatly improved the paper.

ORGANIZATION AND ANALYSIS OF SAFETY DATA

1089

REFERENCES I . Peace, K. E. ‘Design, monitoring, and analysis issues relative to adverse events’, Drug Information Journal, 21, 21-28 (1987). 2. Brown, K. R., Getson, A. J., Gould, A. L., Martin, C. M. and Ricci, F. M. ‘Safety of Cefoxitin: An approach to the analysis of laboratory data’, Reoiews of Infectious Diseases, 1, 228-231 (1979). 3. Sogliero-Gilbert, G., Mosher, K. and Zubkoff, L. ‘A procedure for the simplification and assessment of lab parameters in clinical trials’, Drug Information Journal, 20, 279-296 (1986). 4. Gilbert, G. S., Ting, N. and Zubkoff, L. ‘A statistical comparison of drug safety in controlled clinical trials: The Genie score as an objective measure of lab abnormalities’, Drug Information Journal, 25,81-96 (1991). 5 . McCullagh, P. ‘Regression models for ordinal data (with discussion)’, Journal of the Royal Statistical Society, Series B, 42, 109-142 (1980). 6. Agresti, A. Analysis of Ordinal Categorical Data, Wiley, New York, 1984.

Organization and analysis of safety data using a multivariate approach.

The collection of safety data is an important part of clinical trials. These safety data are often described and reported in great detail with expendi...
910KB Sizes 0 Downloads 0 Views